![]() | Monkeys at Keyboards: The Javanomicon © Michael James Heron | ||||
| Topic: Java Programming Level: 2 Version: delta | |||||
23 - More Data Structures | |||||
| Previous | Table of Contents | Next |
| Forum |
| Chapter Objectives |
By the end of this chapter, the reader will be able to:
|
In this chapter, we discuss some of the data structure considerations that Java developers must take into account when writing a program of any real complexity. Effectively representing the data in a Java program is undoubtedly the single most important thing you can do to ensure success as a developer. All programming tasks eventually resolve into two key considerations:
If you have answers to these two questions, the rest of the development process is easy. Well... it is easy for a given value of easy. In this chapter, we'll discuss one of the more powerful data structures provided for us by the Java programming language.
The formal study of data structures has long been an integral part of computer programming. It is only comparatively recently, with the introduction of fully featured object libraries in environments such as Java that it has become less necessary for programmers to study data structures in and of themselves. This is a good thing in many respects - it is important that programmers do not become bogged down in a futile game of reinvent the wheel. Once a particular data structure has been mastered, then it is a tool to be used... obscuring this fact with the implementation is counter productive. Standardising such tools means that you can spend more time actually designing computer programs and less time coding standard data structures. However, it's also a bad thing - mainly because you don't get nearly as much practise with manipulating data. Writing your own stack, queue or linked list implementation is a very educational exercise and can lead to a much deeper understanding of the underlying strengths and weakness of a particular structure than simply using an off the shelf object. Java provides a range of powerful data structures to aid in the development of sophisticated programming solutions. However, it is important to realise that these are generic tools - they will not be the best implementation for all situations. There are many cases in which a slight tweak of the structure would make it ideal for your needs, but specialisation of the object doesn't necessarily give you the control that you require. Data structures are a technically interesting subject and worth much more than the single chapter we can devote to them in this book. There are numerous books on the subject for those who wish to spend more time delving into this area with an intention of sharpening the tools in their development toolkit. In this chapter, we'll be concentrating mainly on making use of the provided data structures rather than writing our own. The act of choosing the right data representation for the right problem is one that a developer gets better at as experience starts to accrue. The situations where these structures are useful may not be immediately apparent, but sometime in your programming career you will come across a situation where they are perfect - recognising such situations is an important skill. You will already be familiar with some of the basic data structures:
Undoubtedly you've already realised how much easier it is to develop applications when using these structures than it would be without them. Our ability to tackle complex problems becomes even greater with access to some of the other data structures provided by Java... so let's get onto the real meat of the chapter!
Most data structures are a specialisation, in one way or another, of the standard Array structure. An array such as we've been dealing with during the course of the book is a collection of contextually related data elements. We place no specific restrictions on how we access the array - we can pull things out at any index, and set the value of any index as required. Sometimes we benefit from restricting this freedom through the application of certain kinds of access systems. Two common examples of these are the Stack and the Queue. A stack is known as a LIFO structure... this stands for Last In First Out, and describes the way data is accessed. The last thing we added to a stack is the first thing we pull out if it. The name is very descriptive - it works just like a stack of dishes. You don't pull dishes from the bottom of the pile (unless you're feeling lucky)... you take a dish off of the top of the pile. The last dish you added is the first one you remove. Stack data structures are useful for many kinds of problem. Consider for example a game of poker... you have a deck of cards, and you deal cards off of the top of the deck. This could be represented as a stack of Card objects... we'll look at that as an example a little later in this section. Java provides a Stack object for us... we create it in the normal way:
Stacks give us two methods to use to add and remove data elements. We push elements onto the stack, and we pop elements off. As mentioned above, new elements get added to the top of the stack (on a push operation), and we retrieve the last element that was added with each pop. Consider a simple deck of cards implementation. First we need an actual card:
And then a Deck object which creates all of the possible cards and pushes them onto an internal stack (which we'll call cards). We then shuffle our cards to get a representation of a deck:
Now we can pop cards off of the deck to deal out a hand:
Let's deal out a hand of five cards and see what we get:
Running this prints out a hand of cards:
Alas, no win there. We should also provide a method for returning cards into the deck once we're done:
And then when we're done with our hand of cards, we return them to the deck and shuffle:
The stack gives us an easy way of implementing this kind of functionality, but it's not just for simple applications such as this. Stacks are a very powerful data structure and used in many of the more complex area of programming and operating system design.
The queue works in a very similar way to the stack - the difference is that it's a FIFO structure (First in, first out). It's used for implementing things in which the order elements are added matters. Print queues are an obvious area in which they are important - print jobs are processed in the order in which they are received, and recent jobs are added to the end of the queue. There's no standard Queue object in Java, but it can be very easily implemented using a standard ArrayList and the framework for generics that we discussed in chapter four. However, there is a catch in that we must delve a little bit more deeply into what generics can do within the Java framework. As we have seen, an ArrayList requires a type decleration through the generics syntax - but if we're writing a simple Queue, we have no idea what kind of data is going to go onto it. We need a way of being generic about generics - luckily we have just that mechanism. First of all, we need to create a class for our Queue - this will use exactly the same syntax we have discussed in the past with only one small difference:
See the <T> on the class definition? That defines for us an identifier for a generic decleration - this is called a type decleration. When we create an instance of our class, we will do so using the familiar syntax:
The <T> is an identifier we can use within our class, and when the class is instantiated it is replaced with the concrete class definition. The use of T is entirely arbitrary - we can use any non-reserved word in here, and whatever word we use can be used as a class definition in the text of our code:
In this way, we get a generically typed collection that implements the functionality of a standard Queue:
Queues offer a powerful framework for implementing certain kinds of data structure, and a firm understand of how they work internally can provide great insight into the best way of representing your own data.
Java's terminology for most of its data structures is Collection. An ArrayList for example is a specialisation of a generic Collection. One of the most useful collections that Java provides is known as the HashMap - it is effectively an associative array structure. The mechanics of the HashMap are somewhat complex, so we'll only touch on them briefly. It is important to realise that you don't need to understand how a HashMap works in order to use it, but you may find that you use it better if you have some grasp of the underlying processing. HashMaps are sometimes known as lookup tables. Like an array, it contains a number of elements. Unlike an array, these elements are not referenced by a numeric index. Instead, they are accessed by what is known as a key. Each key has a value, and each key is unique. Every element in a HashMap is comprised of a key-value pair. Any object is a candidate for being used as a key or a value. However, much like as with an ArrayList we need to use the generic syntax in order to make use of them within our code. The generic syntax is little more complicated for a HashMap as we'll see. Consider the following initial representation of a HashMap:
Each of the keys in this HashMap is the name of a programming language, represented by a String object. Associated with each of these keys is a value, which is also a String object. We create a HashMap by providing a type for both the key and the value in the generic definition. The first type is the type of the key, the second is the type of the value:
We use the put method of a HashMap to add a key value pair:
After this method call, our HashMap would look like this:
When we want to get data out of the HashMap, we use the get method, passing in the key as a parameter.
After this method call, the variable temp holds the value "A wonderful, fun filled language". The important aspect about a HashMap is that a key has only a single value (although that value may be a complex object like an ArrayList or even another HashMap)... so if we then have another call to put like so:
Our HashMap then looks like this:
HashMap objects are a very quick and effective way of implementing a one to one mapping between particular objects without any significant computational overhead. The danger with the put method is that we may carelessly overwrite an existing value unintentionally - luckily the HashMap class has a method called containsKey which returns a boolean value that indicates whether the HashMap already contains a particular key:
For good practise, we should also use this method before attempting to query the value of a particular key - but the HashMap will simply return a null value if there is no key present that matches the parameter of a get call.
A data structure like a HashMap is no good if we can't manipulate it properly. We saw above how to add and query elements, but usually we want to do something a bit more than that. One of the more common acts is to step over every element in a Collection, so we should be able to do that also with a HashMap. Most of the collection classes in Java use the foreach loop as a standard way of stepping through all of their elements. The HashMap, being somewhat more complex, does not. The HashMap provides a method called keySet that returns the list of all keys in the HashMap. It returns this as an object of type Set. A Set is effectively an ArrayList that contains no duplicate entries. So we can gain a list of all the keys in a HashMap like so:
The Set object does allow for the use of the foreach structue, and so we can easily step over all of the keys contained in a particular HashMap, and call the get method using each:
Sets are an enormously powerful mathematical concept... and as usual, we don't have the room in this book to even touch on the complexities. However we can look at two of the operations modelled by sets... that of the symmetric difference and the intersection. An intersection of two sets will provide a set that contains all of the elements that are present in both sets. A symmetric difference provides the set that contains the elements that were in one or the other, but not in both. A set in Java is just an Array that contains no duplicate elements. In mathematics we can apply certain operations to pairs of sets to obtain useful information. We can do the same thing using Java's set implementations. We can't just create a Set object - it's what's known as an abstract class (more on this later in the book). Instead we need to create an instance of one of its more specialised implementations... we'll use HashSet for this. The HashSet provides two useful methods... each methods takes as a parameter another set. The removeAll method will go over each element of set one and remove any element that is also present in set two. This models a symmetric difference. It changes the state of the set the method is called on, but leaves the original set unmodified. The retainAll method models an intersection:
This is a very useful tool for quickly comparing two sets against each other. Imagine we are developing an employee tracking system for a Great Benevolant Company. The company has decreed that every employee is to get a hot stone massage as part of the company's policy of stress reduction. They have one set which is a list of employees, and another set that lists all of the employees who have had their massage. We can create a set by passing in a standard ArrayList to the constructor method, so let's have a look at a quick way of telling which employees have still to receive their Complementary Therapy:
This gives us the list of people who are in set one (the employees) but not in set two (those who have been massaged). The list we get out is: Michael, Pauline, Natalie, Ruth. These lucky people can then been notified that they have some Relaxing Therapy awaiting them.
There is one particularly obvious reason to use a HashMap - they model an incredibly useful relationship between objects in a simple and effective way. We don't have to develop complex object structures to implement a simple one to one mapping of keys to values - we just create a HashMap and let that do the work for us. They are also very fast - the setting or getting of any single element is done as a very simple operation as opposed to other data structures like arrays, where first you must find the appropriate element (which can be very costly). We will look at how a HashMap accesses its elements in a few moments. However, this power comes at a price. They are substantially more memory-hungry than they have to be. The way in which they are internally represented means that they have to consume more space than is strictly necessary to hold all their elements. We will also look at why this is the case in a later section of this chapter. They can be difficult to manipulate. They model a very common relationship between objects, and it is very easy to slip into the trap of thinking that they are therefore the best way to model all such relationships. The individual circumstances of an application will determine whether this or not this is a good idea. They are very quick for accessing particular elements, but stepping over every element in a HashMap is a more expensive process than doing the same thing with a comparable ArrayList structure. They are not particularly expandable. The relationship is one to one, and if you later want to express a more complex relationship it can require significant remodelling of your code. Often the best solution is a combination of a HashMap along with a more complex object representation that serves as a value. Consider for example a HashMap used to implement a lookup table for usernames and passwords:
By using this system it is very simple to determine whether a user has entered the right password. We get their name from a text field, we get their password from another text field, and then we check to see if there is a match:
However, consider the future - such a limited relationship seems as if it's going to be problematic in later years. What if I want a way of then recording a user's login time? Or their access privileges (such as we discussed in the password case study), or other such details? The simple mapping provided by a HashMap is not very expandable and this means that if we are likely to need to store more data, we will need a more complex data representation, such as implementing a HashMap of Strings mapped to User objects, which is a more desirable solution. The system we used for our Password case study was an ArrayList of objects, which is a very similar kind of structure. In fact, it's one of the most powerful possible structures available using the Java programming language. It is an immensely flexible way of representing even very complex sets of data. However, when using a system like this there is a considerable amount of extra overhead required in providing utility methods to query the state of objects and find particular elements in the list. We need to write code for dealing with duplication, and code for adding and removing elements. The HashMap class allow for both the flexibility of objects and a simple and efficient syntax for finding particular elements, provided there is some unique key that can be used to identify them. The HashMap essentially gives us all of the flexibility of an ArrayList but with a bundle of utility methods provided. Whether this is appropriate for any given software application depends on the data and the kind of manipulation you are likely to be performing on a regular basis.
As with most things in this book, we only have enough space to cover the basic ideas of HashMap functionality. It is a considerably richer subject that this brief overview would suggest. All objects in Java share a method called hashCode. The hashCode method computes an integer representation of any given object - this integer representation is known as a hash code. The hash code of any two different objects will, in the best case scenario, be substantially different even if the objects themselves are almost identical. Java provides a standard hashCode method for computing the code of any given object. Within a HashMap, this hash code is used as the index of an array. We compute the hash code of the key to work out an index, and then place the value at that specified point in the array. Let's say for example we have an object of our Password class, and that objectas a hash code of 102. We pass this object as a key to our put method, and the HashMap places that key and its value in the element (the key-value pairing is known as a record).
This is a simplified example, but it is effectively what is going on inside a HashMap. All that we need to know to access a value for a particular key is the hash code of that key, then we know where to look in the array. In efficiency terminology, it has what is called constant efficiency in that the time it takes to find an element doesn't increase when the HashMap is bigger, unlike a typical ArrayList where we need to search every element to find a particular value. That's great, except that in the real world, it's not really true. A HashMap doesn't really have a constant efficiency since real world constraints get in the way. For a HashMap to have truly constant efficiency all the time it would need to be of infinite size, which is impossible with the limits of modern technology (or indeed, any imaginable future technology). We must make some trade-offs between memory usage and efficiency. It's not possible for an array to hold all possible hash codes. It's not even feasible to hold a large fraction of them. Instead, we must choose some reasonable size of internal representation and then reduce computed hash codes so that they fall within these constraints. The very low hash code value shown above is unrealistic... most hash codes are many digits long. You can see how big the hash code of even a basic string is with the following simple program:
The string "Michael" has a hash code of 1610629318 - that's really big! However hash codes need to be that big to maximise the chance that two objects won't compute the same code - it is of course still possible, but much more unlikely. We can reduce a hash code to fall within a particular range by modulus dividing it. For example, if we had a one thousand element array, we'd want the hash code to fall into a valid index:
In this case, the hash code computes to 441. The smaller a range of valid indexes we use, the more chance there is of two objects sharing the same hash code. When this occurs in a HashMap, it is called a collision. There are two main methods of dealing with a collision. These are open addressing, and a process called chaining. Java uses the latter technique. When there is a collision at a particular index, Java sets up a bit bucket that contains a list of all the matching values. This is usually implemented as an array or a list. When we add a value to a HashMap, it checks to see if something is already there. If there is, it adds it to the end of the bucket. So, before we add the element:
And then after we add the element:
When it comes time to retrieve an element from the HashMap, it checks every element in the bucket until it finds the one we want, and then returns it. For this it doesn't use the hashcode, instead it uses a .equals() call on every key in the chain until it finds the one it's looking for. This has a tremendous effect on the efficiency of a HashMap. For the worst case scenario (A hash table with only one valid index, so everything goes into the same bucket), it is no more efficient than a simple linear search:
Obviously, maintaining the proper balance between memory usage and efficiency is a key consideration when using a HashMap. Open addressing resolves collisions in a brute force way - upon finding a particular element taken, it searches through the rest of the HashMap until it finds an empty element, and places it there. There are two main strategies of doing this. One is linear probing, which simply checks each position in turn until it finds an empty element:
The linear probing strategy uses a wraparound system, so that if it gets to the end of the array and finds no empty spaces, it will go to the start. Using the example above, if we tried to insert an element at position 200 (which is occupied), it would then place the element at position 0 (which is not). There is also a popular quadratic probing system that does the search in a slightly more complicated way. If a linear search has a hash code that points to index X, for example, it uses the following algorithm to find an empty slot:
Quadratic probing works as follows:
So linear probing would search position 100, then 101, then 102, and so on. Quadratic probing would search 100, then (100 + 1) then (100 + 4) then (100 + 9) and so on. HashMaps implemented using linear probing tend to end up with a clustering problem - since the search is concentrated around used indexes, it becomes gradually less efficient to insert new elements as the table fills up. Quadratic probing eliminates the clustering problem, but requires that the size of the array is a prime number, and that the array is never more than half full. In both cases, where there are no more spaces left to fill, the internal array must be resized to cope with the increase in data. This is a very expensive operation, and so is an undesirable conclusion of any given collision management strategy.
Well... there's a catch to the HashMap. Of course there is, you wouldn't expect anything less of me, would you? The hashCode method of Object works, but it's not very good if you're writing a significant object of your own with many attributes. If you plan on using such an object as a key in a Hash based structure, you need to over-ride the hashCode method of Object to provide your own better implementation. There are a number of common hashing algorithms that achieve moderately effective results. The simplest of these is to get the hash code value of each of your attributes and combine them together. The hash algorithm for a String is as follows:
A similar technique can be used to compute the hash code for your own objects. First you choose some multiplier value (a prime number is best), and then combine the hash code of all your attributes:
If you have more than two attributes, it works slightly differently:
This is not a foolproof method - it is trivial to find two Strings in Java that share the same hash code, for example:
This program shows that the strings "BB" and "Aa" both have the same hash code value of 217. The techniques above are clearly not flawless, but they work well enough for day to day use in situations where it is not particularly important that collisions are avoided (for example, using such hash codes in cryptography systems would be inadvisable!).
Additionally, all objects in Java inherit a method called equals that is used to determine the equivalence of two objects of the same type. If you over-ride the hashCode method in an object, you should also make sure that your equals method is also over-ridden and conforms to the following rule:
If this rule is not enforced in your equals method, there will be some unusually interesting behaviour exhibited by the objects you are using within your HashMap. Such irregularities in behaviour are often difficult to debug, so it is in your best interests to ensure they are avoided from the very start. There is no requirement for the converse of this relationship to be true - if the hash code of x equals the hash code of y, they are not necessarily the same object. It is merely a consequence of the hashing model.
All this talk of hash codes, collision management and such is obscuring the real point of the chapter - that HashMaps are pretty cool and easy to use. For the majority of the programs you write, you will never need to worry about computing hash codes to avoid collisions, and certainly not during the course of this book. This information is provided as a discussion of an interesting way to represent data, rather than a tutorial on how you are expected to write code for the programs in this book.
Enumeration as a programming concept refers to the process of creating a variable that has one of a fixed number of values. This is a fairly common situation. For example, consider our BorderLayout manager. We can make use of one of five possible values - NORTH, SOUTH, EAST, WEST, CENTER. The BorderLayout class offers static variables to simplify this process, but these variables are just integers... there's nothing to stop us writing the following code:
If we try to do this, we get a compiler error about having an illegal component position, which is only to be expected. For more complex implementations of the concept, the error may not be caught until much later in a program's execution (if at all). Java 1.5 now offers a new structure called the enumeration (or enum). Within an enumeration, we declare all of the valid states for that enumeration. Once we have declared all the possible states, we cannot set an enumeration object to be a value other than those that have been listed:
With our enum above, we will get a compiler error if we try to set dir to something that isn't defined in the enum... for example, Directions.Bing will give a 'cannot resolve symbol' error. Once of the most useful things about Java enumerations is that we can also define methods, attributes and constructors for them. Imagine a program that was designed to move an element around a 2D grid through the use of an enumerator that defined all the legal directions, and also the way we have to modify the x and y co-ordinates to make the move:
Methods and attributes must come after the enum declarations in an enum, and the constructor must be private and used only within the enum declaration (otherwise it defeats the whole point). We can define a public constructor, but it will error if we try to use it outside of the warm confines of the enum itself. We make use of this Direction enum in our Robot class:
The facing int tells us which way the robot is facing (obviously), and whenever we change the state of facing, we also set the dir enumeration object to the right value. Now whenever we move, we adjust the value of our x and y co-ordinates by the right values (which are set in the Direction enumeration).
A HashMap is a powerful and flexible data structure that has considerable performance efficiency benefits over other, simpler data structures. This efficiency comes at a cost in terms of memory usage, since the representation of the data requires an internal array that is significantly larger than is necessary to hold each of the elements. The efficiency of a HashMap is its primary benefit, but it is also especially useful in modelling simple relationships between otherwise unrelated objects. It is possible for a programmer to take this too far and hammer unsuitable data structures into a HashMap through a misguided sense of familiarity, so it is important that you avoid this in your development. Java provides a range of other data structures - some of these are special kinds of classes (like stacks and queues) while others are more fundamental structures (like enumerations). All of these have their place in the toolbox of the wise developer, Further ReadingThe following table details further reading on the topic in this chapter, and also any external resources that you may find useful.
|
| Previous | Table of Contents | Next |
© 2004-2006 Michael James Heron