Story Behind Java HashMap
The Java HashMap class is a implementation of what is called a hash table. But before delving into hash tables let’s recall what are some of the basic data structures in Java.
When we face the need of storing and retrieving set of objects, the first data structure that comes in mind in Java or other programming languages is an array. An array allows us to store a set of values of a single data type contiguously in memory with each element associated with an index.
We can randomly access each element of an array by its index, i.e. we can access every element in a single step which is an ideal situation from the standpoint of performance. A downside of arrays is that their size is fixed. You must specify the array size when you declare the array.
A different data structure is the so called linked list. Linked lists have elements that are not contiguous in memory and can grow. Each element of a linked list contains the stored object as well as the pointer to next element which would be null if there are no further elements. A downside of the linked list is that it cannot be accessed in a random way, i.e. to reach a specific element in the list we must traverse the list from the first element until we found it.
The limitations of these two data structures can be overcome using a data model known as “hash table”. A hash table is an array coupled with a so called hash function. A hash function gets an object that is intended to play the role of a key and returns an integer called the hash value. The hash value is then used to calculate the index of the hash table in which to put the key and the corresponding value. The positions of the hash table in which the key and value pairs are stored are called “buckets”.
To be more precise each bucket actually contains the pointer of a linked list and the key and value are put into the first available element of this linked list. The hash function must always give the same hash value for identical keys otherwise the retrieving operation would be impossible, but it could return the same hash value for different keys. When this happens we talk about “collision”. This is where linked lists associated with hash table come in handy, that is to solve the issue related to possible “collisions”. In the figure below the hash function returns the same hash value and bucket for key 1 and key 2. The result is that in the bucket number 2 the two entries related to key 1 and key 2 are stored in the linked list one after the other and entry 1 points to entry 2.
Then when we want to retrieve for instance Entry 2 we use the hash function to calculate the hash value to Key 2, from the hash value we calculate the table’s index and then we traverse the linked list checking the key contained in each entry until we find it equal to Key 2.
The approach to deal with collisions described above is called “separate chaining”. For sake of information we must say that other strategies exist to deal with collisions which do not involve linked lists. One of these strategies is called ‘linear probing’. Using this model if a collision occurs the information record is assigned the next available slot in the table. But linear probing suffers from what is called “clustering”. Clustering is the tendency of having contiguously filled slots and has the effect of negatively affecting performances.
An important factor in hash table performance is the ratio between the number of entries and the number of buckets, called Load Factor. If the load factor reaches a certain threshold the performances begin to decrease. If the load factor is low and the hash function is made well the hash values are spread uniformly in the table, there are few collisions and the performances are at their best.
How Java HashMap Implements a Hash Table?
The Java HashMap class implements the Map interface and provide a hash table data structure. Its public interface main methods are essentially put(K key, V value) and get(Object key) and it has the following constructors.
- HashMap()
- HashMap(int initialCapacity)
- HashMap(int initialCapacity, float loadFactor)
The method put() get the key and corresponding value, creates an instance of an implementation of the Entry interface with them and stores it into a bucket (The Entry object represents just the pair of key and value), or to be more precise it stores it in a linked list pointed by the bucket, in the first element available (in case of no collisions the linked list will have only one element).
It is advisable to create a HashMap instance with an initial capacity big enough to avoid its resizing when the load factor reaches its threshold.
The part of the hash function is played by the method hashCode() which is inherited by the Object class, so every class has it. Any object that is supposed to be used as a key in the HashMap must be an instance of a class that override the hashCode() method in order to give a proper hash value. If the hashCode() implementation is good enough we would expect that hash values are spread uniformly and collisions are minimized.
The Java HashMap uses an internal function to get the hash value returned by the key’s hashCode method and returns a final improved value (a value that is supposed to be distributed in a better way along the table) that is eventually used to calculate the index of the bucket in which to put the entry.
The method get() retrieves an object by its key using the hashCode() method of the key object to generate the corresponding hash value and from it the array index, traverses the linked list from the beginning and uses the equals() method of the keys already stored in the linked list to find the matching key.
The equals() method is also inherited by the Object class and must be overridden accordingly, here is its signature:
public boolean equals(Object obj)
It gets an object in input and compares it with the object it belongs to, if they are equal return true.
The default implementation provided by the Object class returns true if the two objects have the same reference which means they are exactly the same object. This is not what we want when comparing keys: the right implementation must guarantee that objects having the same internal state are found equal.
The get method returns null it the map does not contain a matching entry for the key.
It is important to note that good keys are only provided by immutable objects. If an object is not immutable its internal state can change and it is not guaranteed that the get() method will be able to retrieve a previously inserted object.
The HashMap has a load factor’s threshold of 0.75. If it reaches this limit the HashMap object will be re-sized creating a new array twice the size of the original, using again the hash function (hashCode method) to redistribute the objects in the new buckets locations, this is called re-hashing.
Accessing Java HashMap Concurrently
The Java HashMap class is not thread safe and it is not fitted for multithreaded environments unless the access to its instances are synchronized from outside.
There is also a particular situation which can lead to a race condition with an infinite loop: when the HashMap instance data reaches the threshold for the load factor (see previous paragraph), which is 0.75, a resize of the hashmap is triggered.
The resizing works by creating a new array having twice the size of the original one. Each original linked list is then assigned to the new table but its elements are in the reverse order, because they are taken and put from the head of the list in order to avoid the extra work to traverse the all list. The reverse order could be the cause the cause of a possible infinite loop that can happen if more than one thread is involved in the resizing.
This behavior is caused by the aforementioned reverse order of the re-hashed linked lists and the actual implementation of the function that does the re-hashing. If two threads run this function concurrently one of them could find the variables used to contain the linked lists pointers in a inconsistent state and that eventually causes the last element of the list pointing to the first causing an infinite loop in any subsequent operations.
Java HashMap Example
Here is an example of using HashMap in Java. The MyKey class below represents our key. It contains an integer value and it overrides the hashCode() and equals() methods. The implementations of hashCode() and equals() have been generated in this example using the utilities of the Eclipse platform (see the screenshot below).
The reason behind the hash code calculation by the use of a prime number is to minimize the possibility of collisions. From the hash value the final table’s index in which to store the entry can be calculated by <hash value> % <number of buckets>, i.e. the modulus operator applied to the hash value and table’s size, but for performance reasons some sort of bitwise calculation is used instead. That calculation, based on the fact that the HashMap has normally a number of buckets which is a power of 2 (16 is the default) can lead for even values to a bigger probability of collisions unless a prime number is used to correct it.
The equals method implementation is indeed self-explanatory: if the comparing objects are exactly the same instance or the keyValue values of the two objects are equal, i.e. if the two objects internal status is equal, returns true, otherwise returns false.
package hashmapexample; public class MyKey { Integer keyValue; public MyKey(Integer keyValue) { this.keyValue = keyValue; } public Integer getKeyValue() { return keyValue; } @Override public int hashCode() { final int prime = 31; int result = 1; result = prime * result + ((keyValue == null) ? 0 : keyValue.hashCode()); return result; } @Override public boolean equals(Object obj) { if (this == obj) return true; if (obj == null) return false; if (getClass() != obj.getClass()) return false; MyKey other = (MyKey) obj; if (keyValue == null) { if (other.keyValue != null) return false; } else if (!keyValue.equals(other.keyValue)) return false; return true; } }
The HashMapExample class below creates a HashMap in its main method, put a number of entries into it using MyKey’s instances as keys, retrieves a single value by a particular key and then retrieves all the values iterating over the set of entries of the HashMap object.
package hashmapexample; import java.util.HashMap; import java.util.Map; import java.util.Iterator; import java.util.Map.Entry; import java.util.Set; public class HashMapExample { public static void main(String args[]) { //we declare and create a HashMap with initial capacity of 4 HashMap<MyKey, String> hashMap = new HashMap<MyKey, String>(4); //we put 100 values as strings using MyKey instances as keys for (int i = 0; i < 100; i++) { MyKey myKey = new MyKey(i); hashMap.put(myKey, String.valueOf(i)); } //we get a value by its key MyKey myKey = new MyKey(4); String value = hashMap.get(myKey); System.out.println("value: " + value); //we display all the entries System.out.println("All entries:"); Set<Entry<MyKey, String>> set = hashMap.entrySet(); for (Iterator<Entry<MyKey, String>> iterator = set.iterator(); iterator.hasNext();) { Entry<MyKey, String> entry = (Entry<MyKey, String>) iterator .next(); System.out.println("key: " + entry.getKey().getKeyValue() + " - value: " + entry.getValue()); } } }
Conclusion
As an implementation of the HashTable data structure HashMap is a good choice in a non-concurrent environment since it guarantees good performances, but when concurrency is a concern then it is important to get the application behavior free of race conditions and data corruption. Some of the possible choices are:
- Syncronize the access outside the HashMap
- Use ConcurrentHashMap
Another choice would be the HashTable class since it is thread safe, but it has the downside that its performances are poor. ConcurrentHashMap performances are better and it is thread-safe as HashTable, so they are interchangeable except for the details about their locking behavior.
Author Bio:
Mario Casari is a software architect with experience in complex architectures, mainly in Java. He has worked in many different fields, among others banking unattended systems, navy messaging, e-health and pharmacovigilance. He has a blog not only Java where he writes about Java programming.