# Hashing

UPDATED: 4.02.2024

#### Hashing in Java Collections Framework

Generally, hashing is widely used in algorithms, data structures, and cryptography. In Java, probably most known case is its application into some implementations of Java Set interface and Map interface - two main interfaces of Java Collections Framework. We are often required to use HashSet over List implementation where possible due to HashSet lookup speed. Also, HashMap seems to be the most widely used Map implementation in Java.

In this article, we will take a general look at hashing in Java. Let’s start with recalling some important data structures and the importancy of `equals()`

and `hashCode()`

methods.
Then we will discuss when and how hashing is used in Java collections. Last but not least, hashing function features and its usage in cryptography and security will also be mentionned.

#### Set (interface) - recapitulation

Main features of Java Set interface:

- uniqueness of items: item which is already in a set will be omitted when adding
- Set contains no duplicates - does not allow duplicate elements
- items added to every set must at least implement
`equals()`

method (to check for equality when determining whether an element is already present in the set) - Set has exactly the same interface as Collection (Set is a subtype of Collection, Map is not!)
- it does not guarantee any particular order -
**but it is not a random order, so with every fetch it might sometimes return elements in the same order!** - Set is generally faster than List - at least for some operations.

When a set is faster than a list?

When checking for specific element (for containment = ‘contains’ operation). Since a Set does not allow duplicate elements, the internal data structure used by a set (like a hash table) allows for constant-time or near-constant-time lookup. A list, especially if it’s unsorted, requires iterating through in a search operation, and that results in a linear time complexity (O(n).

Set implementation are optimized for specific operations:

- HashSet uses a hash table for constant-time average-case performance for basic operations like add, remove, and contains
- LinkedHashSet is optimized for iteration as it keeps insertion order
- TreeSet is suitable when there is a need for maintaining a sorted order while adding and removing elements

Set implementation | Features | Requirements |
---|---|---|

HashSet | lookup speed, general purpose | equals(), hashCode() |

TreeSet | ordered Set backed by a tree Red-black tree implementation of the NavigableSet interface |
equals(), Comparable |

LinkedHashSet | lookup speed of a HashSet, keeps the insertion order using a linked list Hash table and linked list implementation of the Set interface runs nearly as fast as HashSet |
equals(), hashCode() |

#### How hashing is employed by HashSet

**1. Behind the scenes, HashSet is using HashMap!**

This class implements the Set interface, backed by a hash table (actually a HashMap instance) (from Oracle docs)

- HashSet is implemented as a wrapper around a HashMap
- it uses the HashMap to store its elements
- the elements themselves are used as keys, and the associated values are a constant placeholder (
`PRESENT`

). - HashSet methods are delegated to equivalent HashMap methods

**2. HashSet is using hashing function to identify and locate items**

- HashSet relies on the hash code of the objects that are being stored in the set (therefore it requires them to correctly implement
`hashCode()`

) - hashCode() method is invoked on each object to obtain a hash code (an integer value that represents the object)
- the hash code is then used to determine the bucket in the hash table (HashMap) where the object should be stored
- searching for an element or checking for containment, the hash code of the target object is used to locate the corresponding bucket
- owing to the hash code, HashSet is able to maintain item uniqueness

**3. Collisions, need for balancing, load factor, rehashing**

- the hash code is often subjected to additional transformations (like bitwise operations) to distribute the values more evenly across the available buckets
- HashSet uses a linked list (or a balanced binary tree) to handle collisions: each bucket can store multiple elements in the form of a linked list or a balanced binary tree
- if the bucket contains a linked list or tree, a linear search or tree traversal is performed to find the specific element

To maintain a balance between space usage and performance, HashSet dynamically resizes its underlying hash table when the number of elements exceeds a certain threshold (determined by the load factor). The resizing operation involves creating a new hash table with a larger size, rehashing the existing elements into the new table, and discarding the old table.

Let’s get even deeper - to the HashMap itself!

But first, let’s recall what do we know about Map interface in general.

#### Map (interface) - recapitulation

The implementations and featurs of the Map interface in Java:

Map implementation | Features |
---|---|

HashMap | based on a hash table constant-time performance: insert and search performance can be adjusted using constructors capacity and load factor can be set |

LinkedHashMap | insertion order or least-recently-used (LRU) order (option)slightly slower than HashMap faster iteration - using LinkedList mechanism can be configured (constructor) to use LRU algorithm based on accesses: not accessed items appear at the front of the list LRU is good periodic cleanup in order to save space |

TreeMap | implementation based on a red-black treekeys (and entries) in sorted order (Comparable or Comparator) only Map with the subMap( ) method, returns a portion of the tree |

WeakHashMap | map of weak keys that allow objects referred to by the map to be released no references to a particular key held outside the map = key may be garbage collected |

ConcurrentHashMap | thread-safe Map, no need for synchronization or locking |

IdentityHashMap | HashMap that uses == instead of equals() to compare keys not for general use |

Clear explanation of IdentityHashMap found in Cay Horstmann’s *Core Java. Vol I: Fundamentals* (shortened and slightly re-edited):

IdentityHashMap is a really special case. Hash values for the keys should not be computed by the hashCode() method but by the System.identityHashCode() method. It is the same method that Object.hashCode() uses to compute a hash code from the object’s memory address. Also, for comparison of objects, the IdentityHashMap uses == , not equals(). In other words, different key objects are considered distinct even if they have equal contents (due to hashCode() and equals() implementation they are using). This class is useful for implementing object traversal algorithms, such as object serialization, in which you want to keep track of which objects have already been traversed.

Important: when a custom object (user-defined class) is used as a key in HashMap, **it should implement equals() and hashCode() methods** - otherwise, it won’t work properly!
Results may be unpredictable. The same is true when custom collection container is designed by a user.

#### How to implement `equals()`

correctly?

Why correct defining of `equals()`

method is important?

All objects of any class implicitly inherit `equals()`

method from Object class.
It’s a part of **single-root hierarchy**, a term describing that all objects in Java inherit from the Object class.
In other words, all Java classes, no matter if in-built or programatically-created, implicitly extend the Object class and inherit its methods.

At this root level, in the Object class, `equals()`

works like `==`

, simply comparing references. So for most of everyday cases, it won’t work properly.

The Object class provides some common methods to all subclasses. It has nine instance methods (excluding overloaded methods) which can be divided into four groups: threads synchronization: wait, notify, notifyAll; object identity: hashCode, equals; object management: finalize, clone, getClass; human-readable representation: toString;

Although an enum is a reference type, two variables can be correctly compared using both ways: the method equals() and the operator ==

Good practice says: when we override the

`equals()`

method, we should also override`hashCode()`

!

Requirements that should be met by every good `equals()`

method:

1) **Reflexivity**: for any non-null reference value x, x.equals(x) should return true.

2) **Symmetry**: for any non-null reference values x and y, x.equals(y) should return true if and only if y.equals(x) returns true.

3) **Transitivity**: for any non-null reference values x, y, and z, if x.equals(y) returns true and y.equals(z) returns true, then x.equals(z) should return true.

4) **Consistency**: for any non-null reference values x and y, multiple invocations of x.equals(y) consistently return true or consistently return false, provided that no information used in equals comparisons on the objects is modified.

5) **Non-nullity**: for any non-null reference value x, x.equals(null) should return false.

Mnemonic to remember the rules: RED-REFLEXIVE SYMMETRIC TRANSIT-TRAIN CONSIST ON NON-NULL CARS

**How to write a custom, good equals() method:**

First, select some fields to use them in comparison and then perform three tests inside the `equals()`

method:

1) if this and other object have the same reference, the objects are equal, otherwise - go to step 2;

2) if the other object is null or has an unsuitable type, the objects are not equal, otherwise - go to step 3;

3) if all selected fields are equal, the objects are equal, otherwise, they are not equal;

4) If you do not perform all the tests, the equals method will work properly in some cases.

#### How to implement `hashCode()`

correctly?

Why correct defining of `hashCode()`

is important?

The Object class `hashCode()`

just returns hash code value that derives the hash code from
the object’s memory address, which is not good in case when it should return some meaningful, distinct value.

**What are the requirements for good hashCode() implementation?**

1) Whenever it is invoked on the same object more than once during an execution of a Java application, the `hashCode()`

method must consistently return the same integer value,
provided no information used in equals comparisons on the object is modified. This value can be negative number.

2) If two objects are equal according to the equals() method, then calling the hashCode() method on each of the two objects must produce the same integer result.

3) It is not required for unequal objects to produce distinct hash codes. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.

`Hash code`

generation should not be based on mutable data. `hashCode()`

method must be fast and meaningful: it should rely on values that indentifies the object in a meaningful way.
The identity of the object should be clearly resolved. The generated code does not need to be unique, but hashing should be fast - that is the importancy of hashing.

The `String`

class uses the following method (as of Java 17):

```
public int hashCode() {
int h = this.hash;
if (h == 0 && !this.hashIsZero) {
h = this.isLatin1() ? StringLatin1.hashCode(this.value) : StringUTF16.hashCode(this.value);
if (h == 0) {
this.hashIsZero = true;
} else {
this.hash = h;
}
}
return h;
}
```

So it invokes some hashing methods depending on the encoding. For UTF16 it is:

```
public static int hashCode(byte[] value) {
int h = 0;
int length = value.length >> 1;
for(int i = 0; i < length; ++i) {
h = 31 * h + getChar(value, i);
}
return h;
}
```

Further explanation in documentation

Joshua Bloch in his *Effective Java* describes potential pitfalls related to both methods, and shows the recipee for a good implementation.

Notice **the magic number 31** here. According to the author:

The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, because multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance on some architectures: 31 * i == (i « 5) - i . Modern VMs do this sort of optimization automatically.

Remember that:

You must override hashcode if you’ve overridden equals.

Why is that?

First of all, to maintain consistency within both methods.

Secondly, `hashCode()`

method is used to determine which bucket an object should be placed in, and the `equals()`

method is then used to find the correct object within that bucket.

#### Hashing gives quick item retrieval: a simple example how hash table works

**What is a hash table?**

A hash table, also known as a hash map, is a data structure that allows the storage and retrieval of key-value pairs. A hash table consists of an array of buckets, each capable of storing a key-value pair. The number of buckets in the array is determined by the size of the hash table. Hashing function computes value that serves as an index to the hash table.

**What is a bucket?**

In the context of a hash table: individual slot or container in the array where key-value pairs are stored.

**What data structures are used for buckets?**

When we want to store more than one key-value pair in a single bucket, hash table takes advantage of an additional data structure, which is then “chained” under this bucket. It may be linked list or tree-based structure.

**How hashing works in hash table?**

So `hashCode()`

method is **hashing function** that produces numeric value (called **a hash code**) pointing to some value, placed under some index in an array.
This array is called **hash table**.

Example:

Let’s say that an object has a hash code of 64, and other has 128. The first is placed in an array under index 64. The other under index 128.

This is just a simplistic example - in fact, the number is further processed: the real value is modulo of hash code by the total number of buckets.

Example: if hash code value is 7628, and there are 128 buckets, the object will be stored under index 108:

```
76268 % 128 = 108
```

When a searching operation is performed, there is no need to use searching algorithm to search for a given object among others, nor to iterate through whole array.
Once it is know that we are looking for an object, of which `hashCode()`

returns 76268, we can access the object under the index of 108.

Null key is stored at index 0. HashMap can contain multiple null values, but only one null key.

#### Even distribution problem

A correctly designed hashing function should result in an even distribution of values within a hash table. If the values tend to cluster (some areas of the hash table are heavily loaded while others are empty), it may affect the speed.

#### Collisions

If a bucket is already filled, hash collision is happening.

**A perfect hashing function** is a case when there is a fixed number of values, so that a collision is impossible. According to Eckel:

The case of a perfect hashing function is implemented in the Java SE5 EnumMap and EnumSet, because an enum defines a fixed number of instances.

On the other hand, **collision** is when different objects have the same hash code. Normally, the collisions are handled by **external chaining**:

The array doesn’t point directly to a value, but instead to a list of values. These values are searched in a linear fashion using the equals() method. Of course, this aspect of the search is much slower, but if the hash function is good, there will only be a few values in each slot. So instead of searching through the entire list, you quickly jump to a slot where you only have to compare a few entries to find the value. This is much faster, which is why the HashMap is so quick.

The hash table in Java is an array of LinkedLists, prepared for future collisions. Each bucket contain list of slots for values (objects) that have colliding hash codes. In case of hash collsion, the new object will be compared with all objects in that bucket (in the list “attached” to the bucket) to check if it is already present. Set is not accepting duplicates, so if the object is already there, it will be skipped when added again. If it is a really new object, it will be stored in the list.

As of Java SE 8, the buckets change from linked lists into balanced binary trees when they get full, for better performance. It is useful when malicious code tries to flood a hash table with many values that have identical hash codes.

Other solutions for handling the collsions is **open addressing** and its variants. Conflicting items are not stored in the list as in the case of chaining, but in normal buckets.
How so? In case of a collision, a hash is incremented using special function. Open addressing consists on linear probing, quadratic probing and double hashing.

#### Efficiency, load factor and rehashing

Getting back to the hash table - the slots are called **buckets**. How many buckets should be in terms of efficiency?
The length of a bucket array (hash table) should be either prime number or power of two. Again, Eckel says in his book:

As it turns out, a prime number is not actually the ideal size for hash buckets (evidence for this isn’t conclusive), and recent hashed implementations in Java use a power-of-two size (after extensive testing). Division or remainder is the slowest operation on a modern processor.

With a power-of-two hash table length, masking can be used instead of division. Since get() is by far the most common operation, the % is a large part of the cost, and the power-of-two approach eliminates this (but may also affect some hashCode() methods).

The standard library uses bucket counts that are powers of 2, with a default of 16 . Any custom hash table size is rounded to the next power of 2. According to Horstmann, bucket number should reach from 75% to 150% of the expected element count.

Load factor is important.

**Load factor** of a hash table says to which extent the table is filled with values (0 is empty table, 0.5 is half-full, 1 is full).

So the less, the better?

Not necessarily.

Hash table with small load has fewer chances of collisions, hence it is optimal for insertions and lookups. On the other hand, it will be slower when traversing with an iterator. HashMap and HashSet have constructors to specify the load factor.

When the load factor is reached, the container will automatically increase the capacity - the number of buckets - by roughly doubling it
and will redistribute the existing objects into the new set of buckets, which is called **rehashing**.

The default load factor of Java HashMap is 0.75, which is considered to be a good trade-off between time and space complexity.
Higher load factor means less space is required by the table, but at the same time it increases the lookup cost (time complexity).
This lookup is often performed, for example when invoking `get()`

and `put()`

methods.

#### More about HashMap in Java

To sum up, A HashMap in Java is an implementation of the Map interface, which is designed to store key-value pairs. It uses a hash table data structure.

**Initial size (capacity) and default load factor**

When you create a HashMap using its default constructor, it initializes with an initial capacity of 16 and a load factor of 0.75. (initial capacity - number of buckets in the hash table, load factor - a measure of how full the hash table is allowed to get before it’s resized).

Threshold (how many buckets can be used) = capacity x load factor

The load factor is **the trade-off between time and space complexity.**

**Preferred size (capacity)**

Either prime number or power of two. The latter solution ensures better performance: bit-shift preferred over modulo for calculations like this.

**Big O notation: time and space complexity**

Correctly designed hash table implementation with proper `hashCode()`

should allow for constant-time average-case complexity (O(1)).
Space complexity amounts to O(n), where n is the number of stored items. A higher value of load factor = lower space overhead, but higher ookup cost.

#### What is this hashing for, after all?

What we know so far is that hashing function returns numeric representation of an object. The returned value is fixed-size, e.g.
in Java `hashCode()`

returns `int`

, which has some maximum capability. The size of this primitive type is 4 bytes.
The range from -2,147,483,648 to 2,147,483,647. Any Java hash code, generated by `hashCode()`

as int value, is limited by this range and size,
no matter what is the size of the input. The input data is often variable-length, but the output always fixed-length.

Hash function must work in a **deterministic way**. Java `hashCode()`

method invoked on the same object multiple times should always return
the same value. Equal objects must have the same hash code. In short, for the same input it produces the same output.

It is not required (but advised) for different objects to have distinct hash codes. However, hash collision affects lookup speed and - in some areas, like cryptography - it is dangerous for security. Hence, cryptographic hashing functions are required to prevent collisions.

Hashing is **one-directional function** - meaning: there is no way to recreate an object from its hash representation.

Apart from data structures, hashing is used in security.

For example, passwords are not stored in database as text (at least, it should not be stored like that).
Instead, they are *hashed*. Each password is passed to some hashing function that returns hash code representation of the input.
When user is trying to log in, his input is passed to the same hashing function, and returned value is compared with the hash stored in the database.

Hashing is not an encryption, although it is a form of crypthographic security. Encryption is two-directional, and it is a separate process that uses different algorithms.
Passwords may be encrypted after hashing. Before hashing, passwords may be **salted or peppered**. **Salt** is random data added to an input before hashing. **Pepper** is also random data
added to an input before hashing, but it is often kept secret and not stored alongside the hash. Both salt and pepper are used to make cryptographic hash functions more difficult to reverse
(so-called **rainbow table attack**).

Hashing is used in data integrity validation (*checksum*) and in blockchain (e.g. cryptocurrencies).

Cryptographic hash function must be **preimage-resistant**: for essentially all pre-specified outputs, it is computationally infeasible to find any input that hashes to that output;
i.e., given y, it is difficult to find an x such that h(x) = y. Otherwise, it is vulnerable to preimage attack.

**Second-preimage resistance** is another requirement: for a specified input, it is computationally infeasible to find another input which produces the same output; i.e., given x, it is difficult to find a second input x′ ≠ x such that h(x) = h(x′).
Collision resistance implies second-preimage resistance, but does not guarantee preimage resistance. A second-preimage attack implies a collision attack.

The functions should be **indifferentiable from a Random Oracle**. **A Random Oracle** is a kind of theoretical, cryptographic black box, returning random response to every unique query. Meaning it should not be possible
to find input values making only slightly different output values (hash codes). And *vice versa*: small change in any input value should return a completely different hash code.
It should not be possible to find correlation between inputs and outputs.

Further reading:

*Core Java, Volume I: Fundamentals*, 12th Edition, Cay S. Horstmann, Oracle, 2021

*Thinking in Java 4th Edition*, Bruce Eckel, Pearson, 2006

*Effective Java, 3rd Edition*, Joshua Bloch, 2018, chapter III, item 10 & 11

Rogaway, Shrimpton, Cryptographic Hash-Function Basics: Definitions, Implications, and Separations for Preimage Resistance, Second-Preimage Resistance, and Collision Resistance