When I started blogging about ten years ago (well, not that much), I always focused on coding-related posts, which genuinely helped me understand how things work behind the scenes. However, in the last few days, I’ve realised I haven’t written anything similar for a long time.

Going back to the roots, in this article, I’ll focus on showing how to implement a hash table, its use cases and how reinventing the wheel can help you learn some modern tools.

What is a hash table?

Hash tables are a type of data structure that stores data in an associative manner. Depending on the programming language, a hash table is also known as a dictionary, hash map or just map.

It uses an array where each index is generated by applying a hashing algorithm to the key and associated with a value. A hashing algorithm is crucial to map keys into the size of the array. As a result, hash tables are one of the most efficient data structures for searching, insertion and deletion operations when you already know the key of specific data.

Usually, hash tables have the following operations:

interface HashTable<U, V> {
	public int insert(U key, V value);
	public int delete(U key);
	public V get(U key);
}

Diving deeper into implementation details, we’ll explain what hashing algorithms mean and why this is a core component of this data structure.

Hashing algorithms

Hashing is the process of converting a key of any type - strings, integers, whatever - to a hash value by using a hashing function. Additionally, it transforms an input of any size into a fixed-size output.

A hash value is relevant - but not limited to - to data indexing and retrieval. As a result, since hash tables use an array behind the scenes, this value becomes a specific value index.

The main requirements of a good hashing function are:

  • Fewer collisions: collision means we’ve generated the same hash value. A good hashing function produces as few duplicated hash values as possible. The idea is to minimise clustering.
  • Uniform distribution: each value has an equal probability of being generated. Non-uniform distribution can increase the number of collisions.
  • Easy and fast to compute

Several methods cover those requirements, and the most common is the division method, where you can use the modulo operator to produce a hash value. The operator ensures we’ll have valid ranges for the hash table’s array.

private int hash(int key, int arrayLength) {
	return key % arrayLength;
}

Applying it to the key 100 in a hash table which array’s size is 20, we would have:

int hash = hash(100, 21)
// hash = 16

That means: we’ll store the data associated with key 100 at position 16 within the array.

However, the keys can be of any type, such as strings. So, how can we apply a hashing function to a string using the division method? To achieve that, we transform each character of an alphanumeric key into an ASCII number code.

For example, say we have BENE as a key, and each character number code is, respectively, 66, 69, 78, 69. Then, summing up those values and applying the hashing function to this value, we’d have:

int sum = 66 + 69 + 78 + 69; // 282 
int key = hash(sum, 21) // key = 282 % 21
// key = 9

Therefore, the data associated with key BENE will stand at position 9.

Looking at this example, it’s clear that we may see the same hash values for different keys. That’s the meaning of what we know as collisions, and most likely, you will have to deal with them.

Collisions resolution

A collision will most likely happen unless you find the perfect hashing algorithm for a determined case. Even though there are many techniques to solve it, the two most common collision resolution techniques are separate chaining and open addressing. Separate chaining uses a linked list to store data with the same hash value as a key, whereas open addressing uses the hash table itself as storage. Like almost everything in life, both have pros and cons.

In this article, we’ll focus on the separate chaining technique.

Separate chaining

As I’ve mentioned before, separate chaining uses a linked list to store data with the same hash value as a key. Then, every time the hashing function produces the same hash, the algorithm groups all values in the same bucket using a linked list.

This approach has its advantages. With the technique, hash tables will never get full, and we can always either add or remove elements. Thus, that’s the main reason it’s primarily used when we don’t know exactly how many keys we would have.

On the other hand, using separate chaining means the hash table is not cache-friendly as keys are stored as linked list elements. Also, searching performance can be significantly worse if it has too many collisions, reaching a time complexity of O(n) where n is the number of elements with the same hash value.

Yet, there are some ways to improve performance by using other data structures to handle the chaining, such as self-balanced BST (Binary Search Tree), but it’s not the scope of this article which we’ll use a linked list.

Implementation details

To illustrate the definitions we’ve done, let’s focus on implementing the following operations:

  • Insert a new element
  • Remove an element by key
  • Search for the element

Insert

Storing a new value with the separate chaining technique means we’ll insert it into a linked list located in a bucket. Then, whenThen, when a collision happens, we’ll store the element as the last element of this linked list.

public void put(U key, T value) {
    int index = hash(key);
    Node<U, T> node = new Node<>(key.hashCode(), key, value);
    Node<U, T> head = array[index];
    if (head == null) {
        array[index] = node;
    } else {
        while (head != null) {
            if (head.key.equals(key) && key.hashCode() == head.hashCode) {
                head.value = value;
                return;
            }

            head = head.next;
        }

        head = array[index];
        while (head.next != null) {
            head = head.next;
        }

        head.next = node;
    }

    size++;
}

This algorithm has the following steps:

  • Apply hashing function to the key
  • Check if there’s a collision
  • If so, find the bucket and push the element to the linked list
  • Otherwise, build a new linked list with the element

Remove

Remove has pretty much the same steps of insertion. Nevertheless, instead of pushing a new element to the linked list, we’re going to remove it.

public T remove(U key) {
    int index = hash(key);
    Node<U, T> head = array[index];
    if (head == null) return null;

    Node<U, T> prev = null;
    while (head != null) {
        if (head.key.equals(key) && key.hashCode() == head.hashCode) {
            Node<U, T> aux = head;
            if (prev != null) {
                prev.next = head.next;
            } else {
                array[index] = head.next;
            }

            size--;
            return aux.value;
        }

        prev = head;
        head = head.next;
    }

    return null;
}

This operation has the following steps:

  • Apply hashing function to the key
  • Check if the hash value exists in the hash table
  • If so, find the element in the linked list.
  • Remove a key-value pair from the linked list by pointing the previous to the next node.
  • Return the removed value

Lookups in a hash table might be the most straightforward operation. Therefore, after applying a hashing function to the key, we need to access this index within the array if it exists directly. Otherwise, the process returns nothing.

The tricky part happens when there’s a collision. In order to get the correct key, we need to traverse the linked list. For this reason, if there are too many collisions, this could lead to a bad performance.

public T get(U key) {
    int index = hash(key);
    Node<U, T> utNode = array[index];
    if (utNode == null) return null;

    while (utNode != null) {
        if (utNode.key.equals(key) && key.hashCode() == utNode.hashCode) {
            return utNode.value;
        }

        utNode = utNode.next;
    }

    return null;
}

Real-world usages

Algorithms and data structures are the core of many modern tools we’ve been using lately. However, as a software engineer, you may think they’re too distant from the real world since you don’t use it directly daily.

Hash tables are often related to databases and cache data stores such as Redis and Memcached, but there’s more you can do by learning this data structure.

Many modern languages like Python and Kotlin already have hash tables implementation in their standard libraries, often known as maps or dictionaries. Additionally, they have more than one implementation with specific uses.

For example, this article demonstrated how a hash table works without any particular concern about thread safety. Both libraries have hash tables implementations that support it, though.

Finally, don’t underestimate learning fundamentals. They’re more relevant than learning the latest web frameworks, and by doing it, you can learn all of them when needed.