Split-Ordered Lists: Lock-Free Extensible Hash Tables

How split-ordered lists build a lock-free, extensible hash table: ordering items by the bit-reversed value of their hash, using dummy nodes as stable bucket anchors, and a segmented bucket array to grow without moving items.

Hash tables are a key building block of many high-performance systems. Over the decades, there has been a huge amount of research on hash tables, with each design trying to optimize a different part of the structure. One widely studied area is how to make a hash table extensible, meaning how to adjust the number of buckets efficiently as the number of stored items grows.

At its core, a hash table has two main components: a hash function and a backing array. Each slot in the backing array is usually called a bucket. To store or look up an item, we first compute its hash value, then map that hash value to one of the buckets.

Conceptually:

hash_value = hash(key)
bucket_index = bucket_allocator(hash_value, bucket_count)
hash_value = hash(key)
bucket_index = bucket_allocator(hash_value, bucket_count)

Different keys may still end up in the same bucket. This is known as a hash collision. Because of this, a hash table also needs a collision-resolution strategy. One common strategy is separate chaining. In this approach, each bucket points to a collection of items that map to the same bucket, usually implemented as a linked list.

bucket[1] -> A -> B -> C
bucket[1] -> A -> B -> C

This keeps the hash table simple and flexible: if multiple items collide, we can attach them to the same bucket instead of finding another slot in the backing array.

For this post, we will focus on separate chaining and the problem of making a chained hash table grow efficiently.

This leads us to split-ordered lists, a technique for building lock-free extensible hash tables.

On multiprocessor machines, concurrent access makes resizing harder. Multiple threads may read from or write to the hash table at the same time. To grow the table, we usually need to create a larger bucket array, then move or copy items from the old array to the new one.

At the first glance, this sounds straightforward. But while resizing is happening, other threads may still be reading, inserting, updating or deleting items in the old table. Preserving correctness during this transition is not trivial. There are a few common approaches:

  • Block all operations during resizing: This is simple, but it makes the hash table unavailable while items are being moved. If the table is small, this may be acceptable. But if the table contains many items, the pause can become a serious latency problem.
  • Allow reads, but block writes: This works better for read-heavy workloads, but write latency can still suffer. In many real systems, this approach may still be acceptable because the read path is often much hotter than the write path.
  • Dual-write to both the old and new table: This allows migration to happen in the background, but now we need to guarantee that updates, inserts, and deletes are reflected correctly in both places. Otherwise, the new table may miss an update or bring back a deleted item.

In general, these approaches need some form of coordination during resizing. Often, that means using locks or blocking parts of the system. This can suffer from the usual drawbacks of blocking synchronization, such as lock delays, priority inversion, or even deadlocks if the protocol is not designed carefully.

So, a question arises:

Can we build an extensible hash table without relying on blocking synchronization?

This leads us to the idea of a lock-free extensible hash table.

For a lock-free hash table, the hard part is not allocating the new bucket array but is dealing with existing items without blocking other operations.

There are two obvious options: copy the items or move the items. Both are problematic.

  • Copying items can create stale data. For example, one thread may copy key K to the new table while another thread updates or deletes K in the old table. Without extra coordination, the new table may lose the update or bring back a deleted key.
  • Moving items is also difficult. In a chained hash table, moving K means removing it from one linked list and inserting it into another. That requires multiple pointer updates. But lock-free algorithms usually rely on CAS, which updates only one memory location at a time. During the move, K may temporarily disappear, appear twice, or corrupt one of the lists.

So the lock-free version is hard because we cannot simply hide these intermediate states behind a global lock. The data structure must remain correct even while operations observe and modify it concurrently.

This is where split-ordered lists come in.

The idea is to rethink the relationship between buckets and items. Instead of treating each bucket as an independent container that owns its items, split-ordered lists organize items in a way that makes bucket growth easier to manage.

At a high level, the design combines two pieces:

  • a lock-free linked list to store the items
  • a bucket directory to provide efficient entry points into that list

This gives us a different way to build an extensible hash table, where growing the table does not have to follow the traditional copy-or-move approach.

In the next section, we will go through the core design step by step.

The main problem that we described in the previous section, β€œWhy lock-free extensible hash table is hard”, is about moving or copying items within a bucket while preserving correctness.

In a normal separate chaining implementation, each bucket usually points to a linked list which contains items belonging to that bucket.

Separate chaining: each bucket points to a linked list of its items

In other words, the bucket points to the first element of a linked list, which contains multiple items. We know that, on resizing, some items in that linked list will be allocated to another bucket, aka moving to another linked list. The hard part is that we need to deal with multiple updates - update the original bucket, and the new buckets of the items. In order to achieve lock-free, we need to find a way to reduce the number of updates, and also make those updates can be done atomically.

There are two things involved here: buckets and items.

A bucket may point to many items, but each item is just one node in the data structure. If we try to move items during resizing, we may need to coordinate many node updates. But if we keep the items fixed and only add new bucket entry points, the resize operation becomes much smaller.

So the core idea is:

Instead of moving items to match the new bucket layout, keep all items in one ordered list and create new bucket entry points into that list.

In order to do so, we need to address following challenges:

  • We need a lock-free list which can support search / insert / delete.
  • We need to find an ordering for the list that matches how bucket membership changes when the table grows. Ideally, items that belong to the same bucket should appear as a contiguous range in the list. After resizing, if some of those items now belong to a different bucket, they should also form a contiguous sub-range. This way, a new bucket can point to the beginning of its range instead of collecting items from different places in the list.

Let’s see how we can resolve the challenges mentioned above.

The first requirement is that the item list itself must be lock-free. Since all items will be stored in this list, every hash table operation eventually depends on it.

At a high level, we need the list to support:

  • get: find an item by its ordered key
  • insert: add a new item at the correct position
  • delete: remove an item
  • update: modify the value of an existing item

A common way to build such a list is to use a CAS-based ordered linked list, such as the Harris-Michael lock-free linked list.

The idea is similar to a normal sorted linked list: before inserting a node, we search for the correct position, then link the new node between its predecessor and successor.

prev -> next
prev -> new_node -> next
prev -> next
prev -> new_node -> next

The difference is that the pointer update is done with CAS:

CAS(&prev.next, next, new_node)
CAS(&prev.next, next, new_node)

This means:

update prev.next to new_node only if it still points to next.

If another thread changes the list at the same time, the CAS fails and the operation retries.

Delete is usually done in two steps. First, the node is logically deleted by marking its next pointer. Then it is physically removed by unlinking it from the list.

prev -> node -> next

prev -> node(marked) -> next

prev -> next
prev -> node -> next

prev -> node(marked) -> next

prev -> next

This two-step delete allows other threads to notice that a node has already been deleted and help unlink it.

So the lock-free linked list gives us the basic building block: a shared ordered structure where multiple threads can search, insert, and delete nodes without taking a global lock.

Before thinking about how to order the items, we should first ask: Can we make bucket splitting predictable?

If the table can grow to an arbitrary size, an old bucket may be redistributed in a less structured way. We may still be able to initialize the new bucket pointers independently, but it becomes much harder to find a single global ordering that keeps bucket ranges clean across multiple resizes.

At this point, doubling and modulo become useful.

If we use the modulo to determine the bucket:

bucket_index = hash(key) % number_of_bucket
bucket_index = hash(key) % number_of_bucket

We require the bucket count to always be a power of 2. On each resize, we double the size of the bucket array from NN to 2N2N.

The interesting property is:

Each old bucket ii can only be split into at most two buckets ii and i+Ni + N

Let’s call vv is the hash(key)hash(key) value of item xx.

Before resizing, xx belong to bucket ii if:

i=vβ€Šmodβ€ŠN,0≀i<Ni = v \bmod N,\quad 0 \le i < N

After resizing, the new bucket index for the item xx becomes:

j=vβ€Šmodβ€Š2N=(kN+i)β€Šmodβ€Š2Nj = v \bmod 2N = (kN + i) \bmod 2N

Now there are only 2 cases:

  • If kk is even, then j=(2mN+i)β€Šmodβ€Š2N=ij = (2mN + i) \bmod 2N = i, because i<N<2Ni < N < 2N
  • If kk is odd, then j=((2m+1)N+i)β€Šmodβ€Š2N=(2mN+N+i)β€Šmodβ€Š2N=i+Nj = ((2m + 1)N + i) \bmod 2N = (2mN + N + i) \bmod 2N = i + N, because i<N<2Ni < N < 2N

Therefore

j∈{i, i+N}j \in \{i,\ i + N\}

So, up until this point, we agree that: We’ll use power of 2 as the bucket count, and always resize by doubling.

Now we need to solve the second requirement: how should we order items in the list so that bucket splits become clean?

As the bucket count is always power of two, which means, we can always have the bucket count in this form

2i2^i

When using modulo to determine the bucket, then an item with hash value vv belongs to bucket bb when:

vβ€Šmodβ€Š2i=bv \bmod 2^i = b

Because the bucket count is 2i2^i, this modulo operation only depends on the lowest ii bits of vv.

For example:

If the table size is: 22=42^2 = 4

Then we use the lowest 2 bits of the hash. The possible bucket indices are:

bucket 0 -> 00β‚‚
bucket 1 -> 01β‚‚
bucket 2 -> 10β‚‚
bucket 3 -> 11β‚‚
bucket 0 -> 00β‚‚
bucket 1 -> 01β‚‚
bucket 2 -> 10β‚‚
bucket 3 -> 11β‚‚

So an item belongs to bucket bb if the lowest ii bits of its hash are equal to bb written as an ii-bit binary number.

For example: 9(10012)9 (1001_2) and 13(11012)13 (1101_2)

  • When table size is 4=224 = 2^2, they’re in the same bucket 11 (as their lowest 2 bits are all 012)01_2)
  • When the table size is 8=238 = 2^3, 99 stays in bucket 1(0012)1 (001_2), while 1313 moves to bucket 5(1012)5 (101_2).

This also gives us an important observation:

When the table doubles, we reveal one more bit of the hash

If the newly revealed bit is 00, the item stays in the old bucket. If the newly revealed bit is 11, the item moves to the new sibling bucket.

The observation we got from previous section gives us the idea of using binary representation for ordering.

Bucket splitting is decided by the hash bits from right to left: first the lowest bit, then the second-lowest bit, then the third-lowest bit, and so on. But a normal sorted list compares numbers from left to right, starting from the most significant bit. That means normal numeric order does not match the order in which buckets split. So we reverse the bits. Which leads us to:

sorting items according to the bit-reversed value of their hash

For example, we have following hash values:

1, 3, 5, 7, 9, 11, 13, 15
1, 3, 5, 7, 9, 11, 13, 15
hash_valuebinaryreversedreversed value
1000110008
9100110019
50101101010
131101101111
30011110012
111011110113
70111111014
151111111115

So the order is:

1 -> 9 -> 5 -> 13 -> 3 -> 11 -> 7 -> 15
1 -> 9 -> 5 -> 13 -> 3 -> 11 -> 7 -> 15

Let’s walk through example with different bucket count

When the bucket count is 2

Bit-reversed ordered list with 2 buckets, bucket 1 points to the first item

All the items belong to the bucket 1, hence, the bucket 1 points to 1 - the first item in the list.

When the bucket count is 4

Same list with 4 buckets, items split between buckets 1 and 3

Items 1, 9, 5, 13 remain in bucket 1, while 3, 11, 7, 15 move to bucket 3=1+213 = 1 + 2^1

When the bucket count is 8

Same list with 8 buckets, items split across buckets 1, 3, 5 and 7

Items 1, 9 remain in bucket 1, 5, 13 move to bucket 5=1+225 = 1 + 2^2, 3, 11 still belong to bucket 3, meanwhile, new bucket of 7, 15 is bucket 7=3+227 = 3 + 2^2

So the larger the bucket array becomes, the more precise the bucket entry points are. The items themselves do not need to move; the list order already matches the way buckets split.

With only bucket pointers, insertion and deletion may need to update two places: the lock-free linked list and the bucket pointer itself (in case it’s the first element of the bucket).

For example, suppose a bucket points directly to the first real item in its range:

bucket[i] -> A -> B -> C
bucket[i] -> A -> B -> C

If A is deleted, we need to do two things:

  • delete A from the lock-free linked list
  • update bucket[i] to point to B

The same issue can happen on insertion. If we insert a new item whose order is smaller than the current first item of the bucket, then the bucket pointer also needs to be updated.

So the bucket pointer becomes part of the insertion and deletion protocol. That makes the lock-free algorithm more complicated, because now each operation may need to coordinate updates to both the list and the bucket directory.

To avoid this, split-ordered lists introduce a dummy node for each bucket.

Instead of pointing to the first real item, each bucket points to its own dummy node. The dummy node is inserted into the same lock-free linked list as the real items, but it does not store an actual key-value pair. It only acts as a stable entry point for the bucket.

It looks like this:

Each bucket points to its own dummy node inserted into the shared ordered list

Now, if for example item 33 is deleted, we can just update the next pointer of the dummy node to 1111. That operation can be done using CAS too!

The same thing with insertion.

With that Dummy node, every bucket has its own permanent anchor point. So new insertion / deletion, we can perform entirely using the lock-free list, without touching the bucket pointer.

There is still some extra complexity: before using a bucket for the first time, we need to initialize its dummy node and insert it into the correct position in the list. But this initialization happens once per bucket. After that, the bucket pointer remains stable.

But this introduces a new challenge:

As we still insert the Dummy node to the lock-free linked list along with the real items, how can we:

  • Distinguish the Dummy node from the real node
  • More importantly, how do we make sure the dummy node is always placed before all real items in that bucket?

Recall that items are ordered by the bit-reversed value of their hash. Items in the same bucket share the same low-order bits, so after bit reversal, they share the same prefix in the ordered list.

Split-ordered lists use a small encoding trick to distinguish dummy nodes from real item nodes.

If we use a 64-bit value for the hash value, in reality, we don’t have that many buckets, as 264β‰ˆ1.8Γ—10192^{64} \approx 1.8 \times 10^{19} buckets. So we can reserve one bit as a marker bit.

One way to think about it is:

dummy node -> marker bit = 0
real node -> marker bit = 1
dummy node -> marker bit = 0
real node -> marker bit = 1

Then we compute the ordering key by reversing the bits.

Conceptually

dummy_key(bucket) = reverse(bucket_bits with marker 0)
real_key(item) = reverse(hash_bits with marker 1)
dummy_key(bucket) = reverse(bucket_bits with marker 0)
real_key(item) = reverse(hash_bits with marker 1)

Let’s look at a small example using 4-bit values.

Assume the table size is 44, so we use the lowest 22 bits to identify buckets. Bucket 11 is represented as:

01β‚‚
01β‚‚

Now consider two real items whose hashes belong to bucket 11:

9 = 1001β‚‚
13 = 1101β‚‚
9 = 1001β‚‚
13 = 1101β‚‚

Both have the lowest two bits 0101, so they are in bucket 11 when the table size is 44.

For real items, we mark them as real by setting the marker bit to 1 before reversing. For the dummy node, we use marker bit 0. (For simplicity, the bit-reversed ordering example above ignored this marker bit and reversed the raw hash; here we add the marker so dummy and real nodes can be distinguished.)

Conceptually, the values look like this:

nodeencoded bits before reversereversed order key
dummy(1)00011000
item(9)10011001
item(13)11011011

So the order in the linked list becomes:

dummy(1) -> item(9) -> item(13)
dummy(1) -> item(9) -> item(13)

So far, we assumed that the bucket directory is stored as one big array of bucket pointers.

But this introduces another problem.

Even if split-ordered lists avoid moving the actual items during resizing, we may still need to resize the bucket array itself. With a normal array, growing from N to 2N means allocating a new array and copying the old bucket pointers into it.

During that process, other threads may need to coordinate with the resize operation, or even wait until the new array is published.

To avoid this, split-ordered lists use a segmented bucket array.

Instead of storing all bucket pointers in one contiguous array, we split the bucket directory into fixed-size segments. You can think of it as a 2D array:

segments[segment_index][offset]
segments[segment_index][offset]

Each segment contains a fixed number of bucket pointers.

Segmented bucket directory modelled as a 2D array of fixed-size segments

For example, suppose the segment size is 2.

When the table has 2 buckets, we only need one segment:

segment 0: bucket[0], bucket[1]
segment 0: bucket[0], bucket[1]

When the table grows to 4 buckets, we do not need to copy the old segment. We only allocate one more segment:

segment 0: bucket[0], bucket[1]
segment 1: bucket[2], bucket[3]
segment 0: bucket[0], bucket[1]
segment 1: bucket[2], bucket[3]

When the table grows to 8 buckets, we can add more segments:

segment 0: bucket[0], bucket[1]
segment 1: bucket[2], bucket[3]
segment 2: bucket[4], bucket[5]
segment 3: bucket[6], bucket[7]
segment 0: bucket[0], bucket[1]
segment 1: bucket[2], bucket[3]
segment 2: bucket[4], bucket[5]
segment 3: bucket[6], bucket[7]

The old bucket pointers stay where they are. Resizing only makes more bucket slots addressable.

Another benefit is that segments can be allocated lazily. Even if the logical table size is 8, we do not have to allocate all segments immediately. If no operation ever touches buckets 6 or 7, the segment containing those buckets can remain unallocated until it is actually needed.

So the segmented bucket array solves a different resizing problem:

  • split-ordered lists avoid moving items
  • bucket dummy nodes avoid changing bucket pointers after initialization
  • segmented arrays avoid copying the whole bucket directory during growth

Together, these ideas make resizing much more incremental.

I (with the help of GitHub Copilot) tried implementing it in Go. You can find the code & benchmark here: https://github.com/trunghieu99tt/splitorderedlist

Surprisingly, the benchmarks show that this implementation is nearly as good as (and sometimes even better than) Go's built-in sync.Map and a plain map guarded by an sync.RWMutex (RWMutexMap).

[1] Ori Shalev and Nir Shavit, Split-Ordered Lists: Lock-Free Extensible Hash Tables: https://dl.acm.org/doi/10.1145/1147954.1147958

[2] Lock-free Linked-list: https://timharris.uk/papers/2001-disc.pdf

Tagged:#Backend#Distributed System#Paper Notes
0