SegCache: A memory-efficient and scalable in-memory key-value cache for small objects

In-memory key-value caches are widely used to deliver low-latency, high throughput services. Their usefulness is evaluated by:

Memory efficiency: Amount of memory cache needs to achieve a certain miss ratio.
Throughput: Measured by Query Per Second (QPS) per CPU core.
Scalability: How well the cache can utilize multiple cores on a host.

There have been several efforts to optimize these properties of in-memory caches, some focus on improving throughput, while some try to optimize the memory efficiency. No matter what, there are some existing challenges that even the popular in-memory caches incur.

Cache TTLs serve several purposes:

Limit data inconsistency: In case there’s any error when refreshing the cache, having the TTL can guarantee that if the cached data is only stale until the TTL, otherwise, the stale data will still be served until the next success refresher runs.
Periodic re-computation: For example, a recommendation service may only want to reuse cached results within a time window, and recompute periodically to come up with newer content.
Save memory usage: Having the TTL, we can decide which data could be deleted to save the memory.

Having the TTL brings in many benefits, but how could we effectively track for the expiration and remove the expired data? Based on the approach, we have 2 main category:

Lazy expiration

In this approach, we only remove the cached data when we re-access and its TTL expired.

Advantage:

No overhead.

Disadvantage:

The expired data is only removed if it’s re-accessed, which means, if there’s no access to that object, it remains and occupies the storage.

Proactive expiration

This mechanism reclaim memory occupied by expired objects more quickly. There are several popular approaches such as:

Checking LRU tail:
- Idea: Examine a fixed number of objects at the tail of the LRU and removes expired objects.
- Problem: Too optimistic and can’t guarantee timely expired object removal (as expired objects could be in the middle the LRU).
Full cache scan: (Adopted by Memcached and CacheLib)
- Idea: Periodically scans all the cached objects to remove expired one.
- Problem: The more frequent the full cache scan run, the less expired objects remain in the memory, but the more resources it take.
Random sampling (Adopted by Redis).
- Idea: Periodically sample a subset of objects and remove expired ones. In Redis, if the percentage of the expired objects in the sample is above a threshold, this process continues.
- Problem: Users have to accept that the sampling can only keep the percentage of expired objects at a pre-configured threshold.

Most of prevalent cache systems have small object sizes. For example, the mean object sizes (key+value) of Twitter’s top four production clusters are 230, 55, 294, 72 bytes, respectively. Same thing observed with Facebook & Reddit.

But apart from the object data itself, they need to store extra metadata for cache operation. And these metadata consumes lots of memory. For example, Memcached stores 56 bytes of metadata with each object (2 ×8 bytes LRU pointers, 8 bytes hash pointer, 4 bytes access time, 4 bytes expire time, 8 bytes object size, 8 bytes $*cas*$ (compare-and-swap, used for atomic update).

Systems that directly use external memory allocators (e.g., malloc) such as Redis are vulnerable to external memory fragmentation and Out Of Memory.

For example, in Redis, it’s the job of the OS-level heap allocator to allocate the memory to each Redis object. This produces the problem when the object is deleted, the heap memory occupied by that object is free-up, but leaving holes between the occupied memory, which causes the memory fragmentation.

Having memory fragmentation increases the chance of OOM, even though we have enough free space for the object size. Why? Let’s say, we want to allocate the object with size 5KB, we have some free memory which sizes are 3KB, 4KB respectively. So, in total, we have 7KB free in memory, but there’s no contiguous slot large enough to fit 5KB in. So, we have the OOM.

Systems such as Memcached use a slab-based memory allocator to mitigate external fragmentation and reduce allocation overhead. Memory is partitioned into predefined slab classes, where each class corresponds to a fixed object size. Each slab class manages one or more slabs, and each slab is further divided into equal-sized slots.

When a new object arrives, Memcached selects the smallest slab class whose slot size can accommodate the object and places it into a free slot within that class. If no free slot is available, the system evicts an existing object from the same slab class (using LRU) and reuses its slot. In effect, each slab class behaves like a size-specific object pool.

By allocating and recycling fixed-size slots, slab allocation avoids frequent calls to the general-purpose allocator and eliminates external fragmentation. This significantly reduces allocation and deallocation overhead compared to malloc-based designs, at the cost of potential internal fragmentation due to size rounding.

But it also has some downsides:

Internal memory fragmentation: As the memory slots are fixed size, and the request object can have arbitrary sizes, so it’s common that an object of size 32 bytes occupies the 64-byte class.
Wasted unused memory: If an object size takes the majority of the requests, then there might be the case that one slab class is overfull and slots are evicted more frequently, while others still have free spaces, leads to wasted unused memory.

SegCache was born, it targets small-object workloads, optimizing memory efficiency, throughput, and scalability. Next, we’ll walk through the mechanisms that make this possible.

Here’s the overview of Segcache.

The cache heap is divided into segments, each holds objects with similar expiration time (no matter which size they are). Objects area always appended to the end of the segment. If the segment is full, a new segment is allocated and appended to the chain (if no free segment is available, eviction is triggered, discussed later).

Note:

All segments have the same size
📔 SegCache preallocates fixed number of segments based on the heap memory we pass to it when initializing it. The number of segments is $\frac{heapsize}{segmentsize}$

Objects written to segments are read-only (can’t be updated) after it’s added to the segment (except for the inc/decr atomic operation).

Each TTL bucket can have many segments, those segments are sorted by creation time and have a link to the next one, forming a linked list called segment chain.

The TTL bucket keep the pointer to the head (the first / oldest) and the tail (the last / newest) segment in the segment chain.

SegCache breaks the spectrum of possible TTLs into ranges called TTL bucket. Each TTL bucket contains the objects have TTL between $t_{1}$ and $t_{2}$ . Those objects will be treated as having TTL $t_{1}$ .

SegCache uses 1024 TTL buckets, divided into 4 groups. From one group to the next, the time width ( $t_2$ - $t_1$ ) grows by a factor of 16. When an item is written to the cache, SegCache computes which TTL bucket its TTL belongs to (using a fast bitwise computation for efficiency

SegCache stores object data in segments, it uses Hash Table to store the location and some extra small metadata of the object. From now on, we call each item in the Hash Table is a Hash Bucket.

One of the common problem in hash table is: hash collision - when multiple keys hash to the same bucket index (remember that, a hash table is just an array underneath), they must somehow coexist in that single slot.

Traditional systems like Memcached solve this using object chaining: each bucket holds a linked list of objects, and collisions simply append new objects to the end of the list. This works, but introduces 2 major problems:

Extra metadata per object: Every object must carry a pointer to the next object in the chain.
Random access: Let’s say your target object is 7th in the chain, then you need to jump through 6 heap objects before reaching it. Each hop is a random access across the heap.

As SegCache only uses hash table to store the location of the object in the segment (segment_id and offset), so it adopts a smarter way called bulk chaining, which is already used in systems like MICA or Faster.

In SegCache, each hash bucket has size of 64 bytes, divided into 8-byte slots - so basically, one hash bucket holds 8 slots. The first slot store the hash bucket information, following 6 slots store object metadata, the last slot either stores the object metadata or the index of the next hash bucket which stores the data of objects having the same hash value.

The hash bucket information stores an 8-bit spin lock, an 8-bit slot usage counter, a 16-bit last-access time, and a 32-bit $cas$ value (to control the concurrent updates). This information is shared across all the slots in the hash bucket.
Each other slot stores following: 12-bit tag, 8-bit frequency counter, 24-bit segment ID, 20-bit offset (the offset of the data in the segment)
- 📔 The $tag$ is a value derived from the key’s hash value. It’s used in key comparison to find the matching object metadata stored in hash table of the looking key, so we don’t need to access the actual object metadata stored in segment (which is random access!)

SegCache allocates more buckets than required by the hash function. The bucket array is logically split into two regions:

The first region consists of primary buckets, which are directly indexed using the hash value. Each primary bucket stores metadata for multiple objects that map to the same hash index.
The second region contains overflow buckets, which are never addressed by hashing. These buckets exist solely to support collision handling through bulk chaining. When a primary bucket runs out of slots, it links to an overflow bucket, which can store additional object metadata.
Bulk chaining — figure 6

The bucket chain when multiple object hash the same hash value.

Note:

If the overflow bucket runs out of slots, it will link to another overflow bucket, until we don’t have any available overflow bucket left.
The hash bucket information is only stored in the first slot of the primary bucket. All following slots either store the object metadata or the info of the next overflow bucket.
- As a result, the spinlock / $cas$ used to coordinate updates is associated with the primary bucket and shared across all hash buckets in the bucket chain!
As the frequency is using 8-bit of the object meta (stored in the hash bucket slot), then the maximum frequency of an object is 127, we only update the counter if it’s less than 127.
In order to find the overflow hash bucket, SegCache applies an approach which is similar to linear probing. They maintain a pointer to the nearest unlinked overflow hash bucket, when that bucket is linked to any other hash bucket, SegCache advances that pointer.
⚠️ If there’s no available hash bucket left, the insertion fails!

This has some advantages:

No per-object next-pointer: As each block can store up to 7 object metadata and only need one pointer to the next block, we reduce the number of pointers from 7 to just one.
Cache-friendly: Each block is just 64 bytes - a single cache line. This means metadata for up to seven colliding keys is fetched together and fits nicely in the L1 cache (the fastest CPU’s cache layer)
Random access: During lookup, SegCache scans the L1 cache and compares $tag$ . Only matching $tag$ require a random access into the segment to verify the full key. Even if multiple keys share the same tags (which is rare), the number of random accesses is still dramatically lower than pointer-chasing linked list.

The read in SegCache has following steps:

Hash lookup: A read first hashes the key and looks it up in the hash table.
Tag checker: After located the hash bucket, it starts examining the bucket blocks in that hash bucket. It compares the tag derived from the object key’s hash with the bucket slot tag, if the tag mismatches, it advances to the next slot or the next block if it reaches the end of the block.
Segment access: If the tag matches, the reader uses the information in the bucket slot (segment_id, offset) and reads the object data from the segment. This requires no locking
Frequency update: The object’s access frequency counter is updated (we’ll go deeper into this in the Thread model and scalability section)

The write in SegCache has following steps:

TTL bucket selection: The object’s TTL is mapped to the TTL bucket.
Thread-local segment append: The thread receives the write append the object data to the thread-owned TTL bucket’s active segment. (we’ll go deeper into this in the Thread model and scalability section).
1. In this step, if the segment becomes full, then the writer briefly locks the TTL bucket to link to the tail of the TTL bucket’s global segment chain. After that, a new segment is allocated.
Update hash table: After the object data is fully written, the writer updates the hash table to point the key to the new (segment_id, offset)

SegCache achieves low metadata overhead by sharing metadata across objects.

Segment level: Objects in the same segment share creation time, TTL.
- By sharing the TTL, there will be the case that the object expires earlier than its actual TTL value, but according to their evaluation, early expiration has negligible impact on miss ratio.
Hash bucket: Objects has the same hash value share the same hash bucket information (spin lock, chain length, last accessed timestamp, $cas$ ).

In SegCache, objects in one segment are written sequentially and have the same TTL and segments linked in each TTL bucket are ordered by creation time and have the same TTL. So it’s feasible to proactively remove the expired objects in bulk.

A background thread wakes up (by default every second) and scans through the TTL buckets, checking each bucket’s oldest segment.

Check oldest segment: If the head segment’s creation time plus the bucket TTL has passed, SegCache removes the entire segment and all its objects in one batch operation.
Move to next segment: After freeing the expired segment, that thread moves to the next segment in chain, as long as it reaches the end of the chain or a segment that is not yet expired.
Repeat for each TTL bucket: The expiration thread then proceeds to the next TTL bucket with the same process as above.

💭 Potential challenge with grouping items by TTL is that some objects might be inserted later into a segment that already contain slightly older entries, as when do expiration, Segcache examine against the segment creation time - the time the first object was added to the segment, so, even if the specified TTL of newly added object might not has fully elapsed yet, it’s still removed as part of the segment removal. It’s the trade-off that Segcache compromises.

While expiration remove objects that cannot be used in the future, it’s not enough, all caching systems support eviction (remove objects that could potentially used in the future) when necessary to make room for new objects.

In SegCache, this can happen if there’s no free segment left to put the new object to.

SegCache uses a merge-based eviction algorithm. The basic idea is to merge some segments into one and only keep a relatively small portion of the objects that are more likely to be accessed in the future.

Merging segments is always from a single TTL Bucket. SegCache takes the first N consecutive, unexpired, and un-merged (in the current iteration) segments. The new segment creation time inherits the creation of the oldest evicted segment. By doing so:

The new segment can be placed in exactly the same place as the evicted segment, so the segment chain order is still retained.
Merging consecutive segments guarantees that the objects in that new segment still have relatively close creation / expiration time.

Choosing the TTL Bucket is done by the round-robin algorithm.

SegCache uses the frequency-biased merge-based eviction. It use approximate and smoothed frequency counter (ASFC) to track frequencies. This counter is stored in the object metadata in each bucket slot of bucket block.

Approximate counter: The ASFC has 2 stages:
1. When frequency is smaller than 16 (last four bits of the counter), it always increases by one for every request.
2. In the 2nd stage, it counts frequency similar to a Morris counter
Smoothed counter: Segcache uses the last access timestamp which is shared by objects in the same bucket block, to rate-limit updates to the frequency counters. The frequency counter for each object is increased at most once per second.

One note is that after the segment eviction, Segcache resets the frequency of retained objects to avoid cache pollution due to request bursts and non-constant data access pattern.

Segcache is designed to scale linearly with the number of threads by using a combination of techniques such as minimal critical sections, optimistic concurrency control, atomic operations, and thread-local variables.

Traditional caches like Memcached perform object-level bookkeeping, such as updating LRU state on every access and maintaining global free-object lists on deletion. These operations require frequent locking of shared structures, which quickly becomes a scalability bottleneck as thread count increases.

SegCache avoids this entirely by managing objects at the segment level. it operates on segment chains grouped by TTL. As a result, locking occurs only during segment allocation or eviction. It reduces locking overhead by several orders of magnitude and enabling near-linear scalability.

Instead of locking, Segcache employs optimistic concurrency control which only perform lightweight checks or use atomic CAS operations to detect conflicts. In many cases, updates (like inserting a new key) can be done without acquiring a lock by atomically update a free slot or pointer using the $*cas$ v*alue stored in the bucket block header. Even though there is a chance that two threads target the same hash bucket block (which is rare in practice), at that time, Segcache simply lets the client retry the request or retry automatically.

All writes in SegCache are append-only: new objects are appended to the end of a segment and never modify existing entries in place.

Initially, Segcache took advantage of atomic instruction to coordinate these appends. When a thread want to write an object of size $N$ , it performs an atomic increment on the segment’s tail pointer. For example

old_tail = atomic_fetch_add(&segment->tail, N);

old_tail = atomic_fetch_add(&segment->tail, N);

the old_tail holds the pointer of the previous tail, which will be the newly object’s offset in the segment. This no needs any lock at all.

However, the developers found that relying solely on atomic operations on a shared segment still introduced contention when scaling beyond roughly 8 threads. It’s because threads still serialize updates to a shared memory location (the segment’s tail pointer) at the hardware level, due to the cache coherence, the cache line containing that pointer continuously migrates between CPU cores caches. Beyond roughly 8 threads, this cache-line migrations are costly, affecting the scalability even though no lock is used at all.

To eliminate cache-line contention caused by concurrent atomic updates, each thread maintains a thread-local active segment for the TTL buckets it is currently writing to, and all writes issued by that thread append exclusively to these local segments. Unlike global segments, those segments are not shared across threads. As a result, no two threads ever contend on the same segment’s append pointer, completely avoiding write-side contention.

Note: The thread we’re referring here is the worker thread spawned by SegCache itself. The number of threads is configurable and ideally should be 1 thread / CPU core.

For example, if there’s write to the TTL Bucket 1 in Thread 1, Thread 1 will have its own segment for the TTL Bucket 1, all following write will go to that local thread segment.

Once a local segment becomes full, it is sealed - meaning no further writes are allowed - and then linked into the TTL bucket’s global segment chain. This linking step is infrequent and amortized over thousands of writes.

Reads remain fully concurrent and unaffected by this design. Since lookup is performed through the hash table, objects written into a thread-local segment are immediately visible to all threads, even before the segment is sealed or linked. Thread-locality applies only to writes, not to visibility.

d. Thread-Local Active Segments — figure 2

📔 In their original implementation written in C, I see that they did have this optimization, but in the new version written in Rust, this optimization was not added.

For read operations, SegCache is almost entirely lock-free. A cache lookup involves checking the hash table, retrieving the segID and offset , then reading the object from the segment, we don’t need locking at all.

Because the objects are not updated on read, multiple threads can fetch items concurrently without coordination. The only minor synchronization on the read path is updating an object’s access frequency counter. However we have the rate-limit to increase the counter only once per second, so even if we have thousands reads to single item, we don’t incur thousands of lock updates - as the counter is only increased once and hold off until next epoch.

You can find the official implementation at http://www.github.com/twitter/pelikan and archived at http://www.github.com/thesys-lab/segcache. The original version was written in C, but after that they revamped to use Rust instead.

I have implemented my own version using Go. You can find it here: Repo. In my implementation, I tried a different approach to reduce the segment’s tail contention by using sharding, each shard also maintain its own active segments. Even though technically, the contention can still happen if multiple keys are hash the same shard value which tend to write to the same TTL bucket.

in the repo it also contains the benchmark result, it’s not really bad tho. 🤘 Benchmark

Feel free to submit issue if you find any bug. 🫂

Memory efficiency

SegCache significantly outperforms all baselines. Compared to the best alternative system, it reduces miss ratio by up to 58%. When normalized by miss ratio, SegCache achieves the same cache effectiveness using up to 88% less memory than PCache, 60% less than Memcached, and 64% less than r-Hyperbolic. These gains come from proactive expiration, compact metadata, zero fragmentation, and merge-based eviction.

Impact of proactive expiration.

Enabling full cache scanning in Memcached reduces miss ratio by up to 40%, confirming that timely expiration is critical. However, scanning cuts throughput nearly in half. SegCache achieves similar or better miss-ratio improvements without scanning, demonstrating that proactive expiration must be structural, not reactive.

Throughput.

SegCache matches or exceeds the fastest baselines. It is up to 2.5× faster than s-Memcached, 3× faster than r-Hyperbolic, and 4× faster than r-LHD. High throughput comes from batched, segment-level bookkeeping and the absence of object-level LRU maintenance.

Scalability.

SegCache scales almost linearly with CPU cores. At 24 threads, it achieves ~70 MQPS, a 19.9× speedup over single-thread execution. In contrast, Memcached scales poorly beyond 8 threads (only 3.4× speedup), and s-Memcached deadlocks under higher concurrency.

Sensitivity.

Merge-based eviction is robust: merging 3–4 segments yields the lowest miss ratio, but performance remains stable across a wide range of parameters. Segment size has minimal impact on miss ratio, indicating the design does not require fine-tuning.