Scaling Memcache at Facebook
Explore how Facebook scaled Memcache to serve billion of requests
This blog is inspired by a research paper I recently read, titled exactly the same as this post. I found it particularly intriguing and wanted to articulate its concepts in my own words. So, here we are today, diving into this discussion.
It’s important to note that this blog is heavily based on the original paper, along with several older blog posts and videos from 2008 to 2013. As a result, the metrics and figures referenced here may be outdated. However, the core focus of this blog is not on the exact numbers but rather on understanding the challenges Facebook faced at that time and the solutions they implemented - which, in many ways, remain highly relevant today
As Facebook’s scale grew, relying on a single service or database server to handle all data operations become impractical. Instead, Facebook adopted a distributed architecture where multiple services & data servers each manage a portion of data. This introduced a challenge: even for rendering a simple page, numerous requests to different services were required to gather all necessary data. A common scenario where a single page depended on hundreds of database queries would be prohibitively slow and resource-intensive.
For example, this is the data dependency graph for a small request. Each dot represents a data fetch/computation.
To address this, Facebook turned to caching as a fundamental optimization strategy and they chose Memcached. It was chosen for its simplicity, offering just three primary operations: get, set and delete. According to an early blog post from Facebook Engineering, in 2008, Facebook had already deployed over 800 memcached servers providing more than 28 TB of memory. In a talk from the same year, Mark Zuckerberg highlighted that 95% of Facebook’s requests were served by memcached servers, underscoring its critical role in handling large-scale workloads efficiently.
In case some of you don’t know about memcached, this section gives a brief introduction about memcached, how does it work and why is it efficient.
According to the official website of memcached, it’s Free & open source, high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.
As its core, memcached is a simple key-value store, primarily structured as a hash table. It consists of 2 main components:
- Memcached Client: Maintains a hash mapping of item keys to Memcached Servers. When a request is made, it uses consistent hashing to determine which server is responsible for storing the value ****
- Memcached Server: The actual storage node that holds the value associated with the keys. It processes get, set and delete operations, serving cached data efficiently.
Each memcached server is independent without coordinating or communicating with others. This stateless architecture makes memcached distributed.
Memcached is fast because:
- It stores data in memory - as the name described, all the operation are
- It’s lockless, all objects are multi-versioned internally & reference counter, so there’s no blocking at all, a client can perform updating on a key while dozen of other clients reading its value, no one needs to wait
- Batching:
- Memcached supports bulk fetching of multiple keys in a single request, avoiding the latency of sequential lookups & network roundtrips.
- If multiple requested keys reside on the same Memcached server, they can be retrieved in one network round trip. To take advantage of this, applications can explicitly provide hash values with keys, ensuring that related data is stored on the same Memcached node.
Facebook chose the look-aside cache strategy for caching the data, let’s see how does it work
For read request: When a web server requires data, it first queries Memcache, if the key exists in Memcache (cache-hit), the value is returned immediately. However, if the key is not found (cache miss), the web server retrieves the value from database (or any backend service which manages the data), then stores the fetched value back into Memcache to serve future requests.
For write request: The web server calls update to the database (or any backend service which manages the data). Afterward, it invalidates the corresponding key in Memcache by issuing the DELETE command.
In this caching model, hot keys remain in Memcache most of the time, significantly reducing database queries and ensuring fast response times.
→ So the very common case is that the hot keys are mostly in cache all the time
In the event of a cache miss, the system retrieves data from the database and then writes it back into Memcache. However, this approach introduces two major challenges.
So, consider this scenario: ClientA queries Memcache for a key but encounters a cache miss. It then retrieves the value from the database. Before clientA writes this fetched value back into Memcache, another ClientB updates data related to that key in the database. As part of this update, ClientB triggers a cache invalidation, removing the outdated entry from Memcache.
However, since ClientA is unaware of this update, it proceeds to write back the old value it originally retrieved from the database. As a result, Memcache now stores an outdated value, leading to stale data being served to subsequent requests.
The thundering herd problem occurs when a large number of clients simultaneously requests the same key that has either just been invalidated from the cache or was never cached in the first place. Because the key is missing in Memcache, all these requests hit the database to query for the data, creating a sudden surge of queries. This can overload the database, significantly increasing latency, resource contention and potential service degradation.
To mitigate these issues, Facebook implemented the lease token mechanism. The lease token is
- A
64-bit tokenbound to the cache key, generated and returned to the client upon a cache miss - Deleted whenever a cache invalidation occurs.
When a client encounters a cache miss, the memcached server generates a lease token and provides it to the client. Later, when the client attempts to write the fetched value back into memcached, it must include the lease token in the request. The memcached server then verifies the token:
- If the token matches the current one, the value is stored in the cache.
- If the token does not match ****(because a cache invalidation occurred in the meantime), the write is ignored.
What’s about Thundering Herd?
Facebook introduced an optimization to the lease token mechanism by limiting token issuance to once every 10 seconds per key. If multiple clients request the same key within 10 seconds of a token being generated, instead of issuing a new lease token, they receive special notifications instructing them to wait briefly. Since the cache is typically repopulated within a few milliseconds, the waiting clients can retry their request after a short delay, at which point the data is expected to be available in cache.
Temporarily storage for recent deleted data
An additional optimization mentioned in the paper addresses the handling of stale data - specifically, when a key is evicted due to an update. Instead of immediately discarding the old value, Facebook transferred that value to another data structure and store that value for a short amount of time. For the GET request, a lease token is returned to mark that the data is tale. If the clients that can tolerate slightly stale data they can use that data directly instead of waiting for the latest data from database
Let’s start with single memcached server first and see how did Facebook optimized it:
Memcached operated with a fixed-size hash table, with the growth of the key stored in hash table, the performance degraded due to hash collision. Memcached use **Separate Chaining** to resolve hash collision so for the worst case, the hash table look-up time can drift to .
Facebook resolved this by allowing automatic expansion of the hash table which reduced the hash collision.
Memcached was single-threaded, so it can only process one request at a time, as Facebook scaled, this caused underutilization of multi-core CPUs. Hence, they introduced multi-threading to parallelize requests across multiple CPU cores.
To avoid race condition, Facebook started with using global lock, this allow multiple threads to process requests, but only one could access the cache at a time. But the contention on the global lock became the bottleneck.
To eliminate contention, Facebook replaced the global lock with finer-grained locks. So, instead of locking the entire cache, each bucket of the hash table got its own lock. This allowed multiple threads to access different parts of Memcached simultaneously.
At early day, Facebook used TCP for GET requests to memcached servers, but it raised several challenges
- They have thousands and thousands of computers, each running a hundred or more Apache processes, they end up with hundreds of thousands of TCP connections open to their memcached processes → the high overload related to TCP congestion control & TCP handshakes
- Waste of memory, if using TCP, memcached uses a per-connection buffer to read and write data out over the network. In case of large number of connections, this consumes significant amount of memory - which could be better used to store data instead.
- TCP incast congestion: this happens when multiple memcached servers send responses simultaneously to a single client, overwhelming the network switch buffer and causing packet loss.
With above challenges, Facebook decided to use UDP for get requests, it helped reducing the network traffic, memory consumption.
Facebook also discovered that under load on Linux, UDP performance was downright horrible because of lock contention on the DUP socket lock when transmitting through a single socket from multiple threads. They resolved this by separate UDP sockets for transmitting replies (with one of these reply sockets per thread)
And because the packet can be dropped or received out of order, Facebook treated them as errors on client side. The client side treat those kind of errors as cache miss, but they skip inserting data into Mecached server after retrieving data from database for this error.
Memcached uses a slab allocator to manage memory in fixed-size classes. Facebook observed inefficiencies when one slab class is overfull and evicting items while another has free space. They implemented an adaptive slab rebalancer that reassigns memory between slab classes based on usage and eviction rates . If one class is evicting items more frequently than others (meaning demand for that size is higher), the allocator will move some memory to that class from others. This keeps the cache hit rate high and improves memory utilization (the open-source project later adopted a similar idea). They also introduced a special Transient Item Cache for very short-lived items: instead of strictly using LRU expiration (which can let short-lived keys clutter the cache until they naturally expire), the transient cache proactively evicts keys that have a short TTL once they expire . This prevents high-churn keys from pushing out more valuable data. In one case, adding a short-TTL item family to the transient bucket reduced that item’s footprint from 6% of a cache pool down to 0.3% without hurting hit rates
Facebook operates tens of thousands of memcached servers across multiple regions and data centers. Routing the requests to the right memcached server add unnecessary complexities to every clients that uses memcached.
Hence, Facebook developed mcrouter (pronounced “mick-router) - a memcached protocol router that is used at Facebook to handle all traffic to, from, and between cache servers. It sits between web servers (memcached clients) & memcached servers. To a web server, mcrouter looks like a memcached server. To a memcached server, mcrouter looks like a normal memcached client. So with mcrouter, Facebook separated the logic
Apart from routing, it serves a lot of useful functionalities, in this section, we’ll just go through some of them, in the following sections, we will explore more about where did Facebook use McRouter for. If you are interested in digging more about it, can refer to official blog by Facebook.
- Request routing: All the request from web server go through the
mcrouter, andmcrouterwill decide where to route the request. - Connection pooling: Let the client open & maintain the connections directly to the memcache server is expensive, the
mcrouterhas the connection pools that the client can reuse when making the request. By separating this concern, now client can only focus on the core logic, no need to worry about connection to memcache servers. - Failover management: McRouter keeps track of information about memcache server status and route request to the appropriate memcache server.
A cluster in Facebook’s caching system is a logical unit that groups together multiple web servers (memcached clients) and memcached servers.
The source of truth - the database layer - is maintained at the regional level, meaning that all clusters within a region share the same underlying database infrastructure.
Typically, each cluster is responsible for a specific subset of data. However, for highly available or frequently accessed data, Facebook replicates those entries across multiple clusters or stores them in a regional cache pool to reduce latency and improve fault tolerance
Each cluster can have thousands of web servers & memcached servers. Optimizing read latency for such a large-scale infrastructure is a significant challenge. Let’s explore how Facebook addressed this issue to enhance performance and efficiency.
One optimization can think about when we fetching a large number of keys is to maximize the parallelism. Facebook divided the keys into batch of keys that can be fetch in parallel by constructing a Directed Acyclic Graph (DAG) to represent the dependency relationships between keys. Independent keys can be retrieved simultaneously.
For GET request, Facebook used UDP to maximize speed and reduce overhead, while SET and DELETE operations still rely on TCP for reliability. This design choice is based on the fact that dropped packet(s) in GET request is acceptable (as described in previous section), whereas losing SET or DELETE requests could lead to data inconsistency.
Previously, we discussed incast congestion, a phenomenon that can also impact SET and DELETE requests. This occurs when a large number of cache misses trigger simultaneous SET operations or when many DELETE requests happen at once.
To mitigate incast congestion, Facebook implemented a sliding window mechanism to control the number of requests sent within a given time frame. A smaller window size increases sequential processing latency since requests must be sent in smaller batch. Conversely, a larger window size increase the risk of incast congestion. Therefore, choosing the optimal window. size requires careful provisioning and continuous monitoring to balance efficiency and network stability
The above diagram shows the average web request spend waiting to be scheduled with different window size, you can see that with small window size, the time for the request to wait to be scheduled is large, this waiting time reduces when window size becomes larger and reaches the best when window size is 300, then starts to increase again when the window size is larger than 300ms
Not every cache data has the same properties (size, churn, access rate), so putting all of the data together is not ideally optimal. The memcached uses LRU to evict cache data when we run out of the maximum memory allowed, if we put all the cache data inside that memcached server, there would be the case a lot of sudden requests for the high churn datas (those are accessed only once and never again) come in, and LRU cache evicts the low churn (the data which is typical always needed or has higher access rate). Because of that reason, having different pools for different cache data based on their properties is better.
According to the paper, Facebook has one default pool named wildcard as the default one, and continuously provision keys, if it becomes problematic if for the keys to be in the wildcard, they move the keys to the appropriate one based on its properties.
So how did Facebook define & route the request to correct pool? They added some special prefixes to the keys, all the keys with the same special prefix(es) go to the same pool. The leftover job is for the mcrouter to route the request to the correct pool.
Facebook chose to replicate a category of keys within a pool when
- The application routinely fetches many keys simultaneously
- The entire data set fits in one or two memcached
- The request rate is much higher than what a single server can manage
One of thing to note here is that they favor replication the entire instance over dividing the key space into multiple instances. One example they mentioned were if they have a memcached server holding 100 items and capable of responding to 500k reqs/s. Each request asks for 100 keys. The difference in memcached overhead for retrieving 100 keys per request instead of 1 key is small. To scale the system to process 1M requests/sec, suppose that we add a second server and split the key space equally between the two. Clients now need to split each request for 100 keys into two parallel requests for ∼50 keys. Consequently, both servers still have to process 1M requests per second.
The mcrouter accounts for routing the request to one of the replica memcache servers of the key.
In case a small number of hosts are inaccessible due to network or server failure within a cluster, Facebook setup a small set of machines called “Gutter”, they would take over the responsibilities of a few failed servers. Typically, Gutter accounts for 1% of memcached servers in a cluster. They did not use the rehashing solution, due to the fact that maybe the hot keys can be shifted to another server which is not ready to be responsible for large number of requests for that key.
When a memcached client receives no response to its get request, the client assumes the server has failed and issues the request again to a special Gutter pool.
This is the flow for the request within a cluster.
Let’s talk a bit about Gutter, they’re a set of servers act as fallback servers for failed servers. Here are some characteristics:
- The Gutter Pool temporarily stores data for some missing keys and serves them as a backup cache.
- If the key is missing in Gutter, the request goes to the database, and the retrieved value is inserted into Gutter for future requests.
- These cached values in Gutter have a short TTL (Time-To-Live) and are automatically evicted when the main cache recovers.
The reason why Facebook did not choose the solution that client does the rehashing keys among memcached server is that approach risks cascading failures due to non-uniform key access frequency. For example, a single key can account for 20% of a server’s requests. The server that becomes responsible for this hot key might also become overloaded.
Routing requests to Gutter is also done by mcrouter
In the previous section, we’ve just went through the Cluster, a larger unit is Region, a Region contains many clusters, sharing the same databases.
In previous sections, we focused about the performance, in this section, we’ll focus on Consistency & Handling failure
Each cluster independently caches data based on the mix of user requests it receives. If user requests are randomly distributed across all available frontend clusters, the cached data across these clusters will be roughly the same. This redundancy allows Facebook to take a cluster offline for maintenance without significantly reducing cache hit rates.
However, over-replicating cached data across all clusters can be memory inefficient, especially for large, infrequently accessed items. To optimize memory usage, Facebook introduced the concept of a regional pool, where multiple clusters share a common set of Memcached servers rather than each cluster maintaining its own full copy of the data.
Cross-cluster queries introduce additional latency and require more bandwidth. On average, network bandwidth across clusters is 40% lower than within a single cluster. Replication reduces cross-cluster traffic by trading more memcached servers for lower inter-cluster latency, reduced network congestion, and improved fault tolerance.
However, not all data benefits from full replication. For some data types, it is more cost-efficient to maintain a single replica per region instead of replicating it across all clusters. One of the key challenges in scaling memcache within a region is deciding which keys should be fully replicated across clusters versus those that should have only one copy per region.
Facebook tried to provide best-effort eventual consistency but place an emphasis on performance and availability.
Because each region can have multiple Memcache clusters, the same piece of data might be cached in several clusters (if users are load-balanced to different clusters) . Facebook maintains cache consistency within a region using an invalidation pipeline. Rather than updating cached data on writes, Facebook employs a delete-based (invalidation) approach: when the database is updated, the corresponding cache key is deleted from memcache. Why delete approach? because the delete is atomic, wrongly delete the key from the cache is not really a big deal, but setting a stale data is. A daemon called mcsqueal runs on every database server and watches the commit log for data modifications . On detecting a data change, mcsqueal broadcasts a delete for that key to all memcache clusters in the region . These invalidation messages are funneled through mcrouter in each cluster, which then deletes the item on the relevant memcached node . This ensures that stale data is purged from every cluster’s cache soon after a write, maintaining consistency within the region. (Any subsequent read will miss in cache and fetch the updated value from the database.)
Having copy of data across multiple region has many benefit such as:
- Reduce latency because the servers are closer to the users
- Reduce the chance of losing data in case of disasters or power outages
- Location with cheaper costs
Having multiple regions, Facebook decided that one region would hold the master database, while others will hold replica database for readonly. So all writes must go to master region.
Maintaining the consistency of data across regions with database rely on the database replication, but what’s about the data in memcached servers? Within one region, the write request to update the data leads to invalidation requests using mcsqueal, but things become more complex when it’s cross region, due to the replica lag.
Let’s say, the invalidation requests read the memcached servers in the non-master region, but the latest data is not, so even in that case, reading the data from the replica database would lead to stale data. Here’s how Facebook resolved this issue:
Facebook introduced a value called remote marker which is unique value across all regions, it can be the timestamp + the id of non-master region which had the write request. When a client makes the write request, it also creates a remote marker to the memcache. From that time, up until the remote marker is removed when the replication log for the latest data comes, the other clients read the data directly from the master region instead. Facebook did the tradeoff between latency and data integrity in this case.
In case a widespread outage that affects a significant percentage of servers within the cluster, Facebook chose to shift all the traffic from that cluster to other clusters which also have the request data.
When a brand new cluster is online, or an existing one recovered from failure, the cache will have very pure cache hit rates. They’re called “Cold Cluster”, it takes a while for this kind of cluster to catch up with the “Warm Cluster”. During this time, instead of fetching the data from database, the web client in Cold Cluster will fetch the data from the Warm Cluster
But there’s a problem here about the race condition.
Consider a scenario where Client B, operating in the Cold Cluster, issues a write request. This triggers cache invalidation in both the Cold Cluster and the Warm Cluster. However, if the invalidation request to the Warm Cluster is delayed, the Cold Cluster processes the invalidation first.
Now suppose ClientA makes a read request to the Cold Cluster immediately after the invalidation. Since the cache entry was evicted, ClientA fetches the data from the Warm Cluster and writes it back to the Cold Cluster. However, because the Warm Cluster has not yet processed the invalidation request, the data being read is stale. This results in stale data being re-cached in the Cold Cluster, leading to inconsistency.
→ Facebook came to a workaround solution for this issue. Upon receiving a cache invalidation request, a client in the Cold Cluster will avoid reading from the Warm Cluster for 2 seconds and will instead fetch data directly from the database. While this approach is not perfect solution - as it does not guarantee that the Warm Cluster will receive the invalidation within 2 seconds, it was likely determined empirically, based on extensive monitoring and system performance analysis.
Let’s wrap up by the overall architecture diagram
https://www.youtube.com/watch?v=m4_7W4XzRgk&t=205s&ab_channel=USENIX
https://blog.bytebytego.com/p/how-facebook-served-billions-of-requests
https://www.youtube.com/watch?v=Y8StDey3_6Q&ab_channel=Jordanhasnolife