TAO: Facebook’s Distributed Data Store for the Social Graph

Facebook is a social platform with billion of users, leading to extremely workload on its backend. Each time a user visits Facebook, they’re presented with hundreds of pieces of information. I.e, user goes to News Feed, they would see their friends/following users/pages/communities latest posts, along with them are the information about how many interaction with each post, how many comments, whom comment, reaction belongs to. Those information represents a graph model of data, where each node can be entity (post, comment, reaction,…) and edge is the actual type which connect those entities (entity user likes post, entity user comments on post,…) With highly customized feed for each user & the high update rate, it’s impossible to generate views to show to users ahead of time. Therefore, the dataset must be retrieved and shipped to users on the fly within a few hundreds milliseconds.

Before TAO, Facebook used MySQL with memcache as a lookaside cache, which led to complex application code and inconsistent caches. Besides, given the requirement of graph queries, the existing architecture has some downside:

Initially, Facebook started with persistent storage, which is MySQL , But the requirement of the ability to serve large number of queries with graph model dispose make MySQL become a bottleneck. Even though MySQL uses the buffer pool to caches recent accessed data to optimize read query, but the assumption of InnoDB buffer pool algorithms don’t match the request pattern of serving social graph. Why is that? Buffer pool contains block caches, it works by loading and storing fixed-size blocks of data from storage into memory with the assumption that: if one part of a dataset is accessed, nearby data will likely be accessed soon (Spatial locality). But that optimization doesn’t work in Facebook’s case for following reasons:

Creation Time Locality > Spatial Locality: At Facebook, the newer data is accessed more frequently than data that is close in memory or disk location. For example, a new post in a trending discussion will receive rapid engagement (likes, shares, comments), making its creation time the primary predictor of access. However, just because one post was accessed does not mean adjacent posts (stored near it on disk or in memory) will be accessed to. Traditional block caches fail here because they preload adjacent data blocks, polluting the cache with unused data.
Many Queries are for Non-Existent Relations: Facebook systems frequently check relationships that do not exists, for example: “Did user X like User Y’s Story” → Most of the time, the answer is NO. Traditional block caches would load and store unnecessary data just to answer these queries, which wastes cache space and lowers hit rates.
Expensive read-after-write consistency: Facebook uses asynchronous primary/replica replication for MySQL, hence, writes are forwarded to master, then the update is replicated to replicas - the replication might need sometimes to reflect the change from master to replica.

In the previous blog, we know that Facebook used a large, scalable memcache architecture to serve the read request to their users, but again, this architecture has some downsides which can not completely resolve the challenges with the graph data model.

Inefficient edge lists: Memcache is a simple key-value store, making it hard to present the graph, queries on edge list always fetch entire edge list, and update/delete one of them requires reloading the edge list entirely.
Expensive read-after-write consistency: With read-aside memcache architecture, in order to guarantee the RAW consistency, Facebook came up with the remote marker , which tracks cache keys that are stale (recently written, but replication hasn’t caught up), if a client in a replica region tries to reach such a key, the systems forwards the read to master region to ensure data it gets is latest one. Cross-region request increases the latency.

Therefore, Facebook introduced a new system called TAO (The Associations and Objects), designed specifically to support graph-oriented queries efficiently. Internally, TAO still relies on MySQL and Memcache as its foundational components, but with key modifications and optimizations to better accommodate the requirements of traversing and interacting with large-scale social graphs.

Hide the complexity of caching, replication, and database access behind a simple & clean APIs (We’ll go more into detail in following sections)
Provide sufficient expressiveness graph queries to handle most application needs
Provide read-after-write consistency within regions, while replicating data asynchronously across them

At Facebook, the core elements revolve around people, their actions, and the relationships among them. These elements can naturally be modeled as a graph. For instance

Alice used her mobile device to record a visit to a well-known landmark with Bob. She ‘checked in’ to the Golden Gate Bridge and ‘tagged’ Bob to indicate that his presence. Cathy later added a comment, which David subsequently ‘liked.’ The social graph includes the users (Alice, Bob, Cathy, and David), their relationships, their interactions (checking in, commenting, and liking), and a physical location (the Golden Gate Bridge).

A graph has 2 primary components: Nodes & Edges . At Facebook, these are referred to as: Objects & Association.

Objects represent entities (users, photos, etc.) and have an ID and attributes.
Associations represent relationships or actions (Alice likes Bob’s photo) and are stored as directed edges from one object to another.

With this abstraction, Facebook came up with a straightforward data structure for TAO:

Object: ( $id$ $i d$ ) $→$ $\to$ $(otype, (key→value)*)$ $(o t y p e, (k ey \to v a l u e) *)$
- Represents objects like: user, comment, location,…
Association: $(id_1, atype, id_2) → (time, (key→value)*)$ $(i d_{1}, a t y p e, i d_{2}) \to (t im e, (k ey \to v a l u e) *)$
- The list of all outgoing associations of a certain type from a given source object $id_1$
- Note that in social graph, we often have an association has an inverse relationship (e.g “friend” edge might have a corresponding “friend_of” edge)

$id$ is 64-bit integer that is unique across all objects, regardless of object type ( $otype$ )

TAO offers operations for allocating a new object and its corresponding ID, as well as retrieving, updating, or deleting the object associated with a given ID.

One important point to note is that, due to TAO’s eventual consistency semantics, it does not support CAS (compare-and-set) operations. Because different regions may observe the data at slightly different times or states, there is no guarantee that the condition for a CAS check would hold consistently across all replicas.

Write: Many edge in social graph are bidirectional, such as the FRIEND relationship. TAO supports maintaining the consistency of such bidirectional edges by defining the inverse type for the association. The detail how do they keep them in sync will be reveal in the following sections. Some notable write APIs are:
- $assoc\_add(id_1, atype, id_2, time, (k→v)*)$ : Adds or overwrites the association $(id_1, atype,id_2)$ , and its inverse $(id_1, inv(atype), id_2)$ if defined.
- $assoc\_delete(id_1, atype, id_2)$ :Deletes the association $(id_1, atype, id_2)$ and the inverse if it exists.
- $assoc\_change\_type(id_1, atype, id_2, newtype)$ : Changes the association $(id_1, atype, id_2)$ to $(id_1, newtype, id_2)$ , if (id1, atype, id2) exists.
Read:
- $assoc\_get(id_1, atype, id_2set, high?, low?)$ $a ssoc_g e t (i d_{1}, a t y p e, i d_{2} se t, hi g h ?, l o w ?)$ : returns all of the associations $(id_1, atype, id_2)$ $(i d_{1}, a t y p e, i d_{2})$ and their time and data, where
  - $id_2 ∈ id_2set$
  - $high ≥ assoc\_creation\_time ≥ low$ (if specified). The optional time bounds are to improve cacheability for large association lists.
- $assoc\_count(id_1, atype)$ : returns the size of the association list for $(id_1, atype)$ , which is the number of edges of type $atype$ that originate at $id_1$ .
- $assoc\_range(id_1, atype, pos, limit)$ : returns elements of the $(id_1, atype)$ association list with index $i ∈ [pos,pos+limit)$ .
- $assoc\_time\_range(id_1, atype, high, low, limit)$ : returns limit elements from the $(id_1, atype)$ association list, starting with the first association where $low \leq time \le high$ ,
Note: TAO enforces a per-atype upper bound (typically 6,000) on the actual limit used for an association query. To enumerate the elements of a longer association list the client must issue multiple queries, using pos or high to specify a starting point.

TAO relies on MySQL as its underlying storage layer. However, due to the immense volume of data it must manage, a single server is insufficient. This led to the adoption of sharding. Some important characteristics include:

In each shard, Objects and associations are stored in separate tables.
Each object includes an embedded shard_id, which determines the shard responsible for storing it. Once assigned, an object remains bound to that shard for its entire lifetime.
- If a shard becomes unavailable, TAO reroutes requests to its replicas or designated failover mechanisms until the shard is recovered.
- Association shard_id is the same as the source objects’
Scaling is achieved by adding more shards.
- TAO avoids rebalancing by design; instead, new objects are allocated to newly added shards, while existing objects remain on their original shards. This eliminates the need for moving data during scaling. That’s the reason why the shard_id is embedded directly in the object.

For cache, TAO still heavily relies on memcache servers.

Facebook organizes multiple cache servers capable of independently serving requests into what is known as a cache tier. Each database shard’s data is cached by exactly one cache server within a tier, and this assignment is determined via consistent hashing.

Client requests are consistently routed to a specific cache server based on the shard they target. That cache server is responsible for fulfilling the request: if the data is available locally, it returns it immediately. Otherwise, it may need to fetch data from other cache servers in the same tier or query the underlying database. Because fulfilling complex requests can require coordination between multiple cache servers in a tier, the number of inter-server connections may grow quadratically with the number of servers. To avoid excessive communication overhead, the size of each cache tier is deliberately kept relatively small.

Cache data is filled on demand and evicted using LRU policy

Due to the challenge: The more cache servers they have in cache tier, the more all-to-all connections they have to deal with. So in order to scale out the system, Facebook decided to apply the 2 tier cache layers: leader & followers.

Followers serve requests from client
- If the request query for data already cached in the cache server, follower returns the response directly
- In case a read miss, follower makes request to leader, which in turn making request to database to fetch the data
- In case it’s the write request, follower forwards the request to the leader, the leader might update data in the database.
Leaders handle database access and cache consistency: every write goes through a follower to the leader.

b. Leader & followers cache tier — figure 3

Since Facebook has users all over the world, scaling geographically is inevitable to reduce latency.

One of the challenges is that the social graph is tightly interconnected, which makes partitioning data by region impractical. As a result, each TAO follower must maintain access to a complete multi-petabyte copy of the social graph. However, replicating the entire dataset across all data centers would be prohibitively expensive.

To address this, Facebook strategically clusters its data centers into a small number of regions. Each region stores a full copy of the social graph, ensuring low-latency access while keeping storage costs manageable.

TAO uses a primary-secondary model for its databases. It uses a primary region per shard and multiple secondary regions. Each region has its own leader & follower tiers.
A region can act as the primary for some shards and a secondary for others.
- However, in Facebook’s deployment, all primary shards are typically co-located in a single region — for example, US-East.
Within a region, followers forward read misses and write requests to the leader.
- If it’s read miss request (data is not in cache server of follower) the leader queries the local region’s database, regardless of whether it is primary or secondary database.
- In the case of a write, the regional leader forwards the request to the leader in the primary region (region owns the shard’s primary database)
  - Once the primary region leader processes the write and returns a response, the leader in the secondary region updates its cache — even if the region’s local database hasn’t yet received the replication!
TAO embeds invalidation and refill messages in the database replication stream. These messages are delivered within a region immediately after the corresponding transaction has been replicated to the region’s secondary database, ensuring cached data is consistent with the DB.
Followers in a secondary region that initiated the write may receive two invalidation/refill messages:
1. One directly from the leader (synchronously), and
2. One from the replication stream (asynchronously).
TAO ensures that all reads can be served within a single region, and as long as a client consistently talks to the same follower tier, it will have a logically consistent view of the data.

Here’s my guess of data cached in cache server

Object

$(id) → (otype, { key → value }*)$

Association

$(id, atype) → [(id_1, metadata), (id_2, metadata), … (id_n, metadata)]$

TAO doesn’t cache the entire association list; instead, it caches only a prefix of the list, ordered by association creation time from newest to oldest. The cached list is extended on demand. This means if the cache currently holds 5 items and the client requests 10, the cache server will query the database (through the leader tier) to retrieve the full 10 items (if there is), update cache, and return the result to the client.

This design choice is based on the observation that recent associations are usually more relevant, and storing older associations in cache would waste memory without significant benefit.

Facebook has some memory optimization within a single cache server

Slab: is a block of memory reserved to store cache item of the same size.

TAO divides available RAM into slabs, they have different slab classes which:

is optimized to hold items of a fixed size (e.g 64B, 128B, 256B…)
uses LRU cache eviction, so only items of the same size compete for space

Advantage:

Prevents large objects from pushing out small objects (e.g a big edge won’t evict thousands of small counters)
Keep allocation fast since TAO knows exactly where to put an item based on its size.

Dynamic rebalancer

In case some classes uses less memory while others are overloaded, TAO’s slab rebalancer can move memory from one class to another, maintaining fairness and reducing eviction pressure.

Arena: Partitioning cache memory based on data type, rather than object type.

TAO divides available RAM into arenas per data type - for example:

One arena for user profiles
One for friend lists
One for news feed edges
One for association counts

Each arena may internally use slabs, but the arena determines which kinds of data are allowed to use which memory.

Advantage:

Better isolation between data types:
- Prevents a “bad actor” (e.g. a super active feed list) from evicting everything else from cache.
- Helps give priority or protection to important data (e.g. user sessions, profile pictures).
Enables manual tuning of cache space by type, or even automated tuning in the future.

One side note, you might be confused with Arena & Slab. Here’s an ELIF:

Think of the cache as a building:

Arenas are the floors – each floor is reserved for a specific type of resident (e.g., athletes, students, artists).
Slabs are the rooms – each room is designed for a specific size of resident (small room for 1 person, large room for 4).

Slabs ensure people don’t waste space by moving into the wrong size room. Arenas ensure that athletes don’t take over the artist’s floor.

There’s a lot of tiny values (ie integers for counts), using traditional hash table structures becomes inefficient, because:

You have to store pointers, hash table metadata, and object headers
The overhead is often bigger than the value itself (e.g., 64-bit pointer to store a 32-bit integer)

For example, storing 10 million (id1, atype) → count entries using standard structures would waste a huge amount of memory in metadata, not actual data.

To optimize this, TAO uses Direct-Mapped 8-Way Associative Cache. Let’s break this down:

8-way associative: They create bucket which can hold up to 8 items (each item is cached data
Direct-mapped: The hash function directly maps a key (like (id1, atype)) to a specific bucket — no extra pointers or dynamic allocations.
LRU in each bucket is tracked by simply sliding the entries down:
- When an item is used, it’s moved to the front.
- When a new item is inserted into a full bucket (all 8 slots used), the oldest (least-recently-used) entry (usually at the end) is evicted.

Advantages: Eliminates pointers, keeps cache compact, and still enables fast lookup + replacement

Divide the space of objects & associations into shards (a logical MySQL database)

One table for object, one table for associations, one table for association counts (to avoid expensive SELECT COUNT query

One thing to note is in object table, they serialize the object data into a single data column.

Shards are mapped onto cache servers within a tier using consistent hashing. Let’s make it clear

In a single region, in each follower tier, each shard has only one cache server serving requests for it, as determined by consistent hashing.

But this leads to some cache servers will receive a higher request load than others → hot spots

To mitigate this issue, TAO rebalances load among cache servers with *shard cloning → *****it allows read to a shard can be served by multiple cache servers in a follower tier.

They also have an additional optimization, which is:

Cache hot objects/associations from client side.

When a follower responds to a query for a hot item, it includes that item’s access rate. If that rate exceeds a threshold, the TAO client caches the data and version. Using version, client can ignore the data from the follower if the version hasn’t changed. The access rate can also be used to throttle client requests for very hot objects.

Many objects have more than 6,000 associations with the same $atype$ Let’s say, we want to check wether $id_k$ has $atype$ relationship with $id$ which is a hot object, it’s hard to tell if $id_k$ is not cached in the association list or $id_k$ has no $atype$ relationship with $id$ from the first place. In this case, those kind of requests would need to query to database to return the result. It’s efficiency.

To address this issue, TAO did some optimization

If $atype$ has inverse edge type, we can check the association list $(id_k, inv_atype)$ instead if $id_k$ is not high-degree objects. They can check if $id$ is high-degree objects or not by using assoc_count
But what if atype doesn’t have inverse edge type, or $id_2$ is also high-degree object? They observe that the association also has creation time, and object has creation time too. So in case in this list [ $id_1$ , $id_2$ , … $id_n$ ] there is any $id_j$ that the association between $id$ and $id_j$ is older than $id_k$ creation time → we know that $id$ has no relation with $id_k$ !

When a write request occurs, the follower-tier cache server forwards it to the leader. If the region is the primary (master) region, the leader directly updates the database. If it’s a replica (non-primary) region, the leader forwards the request to the corresponding leader in the primary region to perform the write.

Let’s assume we are in the primary region.

After the leader-tier cache server updates the database, it proceeds as follows:

It updates its local cache, then returns a changeset in the response to the follower-tier cache server that initiated the request.
- That follower updates its cache synchronously with the returned changeset.
It then sends an invalidation message (if the write was to a simple object) or a refill message (if the write was to an association list) to other follower tiers that may have cached the affected key. Those tiers would perform cache invalidation by requesting the latest data from the leader tier and update their cache data.

The TAO paper does not specify the exact type of message queue used for this propagation. For now, just note that TAO maintains a queue-based system to reliably deliver these messages asynchronously.

Each cache refill/invalidation has a version number, which allows the cache server in follower to determine wether it should apply that cache invalidation/refill message (since there might be the case that, the newer cache invalidation message has reached the cache server before the older one arrives).

In normal operation (at most one failure encountered by a request) TAO provides read-after-write consistency within a single tier.

Below is a simplified flow of a write request.

2. Within one region consistency — figure 4

No matter if client makes write request to secondary or primary region, as long as the client consistently makes request to a cache server, it should see its up-to-date write!

TAO has simple rule: All writes must go to the leader in primary region.

Below is diagram illustrates the flow of a write that goes from the Secondary region.

When a write request is issued in a secondary region, it follows this path:

Follower Tier (Secondary region) → Leader Tier (Secondary region) → Leader Tier (Primary region).

Once the leader in the primary region updates the data in the primary database, it sends the response back along the same path:

Leader Tier (Primary region) → Leader Tier (Secondary region) → Follower Tier (Secondary region).

Meanwhile, the updated data is replicated asynchronously from the primary database to the replica databases in secondary regions.

TAO embeds cache invalidation and refill messages in the database replication stream, so once a transaction is committed and replicated, these messages are delivered to the appropriate caches in that region.

This propagation is designed to be very fast — the TAO paper reports that replication lag is typically less than 1 second, ensuring that caches across regions converge quickly.

Let’s say you’re adding an edge like:

$(id_1, atype, id_2)$

And $atype$ has an inverse type defined:

$inv\_type$ → $(id_2, inv\_atype, id_1)$

So this single logical write creates two edge records:

Forward edge: $(id_1, atype, id_2)$ , belongs to $shard_1$
Inverse edge: $(id_2, inv\_atype, id_1)$ belongs to $shard_2$

This operation touches 2 different shards, and thus potentially two different leader cache servers, even within the same region. Let’s call these leader cache server in primary region: $primary\_leader_1$ and $primary\_leader_2$ respectively. And leader cache servers in secondary region: $secondary\_leader_1$ and $secondary\_leader_2$ Here’s the flow

Write Involving Inverse Type in different shard — figure 2

$primary\_leader_1$ $p r ima r y_l e a d e r_{1}$ receives the write request from leader in secondary region. It:
- Processes the forward edge: $(id_1, atype, id_2)$
- Also generates the inverse edge: $(id_2, inv\_atype, id_1)$
- Update database, then update its own cache $(id_1, atype, id_2)$
- Forward inverse edge write to $primary\_leader_2$
- ⚠️ $primary\_leader_1$ doesn’t wait for response from $primary\_leader_2$ , after it’s done with updating data of forward edge, it immediately respond back the changeset contains data of both forward and inverse type
$primary\_leader_2$ receives the write request from $primary\_leader_1$ for the reverse edge it would update data the $shard_2$ then update its own cache.
$secondary\_leader_1$ $seco n d a r y_l e a d e r_{1}$ after receives the changeset from $secondary\_leader_1$ $seco n d a r y_l e a d e r_{1}$ , it updates it own cache, then forward the response to $follower_1$ $f o l l o w e r_{1}$ which originated the request. Then $follower_1$ $f o l l o w e r_{1}$ makes the request to the $secondary\_leader_2$ $seco n d a r y_l e a d e r_{2}$ to notify that the $secondary\_leader_2$ $seco n d a r y_l e a d e r_{2}$ should update its cache.
- Note that $secondary\_leader_2$ should have a way to know this write has been already authorized by the primary, it can just apply it or not. In the paper it’s not disclosed, but I think that they can have a simple header type / flag for this check
The data in databases $shard_1$ and $shard_2$ would be replicated asynchronously.
$primary\_leader_1$ does not wait for response from $primary\_leader_2$ so there might be the case that the write is failed. TAO have a cron job to fix the missing edges!

There are few notable race condition cases:

The changeset in the response from the leader in primary region to the leader in secondary region after each update
Cache eviction & server update propagation:
- i.e, the secondary storage server may hold an older version of a piece of data than what is cached by the cache server, so if the post-changeset entry is evicted from cache, then reloaded from the database, the client may observe a value go back in time in a single follower tier. This only happens if it takes longer for the secondary region’s storage server to receive an update than it does for a cached item to be evicted from cache, this case is rare

TAO resolves those race condition cases by using version number that is presented in the persistent store and the cache. The version number is incremented during each update, so the follower can safely invalidate its local copy of the data if the changeset indicates that its pre-update value was stale.

If a primary database fails, TAO automatically promotes a secondary to be the new primary.

If a secondary database fails:

Cache misses in that region are redirected to TAO leaders in the primary region.
Normally, refill/invalidation messages are delivered via the DB replication stream.
But during the outage, this stream is unavailable — so TAO runs an extra binlog tailer on the primary DB to send refills and invalidates inter-regionally.
Once the secondary DB comes back online, it replays the missed refills/invalidates through normal replication — effectively delivering them again, ensuring consistency.

If a leader cache server fails, TAO followers reroute requests around it:

Read misses go directly to the database.
Writes are rerouted to a random leader in the same tier (a temporary replacement).

This replacement leader:
- Performs the write (including inverse edge updates)
- Sends invalidations to followers
- Logs an asynchronous invalidation targeting the failed leader, to restore its cache later
  
  These asynchronous invalidations:
  - Are stored both on the coordinating node and in the replication stream
  - Are replayed once the failed leader comes back online

⚠️ If the failed leader is partially available, followers may serve stale data until its cache is fully repaired.

Leaders send refill and invalidation messages asynchronously to followers.

If a follower is temporarily unreachable, the leader queues the message to disk for later delivery.

⚠️ If the leader fails permanently before delivering those messages, the affected follower may retain stale data.

To recover, TAO uses a bulk invalidation operation:

It invalidates all objects and associations from the failed leader’s shard(s) across followers.
This ensures consistency is restored once the failed leader is replaced.

⚠️ TAO accepts temporary staleness but guarantees eventual consistency via bulk shard-wide invalidation after a leader failure.

TAO clients are configured with a primary and backup follower tier.

Under normal conditions, all requests go to the primary tier.

If the follower server for a shard is unreachable (e.g., due to timeouts):

The client fails over to the backup tier, sending the request to the server that hosts the same shard there.
Since the backup also owns the same shard, the request remains fully cacheable and doesn’t require special consistency handling.

This applies to both read and write requests.

⚠️ However, read-after-write consistency can be violated if the read hits the backup before the write’s refill/invalidate has propagated.

TAO favors availability through redundant follower tiers, accepting temporary inconsistency during failover for better resilience.

TAO is Facebook’s purpose-built system for serving social graph queries at massive scale with low latency. By layering a graph-aware API on top of MySQL and memcache, TAO hides complexity from application developers while enabling efficient reads, eventual consistency, and fault tolerance. Its two-tier cache, smart memory layout, and regional replication make it well-suited for highly dynamic, user-driven workloads. TAO continues to serve as a foundational piece of Facebook’s infrastructure, showing how system design can evolve to meet the demands of a global-scale social network.

https://www.usenix.org/system/files/conference/atc13/atc13-bronson.pdf

https://www.micahlerner.com/2021/10/13/tao-facebooks-distributed-data-store-for-the-social-graph.html

https://engineering.fb.com/2013/06/25/core-infra/tao-the-power-of-the-graph/

https://www.youtube.com/watch?v=O3gv5eYfaWU&ab_channel=Jordanhasnolife

https://stephenholiday.com/notes/tao/