Chubby: Lock Service For Loosely-Coupled Distributed Systems

Chubby is a lock service developed by Google, it’s mainly used for primary election, server discovery, work partition.
Before Chubby, most distributed systems at Google used ad hoc methods or some operator intervention for above use cases.
Google doesn’t introduce new algorithm or techniques in Chubby, it just follows some existing algorithms and techniques. Notably, they use Paxos for distributed consensus.
Chubby prioritizes consistency and availability over performance.

Firstly, we need to know the reason why Google decided to build a lock service, let’s go through some reasons.

Problem: Many systems at Google starts as a simple prototypes, not carefully architected with high availability in mind. Developers often just write code assuming a single master or a single server

Challenge: Later, as the system grows, and the demand on availability arises, that time, they need replication and master election. But retrofitting consensus protocol (such as Paxos / Raft) directly to an existing codebase is hard, as it requires restructuring the program logic to adapt to the consensus steps.

Solution: A lock service like Chubby can provide the desired consensus functionality (master election, fault-tolerant state, fencing tokens) without forcing developers to restructure their systems.

Problem: Many services that elect a primary or partition data across nodes, in either cases, they need a way to publish the result to other clients know, such as: “Who’s the current primary”, or “Which server holds this partition”.

Challenge: One of obvious solution is to use DNS or a name server. But here comes the problems: DNS is time-based caching, clients cache lookup for a configured TTL. If TTL is too short → huge load on DNS, on the other hand, too long TTL causes slow failover (clients keep using dead / outdated master).

Solution: Chubby already runs Paxos replication and has client sessions with consistent caching: clients cache reads locally. When there’s change in the data, Chubby actively invalidates caches instead of relying on TTL expiry. This guarantees stronger consistency than DNS-like timeouts, while keeping load low.

The distributed-consensus algorithms use quorum to make the decision, which requires a certain number of running instances of application to achieve consensus. By using the lock service, even a single client can obtain a lock and make progress safely. Engineers don’t need to care how many instances of their services are running, Chubby takes of that for them.

Given the number of arguments in the previous section, Google eng came up with 2 core design decisions:

Use a lock service, not a library or just a service for consensus
Embed the metadata (primary, data partition information) as some small-files directly inside the lock service rather than build and maintain the second service.

The core decision leads to some follow-up decisions / requirements:

Service’s metadata served by Chubby file may have thousands of clients → they must allow thousands of clients to observe that file without needing many servers
Having notification mechanism to notify instances when the service’s primary changes.
Consistent caching the files is desirable (in case clients may want to poll aggressively)
- This requires caching from the client side, the cache will be invalidated via event notifications.

There are 2 types of lock usage:

Fine-grained lock:
- Held briefly (milliseconds → seconds)
- Acquired / released frequently - often at the same rate as client transactions
- Example: row-level locking in a database
Coarse-grained lock:
- Held for long period (mins - hours - days)
- Acquired rarely - only during master election or reconfiguration
- Example: Electing a Big table master that runs for hours.

These 2 types of locks come with different requirements:

Fine-grained locks:
- High-acquisition rate: Lock service must scale with client transaction rate
- Any downtime hurts: Even short lock service outages stall clients, trying to acquire / release the lock.
- Cheap recovery preferable: As locks are short-lived, it’s okay to drop them on failure, recovering lost fine-grained should not a big deal
  - This may seem to contradict the earlier point, but here the focus is different: the question is whether the lock service should attempt to recover lock state after a failover (e.g., during leader change). Since fine-grained locks are typically held for only milliseconds or seconds, persisting and recovering them would be costly and unnecessary. It is simpler and acceptable to drop them, as clients must already be prepared to lose locks during network partitions.
- Scalability focus: Need ability to shard and lock servers freely
Coarse-grained locks:
- Low-acquisition rate: Lock service load is light, independent of client transaction rate
- Temporary outages less harmful: As client rarely need to acquire locks
- Lock survival is critical: If a lock is lost during failover, recovering can be very costly (e.g electing a new Big table master, reassigning partitions)
- Availability can be modest: Fewer servers can handle the load

Chubby explicitly only supports coarse-grained locks:

Its primary job is electing primaries, coordinating mastership and storing metadata
It’s not designed to handle high-frequency lock churn
By focusing on coarse-grained usage, Chubby can survive failures without losing locks (locks are persisted across Paxos-replicated state).

Even though Chubby only support coarse-grained locks, it’s straightforward for clients to implement their own fine-grained locks tailored to their application

Chubby has 2 components that communicate via RPC: A server and a library client applications link against.

Client:

Must communicate with server via client library.
Find Chubby cells by traditional DNS

Chubby cell: consist of small set of servers (typically 5). There is one master (primary), others are replicas.

Cell master: Receives the read / write request from the client, propagates the write to replicas via consensus protocol.

Cell replica: When receiving request from client, finds the master then returns master’s identity. Each maintains a copy of simple database, which updated when receiving update event from master

Database: Each Chubby server (the master and all replicas) keeps its own copy of the database on local disk.

This database holds:

File namespace (directories, files, ephemeral nodes).
Metadata (ACL references, monotonically increasing counters, checksums).
Lock state (lock generations, which sessions hold what)
File contents written by clients, which in practice encode things like who is primary, partition mappings, configs, service discovery info.

The first version of Chubby used the replicated version of Berkeley DB

Chubby exposes a hierarchical namespace of nodes - each node is either a file or a directory, has a unique path, and can carry a lock. There are no hard or symbolic links, and permissions are per-node.

Nodes may be permanent or ephemeral.

An ephemeral file **is deleted when no clients have it open, it’s often used to advertise client liveness
An ephemeral directory is deleted when it becomes empty.
Opening a node returns a handle (like a UNIX fd) with check digits, the server can recreate handle state across master failover.

Metadata on every node.

ACL names (3): identifiers of the files that control read, write, and change-ACL permissions for the node. (ACLs are themselves files.)
Four 64-bit, monotonically increasing counters:
1. Instance number,
2. Content generation (files only)
3. Lock generation
4. ACL generation.
a 64-bit file-content checksum so clients can detect differences.

⚠️ To the client, Chubby looks like a hierarchical filesystem, but internally, it still uses database to stores all above informations.

Each chubby file and directory can act as a read-write lock: only one client can hold the exclusive (write) lock, or any number of client may hold the lock in share (read) mode.

Requesting → / Holding ↓	Shared (read)	Exclusive (write)
Shared (read)	No conflict (granted)	Conflict (blocked)
Exclusive (write)	Conflict (blocked)	Conflict (only 1 writer)

Locks in Chubby is advisory, which means Chubby arbitrates who owns the lock, but it doesn’t force client to obey it. Client’s code must check / enforce the lock when acting. In Chubby, acquiring a lock (read or write) requires write permission on the node. This prevents an unprivileged reader from blocking writers.

When client acquires the lock successfully, Chubby returns a sequencer, which contains the name of the lock, the lock mode (exclusive or shared) and the lock generation number.

The client then passes the sequencer to external services as a fencing token, protecting against delay requests / packages. So, in case external service found that the sequencer in the request is invalid, it rejects the request. The sequencer validity check can be done by checking against server’s Chubby cache (note, this requires the server must maintain a session with Chubby, in order to update the cache in case of there’s any change) or against the most recent sequencer that the server has observed.

In case it’s slow to use the sequencer, or external services don’t support the sequencer check, Chubby provides a hack. If a client release a lock in normal way, it’s immediately available for other clients to claim, otherwise, the lock server will prevent other clients from claiming the lock for a period called the lock-delay (e.g 1 min)

A Chubby session is a relationship between a client and a cell, maintained by periodic KeepAlive RPCs. Yes, this means the master will receive lots of KeepAlive request, we’ll go into the optimization in following section. Unless the client explicitly ends the session or the master declares it expired, the client’s handles, locks, and cached data remain valid. Idle sessions (no open handles / no calls) are ended after about a minute.

Initially, this session data is stored in the database, but later, Google engineer changed the design, then Session state is kept in memory (not persisted). After failover, the new master recreates sessions and waits a full worst-case lease before fully reopening operations. Please checkout the Session Optimization for more detail.

Each session has a lease - an interval of time during which the master guarantees not to terminate it. The lease is advanced by the master in three cases:

On session creation
After a master failover
When responding to a KeepAlive RPC

The master deliberately blocks the KeepAlive reply until the lease is nearly expired, then extends it (default is 12s, but can be increased under load). The client immediately issues another KeepAlive, so one is almost always outstanding. Events and cache invalidations are piggybacked on KeepAlive replies, and all RPCs flow client → master, ensuring sessions cannot be kept alive without processing invalidations.

The client maintains a conservative local lease timeout, shorter than the master’s, to account for:

Network flight time of the KeepAlive reply
The possibility of the master’s clock advancing slightly faster

Let me explain a bit. he Chubby master tracks each client’s session with an absolute expiry time in its own clock, for example “Client A’s lease expires at 12:00:10.123 (master clock).” While the client get the response from the master for KeepAlive says “I’ll give you more 10s lease”, but given the fact that it could be a network delay, or the master clock running slightly faster, instead of “Okay, I’ll extend my lease to 10s more from now”, client must enforce “I’ll extend to 8s more from now”. **To maintain consistency, the system assumes the server’s clock cannot run faster than a known constant relative to the client’s.

If the client’s local lease expires without a reply, it enters jeopardy: caches are disabled, and the app should quiesce. The client then waits a grace period (default 45s).

If a KeepAlive reply arrives during the grace period → the library delivers a safe event and the client resumes.
If no reply arrives → the library delivers an expired event, and the client must treat all state (locks, caches, watches) as lost.

When client’s lease timeout, we say that it’s in Jeopardy. Jeopardy is signaled as an event, and denotes entering a temporary state (the grace period) where the client pauses work until it learns whether the session survived (safe) or has expired.

There are some scenarios that can make client’s lease timeout:

Master failover (common case): the old master stops replying, client’s local lease expires → enters jeopardy. Ff a new master replies during grace, the client gets a safe event, otherwise expired.
Network hiccups / partitions / congestion: KeepAlive replies are delayed or lost (this is why Chubby eventually sent KeepAlives over UDP; TCP backoff didn’t respect lease deadlines).
Client-side pauses: long GC pauses, CPU starvation, or event-loop stalls can delay KeepAlives enough to miss the local deadline.

Let’s take a look at the example of master failure

The above image shows the sequence of events in a lengthy master fail-over event in which client must use grace period to preserve its section.

Typically, when the old master dies and no KeepAlive reply is received, the client continues sending KeepAlives, which time out until its local lease C2 expires. At that point, it enters the jeopardy. During jeopardy, the client retries against replicas, if a new master is elected in time and responds with a new epoch, the client receives a new lease C3 and continues safely. Otherwise, if no reply arrives before the grace period ends, the session expires and all state is lost.

As mentioned before, Chubby adopts consistent caching - you might not see it often in the normal system since it’s performance killer.

To reduce read requests to Chubby, each client maintains its own cache including file data & node metadata in a consistent, write-through in-memory cache. Every client establishes a dedicated session with the Chubby master, as long as this session remains valid, all cached data is considered valid. We'll examine this process in more detail in the next section.

⚠️ One note is that Chubby permits client to cache locks - they can hold the lock longer than needed, until there’s event informs the lock holder that there’s another client has requested a conflict lock, so the lock holder must release the lock.

Chubby invalidates client’s cache in 3 cases:

Before any modification to a node (file data or metadata): When there’s a write, Chubby do following action:
- Sends invalidations to all clients that may cache the node, this invalidation event triggers cache flushes, resulting subsequent reads go to Chubby’s primary.
- Withhold the cache lease, so client won’t re-cache the response until the write is completed.
- ⚠️ While invalidations are in progress, reads go to the master and return the current (pre-write) value until the write commits.
On session uncertainty / lease expiry: If a client’s local lease deadline passes without a KeepAlive reply (e.g., during master failover or a network hiccup), the client flushes its cache and enters jeopardy (grace period). It resumes caching only after it receives a KeepAlive reply (safe) or declares the session expired.
After master failover: A newly elected master emits a fail-over event**.** Clients flush caches because some invalidations may have been missed during the outage, then acknowledge via KeepAlive before normal operations resume.

In case of modification, we know that the subsequent reads will go directly to Chubby master, not the client’s local cache. This can lead to Thundering herd problem, in which the Chubby master is overwhelmed by the read requests from the client. To prevent this, we could apply the hybrid model, that means, in case the number of read requests is too high, we’ll block the reads. This could pause or stall the clients, but protect the Chubby master.

Created with Open() on a file or directory, destroyed with Close().
Act like UNIX file descriptors but scoped to Chubby’s namespace.
Options at open time include: read/write/lock rights, events to receive, lock-delay, and file creation parameters.
Each client library starts with a handle to "/".

GetContentsAndStat() → atomically read file contents + metadata.
GetStat() / ReadDir() → read metadata or directory listing.
SetContents() → atomically replace entire file contents (with optional generation check for CAS semantics).
SetACL() → update access controls.
Delete() → delete node if no children exist.
Acquire() / TryAcquire() / Release() → lock primitives.
GetSequencer() / SetSequencer() / CheckSequencer() → manage sequencer (aka fencing tokens)
Special calls:
- Poison()cancels all ops on a handle (used to stop other threads safely).
- Ops fail if the node tied to the handle is deleted (handles bind to file instances, not just names).
Operation parameter: lets clients supply callbacks, wait for async completion, and fetch diagnostics.

All contenders open the same lock file and try Acquire().
Winner becomes primary, losers act as replicas.
Primary writes its identity into the file (SetContents()), so others can discover it via GetContentsAndStat() or file-modification events.
For safety, primary also obtains a sequencer (GetSequencer()) and downstream services check it (CheckSequencer()), or fallback to lock-delay if sequencer checks aren’t supported.

The read with cache is trivial, as we just read data from the cache. So we’ll focus on the read path without the cache - the case that we will request read to the Chubby master.

Here’s the sequence diagram to show the read request steps

a. Read request without cache — figure 4

When Chubby master wants to modify a node (file or metadata), it must ensure

No client will read stale data after the write
All cached copies are either invalidated or expired.

On a write, Chubby will:

Block the write to the node
Master sends invalidation messages to all clients that might cache the node and waits for each client to satisfy one of these 2 conditions:
- ACK: Each client flushes its cache and piggybacks an ack
- Lease-expiration: If a client is unresponsive (crashed, partitioned, …) the master waits until that client’s cache lease expires. After expiry, the master assumes that the client’s cache is invalid.
Process the write:
- Every mutation is encoded as a log entry
- This entry is replicated via Paxos to a majority of replicas before it’s considered committed.
- Only then does the master apply the change locally and reply OK to the client

Master: when master fails, the other replicas run the election protocol to elect a new master, this process typically takes 5-30s
Replica:
- When replica fails for a few hour, replacement system spins up a new replica using lock binary from a free server, then updates update the address in DNS tables, replacing the IP address with the new one.
- Master see the change by periodically polls the DNS, then update the list of cell’s member in cell’s database, which is then replicated to other members.
- Once the new replica starts serving the request, it’s permitted to vote

Here’s the process of new master election

b. New master election process — figure 6

Chubby’s clients are individual process, so Chubby must handle more clients than one might expect. Google engs have seen 90k clients communicating directly with a Chubby master. So, assuming the master has no serious performance bug, they used several approaches:

Create arbitrary number of Chubby cells, Google’s typical deployment uses one Chubby cell for a data centre of several thousand machines.
Increase the lease times from the default 12s up to around 60s when it’s under heavy load, so it need process fewer KeepAlive PRCs.
Chubby clients cache file data, meta-data, the absence of file, and open handles to reduce the number of calls they make on the server.

In order to make Chubby scale better, Google engineers came up with some further improvement

Since a Chubby cell can have multiple nodes, and one client might only make read/write request to some of the node, not all of them, this gives the idea of partitioning the Chubby cell.

Partition unit: by directory

The key idea is that, all the sibling under the same parent should be in the same partition, but they’re not required to be in the same partition as their parent.

For example, with $N$ partitions:

Formula: a node $D / C$ (child $C$ inside the directory $D$ ) is stored in partition $P(D / C)= hash(D)\%N$
The directory itself (its metadata) may be stored elsewhere, e.g $P(D) = hash(D')\%N$ where $D'$ is its parent. So different directories get mapped to different partitions.
Read / write traffic is spread across N → each partition sees ~ $1/{N}$ of the load

Even with the namespace partitioning, Chubby master must handle all KeepAlive traffic from clients to keep sessions alive.

And given the fact that RPC traffic is dominated by session KeepAlives (93%). It’s worth to delve into optimization.

Say we have 1,000 clients, each accessing 10 directories.

No partitioning (1 master):
- 1 session per client = 1,000 sessions.
- Master sees all reads/writes and all 1,000 KeepAlives
Partitioning with 4 masters:
- Reads/writes: evenly spread → ~25% of ops per master (best case!)
- Sessions: each client might touch directories in all 4 partitions → now 4,000 sessions total.
- KeepAlives: multiplied by 4, so per-master load isn’t reduced much.

Solution: We can introduce the proxy as an intermediate process that aggregates many clients’ sessions into a smaller number of Chubby sessions. Clients connect to the proxy → proxy maintains one or a few sessions with the real Chubby cell.

Key benefit:

Multiplexes: Many client KeepAlives are represented by fewer KeepAlive to the master
Demultiplexes: events (like invalidation or jeopardy) from the master are forwarded to the right clients.

Although designed as a lock service, Chubby’s most popular use became a name/metadata service. In DNS, caching is time-based (TTL): short TTLs speed failover but can overload servers; long TTLs slow failover. At Google scale, a job with 3,000 processes where each process talks to every other can drive ~150k DNS lookups/sec with a 60s TTL, creating severe, spiky load.

Chubby avoids this trade-off with explicit cache invalidation: clients cache names indefinitely and receive invalidations piggybacked on KeepAlive replies, so a constant KeepAlive rate maintains arbitrarily many cached entries when there are no changes. This lets Google deliver swift name updates without polling each entry. In practice, a 2-CPU 2.6 GHz Xeon Chubby master handled ~90k direct clients. Early “thundering herd” during large job starts was mitigated by batching lookups (returning ~100 related names per call).

Chubby now provides name service for most systems and exposes Chubby-backed names to legacy apps via a Chubby DNS server. Common patterns include master location (GFS/Bigtable), service rendezvous (MapReduce), and configuration storage.

Aside: ZooKeeper offers a similar coordination role but uses client-managed watches; Chubby manages client caches and blocks updates until caches are invalidated, trading a bit of write latency for stronger read consistency and simpler client semantics.

Initially, session creates were treated as writes so on failure the new master could be aware of them (to tell them to invalidate caches). This is costly since besides writing to the database, a write always need to involves the consensus and responses from all the replicas. It leads to Thundering Herd if multiple clients start up at the same time, hence, creates too many session create requests to Chubby.

Google engineers made an attempt to optimize this, they use a lazy session create mechanism. So, sessions only written on their first mutation. For the read only sessions, they would be probabilistically recorded on later KeepAlives

Failure:

This solution introduced an issue when master goes down before the read only session is written. For example:

t=3: A reads (read-only); its session is scheduled to be recorded at t=7.
t=6: B reads; its session will be recorded at t=9.
t=8: C reads; its session would be recorded at t=11.
t=10: Master fails.
t=12: New master elected; it rebuilds state only from sessions that were recorded (A, B).
t=13: A write occurs → invalidations are sent to A and B, because the new master knows only about those sessions.
C is not in the database yet, so it receives no invalidation and can serve stale data from its local cache until its lease expires.

Final solution: Chubby stopped persisting sessions and recreates sessions like handles after failover. A newly elected master waits a full worst-case session lease (which default value is 12s) before fully reopening operations, ensuring any unknown sessions either check in or expire—no client can keep serving cached data unnoticed.

Developers underestimate availability issues: Many teams treated Chubby as if it were “always up,” writing applications that reacted poorly to even short outages. For example, one system of hundreds of machines would trigger long, synchronous recovery procedures whenever Chubby elected a new master - magnifying a single failure into a large-scale disruption. The key lesson: design applications to tolerate brief Chubby outages so that master failover has little or no impact. This reinforces the case for coarse-grained locks, which change infrequently and are less sensitive to short interruptions.
Global availability ≠ local availability: The global Chubby cell was almost always up, but clients often observed lower availability than with their local cell. Why? A local cell is less likely to be partitioned from its clients, and when local maintenance occurs, the client is already impacted anyway. Lesson: what matters is observed availability at the client level, not just aggregate uptime.
APIs shape developer behavior (sometimes in bad ways): Chubby exposed a “failover event” so clients could re-check state after a master change. The intent was harmless, but many developers chose to crash their apps on receiving this event - reducing availability further. In hindsight, it might have been better to send redundant “file change” events or ensure no events were lost during failover. Lesson: API semantics influence failure handling; make the safe path the easy path.
Guardrails for optimistic developers. Google adopted three practices to stop teams from over-relying on Chubby availability:
1. Design reviews: advising teams against fragile usage patterns.
2. Higher-level client libraries: hiding short outages automatically.
3. Postmortems: not only fixing Chubby bugs, but also reducing how sensitive apps were to its downtime.
Fine-grained locks weren’t needed: Despite early designs for a fine-grained lock server, developers found they could usually optimize by removing communication instead. Coarse-grained locks sufficed in practice, reinforcing Chubby’s design choice.
Transport protocol trade-offs matter: KeepAlives were used both to refresh leases and to piggyback events/invalidation acks. This was elegant - clients couldn’t refresh a lease without also acknowledging invalidations - but it clashed with TCP. TCP’s congestion backoff ignored Chubby’s time-sensitive lease deadlines, causing spurious session loss under network stress. The workaround was to send KeepAlives via UDP. Lesson: layering higher-level timeouts on TCP can be fragile; sometimes you need UDP or a hybrid protocol (e.g., add a TCP-based GetEvent() RPC for normal cases, with UDP reserved for time-critical lease refresh).

https://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf

https://medium.com/chubby-a-centralized-lock-service-for-distributed-applications-390571273052

https://www.youtube.com/watch?v=Sj8vo4YmUZI&t=1917s