Spanner is Google’s scalable, multi-version, globally-distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. In this blog, we’ll see how Spanner is structured, and its implementation to supporting external consistency and a variety of powerful features: non-blocking reads in the past, lock-free read-only transactions across all of Spanner.
Spanner is organized at the highest level into a universe - a logical deployment that can span many physical locations around the world.
Within a universe are multiple zones, each representing a physical or logical data-center region.
Each zone runs a collection of spanservers. These servers are responsible for storing data, serving reads and writes, and executing transactions. Spanservers inherit ideas from Google’s Bigtable storage system, but extend them with features needed for transactions, replication, and multi-versioned data.
Spanner gives developers SQL-like, schema-based databases with real transactions built on top of a key-value system. From the developers view, the Spanner still looks like a relational database with: tables, each table has columns and primary keys. Developers write SQL-like (provided by Spanner) queries to fetch/update the data.
But underneath, Spanner still stores key-value pairs:
(key:string,timestamp:int64)→value
The “key” is built from table name + primary-key fields
In Spanner, Directory is a set of contiguous keys that share the common prefix. This common prefix comes from the table hierarchies, one database must have at least one hierarchy of table.
For example, one User has multiple Albums, so in table hierarchy, the User table is the parent table, while Album is chill table. Hence, rows in the User table and all Album rows whose primary keys begin with the same user_id form one directory.
Directory allows applications to keep related data together for locality and to define geographic replication policies, as Directory is the unit that Spanner moves between the datacenter to balance the load or co-locate frequent accessed data.
To achieve high availability, Spanner replicates each directory data across multiple zones. The replication typically is done by replicating the write-ahead logs which contain all the update made to the directory. In order to guarantee the success and consistent view of data across replicas, Spanner uses Paxos consensus algorithm: it ensures all replicas see the same history, even when machine fails.
But if Spanner runs Paxos per directory, the system will explode in overhead, as directories can be tiny and there can be millions of them, each needing its own log, leader, heartbeats and elections in order to run Paxos. So Spanner groups many directories into a bigger unit called a tablet, and runs Paxos at the tablet level instead. This also shifts the replication from directory level into the tablet level. Each tablet has:
One Paxos state machine
One write-ahead log
Multiple directories in side
All write to those directories append to the same log, and the replicas of the tablet across zones form a Paxos group. A write is committed after a majority of replica agrees on it. Data itself lives in Colossus; replicas just apply log entries in order.
A spanserver hosts dozen of tablets (typically from 100 to 1000 tablets). Its roles are:
Runs Paxos for all tablets it hosts to keep them in sync with other replicas.
Serves reads and writes.
Manages lock: Each spanserver implements a lock table to implement concurrency control that tracks read and write locks for ranges of keys and ensures serializable isolation.
Transaction manager: For each leader tablet replica the spanserver hosts, it also implements the transaction manager for that tablet replica. If a transaction involves multiple Paxos groups (i.e the transaction that span across multiple tablets), then the transaction manager for all the tablet replica leader participates in 2-phase commit, one of them will become the coordinator.
Each zone has one zonemaster that assigns tablet to spanservers. It keeps track of which spanserver should host which tablets and directs client via a location proxy. It watches spanservers and handles re-assignment if a server fails.
The universe master is a singleton service for an entire Spanner deployment. It does not handle traffic, instead, it collects status information and serves as a management console. Engineers use it to inspect zones, spanserver health and directory placement.
The placement driver automates movement of directories across zones to meet replication policies and balance load. It periodically talks to spanservers to find directories that need to be moved, then directs Paxos groups to relocate those directories. Movement happens on the order of minutes and is transparent to applications.
Spanner guarantees the externally consistent transactions, lock-free read-only transactions and non-blocking reads in the past. Spanner doesn’t need locks for read-only transactions because they run on a consistent snapshot: the transaction sees the database exactly as it existed at one point in time, even if other transactions update data while the read is in progress. Spanner implements this using MVCC (multi-version concurrency control). Each write gets a commit timestamp, and updates create new versions instead of overwriting old ones. A read-only transaction simply reads the newest version whose timestamp is ≤ its snapshot timestamp, ignoring any newer versions.
Many systems use MVCC, but Spanner’s uniqueness comes from how it assigns and orders timestamps globally - which is done by leveraging TrueTime. We’ll go into detail of TrueTime later, but in order to understanding the upcoming section, we’ll need to briefly go through the API TrueTime provides:
Method
Returns
Description
TT.now()
TTinterval: [earliest,latest]
Means, the actual time is guarantee to be in this interval.
TT.after(t)
true if the given timestamp definitely passed
TT.before(t)
true if the given timestamp definitely not arrived
A read-write transaction consists of zero or more reads or query statements followed by a commit request.
The read-write transaction has following steps:
Acquiring read locks: The leader of tablet(s) issues read locks, returns latest committed data.
Please note that: The reads for read-write transaction are always served by leader.
After client completes all reads and buffered all write, the 2-phase commit starts. One Paxos group is the coordinator, others are participants.
Prepare phase:
Each leader of the participant Paxos group:
Acquires write locks.
Chooses a prepare timestamp which must be larger than any timestamp it has assigned to previous transactions.
Logs a prepare record through Paxos, then notifies the coordinator its prepare timestamp.
The leader of the coordinator Paxos group also acquire the write locks, but skips the prepare phase.
Commit phase: After receiving all the responses from all the leaders of the participant Paxos groups, the leader of coordinator group:
Chooses a commit timestamp for the entire transaction, that timestamp must be ≥max(prepare_timestamps,TT.now().latest()).
Logs a commit record through Paxos (or abort if it timed out while waiting for other participants)
Waits until TT.after(s) before allowing any replica of coordinator to apply the commit record. Then, the leader of coordinator group sends the commit timestamp to the client and all other participant leader, the participant leader logs the transaction’s outcome through Paxos, all participants apply at the same timestamp and then release the locks.
The entire process can be visualized by following sequence diagram:
If you only read the data without the data modification, Spanner gives you more than one way to do so. The three main tools are:
Single reads.
Read-only transactions.
Snapshot read-only transaction.
For the read requests, we just read a specific version of the data, while letting the writes create newer versions, so there’s no locking at all. Hence, they won’t block the concurrent reads or read-write transactions.
A single read in Spanner is exactly what it sounds like: a one-shot lookup. Spanner picks a consistent snapshot for that statement, executes the read, and then discards the snapshot. There’s no ongoing transaction context. The important consequence is that while each single read is consistent by itself (you only see the data that is committed by the time you read them), two separate single reads are not required to observe the same point in time.
A read-only transaction is used when we need to execute multiple reads that all see the same snapshot of the database. Conceptually, it runs in 2 steps:
Step 1: Assign a timestamp sread - the point in time that this read-only transaction will take the database snapshot. Deciding the sread might be tricky.
If the transaction only involves a single tablet (single Paxos group), then the sread is assigned to the timestamp of the last committed write at the Paxos group.
If the transaction values are served by multiple Paxos groups, then we have 2 options:
Option 1: Communicate with all leaders of Paxos groups, then assign the sread is equal to the minimum last committed write among all Paxos groups.
Option 2: Just assign sread to TT.now().latest
The option 2 was adopted by Spanner because of its simplicity.
Step 2: Execute the transaction’s reads as snapshot reads at sread. This read request can be served not only by the leader but the follower replica as well. The detail can be found in section: Read optimization
In short, snapshot read-only transaction = read-only transaction with user-defined timestamp. Instead of letting Spanner choose the sread, the client can specify the timestamp that they want to retrieve the data from.
Spanner lets you determine how freshness of the data you want for read request.
Strong read: a read at a current timestamp and is guaranteed to see all data that has been committed up until the start of this read. It’s the default read-type for read requests
Stale read: read at a timestamp in the past. if the application is latency sensitive but tolerant to stale data, then stale reads can provide performance benefits. For stale read, we have a few options:
Exact timestamp: You provide the timestamp, Spanner returns the database data at that timestamp.
Bounded staleness: Read the version of data that’s no staler than a bound.
Spanner uses a standard wound-wait algorithm for deadlock detection. The rule is simple, Spanner tracks the age of each transaction requesting conflicting locks. It lets older transactions abort younger ones. An older transaction is one whose earliest read, query, or commit occurred sooner.
Spanner has three types of replicas: read-write replicas, read-only replicas and witness replicas.
Read-write replicas: support both reads and writes. These replicas:
Maintain a full copy of user’s data
Serve reads & writes.
Can vote wether to commit a write.
Participate in leadership election.
Are eligible to become a leader.
Read-only replicas: Only support reads, but not writes. These replicas:
Maintain a full copy of user’s data
Serve reads
Can’t vote wether to commit a write
Aren’t eligible to become a leader.
Witness replicas: Don’t support reads but do participate in voting to commit writes. These replicas make it easier to achieve quorums for writes without the storage & compute resources. These replicas:
Don’t maintain a full copy of user’s data
Don’t serve reads/writes.
Can vote wether to commit a write.
Participate in leader election but aren’t eligible to become a leader.
Apart from read-write transactions , where reads must be served by the replica leaders. For single read / read-only transaction/ snapshot reads, they can be served by read replicas.
From now on, when I say, read request, I’m referring to single read / read-only transaction / snapshot read.
When a client issues a read request with timestamp sread, it prefers to use a replica in the local zone to avoid cross-datacenter latency. Each replica maintain a tsafe timestamp, the replica can serve any read request with the timestamp ≤ tsafe. The tsafe is defined by the timestamp of last committed write recorded in Paxos log of replica.
In case the replica receives the request but its tsafe is not ≥ request’s timestamp yet, the client briefly waits for a few milliseconds so the tsafe can advance. If after the wait buffer, the replica is still not eligible to serve that read request. That request is re-routed to the leader.
There’s one catch: What if there’s no further write happens at all? At that time, the tsafe never advances, so the read can’t be served by the replica, even though that the replica has the latest requested data. To prevent that, the leader defines a value M which is the “next-write floor”, leader periodically (typically once per second) proposes a no-op entry with commit timestamp M. This no-op entry is replicated to followers via Paxos; When followers apply it, their last*committed_ts increased to M. So the t∗safe keep advancing.
TrueTime is one of the key innovations in Spanner, enabling its lock-free read transaction. It solves one of the most challenge problems in distributed systems: providing a globally synchronized and consistent view of time across all nodes in a system, no matter which regions and data center the node is in.
API design: TrueTime provides some APIs which has been shown in previous section. One special thing about TrueTime is that it exposes time as a TTInterval [start, end]. The now() method returns the interval which guarantees to contain the real current time. We only know that the current time lies inside that interval, but not its exactly value.
Time sources: TrueTime uses both GPS and atomic clocks.
GPS clocks: These clocks rely on signals from satellites to provide time information. The GPS clocks can suffer from antenna failures, interference, design bugs and proofing.
https://www.cl.cam.ac.uk/teaching/2122/ConcDisSys/dist-sys-notes.pdf
Atomic clocks: This devices measure time-based on the vibrations of atoms, such as caesium or rubidium. These clocks can drift over long periods
https://www.cl.cam.ac.uk/teaching/2122/ConcDisSys/dist-sys-notes.pdf
Combining both types helps bound error even if one fails.
Implementation:1. TrueTime API — figure 1
Each data center has several timemaster machines and every server (client) runs a timeslave daemon. The timemaster can either use GPS clock or atomic clock, but mostly GPS clock.
The client periodically (typically it’s 30 seconds) requests for current time from several timemaster from its local data center and other data centers as well. The client then applies a variant of Marzullo’s algorithm to reject inconsistent timemasters (those timemasters which its clock diverging too much compared to others) and synchronize the local machine clocks to the consensus of trustworthy timemasters.
Uncertainty value: Between synchronizations, the client maintains a slowly increasing uncertainty ε based on its worst‑case local clock drift, master‑uncertainty and communication delay. This value indicate that: If you get the timestamp value from the TrueTime server, then the actual time might be ±ε of the returned value. The interval returned by TT.now() has the width of 2x ε, so it guarantees that the actual time must lies in that interval.
In production, ε typically varies between 1–7 ms over each 30‑second poll interval and is around 4 ms most of the time. Occasional spikes can occur if time masters become unavailable or if machines/networks are overloaded.
Spanner uses timed leader leases in Paxos so that leadership remains stable (typically ~10 seconds) instead of constantly switching. A replica becomes leader only after it receives a majority of lease votes. While it is leader, it can extend its lease by either successfully committing writes or by requesting another lease vote when the lease nears expiration.
However, leadership alone doesn’t remove the risk of split-brain, where two replicas both believe they are leaders at the same time. For example: the leader becomes temporarily unreachable, the rest of the replicas elect a new leader, and when the old leader comes back, its lease still appears valid.
To prevent this, Spanner defines a leader lease interval: [start, end]
start: when the replica receives a quorum of lease votes
end: when it no longer has a quorum (its lease expires)
Spanner enforces that lease intervals for leaders never overlap within a Paxos group.
It achieves this using a single-vote rule. When a replica grants a lease vote, it logs the vote along with the time and promises not to grant another lease vote until after the lease interval has passed. Because every lease vote is persisted, even across crashes, no two leaders can have overlapping valid lease periods - eliminating split-brain behavior.
External consistency means that a distributed database enforces a single global order of transactions that matches real‑time ordering: if transaction A commits before transaction B starts (as observed by any client), then A must appear earlier in the database’s history. Spanner achieves this consistency by typing every transaction commit to a wall-clock timestamp s obtained from TrueTime and by waiting until that timestamp is known to be in the past before returning to the client. As after the waiting period, the transaction’s commit timestamp is guaranteed to be in the past for all servers; so any later transaction will therefore obtain a commit timestamp greater than s.
For example:
Imagine two users, Alice and Bob, working with a bank database.
Alice’s transaction (Txn A) writes “transfer $10 from account A to account B.”
The leader assigns a commit timestamp sₐ based on TrueTime.
The leader waits until it is certain that real time has advanced beyond sₐ (the commit‑wait).
Only then does Alice receive confirmation that Txn A is committed.
Bob’s transaction (Txn B) starts later and reads the balance of account A.
Because Alice’s commit‑wait ensures that sₐ is already in the past on all replicas, the leader selects a timestamp sb strictly greater than sₐ.
Bob’s read at sb therefore sees the effect of Alice’s transfer.
This ordering holds even if different replicas process the operations, ensuring that if Txn A finishes before Txn B starts, every replica will observe A before B