Google Bigtable: A Distributed Storage System for Structured Data

Understand Bigtable end-to-end: sparse data model, tablet assignment and lookup (root→metadata→user), Chubby coordination, GFS/SSTables, compactions, caching, Bloom filters, commit logs, and recovery.

Bigtable is a distributed storage system developed by Google to manage structured data at Google. It has achieved several goals: wide applicability, scalability, high-performance and high availability. It’s used by critical Google products and projects, including Google Analytics, Google Finance, Orkut, Personalized Search, Writely, and Google Earth.

Disclaimer: This post summarizes the 2006 paper Bigtable: A Distributed Storage System for Structured Data. Some details are dated; treat this as a historical reference.

A Bigtable is a sparse, distributed, persistent tables, each of which is a sorted key-value map.

II. Data model — reference figure 1
.

  • Bigtable stores data in rows, each representing a single entity identified by a unique string key (up to 64 KB, though typically much smaller).
  • Each row contains multiple columns, organized into column families. A column is named using the format family:qualifier
    • Note that the column family must be defined up front (and are relatively few), but within a family you can have any number of columns.
    • Each row can have completely different columns, as long as they share the predefined families
    • Each column family is stored separately (in its own file) → retrieving one family doesn’t drag in data from others.
  • The intersection of a row and a column forms one or more cells, each storing timestamped versions of data.
    • By default, the read returns the latest version (the one with latest timestamp)
  • Conceptually, Bigtable is a sparse map:

(row:string,column:string,time:int64)string(row : string, column : string, time : int64) \rightarrow string

  • Values are raw byte arrays (“blobs”), leaving serialization and interpretation entirely to the client.
  • The row range for a table is dynamically partitioned. Each partition is called a tablet. The tablet data is stored in Google File System.
  • Atomicity: All operations under a single row are atomic. If a user does multiple writes to different columns in the same row, Bigtable will apply them atomically.

Bigtable uses the distributed Google File System (GFS) to store its write-ahead logs and data file. The data files are written in Google’s SSTable (Sorted String Table) file format.

A SSTable is a persistent, ordered immutable map from keys to value, where both key and value are arbitrary byte strings.

Internally, each SSTable is made of up a sequence blocks (typically 64kb64kb in size, but configurable) and ends with a block index. This index block stores the first key of every data block, allowing Bigtable to locate keys efficiently. When searching for a specific key (or a key range), Bigtable performs a binary search over the index to find the block that likely contains the target data. The index block itself is kept in memory whenever the SSTable is opened.

For performance, the entire SSTable can optionally be loaded into memory, eliminating disk access during lookups. This optimization assumes spatial locality: if you query for key xx, you’ll probably look for x+1x + 1 soon after. It’s the same principle behind the CPU cache lines, reading one variable often brings nearby data into cache, so subsequent accesses are lightning fast.

Bigtable heavily relies on Chubby for variety of tasks:

  • Ensure only one master at a time (avoiding split brain of master)
  • Stores the bootstrap pointer to the root tablet’s location.
  • Discover tablet servers and finalize table server deaths
  • Store Bigtable schema information (the column family information for each table)
  • Store access control list

If Chubby goes down, Bigtable stops serving request.

You can check out my previous blog about Chubby. Bigtable heavily relies on Chubby for variety of tasks:

  • Ensure only one master at a time (avoiding split brain of master)
  • Stores the bootstrap pointer to the root tablet’s location.
  • Discover tablet servers and finalize table server deaths
  • Store Bigtable schema information (the column family information for each table)
  • Store access control list

If Chubby goes down, Bigtable stops serving request.

The Bigtable has 3 major components:

  • A library that is linked into every client
  • One master server
  • Many tablet servers

The master server accounts for

  • Assigning tablets to tablet servers
  • Detecting the addition or expiration of tablet servers
  • Balancing tablet-server load
  • Garbage collection of files in GFS
  • In addition, it also handles schema changes such as table and column family creations.

The tablet is just data. We need database server that provides the read/write operations. In Bigtable, they’re tablet server.

One tablet server can serve read/write request for a set of tablet (typically ten to a thousand tablets per tablet server). Besides, it splits tablet that have grown too large.

They can be dynamically added (or removed) according to the workload.

⚠️ Clients communicate directly with tablet servers for read/write requests without going through master.

⚠️ One tablet is only served by one tablet server at a time.

Clients talk directly to tablet servers, but how do they find the right one? Bigtable keeps a special METADATA table that maps tablet ranges to tablet servers (and also stores redo-log pointers, split markers, etc.).

Each METADATA row is keyed by (table_id, end_row) and includes columns such as:

  • info:server → tablet server address
  • info:log → redo/commit-log pointers
  • split markers and other metadata

Example of METADATA rows (simplified)

Row key (table,end-row)ColumnValue
orders, A4000info:serverts-08:port
orders, A4000info:loggs://gfs/logs/…#1234
orders, J0000info:serverts-20:port
orders, J0000info:loggs://gfs/logs/…#5678

The METADATA table itself can split into multiple metadata tablets (and be served by multiple tablet servers). So which server holds the metadata you need?

Chubby stores a pointer to a special tablet called the root tablet:

  • It is first partition of the METADATA table
  • Never splits - so we only have 1 entry point for getting the metadata tablet location.
  • Stores the location of other metadata tablets
  • This Root Tablet plays an important role when Master starts up. See the Master Startup Section below

From now on, we’ll have 2 main types of table: The METADATA table (server data) and the USER table (the application data)

So, this brings up the three-level hierarchy analogous:

1. Tablet location — figure 2

Here’s how a client retrieves a tablet location. Note that to improve the performance, the client does have the cache of table location. This flow illustrates the steps when the cache is stale / empty.

1. Tablet location — figure 3

From the sequence diagram, you can see that one single lookup can cost 3 network round-trips to get the tablet location. To optimize this, the client caches the tablet locations. And one smaller optimization here is that the client library prefetches the tablet locations: it reads the metadata for more than one tablet whenever it reads the METADATA table.

Each tablet is assigned to one tablet server at a time. As mentioned in previous section, Master server is responsible for tablet assignment (which tablet server serves request for which tablet)

Bigtable uses Chubby to keep track of tablet servers. When a tablet server starts, it creates and acquires a exclusive lock on an unique file in a specific Chubby directory. The master monitors this directory to discover the tablet servers.

A tablet server stops serving its tablet if it loses its exclusive lock (e.g due to network partition that caused the server to lose its Chubby session). The tablet server will attempt to reacquire its file if that file still exists, in case that file doesn’t, that tablet server knows that it can’t serve the request, so it kills itself.

When tablet server terminates, it tries to release its lock so the master will reassign its tablets more quickly

Master accounts for detecting when tablet server is no longer serving its tablets to reassign those tablets as soon as possible.

In order to detect the liveness of tablet server, Master periodically asks each tablet server for its lock status. If either:

  • Tablet server reports it lost its lock, OR
  • Master is unable to reach the tablet server during its last several attempts

Then, the master attempts to acquire the exclusive lock on tablet server’s file. If it’s success → Chubby is live and tablet server is either dead or unable to reach Chubby server. In that case, Master:

  1. Deletes the tablet’s server file to ensure that tablet server will never serve the request again.

  2. Move all tablets that were assigned to the failed tablet server into a set of unassigned tablets

    ⚠️ During tablet server failure and reassignment, Bigtable temporarily leaves affected tablets unassigned. Client requests to those tablets fail until the master completes reassignment and updates the metadata table.

⚠️ In case the master loses its session, it kills itself, but the tablet assignment doesn’t change!

When a master starts, it executes following steps:

  1. Grab a unique master lock in Chubby → prevent concurrent master initialization (split brain problem).
  2. Scan tablet server directory to find live tablet servers.
  3. Communicates with each tablet server to discover which tablets are assigned to them
  4. Check if Root Tablet is assigned to any tablet server, if not → assign it
  5. Scan Root Tablet to learn about all METADATA tablets, assign them if not yet assigned
  6. Scan the METADATA tablets to learn about all USER tablets, assign them if not yet assigned

As the data in tablet grows, there will be the case that it needs to split into 2 new tablet. This split process is done by the tablet server. After tablet server split the tablet, it:

  1. Commits the split by recording the information for the new tablet in the METADATA table
  2. Notifies the master.

If the notification to master is lost, the master detects new tablet when it asks the tablet server to load the tablet that has now split. The tablet server will notify the master of the split.

Here’s a overview of read/write request flow

3. Tablet Serving — figure 4

Before going into read/write separately, we need to know some key things:

  • Tablet Log: a write-ahead log (WAL) of all changes to a tablet.
  • SSTable: the on-disk file(s) storing tablet data in GFS.
  • MemTable: The in-memory table on the tablet server that buffers recent updates; it’s periodically flushed to an SSTable.

On a write:

  1. The server checks write authorization by reading a Chubby file (usually served from the Chubby client cache).
  2. It appends the update to the tablet log (WAL).
  3. It applies the update to the memtable.

This is why Bigtable can guarantee single-row atomicity even for multi-column updates: a single tablet server serializes all mutations for a row and applies them atomically.

On a read:

  1. The server checks read authorization as above.
  2. It reads from the memtable and SSTables (as needed), merges the results, and returns them.

To recover a tablet server, a tablet server reads its metadata from METADATA table. This metadata contains the list of SSTables that comprise a tablet and a set of a redo points, which are pointers to any commit logs that my contain data for the tablet. The server reads the indices of the SSTables into memory and reconstruct the memtable by applying all of the updates that have committed since the redo points.

As writes come in, data is first stored in memory (the memtable). When it grows too large, Bigtable freezes it, writes it to disk as an SSTable, and starts a new memtable - this is a minor compaction. It frees memory and limits recovery time after crashes.

Over time, multiple SSTables build up. To keep reads fast, Bigtable merges them in the background through merging compactions, combining several files into one.

Occasionally, it runs a major compaction, rewriting all data for a tablet into a single SSTable and permanently removing deleted entries. This keeps storage efficient and ensures sensitive data is truly gone.

Bring all the things together, we can have this diagram to illustrate the architecture of Bigtable

4. Compactions — figure 5

The implementation described in the previous section required a number of refinements to achieve the high performance, availability, and reliability required by their users.

Bigtable lets clients combine related column families into locality groups, each stored in its own SSTable. This separation improves read efficiency as data that isn’t usually accessed together stays on different files. For instance, a web table might keep page metadata (like language and checksums) in one group and the page content in another, allowing metadata queries without scanning full pages.

Each locality group can also be tuned independently. Some can be marked as in-memory, meaning their SSTables are cached on the tablet server for faster access. This is ideal for small, frequently accessed data such as the location column family in Bigtable’s own METADATA table.

Bigtable allows clients to decide whether to compress SSTables for each locality group and which compression algorithm to use. Compression is applied at the block level (typically 64KB64 KB / block) is compressed independently.

2. Compression — figure 6

This design trades a little extra space for much faster reads since Bigtable can load and decompress only the blocks it needs, without touching the whole SSTable.

Many workloads use Bigtable’s two-pass custom compression:

  1. A Bentley–McIlroy pass that finds long repeated strings across large windows.
  2. A fast local 16 KB window compression pass for smaller repetitions.

This combination is both fast and surprisingly effective, achieving up to 10x space reduction on Web data, outperforming typical Gzip compression. The gains come largely from how Bigtable organizes rows: by clustering similar data (e.g., pages from the same host), compression algorithms can exploit shared boilerplate efficiently.

To speed up reads, Bigtable tablet servers use two layers of caching:

  • Scan Cache: stores recently returned key - value pairs from the SSTable interface. It helps when clients repeatedly read the same data, avoiding re-fetching and re-decoding.
  • Block Cache: stores SSTable blocks recently read from GFS. It’s effective for workloads that access nearby data, such as sequential scans or multiple columns in the same locality group.

Together, these caches reduce disk reads and improve response times for both repeated and spatially localized queries.

Bigtable uses Bloom filters to cut down on unnecessary disk reads during lookups. Each SSTable can optionally have a Bloom filter that quickly answers:

“Could this SSTable contain data for this row and column?”

If the answer is no, Bigtable skips reading that file entirely.

This is especially useful when a tablet’s state spans multiple SSTables - without Bloom filters, a read might trigger several disk seeks just to confirm that data doesn’t exist.

For many workloads, keeping a small Bloom filter in memory significantly improves read performance, ensuring most lookups for missing rows or columns never even touch disk.

4. Bloom Filters — figure 7

Due to the requirement that the commit log must be stored in GFS, so every write must request to GFS, it’s expensive I/O operation as the GFS write must be replicated to some replicas. So, instead of keeping a separate commit log for each tablet, Bigtable uses a single shared log per tablet server. This design avoids the overhead of writing to many GFS files simultaneously, and with that, we can enable write to GFS in batch (group commit).

When a tablet server fails, its tablets are reassigned to other servers. Each new server must replay the mutations for its tablets from the shared log - but since all tablets’ updates are interleaved, Bigtable first sorts the log entries by table,row,sequence_number⟨table, row, sequence\_number⟩. This makes all mutations for a tablet contiguous, allowing fast sequential reads during recovery. The log is partitioned into 64MB64 MB segments and sorted in parallel across tablet servers under the master’s coordination.

To handle occasional GFS slowdowns, each tablet server maintains two log-writing threads, each with its own log file. Only one is active at a time, but if the active writer stalls, Bigtable switches to the other. Sequence numbers in log entries ensure duplicates are ignored during recovery.

5. Commit-log implementation — figure 1

When a tablet is moved between servers, Bigtable minimizes recovery time through preemptive compaction.

Before handing off a tablet, the source tablet server performs a minor compaction, flushing any pending in-memory updates (memtable) to disk. This reduces the amount of recent, un-compacted data that would otherwise need replaying from the commit log.

Once the first compaction finishes, the tablet stops accepting new writes. The server then performs a quick second minor compaction to flush any last-minute changes that arrived mid-process.

After this, the tablet’s on-disk state is fully up to date, so the new server can load it immediately without replaying any commit log entries during recovery.

Bigtable’s design benefits greatly from the fact that SSTables are immutable. Once written, they never change, which simplifies many parts of the system.

Because SSTables can be safely shared across threads, Bigtable doesn’t need locks or synchronization when reading them, making concurrent reads and writes lightweight. The only mutable structure shared between reads and writes is the memtable, which uses a copy-on-write strategy so that reads and writes can proceed in parallel with minimal contention.

Immutability also simplifies deletion and garbage collection. Rather than modifying SSTables to remove old data, Bigtable just deletes obsolete files. The master periodically performs a mark-and-sweep process using the METADATA table to identify and clean up outdated SSTables.

Finally, immutability makes tablet splitting fast: when a tablet splits, the new child tablets can simply share the parent’s existing SSTables instead of rewriting them - another example of how immutability turns complexity into efficiency.

https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf

https://brabalawuka.cc/posts/study/bigtable/

https://www.youtube.com/watch?v=x2MdvCTWh3o&t=594s

Tagged:#Backend#Distributed System#Paper Notes
0