Multi-version Concurrent Control

A practical deep dive into MVCC: concurrency protocols, version storage, garbage collection, and index design.

Modern databases are expected to execute many transactions concurrently while preserving correctness. When multiple transactions read or update the same data at the same time, the system must prevent anomalies such as lost updates, dirty reads, or inconsistent results. This is why concurrency control exists in the first place. Traditional approaches rely on locking to serialize access to shared data, but this often limits parallelism and hurts performance under contention. Multi-Version Concurrency Control (MVCC) takes a different path: instead of blocking readers and writers, it maintains multiple versions of each tuple, allowing transactions to operate on consistent snapshots while updates proceed concurrently.

This appealing advantage makes MVCC the most popular choice for new DBMS implemented in recent years. But there’s no “standard implementation” for MVCC, there are several design choices, each comes with its own advantages & disadvantages. In this blog, we’ll examine the design decision of MVCC DBMS in following aspects:

  • Concurrency protocol
  • Version storage
  • Garbage collection
  • Index management.

In this blog, we will use the term “tuple” to refer to the table row.

Regardless of the MVCC implementation, there must be some metadata fields in the tuple’s header.

  • txn-id: The transaction identifier - an unique, monotonically increasing timestamp assigned to the transaction. This field is used for write lock.
    • Most DBMSs use 64-bit txn-id so they can leverage the CAS operation.
    • A transaction can only update an object if the txn-id is zero or equal to that transaction’s identifier (if txn-id is zero, then DBMS set the txn-id to the transaction’s identifier)
  • begintsbegin_{ts} and endtsend_{ts}:
    • The begintsbegin_{ts} is the logical timestamp when the version is created. It is set to INF when transaction deletes it.
    • The endtsend_{ts} is the logical timestamp when the version is updated.
    • A logical timestamp is a system-defined, monotonically increasing value (often implemented using the transaction identifier mentioned above) that uniquely orders committed transactions and allows the system to reconstruct a global serialization order.
    • The larger begintsbegin_{ts} means newer version.
  • pointer: Stores the address of neighboring (previous or next) version (if there’s any).

Here’s the basic layout of physical version of a tuple

Preliminary — reference figure 1

Every DBMS includes a concurrency control protocol that coordinates the execution of concurrent transactions. It determines:

  • Wether transaction can access/update the tuple version.
  • Wether to allow transaction to commit its modifications.

The MVTO is considered as the original multi-version concurrency control protocol. It uses transaction identifier TidT_{id} to precompute their serialization order. It stores an additional field in the metadata - the readtsread_{ts} . The DBMS aborts a transaction that attempts to read / write a version whose write lock is held by another transaction.

For read path: When a transaction TT wants to read a row, it searches for the newest version that has: begintsTtsendtsbegin_{ts} \le T_{ts} \le end_{ts}, then set the readts=max(readts,Tts)read_{ts} = max(read_{ts}, T_{ts})

Validation step: Transaction TT creates a new version Bx+1B_{x+1} if two following conditions met:

  • No active transaction holds write lock.
  • readts<Ttsread_{ts} \lt T_{ts}

If TT can acquire write lock, then DBMS creates a new version Bx+1B_{x+1} and set Bx+1B_{x+1} txn-id = TtsT_{ts} and BxB_{x}’s readts=max(readts,Tts)read_{ts} = max(read_{ts}, T_{ts}).

  • Note that: This Bx+1B_{x+1} is not visible to other transaction yet. And at the same time, there

When TT commits:

  • Set Bx+1B_{x+1}’s begints=Ttsbegin_{ts} = T_{ts} and endts=INFend_{ts} = INF
  • Set BxB_{x}’s endtsend_{ts} = TtsT_{ts} (using CAS, so there’s no lock at all), if CAS fails, the transaction is aborted.

Note that, following points are totally based on my understanding, so it might not be correct. Please read it with caution and feel free to correct them if you see there’s something incorrect.

There are some points which makes MVTO is not viable in real system:

  • Write after read is not possible: As the readtsread_{ts} is updated right after tuple’s read, so if transaction A reads the tuple B, let’s say the readtsread_{ts} of B is now updated to TtsT_{ts} of A. If later, A wants to perform write, it will fail because the condition readts<Ttsread_{ts} \lt T_{ts} is not satisfied.
  • The write lock is only acquired on the validation step, so for a tuple BxB_x we can have multiple Bx+1B_{x+1} versions created by concurrent transactions. But at the end, only one of them would win (as only one can set the BxB_x’s endtsend_{ts} successfully, others abort. Imagine that we have long running transactions and they’re all aborted at the end because another transaction successfully committed its change before them.

The motivation behind this protocol is that DBMS assumes that transactions are likely to conflict, so there’s no need to acquire locks.

For the read only transaction, it follows the same rule as Timestamp Ordering.

But for read-write transaction, MVOCC splits it into three phases:

  1. Read phase: This happens when transaction starts.

    • For read, it follows the same logic as timestamp ordering (without the step updating readtsread_{ts} as that metadata field is only used in MVTO protocol)
    • For write, it also performs the validation step just the same as MVTO.
  2. Validation phase: When the transaction wants to commit, it enters this phase. DBMS assigns another timestamp to the transaction - TcommitT_{commit}. In this phase, DBMS checks that all tuples in transaction’s read set are not updated by any transaction.

    This check is performed by comparing current version of tuple in read set and the version the transaction got in read phase.

  3. Write phase: If the check in validation phase passed, the transaction enters this phase. The DBMS installs all new versions, and sets their begints=Tcommitbegin_{ts} = T_{commit}, endts=INFend_{ts} = INF

    The installation process is done using CAS, which also checks for the current version of tuple to be updated and the version transaction read beforehand.

You might think that, why do we need validation phase if we can perform CAS anyway? Actually the Validation phase is also for read-only sets, if we observe that our read set is stale, we abort it anyway.

In MV2PL, each tuple version maintains a txn_id field that indicates the transaction holding the write lock and a read_cnt field that tracks active readers. A transaction must acquire a shared read lock (by incrementing read_cnt) or an exclusive write lock (by setting txn_id) on the current version before accessing a tuple.

Writers block readers and other writers, while readers block writers. When a transaction commits, the DBMS assigns a commit timestamp (TcommitT_{commit}), updates the begintsbegin_{ts} and endtsend_{ts} fields to make new versions visible, and then releases all locks by clearing txn_id and decrementing read_cnt. If the transaction aborts, speculative versions (the uncommitted version created by the transaction) are discarded and locks are similarly released. This strict two-phase locking ensures serializability but may introduce blocking and deadlocks.

Note: One trick to save the storage consumption is to combine the txn-id and read-cnt in one 64-bit so they can updated together with a single atomic CAS instruction.

These MVCC protocols differ primarily in how they detect conflicts.

  • MV2PL uses read and write locks on tuple versions, causing transactions to block or abort eagerly. The trade offs are lots of blocking & deadlocks.
  • MVTO records read timestamps and aborts writers that violate timestamp ordering.
  • MVOCC avoids modifying tuple metadata during reads, reducing coordination overhead, but must validate each transaction’s read set at commit, which can starve long-running readers. The trade offs are starvation of long-running readers and growth in validation cost with large read-set size.

Under MVCC, the DBMS always constructs a new physical version of a tuple when a transaction updates it.

The MVCC Version Storage defines how the system stores these versions and information each version contains. In the version’s metadata, we have the pointer field which is the pointer to the neighbor version, hence, forming a linked version list.

In this scheme, all versions are stored in the same storage space. It’s used in Postgres and some other in-memory DBMSs like Hekaton, NuoDB and MemSQL. When creating a new version, DBMS acquires an empty slot from the table, copies the data of old version to the new one, then updates the new version.

The key point we need to care about this scheme is how the DBMS orders the version chain. There are two ways:

  • Newest to oldest (N2O): The chain’s HEAD is the newest version.
    • Advantage: DBMS doesn’t need to travel the chain to get the latest.
    • Disadvantage:
      • The chain’s HEAD changes on every write → we need to update all indexes (both primary & secondary) to point to new version. This can be solved by using a “logical pointer” to the chain’s HEAD. We will go more into detail about “logical pointer” in Index Management System section.
1. Append-Only Storage — reference figure 2
  • Oldest to newest (O2N): The chain’s HEAD is the oldest version.
    • The advantage of this is that the chain’s HEAD pointer doesn’t change, so we don’t need to update the database index.
    • The disadvantage is that we might need to travel long chain version to get the newest - this process is slow due to random I/O, and exhausts the CPU caches as it needs to store the redundant versions.
    • The efficiency of O2N highly depends on the efficiency of old version prune.
      • To add up to this point, so yes, eventually, we still need to update the HEAD pointer when old versions are reclaimed, but it still occurs less often than in N2O, where HEAD changes on every write.
1. Append-Only Storage — reference figure 3

Another disadvantage of append-only storage is memory overhead, as the whole old version is copied even if there’s only single field is updated. This can be problematic if the tuple has some kind of non-inline attributes (e.g BLOB), the BLOB values are copied over and over again which is waste of memory.

  • One possible solution to this one is to use a logical pointer to the BLOB value and maintain a reference counter, so we can safely delete it when the reference counter goes down to zero.

This scheme is similar to Append-Only Storage except that the older versions are stored in separate table. The DBMS maintain a master version in the main table, while older versions are kept in time-travel table. As similar to Append-Only Storage, we have 2 options:

  • The master version is the newest version
  • OR the master version is the oldest. This option incurs the cost when GC prunes the older versions, which need the update on the master version.

Ideally, the first option is preferred. In 1st option, when updating a tuple, DBMS first acquires a slot in time-travel table, copies the master version to this location, then modifies the master version in main table. Indexes are not affected as it always point to the master version.

This scheme also suffers from non-inline attributes problem described above, and the mentioned optimization also works for this scheme.

https://www.vldb.org/pvldb/vol10/p781-Wu.pdf

In the delta storage scheme, the DBMS keeps the latest (master) version of each tuple directly in the main table, while older values are stored as a sequence of delta versions in a separate delta area. This design is similar in spirit to the rollback segments used in MySQL and Oracle, and is also adopted by systems such as HyPer.

When a tuple is updated, the DBMS first allocates space in the delta storage and records the previous values of only the modified attributes as a delta record. It then updates the master tuple in place to reflect the new values. Each main-table tuple maintains a pointer to the head of its delta chain, which is updated on every modification to preserve version history.

While this approach reduces write amplification by copying only changed fields, it increases the cost of reads that access multiple attributes or older versions. In such cases, the DBMS may need to traverse the delta chain to reconstruct the requested tuple state, leading to additional pointer chasing and cache misses.

https://www.vldb.org/pvldb/vol10/p781-Wu.pdf

Since MVCC creates new versions when transaction update tuples, if we don’t remove the older version then:

  • The system will run out of space at some point
  • The execution time of queries is increased due to version chain traversal time.

That’s why we need Garbage Collection (GC) to remove older versions and reclaim the space.

GC process is divided into three steps:

  1. Detect expired versions. A version is considered expired if:
    • It’s created by invalid transactions (i.e aborted transactions)
    • It’s not visible to any active transaction by checking the endtsend_{ts} is less than TidT_{id} of all active transactions.
  2. Unlink those versions from their associated chains and indexes.
  3. Reclaim their storage space.

There are 2 GC implementations, they differ on how the DBMS looks for expired versions. The first approach is tuple-level GC wherein the DBMS examines the visibility of individual tuples. The second is transaction-level GC that checks wether any version created by a finished transaction is visible.

Note that: Not all GC schemes are compatible to all storage scheme listed above.

With this approach, the DBMS checks the visibility of each individual tuple version in one of two ways:

  • Background Vacuuming (VAC): The DBMS uses background threads that periodically scan the database for expired versions. This is the most common approach in MVCC DBMSs as it’s easier to implement and works with all version storage schemes. But it does not scale well for large databases, especially with small number of GC threads. Another challenge is for long running transaction can prevent GC any version created after it started, causing version to grow (possibly dramatically) until that transaction finishes. This is a well-known challenge; for example, a long-running snapshot in Postgres will bloat the database because vacuum cannot remove tuples that are new since the snapshot.
  • Cooperative Cleaning (COOP): When executing a transaction, the DBMS traverses the version chain to find the visible version. During this traversal, it identifies the expired versions and records them in a global data structure. This scales well as the GC threads no longer need to detect expired version. But it only works for the O2N append-only storage. But it has the problem is that if transactions don’t traverse the version chain for particular tuple, then the system will never remove its expired version. This problem is called “Dirty corners” in Hekaton. This can be solved by periodically performing a complete GC pass with a separate thread like in VAC.

Transaction-level garbage collection reclaims tuple versions in batches based on transaction lifetimes rather than individual tuples. A version becomes eligible for deletion once no active transaction can observe it, typically determined by comparing its expiration timestamp with the oldest active transaction. Instead of tracking visibility per tuple, the DBMS groups transactions into epochs (or similar units) and removes all versions created by completed epochs in bulk. While the paper presents this approach in the context of an in-memory system, the underlying principle applies broadly: obsolete versions are reclaimed once they fall outside the visibility window of all running transactions.

All MVCC DBMSs keep the database’s versioning information separate from its indexes. So if a key exists in the index, we know that there is (are) version(s) of that key, but index does not know which version of the tuple match.

We define an index entry as a key/value pair, where the key is the index’s attribute value, and the value is the pointer to that tuple.

Primary indexes always point to the current version of the tuple. But how often it’s updated highly depends on the version storage scheme. For example, primary index in the delta storage scheme always points to the master version so it does not need to be updated. But for N2O append-only scheme the index needs to be updated every time a new version is created.

For secondary indexes, it’s way more complicated, as both key and value can be updated. There are two index management schemes for secondary indexes: logical pointers and physical pointers.

In this approach, we introduce an indirection layer that map the tuple’s identifier to the to the version’s chain’s HEAD. So all indexes with the same identifier uses the same value in that map. When the chain’s HEAD is updated, we only need to update the map value, no need to update the indexes at all. This approach is compatible with any version storage scheme.

So what’s the tuple’s identifier? There are 2 implementation choices:

  • Primary key (PKey): The identifier is the same as tuple’s primary key. When the DBMS retrieves an entry from a secondary index, it performs another look-up in the table’s primary key index to locate the version chain HEAD. If a secondary index’s attributes overlap with the primary key, then the DBMS does not have to store the entire primary key in each entry.
  • Tuple ID (TupleID): Drawback of using primary key is overhead of storing the primary key value and the cost on primary key lookup. An alternative solution is to use a unique 64-bit tuple identifier instead of primary key and a separate lack-free hash table to maintain the mapping information to the tuple’s version chain HEAD.

With logical pointers, a query on a secondary key has to first get the logical ID, then use that to fetch the actual tuple/version (which might involve scanning a version chain or looking up in the primary index). That adds cost to reads, especially if the chain is long (the system may need to compare the key values in multiple versions to find the right one, in case the logical ID was reused – e.g., a DELETE followed by an INSERT of the same key gives two different physical tuples for one logical key)

In this design, indexes point directly to the physical addresses of tuple versions. Because all versions are appended to the same table, this technique is only feasible in append-only storage layouts. Each update inserts the new version into all secondary indexes, allowing index lookups to jump straight to the relevant version rather than comparing keys across multiple historical copies. Systems like MemSQL and Hekaton use this approach to avoid expensive version filtering during index scans.

Side note: Uber switched from Postgres to MySQL because PostgreSQL’s indexes use physical pointers (TID = page+offset of the tuple). When an UPDATE creates a new tuple version, all indexes get a new entry for the new tuple, and the old entries become dead weight until vacuumed. With write intensive application, Postgres suffered significant write throughput degradation.

Logical pointers optimize writes, physical pointers optimize reads, and MVCC typically prevents index-only scans unless visibility metadata is embedded in the index.

The paper authors did experiments to verify & benchmark the design choice, here’s the experiment summary:

  • Storage layout matters more than concurrency control: Append-only storage with newest-to-oldest version ordering consistently delivers the best performance. Short version chains and efficient memory allocation outweigh protocol-level optimizations.
  • Read-only transactions must be isolated: Explicit READ ONLY hints dramatically improve both read and write throughput by removing unnecessary conflicts.
  • Garbage collection is critical: Transaction-level GC outperforms tuple-level approaches, providing higher sustained throughput and lower memory usage. Disabling GC quickly degrades performance.
  • Logical index pointers scale better: Logical pointers outperform physical pointers (up to ~45%) under high contention by avoiding repeated secondary-index updates.
  • Memory allocation can dominate performance: Using multiple memory arenas improves throughput by up to 4× for append-only designs.

https://www.vldb.org/pvldb/vol10/p781-Wu.pdf

Tagged:#Backend#Paper Notes
0