Distributed Transaction At Scale In Amazon DynamoDB

Transactions are the backbone of data correctness. Their atomicity (A in ACID) ensures that a group of operations behaves as a single, all-or-nothing unit: either every change succeeds, or none of them do. This “all-or-nothing” property, along with isolation (I) and durability (D), protects the systems from partial failures, race conditions (don’t get me wrong, we still need some extra effort to resolve this, but transaction does help us resolve this), and data corruption.

Transactions are costly, especially when data scales beyond a single machine. A transaction isn’t just a batch of writes - it’s a promise that all participants either commit together or rollback together. Enforcing this across machines, networks, and clocks turns into an expensive coordination problem.

Coordination overhead: Once data is sharded or replicated, no single node sees the full state. To commit atomically, nodes must agree via protocol like Two-Phase Commit or Paxos, adding network round-trips, logging, and synchronization. Any slow or failed node delays the whole system.
Blocking and contention: If a coordinator fails mid-commit, participant must wait, holding locks and blocking progress. Under heavy load, locks and conflict checks turn hot spots into bottlenecks, especially when spread across machines.
Replication and ordering costs: Every replica must reach the same commit decision, demanding log flushes and cross-replica coordination. Ensuring consistent order across drifting clocks adds extra wait time - even Google Spanner must pause briefly before finalizing a commit.
Recovery and availability trade-offs: Crash recovery requires replaying logs and reconciling incomplete transactions, while network partitions force systems to choose between waiting (consistency) and staying online (availability).

Due to those costs, early NoSQL systems abandoned full ACID transactions, as they favor performance and availability over strict consistency.

DynamoDB powers applications for hundreds of thousands of customers and multiple high-traffic Amazon systems including Alexa, the Amazon.com site, and all Amazon fulfillment centers. That’s being said, the customers began requesting ACID transactions. It created challenges, as Amazon engineers must find a way to integrate transactional operations without sacrificing DynamoDB's ****key characteristic: high scalability, high availability and predictable performance at scale.

DynamoDB decided to build a different transaction design with the following capabilities:

Transactions are submitted as single request: Get rid of begin and end operations (such as BEGIN and COMMIT in PostgreSQL). Set of operations that are submitted as a single request will either succeed or fail.
Transactions rely on a transaction coordinator while non-transaction operations bypass the two-phase coordination: Provide separated APIs for transaction operations (TransactGetItems and TransactWriteItems). They are executed by transaction coordinator, while the normal operations (Get and Put) can be performed directly on the storage servers.
Transactions update items in place: DynamoDB doesn’t support multiple versions of the same items. Some other databases (such as PostgreSQL and MySQL) use MVCC so that read-only transaction can access old versions while transactions that write data produce new versions.
Transactions do not acquire locks: DynamoDB uses an optimistic concurrency control scheme that avoids locking. As locks restrict concurrency and can lead to deadlock. Additionally, locks also need recovery mechanism in case of lock holder failure.
Transactions are serially ordered using timestamps: Serializability is achieved using the timestamp assigned to each transaction.

Similar to non-transactional requests, each transaction operation initially is received the a request router, request router performs authentication and authorization, then forwards it to a fleet of transaction coordinators.

Any coordinator in the transaction coordinator fleet can take the transaction. It breaks the operations into item-level operations and runs a distributed protocol in storage nodes of those items.

Timestamp-based concurrency control is a technique used in database systems to guarantee serializability (the illusion that transactions execute one after another) without relying on locks.

Instead of blocking or waiting, the system assigns each transaction a unique timestamp when it starts. This timestamp defines its logical position in the serial order of all transactions.

Transactions with smaller timestamps are treated as older and therefore have higher priority (must be executed first). Any operation that violates this order is aborted and retried, ensuring the final schedule remains serializable and deadlock-free.

To enforce ordering, each data item x maintains two timestamp values:

$Wₜₛ(x)$ – the timestamp of the last transaction that successfully wrote to $x$ .
$Rₜₛ(x)$ – the timestamp of the last transaction that successfully read from $x$ .

Whenever a transaction with timestamp $t$ wants to access $x$ , the system checks:

Read rule: If $t < Wₜₛ(x)$ , the read is too “old” (a newer value already exists), so the transaction is aborted. Otherwise, it proceeds, and $Rₜₛ(x)$ is updated to $max(Rₜₛ(x), t)$ .
Write rule: If $t < max(Wₜₛ(x), Rₜₛ(x))$ , the write would violate the order (it’s trying to overwrite something written or read by a newer transaction), so it’s aborted. Otherwise, the write succeeds, and $Wₜₛ(x)$ is updated to $t$ .

Example:

Transaction $T_1$ (timestamp = 5) and Transaction $T_2$ (timestamp = 10) both access item x. Initially: $W_{ts}(x) = 0$ , $R_{ts}(x)=0$ , $x=100$ .

Step	Operation	Check	Result	Updated Timestamps
1	$T_1$ reads $x$	$5 \gt W_{ts}(x)$ = 0	Allowed	$R_{ts} = 5$
2	$T_2$ reads $x$	$10 > W_{ts}(x)$ = 5	Allowed	$R_{ts}$ = 10
3	$T_1$ writes $x$	$5 \ge max(W_{ts}(x), R_{ts}(x)) = max(0, 10)$ ?
No	Abort $T_1$	–
4	$T_2$ writes $x$	$10 \ge max(W_{ts}(x), R_{ts}(x)) = max(0, 10)$ ?
Yes	Commit	$W_{ts}$ = 10

Result: $T_1$ aborts, $T_2$ commits.

The final order ( $T_1$ → $T_2$ ) is serializable and consistent with their timestamps without locks or waiting.

DynamoDB adopts timestamp-based concurrency control to eliminate the need for locks.

Each transaction is assigned a timestamp by the transaction coordinator that receives it. Since DynamoDB runs a fleet of coordinators, not a single centralized one, each must maintain a closely synchronized local clock.

These clocks are synchronized using the AWS Time Sync Service - which aligns them within a few microseconds. This service provides low skew but not a provably consistent global clock. Once a transaction obtains its timestamp, all participating storage nodes can perform their local prepare and commit actions independently, using that timestamp to decide ordering without further cross-partition coordination.

⚠️ The paper gives limited detail on the timestamp mechanism, so the following discussion is based on logical inference from timestamp-ordering principles.

Handling Identical Timestamps

If two coordinators accidentally produce the same physical timestamp, DynamoDB can still guarantee determinism by extending each timestamp with extra distinguishing fields, such as a logical counter and a coordinator identifier.

When two transactions have identical physical times, the system compares them by logical counter, and if necessary, by coordinator ID.

This ensures a total order: even if physical times match, every pair of transactions has a deterministic ordering.

The coordinator ID serves only as a tie-breaker, not a permanent hierarchy. Logical time remains the dominant ordering factor, so coordinators do not always “win” or “lose” based on their ID.

Dealing with Clock Skew

Suppose Coordinator A’s clock runs 100 milliseconds fast, while Coordinator B’s clock is 100 milliseconds slow.

From the user’s perspective, request $T_a$ (stored by A) occurs before request $T_b$ (stored by B).

However, due to the clock skew, B may assign a smaller timestamp than A.

As a result, DynamoDB’s logical order becomes $T_b$ → $T_a$ , even though, in real time, $T_a$ happened first.

This reversal does not violate correctness. Serializability only requires that the system maintain a consistent total order of transactions - not that this order reflect wall-clock time.

All partitions agree on the same comparison rule, so every node observes the same serial order. The system remains consistent, isolated, and deadlock-free, though not strictly serializable (it does not guarantee real-time ordering).

One of the most common algorithm to ensure atomicity commitment across multiple nodes is the two-phase commit (2PC) protocol.

This protocol works by introducing a coordinator (aka Transaction Manager) and participants database nodes) that we want to ensure the atomicity → all participants successfully performs some desired actions, if any of them fails, no actions are performed (committed).

As the name described, there are 2 phases when a client want to make some changes to multiple nodes (participants). It first sends the request to the Coordinator, then:

First phase (Prepare phase): Coordinator sends a prepare request to each participant in the transaction, each participant replies with a message indicating whether it is able to commit the transaction.
- ⚠️ In this phase, the participant:
  - Does not commit the change yet, they just check the ability to commit the desired change.
    - This typically involves writing all transaction data to disk, and checking for any conflicts or constraint violations.
  - If it replies “ok to commit”, they must ensure that they will definitely able to apply the change if the Coordinator instructs them to commit in the second phase.
  - In this phase, participant likely locks the resources that are about to be updated until it receives the “Commit” or “Abort” instruction from Coordinator.
Second phase (Commit phase): Coordinator collects the response, if:
- Any participant replies with “no” response (or there’s timeout error), Coordinator sends out Abort messages to all participants, instructing them to abort the transaction changes.
- All participants reply with “ok” response, Coordinator sends out Commit messages to all participants, instructing them to commit the transaction changes.
- ⚠️ Coordinator must write the final decision (commit/abort) to its transaction log disk so it can continue after recovering from failure.

The process is illustrated in following diagram:

There are two crucial “points of no return” in this protocol:

The first point is when participant votes “ok” → it must be able to commit the change later
The second point is after coordinator makes the decision (commit/abort), that decision is irrevocable. No matter how many retries coordinator takes, it must ensure that all the participant perform the action based on the final decision.

You might wonder what happens when a node crashes or becomes unreachable due to network failure.

Participant failure:

If a participant fails during the prepare phase, the coordinator’s request will eventually time out, and the coordinator treats it as a “no” vote, aborting the transaction.
If it fails during the commit phase, recovery is straightforward: once the participant comes back online, it can either
- ask the coordinator for the final decision using the transaction’s global ID, or
- simply wait for the coordinator to retry the commit message.

Coordinator failure:

This case is more problematic.

If the coordinator crashes before sending any prepare requests, participants can safely abort since no transaction was initiated.
However, if participants have already voted “yes”, they are stuck in the prepared state and cannot decide on their own. They must wait until the coordinator recovers and informs them whether the transaction was committed or aborted. In other words, once a participant votes “yes,” it’s effectively blocked until the coordinator returns - a key weakness of the 2PC protocol.

DynamoDB uses two-phase protocol to ensure that all of the writes within a transaction are performed atomically and in the proper order.

Source: https://drive.google.com/file/d/1Yg2R-wN7KKugx-R4yc8c080XsXtBB0JT/view

It performs the exact same way that we described above about 2PC protocol.

In the prepare phase, there’s an additional check related to the timestamp ordering to determine the transaction is valid or not.

There’s only one special handling related to item deletion. If item is deleted in the transaction, it requires a special handling, as it’s deleted, there’s no such last write timestamp, so we can’t verify the transaction in prepare phase.

One solution could be maintaining the tombstones for deleted items - a DELETED record indicates the item’s deletion. But it would incur a high storage cost and garbage collection as well.

DynamoDB comes with a solution called partition-level max delete timestamp, when a item is deleted, the partition-level max delete timestamp is updated if it’s smaller than the transaction timestamp. So when the participant receives the prepare request for non-existent item, it can compare the transaction timestamp with the partition-level delete timestamp to decide whether to accept / reject the transaction.

For read-only transactions, DynamoDB uses a variant of Two-Phase Commit called the Two-Phase Writeless Protocol.

Although no writes occurs, the protocol still runs in two phases to guarantee that all items are read from a consistent snapshot.

Phase 1: Validation and Snapshot Capture

The transaction coordinator requests each storage node to return the current value of the item along with its log sequence number (LSN) - a monotonically increasing version counter that represents the most recent write to that item.
If any item is currently locked or being modified by another in-flight transaction, the read transaction is aborted immediately to avoid reading an inconsistent state.

Phase 2: Re-verification

After gathering all the initial values and LSNs, the coordinator sends a second round of read requests for the same items.
Each storage node checks whether the item’s current LSN still matches the LSN from Phase 1.
- If all LSNs are unchanged, it means no concurrent writes occurred, so the transaction commits successfully and returns the values fetched in Phase 1.
- If any LSN has changed, it indicates a concurrent update - the transaction is aborted and may be retried.

DynamoDB automatically promotes another replica to the master in case the primary storage node fails, as the transaction metadata and the prepare state must be persisted to the transaction log and replicated to replicas, so the new master still aware of current state of the transaction and continue what’s left off. That’s being said, the coordinator might not even know it communicates with different set of participants!

The coordinator’s decision must be written to transaction ledger - a DynamoDB table with transaction’s identifier as key, so if any coordinator fails, the new coordinator can know which transactions are incomplete (stalled transactions) and continues to proceed them.

The stalled transactions can also be detected by the storage node when they receive the write/read request for an item that is already being written by another transaction, if that pending transaction’s timestamp is older than some threshold, the storage node sends that transaction’s id to the coordinator. The coordinator further checks the transaction’s status in the ledger table and resumes the execution if that transaction is incomplete.

DynamoDB extends the classic timestamp ordering concurrency control (TOCC) model to work efficiently in a key-value store that mixes single-item operations and multi-item transactions.

The key idea: maintain serializable order via timestamps, but allow non-transactional reads and writes to proceed out of order **when it’s safe to do so to improve throughput and reducing coordination.

A normal GetItem request bypasses transaction coordinators entirely.
The read goes straight to the storage node responsible for that key.
The node returns the latest committed value, even if there’s a prepared transaction that will overwrite it later.

A direct PutItem / UpdateItem / DeleteItem can often be accepted immediately and serialized before any prepared transactions that haven’t yet written their values.
Since prepared transactions haven’t applied their writes yet, it’s safe for single-item writes to “jump ahead” of them in timestamp order.
The only exception: if the prepared transaction already checked a precondition and a new write could violate that check.
- For example, we have a write transaction to withdraw $100 from bank account which is currently having$ 100, and the condition is that the bank account must have sufficient fund. if there’s another non-transaction write to withdraw $50 from that bank account, if we proceed the non-transactional write, if the transaction is instructed to commit, it would break the data consistency (the bank account’s fund is negative)
- In such cases, the storage node must delay or reject the new write.
- Detecting condition violations for arbitrary expressions is complex, but simple cases (e.g., numeric bounds) can be optimized.

If a new write cannot safely jump ahead (because it may violate a prechecked condition), the node can buffer it until the pending transaction finishes.
Prepared transactions typically complete quickly, so buffering introduces little latency.
Once the earlier transaction commits or aborts, the buffered write can be assigned a later timestamp and applied.
Writes with no preconditions can always proceed immediately. If a prepared transaction with an earlier timestamp later commits, its writes are simply ignored.

A storage node can still accept a transaction with an older timestamp, even if newer writes already exist. (Note that, in the classic timestamp ordering, this transaction would be aborted entirely!)
If the transaction later commits, its effect on those items is ignored - since it would be overwritten by newer writes anyway.
This helps transactions that write multiple items: even if one item’s write is “too old,” the rest of the transaction may still succeed.
The exception: modify operations (partial updates) must execute strictly in timestamp order, since the outcome depends on sequence.

Multiple transactions can prepare writes to the same item concurrently.
As long as these are full overwrites or deletes, they can commit in any order - only the transaction with the latest timestamp determines the final value.
For partial updates (modify), strict timestamp order must be preserved.

Read-only transactions could be done in one round instead of the two-phase writeless protocol.
If storage nodes supported GetItemWithTimestamp( $R_{ts}$ ), each node could:
- Return the item if its last write timestamp ≤ $R_{ts}$ and no prepared transaction has a smaller timestamp.
- Otherwise, reject the read.
The coordinator collects all results and commits if all succeed.

If all items of a transaction reside in the same partition, the system can skip the full two-phase protocol.
The single storage node performs all precondition checks and applies writes immediately.
The coordinator just receives a success or failure response - no separate prepare/commit rounds are needed.

DynamoDB’s transaction design achieves all four ACID guarantees. Atomicity is ensured via distributed 2PC and a durable transaction ledger. Consistency is enforced by conditional validation and timestamp-based serialization. Isolation is provided through serializable timestamp ordering without locks. Durability comes from multi-replica persistence and automatic failover. Together, these mechanisms allow DynamoDB to maintain correctness while scaling across thousands of partitions.

I think that I should show some results from their benchmarks, but honestly, their benchmarks are a bit vague to me. So if you’re interested in them, you can find them in the attached paper in the References section.

Paper: https://drive.google.com/file/d/1Yg2R-wN7KKugx-R4yc8c080XsXtBB0JT/view

TIMESTAMP-BASED ALGORITHMS FOR CONCURRENCY CONTROL IN DISTRIBUTED DATABASE SYSTEM: https://www.microsoft.com/en-us/research/wp-content/uploads/2020/12/TO-Algs-Bernstein-GoodmanVLDB1980.pdf

Timestamp based Concurrency Control: https://www.geeksforgeeks.org/dbms/timestamp-based-concurrency-control/

Lecture #17: Timestamp Ordering Concurrency Control: https://15445.courses.cs.cmu.edu/fall2023/notes/17-timestampordering.pdf

Distributed Systems 7.1: Two-phase commit: https://www.youtube.com/watch?v=-_rdWB9hN1c