Distributed Transaction At Scale In Amazon DynamoDB
How DynamoDB delivers ACID at scale: timestamp ordering instead of locks, a distributed 2PC, writeless reads, buffered writes, and recovery that keeps latency predictable.
Transactions are the backbone of data correctness. Their atomicity (A in ACID) ensures that a group of operations behaves as a single, all-or-nothing unit: either every change succeeds, or none of them do. This βall-or-nothingβ property, along with isolation (I) and durability (D), protects the systems from partial failures, race conditions (donβt get me wrong, we still need some extra effort to resolve this, but transaction does help us resolve this), and data corruption.
Transactions are costly, especially when data scales beyond a single machine. A transaction isnβt just a batch of writes - itβs a promise that all participants either commit together or rollback together. Enforcing this across machines, networks, and clocks turns into an expensive coordination problem.
- Coordination overhead: Once data is sharded or replicated, no single node sees the full state. To commit atomically, nodes must agree via protocol like Two-Phase Commit or Paxos, adding network round-trips, logging, and synchronization. Any slow or failed node delays the whole system.
- Blocking and contention: If a coordinator fails mid-commit, participant must wait, holding locks and blocking progress. Under heavy load, locks and conflict checks turn hot spots into bottlenecks, especially when spread across machines.
- Replication and ordering costs: Every replica must reach the same commit decision, demanding log flushes and cross-replica coordination. Ensuring consistent order across drifting clocks adds extra wait time - even Google Spanner must pause briefly before finalizing a commit.
- Recovery and availability trade-offs: Crash recovery requires replaying logs and reconciling incomplete transactions, while network partitions force systems to choose between waiting (consistency) and staying online (availability).
Due to those costs, early NoSQL systems abandoned full ACID transactions, as they favor performance and availability over strict consistency.
DynamoDB powers applications for hundreds of thousands of customers and multiple high-traffic Amazon systems including Alexa, the Amazon.com site, and all Amazon fulfillment centers. Thatβs being said, the customers began requesting ACID transactions. It created challenges, as Amazon engineers must find a way to integrate transactional operations without sacrificing DynamoDB's ****key characteristic: high scalability, high availability and predictable performance at scale.
DynamoDB decided to build a different transaction design with the following capabilities:
- Transactions are submitted as single request: Get rid of begin and end operations (such as BEGIN and COMMIT in PostgreSQL). Set of operations that are submitted as a single request will either succeed or fail.
- Transactions rely on a transaction coordinator while non-transaction operations bypass the two-phase coordination: Provide separated APIs for transaction operations (TransactGetItems and TransactWriteItems). They are executed by transaction coordinator, while the normal operations (Get and Put) can be performed directly on the storage servers.
- Transactions update items in place: DynamoDB doesnβt support multiple versions of the same items. Some other databases (such as PostgreSQL and MySQL) use MVCC so that read-only transaction can access old versions while transactions that write data produce new versions.
- Transactions do not acquire locks: DynamoDB uses an optimistic concurrency control scheme that avoids locking. As locks restrict concurrency and can lead to deadlock. Additionally, locks also need recovery mechanism in case of lock holder failure.
- Transactions are serially ordered using timestamps: Serializability is achieved using the timestamp assigned to each transaction.
Similar to non-transactional requests, each transaction operation initially is received the a request router, request router performs authentication and authorization, then forwards it to a fleet of transaction coordinators.
Any coordinator in the transaction coordinator fleet can take the transaction. It breaks the operations into item-level operations and runs a distributed protocol in storage nodes of those items.
Timestamp-based concurrency control is a technique used in database systems to guarantee serializability (the illusion that transactions execute one after another) without relying on locks.
Instead of blocking or waiting, the system assigns each transaction a unique timestamp when it starts. This timestamp defines its logical position in the serial order of all transactions.
Transactions with smaller timestamps are treated as older and therefore have higher priority (must be executed first). Any operation that violates this order is aborted and retried, ensuring the final schedule remains serializable and deadlock-free.
To enforce ordering, each data item x maintains two timestamp values:
- β the timestamp of the last transaction that successfully wrote to .
- β the timestamp of the last transaction that successfully read from .
Whenever a transaction with timestamp wants to access , the system checks:
- Read rule: If , the read is too βoldβ (a newer value already exists), so the transaction is aborted. Otherwise, it proceeds, and is updated to .
- Write rule: If , the write would violate the order (itβs trying to overwrite something written or read by a newer transaction), so itβs aborted. Otherwise, the write succeeds, and is updated to .
Example:
Transaction (timestamp = 5) and Transaction (timestamp = 10) both access item x. Initially: , , .
| Step | Operation | Check | Result | Updated Timestamps |
|---|---|---|---|---|
| 1 | reads | = 0 | Allowed | |
| 2 | reads | = 5 | Allowed | = 10 |
| 3 | writes | ? | ||
| No | Abort | β | ||
| 4 | writes | ? | ||
| Yes | Commit | = 10 |
Result: aborts, commits.
The final order ( β ) is serializable and consistent with their timestamps without locks or waiting.
DynamoDB adopts timestamp-based concurrency control to eliminate the need for locks.
Each transaction is assigned a timestamp by the transaction coordinator that receives it. Since DynamoDB runs a fleet of coordinators, not a single centralized one, each must maintain a closely synchronized local clock.
These clocks are synchronized using the AWS Time Sync Service - which aligns them within a few microseconds. This service provides low skew but not a provably consistent global clock. Once a transaction obtains its timestamp, all participating storage nodes can perform their local prepare and commit actions independently, using that timestamp to decide ordering without further cross-partition coordination.
β οΈ The paper gives limited detail on the timestamp mechanism, so the following discussion is based on logical inference from timestamp-ordering principles.
Handling Identical Timestamps
If two coordinators accidentally produce the same physical timestamp, DynamoDB can still guarantee determinism by extending each timestamp with extra distinguishing fields, such as a logical counter and a coordinator identifier.
When two transactions have identical physical times, the system compares them by logical counter, and if necessary, by coordinator ID.
This ensures a total order: even if physical times match, every pair of transactions has a deterministic ordering.
The coordinator ID serves only as a tie-breaker, not a permanent hierarchy. Logical time remains the dominant ordering factor, so coordinators do not always βwinβ or βloseβ based on their ID.
Dealing with Clock Skew
Suppose Coordinator Aβs clock runs 100 milliseconds fast, while Coordinator Bβs clock is 100 milliseconds slow.
From the userβs perspective, request (stored by A) occurs before request (stored by B).
However, due to the clock skew, B may assign a smaller timestamp than A.
As a result, DynamoDBβs logical order becomes β , even though, in real time, happened first.
This reversal does not violate correctness. Serializability only requires that the system maintain a consistent total order of transactions - not that this order reflect wall-clock time.
All partitions agree on the same comparison rule, so every node observes the same serial order. The system remains consistent, isolated, and deadlock-free, though not strictly serializable (it does not guarantee real-time ordering).
One of the most common algorithm to ensure atomicity commitment across multiple nodes is the two-phase commit (2PC) protocol.
This protocol works by introducing a coordinator (aka Transaction Manager) and participants database nodes) that we want to ensure the atomicity β all participants successfully performs some desired actions, if any of them fails, no actions are performed (committed).
As the name described, there are 2 phases when a client want to make some changes to multiple nodes (participants). It first sends the request to the Coordinator, then:
- First phase (Prepare phase): Coordinator sends a prepare request to each participant in the transaction, each participant replies with a message indicating whether it is able to commit the transaction.
- β οΈΒ In this phase, the participant:
- Does not commit the change yet, they just check the ability to commit the desired change.
- This typically involves writing all transaction data to disk, and checking for any conflicts or constraint violations.
- If it replies βok to commitβ, they must ensure that they will definitely able to apply the change if the Coordinator instructs them to commit in the second phase.
- In this phase, participant likely locks the resources that are about to be updated until it receives the βCommitβ or βAbortβ instruction from Coordinator.
- Does not commit the change yet, they just check the ability to commit the desired change.
- β οΈΒ In this phase, the participant:
- Second phase (Commit phase): Coordinator collects the response, if:
- Any participant replies with βnoβ response (or thereβs timeout error), Coordinator sends out Abort messages to all participants, instructing them to abort the transaction changes.
- All participants reply with βokβ response, Coordinator sends out Commit messages to all participants, instructing them to commit the transaction changes.
- β οΈΒ Coordinator must write the final decision (commit/abort) to its transaction log disk so it can continue after recovering from failure.
The process is illustrated in following diagram:
There are two crucial βpoints of no returnβ in this protocol:
- The first point is when participant votes βokβ β it must be able to commit the change later
- The second point is after coordinator makes the decision (commit/abort), that decision is irrevocable. No matter how many retries coordinator takes, it must ensure that all the participant perform the action based on the final decision.
You might wonder what happens when a node crashes or becomes unreachable due to network failure.
Participant failure:
- If a participant fails during the prepare phase, the coordinatorβs request will eventually time out, and the coordinator treats it as a βnoβ vote, aborting the transaction.
- If it fails during the commit phase, recovery is straightforward: once the participant comes back online, it can either
- ask the coordinator for the final decision using the transactionβs global ID, or
- simply wait for the coordinator to retry the commit message.
Coordinator failure:
This case is more problematic.
- If the coordinator crashes before sending any prepare requests, participants can safely abort since no transaction was initiated.
- However, if participants have already voted βyesβ, they are stuck in the prepared state and cannot decide on their own. They must wait until the coordinator recovers and informs them whether the transaction was committed or aborted. In other words, once a participant votes βyes,β itβs effectively blocked until the coordinator returns - a key weakness of the 2PC protocol.
DynamoDB uses two-phase protocol to ensure that all of the writes within a transaction are performed atomically and in the proper order.
DynamoDB uses two-phase protocol to ensure that all of the writes within a transaction are performed atomically and in the proper order.
It performs the exact same way that we described above about 2PC protocol.
In the prepare phase, thereβs an additional check related to the timestamp ordering to determine the transaction is valid or not.
Thereβs only one special handling related to item deletion. If item is deleted in the transaction, it requires a special handling, as itβs deleted, thereβs no such last write timestamp, so we canβt verify the transaction in prepare phase.
One solution could be maintaining the tombstones for deleted items - a DELETED record indicates the itemβs deletion. But it would incur a high storage cost and garbage collection as well.
DynamoDB comes with a solution called partition-level max delete timestamp, when a item is deleted, the partition-level max delete timestamp is updated if itβs smaller than the transaction timestamp. So when the participant receives the prepare request for non-existent item, it can compare the transaction timestamp with the partition-level delete timestamp to decide whether to accept / reject the transaction.
For read-only transactions, DynamoDB uses a variant of Two-Phase Commit called the Two-Phase Writeless Protocol.
Although no writes occurs, the protocol still runs in two phases to guarantee that all items are read from a consistent snapshot.
Phase 1: Validation and Snapshot Capture
- The transaction coordinator requests each storage node to return the current value of the item along with its log sequence number (LSN) - a monotonically increasing version counter that represents the most recent write to that item.
- If any item is currently locked or being modified by another in-flight transaction, the read transaction is aborted immediately to avoid reading an inconsistent state.
Phase 2: Re-verification
- After gathering all the initial values and LSNs, the coordinator sends a second round of read requests for the same items.
- Each storage node checks whether the itemβs current LSN still matches the LSN from Phase 1.
- If all LSNs are unchanged, it means no concurrent writes occurred, so the transaction commits successfully and returns the values fetched in Phase 1.
- If any LSN has changed, it indicates a concurrent update - the transaction is aborted and may be retried.
DynamoDB automatically promotes another replica to the master in case the primary storage node fails, as the transaction metadata and the prepare state must be persisted to the transaction log and replicated to replicas, so the new master still aware of current state of the transaction and continue whatβs left off. Thatβs being said, the coordinator might not even know it communicates with different set of participants!
The coordinatorβs decision must be written to transaction ledger - a DynamoDB table with transactionβs identifier as key, so if any coordinator fails, the new coordinator can know which transactions are incomplete (stalled transactions) and continues to proceed them.
The stalled transactions can also be detected by the storage node when they receive the write/read request for an item that is already being written by another transaction, if that pending transactionβs timestamp is older than some threshold, the storage node sends that transactionβs id to the coordinator. The coordinator further checks the transactionβs status in the ledger table and resumes the execution if that transaction is incomplete.
DynamoDB extends the classic timestamp ordering concurrency control (TOCC) model to work efficiently in a key-value store that mixes single-item operations and multi-item transactions.
The key idea: maintain serializable order via timestamps, but allow non-transactional reads and writes to proceed out of order **when itβs safe to do so to improve throughput and reducing coordination.
- A normal GetItem request bypasses transaction coordinators entirely.
- The read goes straight to the storage node responsible for that key.
- The node returns the latest committed value, even if thereβs a prepared transaction that will overwrite it later.
- A direct PutItem / UpdateItem / DeleteItem can often be accepted immediately and serialized before any prepared transactions that havenβt yet written their values.
- Since prepared transactions havenβt applied their writes yet, itβs safe for single-item writes to βjump aheadβ of them in timestamp order.
- The only exception: if the prepared transaction already checked a precondition and a new write could violate that check.
- For example, we have a write transaction to withdraw 100, and the condition is that the bank account must have sufficient fund. if thereβs another non-transaction write to withdraw $50 from that bank account, if we proceed the non-transactional write, if the transaction is instructed to commit, it would break the data consistency (the bank accountβs fund is negative)
- In such cases, the storage node must delay or reject the new write.
- Detecting condition violations for arbitrary expressions is complex, but simple cases (e.g., numeric bounds) can be optimized.
- If a new write cannot safely jump ahead (because it may violate a prechecked condition), the node can buffer it until the pending transaction finishes.
- Prepared transactions typically complete quickly, so buffering introduces little latency.
- Once the earlier transaction commits or aborts, the buffered write can be assigned a later timestamp and applied.
- Writes with no preconditions can always proceed immediately. If a prepared transaction with an earlier timestamp later commits, its writes are simply ignored.
- A storage node can still accept a transaction with an older timestamp, even if newer writes already exist. (Note that, in the classic timestamp ordering, this transaction would be aborted entirely!)
- If the transaction later commits, its effect on those items is ignored - since it would be overwritten by newer writes anyway.
- This helps transactions that write multiple items: even if one itemβs write is βtoo old,β the rest of the transaction may still succeed.
- The exception: modify operations (partial updates) must execute strictly in timestamp order, since the outcome depends on sequence.
- Multiple transactions can prepare writes to the same item concurrently.
- As long as these are full overwrites or deletes, they can commit in any order - only the transaction with the latest timestamp determines the final value.
- For partial updates (modify), strict timestamp order must be preserved.
- Read-only transactions could be done in one round instead of the two-phase writeless protocol.
- If storage nodes supported GetItemWithTimestamp(), each node could:
- Return the item if its last write timestamp β€ and no prepared transaction has a smaller timestamp.
- Otherwise, reject the read.
- The coordinator collects all results and commits if all succeed.
- If all items of a transaction reside in the same partition, the system can skip the full two-phase protocol.
- The single storage node performs all precondition checks and applies writes immediately.
- The coordinator just receives a success or failure response - no separate prepare/commit rounds are needed.
DynamoDBβs transaction design achieves all four ACID guarantees. Atomicity is ensured via distributed 2PC and a durable transaction ledger. Consistency is enforced by conditional validation and timestamp-based serialization. Isolation is provided through serializable timestamp ordering without locks. Durability comes from multi-replica persistence and automatic failover. Together, these mechanisms allow DynamoDB to maintain correctness while scaling across thousands of partitions.
I think that I should show some results from their benchmarks, but honestly, their benchmarks are a bit vague to me. So if youβre interested in them, you can find them in the attached paper in the References section.
Paper: https://drive.google.com/file/d/1Yg2R-wN7KKugx-R4yc8c080XsXtBB0JT/view
TIMESTAMP-BASED ALGORITHMS FOR CONCURRENCY CONTROL IN DISTRIBUTED DATABASE SYSTEM: https://www.microsoft.com/en-us/research/wp-content/uploads/2020/12/TO-Algs-Bernstein-GoodmanVLDB1980.pdf
Timestamp based Concurrency Control: https://www.geeksforgeeks.org/dbms/timestamp-based-concurrency-control/
Lecture #17: Timestamp Ordering Concurrency Control: https://15445.courses.cs.cmu.edu/fall2023/notes/17-timestampordering.pdf
Distributed Systems 7.1: Two-phase commit: https://www.youtube.com/watch?v=-_rdWB9hN1c