Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service

A practical tour of DynamoDB’s architecture: how it partitions data, controls throughput with GAC, balances hot spots, and keeps availability and durability high.

Amazon DynamoDB is a NoSQL cloud database service that provides consistent performance at any scale. Its goal is to complete all request with low single-digit millisecond latencies.

  • Fully managed cloud service
  • Employs a multi-tenant architecture: stores data from different customers on the same physical machines → optimize resources
  • Achieves boundless scale for tables: virtually unlimited table size.
  • Provides predictable performance: No matter how many data is stored, latencies remain stable.
  • Highly available: Offers an availability SLA of 99.99 for regular (tables are replicated across multiple AWS Availability Zone) and 99.999 for global tables (table is replicated across multiple AWS Regions)
  • Supports flexible use case: Tables don’t have fixed schema, each item can have arbitrary number of values with varying types

In 2007, Dynamo - the first NoSQL database system developed at Amazon, built for cart service and highly available writes. But it’s a self-managed database, requiring teams to do their own operations. During that period, Amazon S3 and Amazon SimpleDB were released, completely managed by AWS, widely used by Amazon engineers even though Dynamo aligned better with their applications’ need. SimpleDB has limitation on the storage capacity (10GB) and request throughput capacity, which needed engineers to divide data between multiple tables to meet their needs.

Amazon engineers decided to combine the best part of Dynamo (incremental scalability and predictable performance) and SimpleDB (ease of administration of a cloud service, consistency, table-based data model) → So we have DynamoDB, launched in 2012.

Following are the DynamoDB CRUD APIs for items

OperationDescription
PutItemInserts a new item, or replaces an old item
with a new item.
UpdateItemUpdates an existing item, or adds a new
item to the table if it doesn’t already exist.
DeleteItemThe DeleteItem operation deletes a single
item from the table by the primary key.
GetItemThe GetItem operation returns a set of at-

tributes for the item with the given primary key. |

  • Table: A collection of items
  • Item: corresponds to rows in a relational database, contain a collection of attribute. Each must have a primary key.
  • Attribute: Key-value pairs that constitute the data within an item. They can vary in type, and can also be nested
  • DynamoDB table is divided into multiple partitions, each partition hosts a disjoint and contiguous part of the table’s key-range.
  • Primary key: Uniquely identify item in a table, can consist one or two attributes:
    • Partition key: hashed to figure out partition.
    • Sort key (optional): if it exists, combined with the partition key to form a composite primary key. It’s the index within partition.
    • ⚠️ If the primary key is only formed by partition key, the partition key must be unique. If primary key is a composite key (with the sort key), multiple items can share the partition key, but the sort key must be different.
  • Secondary index: A table can have multiple secondary indexes. A secondary index allow querying the data in the table using an alternate key, in addition to queries against the primary key.
    • Global secondary index (GSI): An index with a different partition key (and an optional sort key) from the primary key. Because the partition key is used to determine which partition the item is stored, having different partition key means the data follow the global secondary index must be stored on entirely different physical partitions from the base table and is replicated separately.
    • Local Secondary Index (LSI): Uses the same partition key as primary key, but having different sort key.
  • DynamoDB supports ACID transactions!
https://www.youtube.com/watch?v=xfxBhvGpoa0

Each partition has 3 replicas (to guarantee that we always have the quorum) across different Availability Zone per Region for high availability and durability. Those replicas form a replication group, they use Multi-Paxos for leader election and consensus. The leader maintains its leadership by periodically renewing its leadership lease.

  • Leader:
    • Serves writes and strong consistent read requests.
    • A write is acknowledged if its write-ahead log record is replicated by a quorum of peers.
    • Only serves request after the previous leader’s lease expires.
  • Follower:
    • Serves eventually consistent read requests.
    • Propose new round of leader election if it detects leader failure.

There are 2 types of replica (node) in a group:

  • Storage node: Full replica that stores the partition’s data structure (B-tree) and its write-ahead logs. Serves reads and participates in writes.
  • Log node: Lightweight replica that stores only write-ahead logs. Exists to restore quorum quickly after a failure. Does not serve reads.

DynamoDB consists of tens of microservices.

  • Metadata service: stores routing information about tables, indexes, replication groups for keys for a given table or index
  • Routing service: authorizes, authenticates and routes request to appropriate server (using information from metadata service)
  • Storage service: stores customer data on a fleet of storage nodes. Each storage node hosts many replicas of different partition.
  • Autoadmin service: responsible for fleet health, partition health, scaling of table.

And some other services such as point-in-time restore service, on-demand back-ups,..

3. Core services — figure 1

As the goal of DynamoDB is to provide an abstraction in which engineers don’t need to care about underlying hardware requirements for their tables. This requires partitioning tables horizontally.

DynamoDB has 2 types of capacity unit:

  • RCU: Read Capacity Unit. One RCU = the ability to perform:
    • 1 strong consistent read per second for an item up to 4 KB,
    • OR 2 eventually consistent reads per second for an item up to 4 KB.
    • For example, if your item size is 8 KB, a single strong consistent read consumes 2 RCUs.
  • WCU: Write Capacity Unit. One WCU = the ability to perform:
    • 1 write per second for an item up to 1 KB.
    • So, if you write an item that’s 2.5 KB, you’ll need 3 WCUs.

By having these units, they abstract away underlying details like partition count, physical replication.

Initially, the customer must provides the capacity provision for their table. It’s used to control the loads for the table’s partitions.

Suppose they expect ~300 reads/sec and 100 writes/sec of items around 2 KB, with strong consistency for reads.

  • Reads: each 2KB → 1 RCU / read → 300 reads/sec → 300 RCUs.
  • Write: each 2KB → 2 WCU / write → 100 writes/sec → 200 WCUs.

Customer would provision the table with at least 300 RCUs and 200 WCUs.

One partition must be split if they hit the maximum size capacity or throughput capacity

Every partition in a DynamoDB table is designed to deliver a maximum capacity of 3,000 read units per second and 1,000 write units per second. https://arc.net/l/quote/ehjvxhym

And

Items are distributed across 10-GB storage units, called partitions (physical storage internal to DynamoDB) https://arc.net/l/quote/cgkyvoyd

DynamoDB uses admission control to ensure the storage node don’t become overloaded. In the first attempt, the customer must declare what’s the RCUs and WCUs for their tables. The job of DynamoDB is to split the table into multiple partitions, then allocate them to storage nodes which ensure 2 following conditions must be met:

  1. If a table has provisioned RCUs is xx and WCUs is yy, and has nn partition

    Then partition=1n(RCUpartition)=x\sum_{partition=1}^n(RCU_{partition}) = x and partition=1n(WCUpartition)=y\sum_{partition=1}^n(WCU_{partition}) = y

  2. If the storage node hosts mm partition (not necessary of the same table) and has the RCUs cap is x1x_1 and WCUs is y1y_1

    Then partition=1m(RCUpartition)x1\sum_{partition=1}^m(RCU_{partition}) \le x_1 AND partition=1m(WCUpartition)y1\sum_{partition=1}^m(WCU_{partition}) \le y_1

If the partition or storage node hits its allocated load, DynamoDB throttles the requests!

  • If partition is split because of its size: the allocated throughput of the parent partition is equally divided among child partitions.
  • If partition is split because of the throughput. The children partition throughput will be table_throughputtotal_number_of_partition\frac{{table\_throughput}} {total\_number\_of\_partition}
    • Let’s say, initially, the table was created with 3200 WCUs, DynamoDB creates 4 partitions, each would be allocated 800 WCUs
    • But then, the table’s provisioned throughput was increased to 3600 WCUs, then, each partition now would get 900 WCUs
    • If the table’s provisioned throughput was increased to 6000 WCUs, then the partitions would be split to create eight child partitions, and each partition would be allocated 750 WCUs

This leads to the result that the distribution of throughput across partitions is uniform. However, in the reality, the application workloads frequently have non-uniform access - some items are hot and get more frequent access than others. This non-uniform access pattern leads to DynamoDB’s two common scaling headaches are hot partitions and throughput dilution.

  • A hot partition happens when most traffic targets a small set of keys - say, all users checking “today’s leaderboard” item, causing that partition to throttle even if others are idle.
  • Throughput dilution kicks in when a partition grows too large and is split for size, its total capacity is divided among the new child partitions, so each gets less throughput. If requests still hit only one of those new partitions, performance drops.

Most of the time, not every partition on a storage node consumes its full share of throughput. DynamoDB exploits this by bursting: a partition can bank unused throughput for up to 5 minutes (≈300s) and spend those credits later to temporarily exceed its provisioned rate- as long as the storage node still has headroom.

  • For example: If a table is provisioned for 800 WCUs and is split into 4 partitions, each partition gets 200 WCUs. If Partition A idles at 20 WCUs (10%), it accrues (200–20) x 300 = 54,000 write credits it can later spend to ride out a spike.

How it’s enforced**:** Each partition is governed by two token buckets - an allocated-rate bucket (its steady RCU/WCU budget) and a burst bucket (saved credits). The storage node also has a node-level bucket that caps aggregate I/O so the host can’t be overwhelmed. A request is admitted only if both the partition and the node have tokens.

There’s one note here is that: For the write requests using burst, they need to do an additional check on the node-level token of other member replicas of the partition. As the write is only success if it’s replicated to quorum of replicas. The leader replica of the partition periodically collected information about each of the members node-level capacity.

Let’s take a look at an example how the bursting works!

Imagine a single storage node in DynamoDB’s fleet.

It hosts three partitions:

PartitionProvisioned ThroughputDescription
P1300 WCUshandles user profile writes
P2400 WCUshandles session logs
P3300 WCUshandles analytics events

Each partition has its own token bucket (for its 300–400 WCU budget),

and the storage node itself has a global token bucket representing the host’s total hardware capacity - let’s say 2000 WCUs max sustainable.

Step 1:Quiet period (credits accrue)

For 3 seconds, P1 and P2 run at 10% of their budgets; P3 runs at 200/300.

Burst credits after 3s:

  • P1: (300 − 30) × 3 = 810 WCUs
  • P2: (400 − 40) × 3 = 1,080 WCUs
  • P3: (300 − 200) × 3 = 300 WCUs

Step 2: Sudden spike

A marketing campaign goes live, and all three partitions get hammered simultaneously.

  • P1 tries to use 900 WCUs (3x its normal rate)
  • P2 tries to use 800 WCUs (2x its normal rate)
  • P3 stays quiet at 200 WCUs

Step 3: Partition-level bursts

Each partition spends its saved-up credits:

  • Partition level: P1 and P2 can temporarily exceed their provisioned rates by spending their burst credits. P3 stays under its allocation.
  • Node level: 1,900 WCUs is ≤ 2,000 WCUs, so the node bucket also admits the load. No throttling.

So far, everyone is happy. The total capacity across all partition is: 900 + 800 + 200 = 1900 WCUs requested < node’s capacity = 2000

This seems to work, but still has problem:

  • If the node capacity were only 800 WCUs, the node bucket would empty and the node would throttle, even if P1/P2 still had burst credits.
  • If a partition is chronically hot, it will drain its burst bucket and hit its allocated rate.
  • Burst credits are per partition, idle partitions can’t donate credits to a hot neighbor.

If DynamoDB detects throttling on a table while the table’s overall throughput is still below its provisioned limit, it automatically boosts the capacity of the hot partitions by borrowing unused throughput from idle ones. **(This naturally means some partitions will have their share reduced.) **The exact algorithm isn’t described in the paper. 😟

If the total table consumption later exceeds its provisioned capacity, DynamoDB scales back the boosted partitions to restore balance, this can happen when the previously idle (borrowed-from) partitions suddenly become active and start bursting.

The Autoadmin service ensures that partitions receiving extra capacity are relocated to less loaded storage nodes so the node’s local throughput cap isn’t violated.

Although this approach seems greedy at first, it proved remarkably effective - eliminating more than 99.99% of throttling caused by skewed access patterns.

While Adaptive Capacity reduced throttling for skewed workloads, it wasn’t perfect:

  • Reactive response: It only activates after throttling occurs, meaning some requests are already rejected before the boost takes effect.
  • Borrowing side effects: It boosts hot partitions by borrowing throughput from cooler ones, so if those cooler partitions suddenly become active, they might now experience throttling.
  • Node constraints: Even after boosting, if several hot partitions are on the same storage node, the node can still hit its physical limits.

DynamoDB later removed the fixed five-minute burst limit and introduced Global Admission Control (GAC), a centralized system that manages throughput at the table level using the familiar token bucket model.

Each GAC instance tracks total token usage across all request routers serving a table. Routers maintain small, short-lived local buckets (also in table-level) to admit requests and periodically refill them from GAC. Because GAC continuously receives reports from routers, it can rebuild its in-memory state at any time without affecting service.

When routers request new tokens, GAC allocates them based on observed table-wide activity.

  • For example, if a table has 10,000 writes per second spread across five routers, each might initially get 2,000 tokens. If one router consumes most of its share while another stays idle, GAC shifts more tokens to the busier router in the next round.

This feedback loop allows throughput to adapt smoothly to uneven workloads. To keep one hot partition from monopolizing a node, DynamoDB still keeps partition-level token buckets that cap local capacity. Together, these controls let DynamoDB absorb bursts and balance load while preserving fairness and stability across the fleet.

a. Solution — figure 2

While GAC greatly improves fairness and utilization, it still has several limitations:

  • Node bottlenecks: Even if GAC allows partitions to burst, multiple hot partitions on the same storage node can still exceed the node’s physical capacity.
  • Slow reaction: Routers only refill tokens every few seconds, so a router facing a sudden spike may throttle briefly until GAC adjusts its allocation.
  • Fairness, not locality: GAC distributes tokens at the table level, so if all traffic targets one hot partition, local throttling still happens at that partition or node.

To address the limitations of Global Admission Control, DynamoDB introduced a system that proactively rebalances partitions across storage nodes based on their real throughput usage. Each storage node continuously monitors the total throughput and data size of the partitions it hosts. When a node’s throughput exceeds a defined threshold relative to its maximum capacity, it reports a list of candidate partitions to the Autoadmin service for relocation. Autoadmin then identifies a new storage node - either within the same Availability Zone or another one - that doesn’t already host a replica of those partitions, and moves them there. This automated rebalancing ensures that no single node becomes a bottleneck, keeping the system’s performance consistent even under dynamic workloads.

  • Limited placement options: If no suitable storage node has enough spare capacity, partition relocation may be delayed or skipped.
  • Migration overhead: Moving partitions between nodes consumes network and I/O bandwidth, which can momentarily affect performance.
  • Transient imbalance: While partitions are being relocated, nodes may still experience temporary overload.

DynamoDB tackles uneven load by automatically splitting partitions when their throughput exceeds a threshold. Instead of splitting key ranges in half, it uses the observed key distribution to choose smarter split points based on real access patterns. Most splits finish within minutes, though DynamoDB skips splitting partitions with hot single keys or sequential access, where splitting wouldn’t improve performance.

  • Ineffective for single hot keys: If most traffic targets one specific key, splitting doesn’t help because all requests still hit the same partition.
  • Sequential access patterns: Workloads that access keys in order can lead to new partitions becoming hot again, repeating the problem.
  • Split overhead: Each split involves data movement and rebalancing, which adds temporary load and increases system complexity.

The remaining problems about single hot key or sequential access don’t have any magic way to solve it from server side, it needs some kind of extensions. For example:

  • If it’s read-heavy:
    • Cache the hot item with DAX or any kind of cache (Elastic Cache, In-memory Cache…)
  • If it’s write-heavy:
    • Can throttle the write from client side, or shard the items!

Many teams migrating to DynamoDB came from self-managed systems where they had to predict and provision capacity themselves. DynamoDB simplified this with a serverless model using RCUs and WCUs, but forecasting was still hard, often leading to over-provisioning or throttling.

On-demand tables solved this by automatically scaling with real traffic, instantly handling up to 2x the previous peak and expanding further as demand grows. DynamoDB achieves this by splitting partitions based on traffic automatically, while GAC ensures no single workload dominates shared resources, keeping performance stable across the fleet.

Data should never be lost after it has been committed. In practice, data loss can occur because of hardware failures, software bugs, or hardware bugs. DynamoDB is designed for high durability by having mechanisms to prevent, detect, and correct any potential data losses.

Write-ahead logs are central for providing durability and crash recovery. They’re stored in all three replicas of a partition and periodically archieved to Amazon S3 (an object storage designed for 11 nines of durability).

  • Every data transfer between 2 nodes (log entry, message, log file) are validated using the checksums
  • Every log file is archived to S3 has manifest contains information about the log such as table, partition, start and end markers. The agent responsible for archiving files to S3 performs various checks (such as logs belong to correct table, there’s no hole in sequence number, checksums verification…) before archiving the file. The file itself also has the checksum, so if there’s already an uploaded file, this agent also checks for the checksums.
  • Scrub process: Runs and verifies (by using checksums)
    • All three copies of the replicas in a replication group have the same data
    • Data of the live replicas matched with a copy of a replica built offline using the archived WAL

DynamoDB’s complex distributed architecture demands rigorous correctness. The team uses formal methods, especially TLA+, to model and verify its replication protocol, catching subtle bugs before production. Every protocol change is formally checked, while failure injection and stress testing ensure both the data and control planes remain reliable under real-world conditions.

DynamoDB protects data not just from hardware failures but also from logical corruption caused by application bugs. It offers continuous backups and point-in-time restore without affecting table performance, using archived write-ahead logs (WALs) stored in Amazon S3.

  • Backups are full, consistent copies of tables that can be restored anytime.
  • Point-in-time restore: users can recover a table to any state from the past 35 days. DynamoDB periodically snapshots partitions based on accumulated WALs and combines these snapshots with logs to recreate the table exactly as it existed at the requested moment.

DynamoDB’s replication group typically consists of three replicas across different Availability Zones. A write is considered successful once two out of three replicas acknowledge it. The system handles failures differently depending on which replica goes down:

Leader Failure

When the leader replica fails, the remaining replicas detect missing heartbeats and elect a new leader to restore write coordination.

  • Impact: Brief pause (a few seconds) for writes and strongly consistent reads while the election completes.
  • Goal: Maintain consistency and durability before resuming normal operation.

Follower Failure

When a follower replica fails, the group temporarily loses quorum, blocking new writes.

To avoid long recovery delays, DynamoDB introduces a log node that stores only write-ahead logs (WALs).

  • Effect: The log node quickly restores quorum, allowing writes to continue.
  • Limitation: The log node cannot serve reads, it simply acts as a Paxos acceptor until the full replica recovers.

  • Strong consistent reads are served only by the leader
  • Eventually consistent reads are served by any replicas (except for the replica which is log node)
  • Just like in previous section, there will be a tiny blip for the strong consistent read if the leader fails and the leader election is in progress!

Any replica can initiate a new leader election if it believes the current leader has failed. However, gray failures - partial network issues between the leader and a follower can create false positives, causing a follower to mistakenly think the leader is down. To avoid unnecessary elections, a follower first checks with other replicas to see if they’re still receiving heartbeats from the leader.

But what if the follower doesn’t get any responses? In that case, it proceeds to start an election. Two outcomes are possible:

  • If it still can’t reach any replicas, it fails to form a quorum, and no new leader is elected.
  • If it can reach others but the original leader is actually alive, those replicas will reject the proposal, and again, no new leader is chosen.

DynamoDB is built for extreme reliability - 99.999% availability for global tables and 99.99% for regional ones.

Availability is measured every five minutes based on successful requests, with continuous monitoring at both the service and table levels. If customer error rates exceed a threshold, Customer-Facing Alarms (CFAs) automatically trigger, allowing rapid mitigation either through automation or operator action.

Daily aggregation jobs also compute long-term availability metrics, stored in S3 for trend analysis. To capture real user experience, DynamoDB measures client-side availability using two sources: internal Amazon services that report API success rates, and canary applications running across all Availability Zones. These canaries mimic real customer traffic, helping detect subtle or partial (“gray”) failures before users notice them.

DynamoDB pushes software updates at a regular cadence, the new update must go through full cycle of development and testing.

As in the new update can introduce new type, new message or new protocol that the old software can’t understand. The strategy is to split into multiple process. The first step is to deploy the software to read the new message/type/protocol. After all the software can handle the new format, the software is updated to send the new format.

All the deployments are done on a small set of nodes before pushing them to entire fleets of node. DynamoDB sets alarm thresholds on availability metrics, if there’s any error occurs that exceeds the thresholds, the system triggers rollback.

In deployment, the leader relinquishes their lease, so the new leader doesn’t need to wait for the previous leader’s lease to expire before serving the request.

To ensure high availability, all the services that DynamoDB depends on in the request path should be more highly available than DynamoDB.

DynamoDB also has caches for data from external dependencies for temporary functionality in the event of failures.

  • These caches are periodically refreshed in the background.
  • In case the request hits the server that does’t have the cache while there’s outage on dependency, that request will see the impact. However, in practice, the impact when dependency is impaired is minimal.

One of the critical path is to find the storage node for a specific primary key.

DynamoDB stores the metadata in DynamoDB itself and cache the metadata for the entire table in the request router. When a router receives a request for a table it hasn’t seen before, it downloads the routing information for the entire table and caches it locally.

→ This raises thundering herd problem, for example, some new request router launch, their caches are empty → every DynamoDB request would result in a metadata lookup!

MemDS

  • DynamoDB introduced MemDS, an in-memory distributed metadata store replicated across the MemDS fleet for high availability.
  • It uses a Perkle tree (a hybrid of Patricia and Merkle trees) for fast key and range lookups, keeping keys sorted for efficient scans.
  • Supports two special lookups:
    • floor: finds the entry with the largest key ≤ the given key
    • ceiling: finds the entry with the smallest key ≥ the given key

Each request router now maintains a lightweight partition-map cache (unlike the old design, which stored full table metadata). Requests first check the router’s cache, but every lookup no matter if it’s hit or miss still triggers a query to MemDS. This design keeps MemDS traffic constant and prevents uneven load spikes.

If a router’s cached metadata becomes stale, the contacted storage node either returns the latest membership info or an error, prompting the router to immediately refresh its cache from MemDS. This ensures routing remains accurate, consistent, and low-latency even at massive scale.

b. Attempt 2: Diagram Oct 19 2025

DynamoDB’s architecture shows how a globally distributed system can deliver simplicity on the surface while managing enormous complexity underneath. Each layer, from token-bucket-based throughput control and adaptive partitioning to formal verification and MemDS, is designed to maintain predictable performance, high availability, and developer ease of use at scale. Over time, DynamoDB has evolved from a provisioned-capacity database into a self-tuning, resilient, and globally consistent service that hides the challenges of distributed systems behind a clean interface. Its story is not just about speed or uptime but about transforming complexity into reliability through careful and disciplined engineering.

https://www.usenix.org/system/files/atc22-elhemali.pdf

https://www.youtube.com/watch?v=FeFYLKJQxTs&t=396s

https://www.youtube.com/watch?v=HaEPXoXVf2k

https://www.hellointerview.com/learn/system-design/deep-dives/dynamodb

Tagged:#Backend#Distributed System#Paper Notes
0