TiDB: A Raft-based HTAP Database

A write up about TiDB’s Raft-based HTAP architecture: exploring multi-Raft storage, learner replicas, DeltaTree columnar engine, MVCC transactions, read optimizations, and how TiKV and TiFlash deliver scalable, consistent, and isolated OLTP + real-time OLAP in one system.

Relational databases provide strong ACID transactions and SQL support but historically struggled with horizontal scalability. NoSQL systems improved scalability by relaxing consistency, and NewSQL systems later combined distributed consensus with ACID semantics to deliver scalable and strongly consistent OLTP systems.

Modern applications, however, also demand real-time analytics on operational data. OLTP (Online Transactional Processing) handles high-concurrency read/write transactions, while OLAP (Online Analytical Processing) focuses on large-scale analytical queries. Traditionally, these workloads run on separate systems connected by ETL pipelines, introducing latency and complexity.

HTAP systems aim to unify OLTP and OLAP while preserving scalability, consistency, freshness, and isolation.

TiDB is an HTAP database built by extending the Raft consensus model. It adopts a multi-Raft storage architecture that maintains both a row-based store and a column-based store within the same distributed system. The row store is optimized for OLTP workloads, where short, high-concurrency transactions benefit from indexed row access. The column store is designed for OLAP workloads, where analytical queries scan large numbers of rows but only a subset of columns, enabling better compression and vectorized execution.

Data is synchronized between the two stores through Raft-based transactional log replication. In a traditional Raft group, multiple replicas maintain a replicated log. One node acts as the leader, and the others are followers. The leader handles writes and replicates log entries to followers. A write is considered committed once a quorum of replicas acknowledges it. This ensures strong consistency and high availability: if the leader fails, a follower can be elected as the new leader.

TiDB extends this model by introducing a new role: the learner. A learner does not participate in leader election and is not counted toward quorum. Instead, it asynchronously replicates logs from the leader and uses them to build a columnar replica. This allows OLAP queries to run on separate replicas without interfering with OLTP workloads, while still maintaining consistency through the shared Raft log.

However, this design introduces several technical challenges.

  1. Building a scalable multi-Raft storage engine that handles high concurrency while reducing quorum-induced network and disk latency.
  2. Replicating logs to learners with low latency and strong consistency, especially under large transactions and schema changes.
  3. Ensuring OLTP and OLAP workloads run simultaneously with strict performance isolation.

TiDB consists of three core components:

  • TiDB servers
  • Storage servers
  • Placement Driver (PD)
II. Architecture — reference figure 1

TiDB Server is the stateless SQL layer. It’s responsible for:

  • Parsing and optimizing SQL queries.
  • Generate distributed execution plans.
  • Coordinating transactions
  • Communicating with the Storage Servers.

TiDB Server does not store data itself and it’s stateless, so can be scaled horizontally.

This storage layer consists of 2 data store:

  • Row store (TiKV)
  • Columnar store (TiFlash)

TiKV and TiFlash can be deployed in separate physical resources → offer isolation when processing OLTP and OLAP queries.

Logically, data stored in TiKV is an ordered key-value map.

  • The key is composed of its table ID and row ID. Both table ID and row ID are unique integers. The row ID would be from primary column.
  • The value is the actual row data.

For example: (table{tableID}_record{rowID}table\{tableID\}\_record\{rowID\}{col1,col2,col3,col4}\{col1, col2, col3, col4\}

To scale out, the data is partitioned into contiguous ranges, each is called Region.

Each Region has multiple replicas for high availability. The Raft consensus algorithm is used to maintain consistency between replicas, forming a Raft group. The leader of each Raft group asynchronously replicate data from TiKV to TiFlash.

Here’s the high level architecture of Distributed Storage Layer.

2. Storage servers — figure 1

The PD (Placement Driver) is the metadata and control plane of a TiDB cluster.

PD tracks Region metadata, including each Region’s key range, replica locations, leader information, and version state. It also maintains store-level metadata for every TiKV node, such as disk usage, CPU load, Region count, leader distribution, and real-time workload statistics collected through heartbeats.

Besides, it tracks real-time data distribution across TiKV nodes, and schedules Region operations such as split, merge, and migration to balance workload.

PD also serves as the Timestamp Oracle (TSO), generating strictly increasing, globally unique timestamps used by distributed transactions - the transactionIDs.

To ensure high availability, PD itself runs as a small Raft cluster, typically with an odd number of members. PD nodes coordinate with each other and with TiKV nodes to maintain cluster state and make scheduling decisions. It also does not have persistent state, on startup, the PD members gathers all necessary data from other members and TiKV nodes.

In this section, we’ll deep dive in to the Storage Layer - The TiKV and TiFlash.

TiKV deployment consists of many TiKV servers. Regions are replicated between TiKV servers using Raft. A TiKV server can be either leader or follower for different Regions. Each TiKV server stores data & metadata in RocksDB.

Here’s the original flow how Raft leader handles read/write requests for the corresponding Region.

1. Row-based Storage (TiKV) — figure 2

TiDB makes some optimizations on above flow:

  • The (2) & (3) steps can be done in parallel. At the end of the day, it’s about reaching the quorum, so appending the log to leader log and followers’ logs can be done in parallel.
    • What if leader fails to append the log? If so, the leader should be able to retry it. If it completely fails to retry, then it’s dead → the other follower start a new round of election.
  • Leader does not wait for the response from the followers before proceeding the subsequent request. It assumes that the write is success, then continue to serve requests.
  • Commit entry & apply locally can be done in another thread.

So the new flow is:

a. Optimization between Leaders and Followers — figure 3

TiKV provides linearizable reads semantics, which means that once a write completes, any subsequent read must see that write (or a newer value). The system must never return an older version of the data. This can achieve by issuing a log entry for each read request, and waiting for that entry to be committed in Raft group before returning. It works because Raft guarantees that all committed entries are agreed upon by a majority, committing this read marker confirms two things: the leader is still valid, and all prior committed writes are reflected in its state. But it’s also very expensive as it incurs overhead of network & I/O.

We can also guarantee the it if we always read from the leader, as Raft guarantees that once the leader successfully writes its data, it can respond to any read requests without synchronizing logs across servers. However, the leader role can be changed, so to achieve reading from leader, TiKV implements following optimizations:

  • read index: When a leader receives a read request, it records its current commit index and confirms its leadership by sending heartbeats to other followers to confirm its leadership. If it receives quorum response from followers - which means it’s still leader, it waits until its applied index is at least the recorded commit index before serving the read.
    • Commit index: highest log index replicated to a quorum (globally agreed).
    • Applied index: highest log index executed on the state machine (visible to reads). This approach improves the read performance, but incurs a little network overhead (due to the heartbeat requests).
  • lease read: The leader and followers agree on a lease period, during which followers do not issue leader election requests, so the leader is not changed during this time. So leader can respond to any read request without connecting to its followers.
    • Note that: A lease is only considered valid as long as followers are still receiving the leader’s regular heartbeats, which keeps their election timers from expiring. If the leader crashes during the lease, heartbeats stop, followers’ timers eventually time out, and they start a new election as usual. In this sense, the heartbeat in a lease-based read mainly serves as a liveness signal that preserves leadership over a time window, while the heartbeat used in Read Index is part of a per-read quorum confirmation to explicitly verify the leader’s authority before serving the read.

Another optimization is follower read - where the read can be served by follower. If follower’s applied index is ≥ read index → it can serve the read request, otherwise, it can wait for the log to be applied.

In TiKV, data is partitioned into Regions. Over time, some Regions may become too large or too hot, while others remain small and rarely accessed. Large or hot Regions can create performance bottlenecks, while too many small Regions increase heartbeat and metadata overhead.

To maintain balance, PD dynamically issues split and merge commands based on observed workloads.

Region Split

c. Dynamic Region Split And Merge — figure 4

Above is the flow of Region Split.

Once committed, the Region is split into smaller contiguous key ranges:

  • The rightmost Region reuses the original Raft group.
  • Other new Regions create new Raft groups.

Merge Region

When two adjacent Regions are merged, one Region becomes the target and keeps its Raft group, while the other is absorbed and its Raft group is eventually removed.

  • Prepare merge and Commit merge logs: Region merging is not an immediate local operation. The source region writes a prepare‑merge log and sends a snapshot of its state to the target region. The target writes a commit‑merge log that references the source’s state. Only after both logs are committed across their respective Raft groups can the merge complete .
  • Epoch and metadata changes: Both regions’ metadata (key ranges and version numbers) are updated through Raft logs. If a client sends a request to the old Region during or after the merge, it may receive an EpochNotMatch or RegionNotFound error because its cached metadata is stale. The client (via TiDB) then refreshes Region metadata from PD and retries the request against the correct Region. This mechanism ensures safety and consistency: no committed data is lost, and ongoing requests are either completed or retried transparently.

TiFlash is composed of learner nodes, which receive Raft logs from Raft groups, and transform row-format tuples into columnar data. They do not participate in the Raft protocols to commit logs or elect leaders so they include little overhead on TiKV.

The logs received by learner nodes are replayed in FIFO order. The log replay has three steps:

  1. Compacting logs: Classify the logs into 3 statuses: prewritten, committed, rollbacked. Then doing compaction by delete invalid prewritten logs (the write which later has rollback logs). The raw logs:

    a. Log Replayer — figure 5

    Compacted logs:

    a. Log Replayer — figure 6
  2. Decoding tuples: Compacted logs from (1) are decoded into row-format tuples, removes information about transaction, then put into a row buffer.

a. Log Replayer — figure 7
  1. Transforming data format: Transform the row tuples into columnar format, using the local cached schemas (which are periodically synchronized with TiKV). For example: the output are 5 columns: operation type, commit timestamp, key, 2 columns of data.
    a. Log Replayer — figure 8

To transform tuples into columnar format in real time, learner nodes have to be aware of the newest schema - which is stored in TiKV. So if TiFlash wants to get the schema, it needs to ask TiKV. To reduce the number of times TiFlash asks TiKV, each learner node maintains a schema cache.

In order to refresh the schema cache, TiDB take two-state strategy:

  • Regular synchronization: Fetch newest schema periodically and refresh cached data.
  • Compulsive synchronization: If there’s mismatch between the columns and row tuples when learner does the transformation, it also fetches for newest schema.

TiFlash uses a custom columnar storage engine called DeltaTree to efficiently handle frequent updates while supporting fast analytical reads.

DeltaTree separates data into two parts:

  • Stable space: immutable columnar chunks (similar to Parquet row groups). Each chunk represents a batch of rows logically, but is stored column by column physically
    markdown
    column_key: [k1, k2, k3, ...]
    column_a: [a1, a2, a3, ...]
    column_b: [b1, b2, b3, ...]
    column_key: [k1, k2, k3, ...]
    column_a: [a1, a2, a3, ...]
    column_b: [b1, b2, b3, ...]
  • Delta space: newly inserted or deleted updates appended in order.

Incoming updates are first written to the delta space in append-only fashion, functioning like a write-ahead log (WAL). These deltas are periodically compacted into larger files and eventually merged into the stable space. During merging, inserted, updated, and deleted tuples are applied to stable chunks, which are atomically replaced.

To accelerate reads, DeltaTree builds a B+ tree index over the delta space, ordered by key and timestamp. This allows efficient lookup and range queries without scanning all delta files.

Compared to an LSM-tree design, DeltaTree achieves roughly 2x faster read performance because it avoids multi-level file overlap during reads. Although DeltaTree has higher write amplification, the trade-off is acceptable given its significantly better read efficiency for HTAP workloads.

c. Columnar Delta Tree — figure 9

When a query arrives, TiFlash first reads the relevant columnar chunks from the stable space. These chunks are already sorted and compressed, making column scans efficient for analytical workloads.

Next, it checks for newer updates in the delta space. Instead of scanning all delta files, TiFlash uses the in-memory B+ tree index to quickly locate delta records for the requested key or key range. The index points directly to relevant delta entries.

Finally, the engine merges stable data with delta updates:

  • Inserts add new rows.
  • Updates override existing values.
  • Deletes remove rows.

TiDB provides ACID transactions with Snapshot Isolation (SI) and Repeatable Read (RR) semantics using MVCC. Each transaction reads from a consistent snapshot defined by a timestamp, avoiding read-write locks while preventing write-write conflicts.

Transactions are coordinated across three components:

  • SQL Engine: Coordinates the transaction, executes SQL, and runs two-phase commit (2PC).
  • PD (Placement Driver): Generates globally increasing timestamps.
  • TiKV: Stores data, implements MVCC, and persists locks and versions.

Optimistic Transaction Flow (2PC)

  1. Client sends BEGIN.
  2. SQL Engine requests a start timestamp (start_ts) from PD.
  3. Reads use the latest committed version before start_ts.
  4. On COMMIT, SQL Engine:
    • Selects a primary key.
    • Sends prewrite requests to lock all keys.
  5. If prewrites succeed:
    • SQL Engine gets a commit timestamp (commit_ts) from PD.
    • Commits the primary key.
  6. Secondary keys are committed asynchronously.

Locks are stored in TiKV (no centralized lock manager).

Here’s the visualization for optimistic flow:

1. Transactional Processing — figure 10

TiDB supports both: Optimistic locking and Pessimistic locking. The key difference between optimistic and pessimistic transactions in TiDB lies in when locks are acquired.

In optimistic transactions, locks are acquired during the prewrite phase of two-phase commit. No locks are held while executing DML statements. This assumes conflicts are rare, but if a conflict is detected during prewrite, the entire transaction must be retried.

In pessimistic transactions, locks are acquired immediately when DML statements are executed, before the prewrite phase. Each lock acquisition uses a new timestamp called for_update_ts. If a lock cannot be acquired, the transaction can retry from that specific operation instead of restarting entirely. Because conflicts are resolved early, once prewrite begins, the transaction is unlikely to fail due to write conflicts.

When reading data in pessimistic mode, TiKV uses for_update_ts (rather than start_ts) to determine the visible version. This preserves Repeatable Read (RR) semantics even when partial retries occur.

Here’s the flow of pessimistic locking:

Wormhole 2 Feb 16 2026 (6).png

TiDB made some optimizations targeted at OLAP queries, including an optimizer, indexer, and pushing down computation in their tailored SQL engine and TiSpark.

TiDB uses a two-phase query optimizer:

  • Rule-Based Optimization (RBO): Applies transformation rules to produce a logical plan (e.g., predicate pushdown, projection elimination, constant folding, subquery unnesting).
  • Cost-Based Optimization (CBO): Chooses the cheapest physical plan based on estimated execution cost.

Because TiDB has two storage engines: TiKV (row store) and TiFlash (column store), so the optimizer must choose among:

  • Row scan in TiKV
  • Index scan in TiKV
  • Column scan in TiFlash

Indexes are distributed across Regions and stored as key-value pairs in TiKV. To improve plan stability, TiDB uses skyline pruning to eliminate suboptimal index candidates. If there are multiple candidate indexes that match different query conditions, we merge partial results (i.e., a set of qualified row IDs) to get a precise results set.

After optimization, the SQL engine executes the physical plan using a pull-based iterator model. To reduce data transfer, parts of the plan are pushed down to the storage layer via coprocessors, which execute filters, expressions, aggregations, and even TopN in parallel. Coprocessors also support vectorized execution for higher efficiency.

TiSpark integrates Apache Spark with TiDB’s multi-Raft storage, enabling large-scale analytics and machine learning directly on TiDB data.

The Spark driver reads metadata from TiKV and obtains timestamps from PD to ensure a consistent MVCC snapshot. It can push parts of the execution plan down to TiKV’s coprocessor and leverage distributed indexes, similar to TiDB’s SQL engine.

Unlike typical connectors, TiSpark reads multiple Regions in parallel, accesses index data directly, and supports transactional writes using 2PC. This tight integration allows Spark to process TiDB data efficiently with reduced data movement.

To protect transactional performance, TiDB separates OLTP and OLAP workloads at the engine level. TiKV (row store) mainly serves transactional queries, while TiFlash (column store) handles analytical queries. Because TiFlash replicates data from TiKV via Raft with low overhead, analytical workloads have minimal impact on transactional performance.

Since data is consistent across both engines, the optimizer can choose among three access paths:

  • Row scan (TiKV)
  • Index scan (TiKV)
  • Column scan (TiFlash)

Each has different cost characteristics:

  • Row scan → efficient for fetching full tuples.
  • Column scan → efficient when reading only selected columns.
  • Index scan → efficient for point or range queries, but may require a double read (index lookup + data fetch).

The optimizer estimates I/O and seek costs and selects the cheapest path. In some cases, it may even combine engines in one query.

https://docs.pingcap.com/tidb/stable/tidb-architecture/

https://docs.pingcap.com/tidb/stable/tidb-architecture/

https://www.youtube.com/watch?v=mmzoSkEhYrA

Tagged:#Distributed System#Paper Notes
0