Inside Meta’s Migration: From InnoDB to MyRocks
A deep dive into how Meta migrated UDB from InnoDB to MyRocks, cutting storage in half while tackling isolation differences, tombstones, and large-scale correctness checks.
Facebook’s User Database (UDB) is a massively sharded MySQL-based system storing petabytes of social graph data such as likes, comments, and shares. To meet Facebook-scale requirements, MySQL was heavily customized with hundreds of extensions, most of which were released as open source. In the past, Facebook used InnoDB, a B+Tree based storage engine as the backend. It was used because InnoDB was a robust, widely used database and it performed well. UDB was one of the first database services at Facebook, originally built on spinning disks where the primary concern was low IOPS. As workloads grew, Facebook introduced SSDs in 2010, first as a flash cache in front of HDDs and later, in 2013, as pure flash storage. This transition removed IOPS bottlenecks but raised new challenges: storage capacity costs, fragmentation, and flash endurance.
InnoDB implements the primary key as a clustered index. This index is organized as B+Tree. Each node of B+Tree corresponds to a fixed page size (16KB / page) and contains the full rows data and some metadata.
📔
InnoDBreserves 1/16th of each page for data modification. So in case there’s any update in the data of rows within the page, we can reduce the chance that we need to perform page split.
The index fragmentation happens when we have free space between the index pages.
This leads to:
- Wasted spaces
- Reduce cache efficiency:
InnoDBbuffer pool cache the whole16KBpages. If pages are half-empty, the caching efficiency drops because:- Each cached page holds fewer useful rows.
- You need more buffer pool memory for the same number of rows.
- Increase random I/O
- In case the index insertion pattern is not sequential (e.g, we use some random string as the PK) fragmentation can increase the random I/O since pages are randomly distributed.
How can fragmentation happen?
-
Page split: In
InnoDB, every index is built on fixed-size pages (e.g.,16 KBeach). When you insert or update a row, the database tries to place the new data directly inside the page that covers its key range. Let’s say we want to insert a new row. If the page still has space left enough to store the new data, we’re good, the data is stored to the page immediately. But what if we don’t have enough space? In this case 2 scenarios:- If it’s sequential inserts (e.g: auto-increment primary keys, time-based UUIDs). In this case, we know that the new record will always belong in the rightmost page! Given that fact,
InnoDBjust creates a new page and puts the new row in that page
a. Index fragmentation — figure 2 -
Otherwise, if it’s random inserts (e.g
UUIDv4, hashes),InnoDBfinds the page where the new record should belong to, then moves about half of the rows into the new page. Because rows have variable size, the split isn’t always exactly 50/50, but the goal is to balance them as evenly as possible. This creates partially filled pages, which over time leads to fragmentation.a. Index fragmentation — figure 3 In this case, the new page is allocated wherever space is available in the table space, so logical neighbors may be physically far apart → more random I/O on the range scan
- If it’s sequential inserts (e.g: auto-increment primary keys, time-based UUIDs). In this case, we know that the new record will always belong in the rightmost page! Given that fact,
-
Deletes: If we delete some rows, we would also end up having some gaps inside the pages (the spaces which were occupied by the deleted rows), which also causes the fragmentation.
Note that we can resolve this by using defragmentation, but this would require manual operation. InnoDB can merge pages if they fail below 50% fill threshold, but this only happens when neighboring pages are both underutilized.
We can save space by enabling page compression. But InnoDB only allowed each page to compress to a predefined size KEY_BLOCK_SIZE, which is one of 1KB, 2KB, 4KB or 8KB.
Why?
This is to guarantee that pages can be individually updated. Let’s say if we don’t have a fixed size for the page compression size:
- You insert a row → page compresses nicely to
5 KB. - Later, you update a row on that same page → recompressed page is now
6 KB. - Where does the extra
1 KBgo?- If pages were variable-sized, that page no longer fits in the
5 KB“hole” it used to occupy. InnoDBwould have to move the page elsewhere in the file to fit the bigger size.- But that means the location of that page changed, so all parent pointers in the B+Tree would need updating.
- If pages were variable-sized, that page no longer fits in the
Problem
But this leads to an issue that, if the KEY_BLOCK_SIZE is 8KB, even if 16KB data was compressed to 5KB, the actual space usage was still 8KB, so eventually, we still use 8KB, not 5KB, resulting storage saving was capped at 50%!
InnoDB doesn’t just store your raw column data, every row carries extra metadata for transactional consistency.
Each row in InnoDB carries:
- Transaction ID (
DB_TXR_ID) →6 bytes- This tracks the last transaction that modified this row (needed for MVCC)
- Rollback pointer (
DB_ROLL_PTR) →7 bytes- Points to undo log so older versions of the row can be reconstructed if another transaction needs them.
→ 13 bytes / row overhead, no matter how small your row is!
By the early 2010s, many Facebook applications were already running on flash storage and had accumulated years of operational experience with its unique characteristics. To address common challenges such as write amplification and space efficiency, Facebook engineers built a new key/value store library in 2012: RocksDB. Designed specifically for flash-based SSDs, RocksDB quickly became a foundational component in several production systems, including ZippyDB, Laser, and Dragon.
At its core, RocksDB is a key/value store optimized for modern storage hardware. After evaluating different data structures, the team chose the Log-Structured Merge Tree (LSM-tree) because it significantly reduces write amplification while maintaining a good balance of read performance. The first implementation drew inspiration from LevelDB, extending and enhancing it to handle Facebook’s demanding workloads at scale.
Before going further, we must have some terms that need to be aligned
- Key: The “key” that we’re referring to from now on is either primary key or secondary key index
MyRocksis built on top ofRocksDBwith some extensions, so you might see that I use theRocksDBandMyRocksinterchangeably throughout of this blog if that thing is shared between these two. Otherwise, I’ll useMyRocks
Before moving forward, I would like to introduce a bit about LSM (Log-Structured Merge-Tree)
At a high level, LSM Trees organize data in layers:
MemTable(in memory): All incoming writes go into a sorted, in-memory structure calledMemTable- Note that change is first written to WAL (Write-Ahead Log) on disk for durability before having been written to
MemTable
- Note that change is first written to WAL (Write-Ahead Log) on disk for durability before having been written to
SSTables(on disk): Once the content ofMemTablereaches a predetermined size, it’s flushed to disk as an immutable, sorted file (SSTable). No in-place updates are needed. Each of the SSTs stores data in sorted order.- Compaction: As
SSTablesaccumulate, a background process called compaction merges them into fewer, larger files while discarding deleted entries and outdated versions. This creates the leveled structure of the LSM-tree:- Level 0 may hold overlapping SSTables (2 or more
SSTablesshare the same keys - in case there are multiple updates to that key), but from Level 1 onward, each SSTable is range-partitioned with no overlaps, ensuring efficient lookups and reduced read amplification.
- Level 0 may hold overlapping SSTables (2 or more
Due to this architecture, LSM Trees are optimize for write-heavy system, as the writes are sequential, append-only!
i. Single key query
Reads are more complex than in a B+Tree, because the database may need to check multiple places:
-
The
MemTable(recent writes). -
Several
SSTablesacross different levels:At level 0, we must check all
SSTables, as they have overlapping ranges.From level 1+ onward, at most one SSTable / level needs to be searched.
To make this efficient,
LSM-basedsystems use Bloom filters and indexes inside SSTables to avoid scanning unnecessary files.
So, in worst case, a point lookup touches all level 0 files + one file per lower level.
ii. Range query
It’s even worse with the range query
To start a range scan, RocksDB binary-searches each overlapping SSTable at every level. If an SSTable has keys in the range, its first matching key is added to the heap. In the example above, this means the heap starts with 9 pointers - one from the memtable and each SSTable that overlaps the query.
Advancing follow these steps:
- Popping the smallest key
- Pulling the next key from the same SSTable (if there’s any, and that key is still in range) push that key to the heap and rebalancing the heap
Fun fact: The process is the same as *Merge k Sorted Lists* problem.
📔 Since the same key can appear in multiple levels, RocksDB resolves conflicts by letting newer levels shadow older ones, so if a key is in both level $*L*$ and $*L+1*$, the version in level $*L*$ wins.
Let’s see how RocksDB outperforms InnoDB in term of space utilization.
While B-tree suffers from index fragmentation, which wastes 25-30% space because of in-place updates. LSM-tree doesn’t have to deal with it since it uses the append-only approach. Since new writes go into the memory (MemTable) and are flushed to disk as a new, immutable files. So there’s no spaces-gap in inserts, updates. The “equivalent” of fragmentation in an LSM-tree is dead data - records that have been deleted or updated but are still present in older SSTables. Since we use the append-only, so multiple records of the same row can co-exists in the disk. But the key difference is that this dead data is actively cleaned up during compaction. By tuning how often compaction runs, systems like RocksDB can keep the amount of dead data relatively low (often below 10% of space used).
RocksDB works well with compression too, if 16KB data was compressed to 5KB, RocksDB uses just 5KB while InnoDB aligns to 8KB (due to its nature as described in the previous section)
Here’s how each key is stored in RocksDB.
Primary key
Note that in case there’s no primary key available, RocksDB automatically create a “hidden PK”
Secondary key
In the paper, they mentioned that:
RocksDBuses only7-bytessequence number of each row (seqID), for snapshot read compared to13 bytesoverhead inInnoDB.
But actually, they still need 1 more byte to store the flag (Put, Delete, SingleDelete,…) and they also have the internal index id which costs 4 bytes. So in overall, I think that the overhead is still ~12 bytes.
I’m not quite sure about why did they put that statement in their paper, if anyone knows, can drop a comment.
Because there were many different types of client applications accessing UDB. Rewriting client applications for a new database was going to take a long time, possibly multiple years, Meta engineers wanted to avoid that, so they decided to integrate RocksDB into MySQL. They called it MyRocks Engine.
They used pluggable storage engine interface, so it was possible to replace the storage layer without changing the upper layers such as client protocols, SQL and replication.
- Reduce number of
UDBservers by 50% →MyRocksspace usage must not exceed 50% of compressedInnoDBformat - Maintaining CPU & I/O utilization
One of the trade-offs of using an LSM-tree (like RocksDB) instead of a B+Tree (like InnoDB) is the cost of key comparisons during queries.
In a B+Tree, finding a range start takes one binary search, and advancing is cheap - just follow leaf pointers with almost no extra comparisons.
- When doing the key comparisons,
InnoDBmay have to do a few extra steps: e.g decode the keys, normalize the keys, then compare them. It’s CPU heavy task, this overhead is tolerable because comparison are fewer.
In RocksDB, as shown in the previous ****section, for the range query, we have to involve a complex process with the heap, each time we advance the key, there are some key comparisons between that new key and the others already in the heap, so that we can rebalance it. Such frequent key comparisons have a significantly impact on the query performance, if we still follow the InnoDB approach.
Solution:
In order to mitigate this issue, here’s what Meta engineers did:
MyRocks encodes keys in a bytewise-comparable format, (More detail at: MyRocks Record Format), so
RocksDBcan use fast rawmemcmpinstead of costly collation logic. This makes each comparison cheaper, offsetting the extra comparisons in anLSM-tree.
📔 Side note: MyRocks is optimized for CHAR/VARCHAR indexes using case sensitive collations, and it by default does not allow to have indexes with case insensitive collations (yet you can optionally allow it in setting file)
In RocksDB, iterating keys in forward order is much faster than reverse order. There are 3 main reasons:
-
Reason 1:
RocksDBuses key delta encoding inside each data block to save space within the block. If there are multiple keys share the same prefix, only the first key (aka restart point) is stored in full. The next keys are stored as deltas (prefix compression) relative to the previous key. For example:b. Reverse Key Comparator — figure 9 Since each key is also encoded, when read keys forward, RocksDB does:
- Decode the first full key (restart point)
- Apply the delta in sequence (, ,…)
But if you want to read keys backward:
- To reconstruct the last entry in a block, you still need to decode all the previous deltas leading up to it. Let’s say you want to get the
Entry4from above example. You must find the nearest restart point (using the metadata stored in the footer of block). In this case, it’sEntry3, decode it, then build the key forEntry4. In real life, the chain might be longer, like, if we getEntry10but the nearest restart point isEntry2, then we must decode all the entries fromEntry2toEntry10!
-
Reason 2: Because
LSM treesdon’t do in-place update, the same key can exist in multiple levels. During a forward scan, the first occurrence of a key is guaranteed to be the latest one. But in a backward scan, the first match we encounter may be stale, so we need to check at least one more entry for that key to confirm which is the newest.- If we have multiple entries of the same key, they must be adjacent in the heap, so we can just check if we have any other entry of the key in the heap in a newer level or not, if there is not, so it’s the latest data, otherwise, we ignore it.
-
Reason 3: The
MemtableinRocksDBis implemented using a skip list with single direction pointers, so reverse interation requires another binary search.
As a result, ORDER BY query direction with reverse iteration is much slower than a forward one.
Solution:
In order to resolve this issue, they implemented a reverse key comparator in
RocksDB, it stores the key in inverse bitwise order! This optimization is mainly applied to secondary index while the primary (clustered index) keeps using the normal order.
One crucial functionality of storage engine is query cost estimation. When MySQL tries to decide the best way to run a query, it asks the storage engine (InnoDB, MyRocks, etc.) to estimate the cost of scanning a certain range of keys.
For example, if the query is WHERE k BETWEEN 100 AND 200, MySQL gives MyRocks both the minimum key (100) and maximum key (200), and asks: “How expensive is it to scan this range?”
How RocksDB estimates cost:
- Finding the data block that contains the minimum key.
- Finding the block that contains the maximum key.
- Estimating the distance (size) between these blocks, including recent data still in memory (
MemTables).
This works but can be CPU heavy if done often.
Solution:
-
Skip When Hints Are Given
If a query explicitly uses
FORCE INDEX, the optimizer doesn’t need cost feedback - it already knows which index to use.Since many Facebook queries are predictable (social graph lookups), adding index hints removes the need to calculate cost at all.
-
Smarter Range Estimation
- Whole files first: If an
SSTablefile is fully inside the query range, just add its size. - Partial files shortcut: For files that only partly overlap, check just enough to know they won’t change the estimate much.
- The paper doesn’t explain the exact method, but one plausible approach is: if the query range is [L…N] and the file covers [L…Z], the engine finds the first block ≥ L and the last block ≤ N, then approximates the byte span between them. If that span is small relative to the total accumulated size, the estimate can be accepted without further refinement.
- One pass instead of two: Combine the minimum and maximum key lookups into a single operation.
- Again, the implementation details aren’t disclosed, but it’s likely that instead of performing two binary searches (one for min, one for max),
RocksDBseeks once tominand then advances forward through block handles untilmax. This avoids a second random probe and benefits from sequential, cache-friendly access. For very wide ranges, it may fall back to a two-seek strategy to avoid long linear walks.
- Again, the implementation details aren’t disclosed, but it’s likely that instead of performing two binary searches (one for min, one for max),
- Whole files first: If an
These optimizations accelerate the performance while trade-off the accuracy - the estimation is slightly less precise.
A tombstone is a special marker written into LSM-tree to represent a DELETE operation. It indicates: “ignore any older values of this key (in lower level / SSTables)”. Besides, during the Memtable Flush or Compaction, having this tombstone would trigger removal all PUT operation of that key. This tombstone remains since there might be PUT for the same key exist in lower level (it’s only deleted completely if it reaches the lowest level). In case there’s frequent update requests, we might encounter huge number of tombstones.
In RocksDB, When the index is updated - which means changing “keys” in RocksDB. Changing the RocksDB requires a Delete for the old key and Put for the new key. If we update the MyRocks index key fields multiple times, it means generating huge number of tombstones, it significantly decline the performance of range scan.
Solution
They introduced a new Deletion type called
SingleDelete, which can immediately be dropped when removing a matched PUT
The bloom filter is an essential part for LSM-tree performance and needs to be cached in DRAM to be effective. It caused a significant DRAM usage regression compared to InnoDB.
📔 Each SSTable has its own bloom filter!
Solution:
Meta engineers observed that the lowest level contains 90% of the data they decided to extended RocksDB to optionally allow it to skip creating the bloom filter in the lowest level. This profoundly reduced the total size of bloom filters by 90% while preserving their effectiveness.
📔 There’s a trade-off of this approach: In case of empty key lookups (the key doesn’t exist, without the bloom filters in the lowest level, they need to perform the binary search for each SSTable, resulting in increasing in CPU time.
Compactions may create hundreds of megabytes to gigabytes of SST files. Deleting all of those files at once may cause a spike in trims, leading to performance risk or even stalls on flash storage.
Another issue is the competition between Compaction I/O and user query I/O requests, causing the user query latency to increase.
Solution
Add rate limit to file deletions and compaction I/O requests
There are some data migration jobs, such as online schema changes and data loading, generated massive data writes, even though RocksDB is optimized for write amplification, such monstrous write requests still have impact on user write queries.
Solution
Adding a bulk loading capability in
RocksDB. In bulk loading,MyRockscreates SST files and ingested them directly into the lowest level (bypassing the memtables and compaction process). By doing so, they eliminated the write stalls caused by massive writes from data migration jobs.
📔 The bulk loading requires that the ingested key range never overlap with the existing data, as the nature of the lowest level is that each SST file contain a non-overlapping keys!
RocksDB organizes data into levels $L0$ through $Lmax$, with about 90% of the data typically residing in $Lmax$. However, most compaction work happens in the intermediate levels.
Solution
Meta engineers optimized this by allowing different compression algorithms per level: lightweight or no compression in the upper levels for speed, and stronger compression in $Lmax$ for space savings. This change reduced overall compaction time to roughly one-third.
REPLACE in MySQL normally probes the primary key first: if a row exists, InnoDB reads it, deletes it (including old secondary-index entries), then inserts the new row - incurring a random read.
Solution
MyRocks can exploit the LSM write path by issuing a blind Put for the PK (and creating new secondary entries) without first checking for an existing row. The newer version shadows older ones by sequence number, so the read is avoided and write throughput increases.
Similarly, MyRocks can optionally let INSERT skip unique-key checks, again eliding the pre-read that InnoDB must do to enforce uniqueness. These shortcuts are fast but risky: without the initial read, the engine cannot reliably compute old secondary keys to delete, enforce uniqueness with full fidelity, or preserve trigger/replication semantics.
⚠️ At Meta scale, correctness trumped raw speed, so UDB disabled these optimizations in production; they remain available for users who understand the trade-offs and can tolerate the risks in carefully bounded workloads.
Here’s the overview of how Meta migrated from InnoDB to MyRocks
- Deploy the first MyRocks secondary: For each replica set, they have 1 master and 3 replicas, they use one of these 3 replica (or secondary) as InnoDB source, stop receiving read requests for that replica, then export the data (via
mysqldump), load it to MyRocks replica using bulk upload. After the bulk upload is done, it replicate using the normal MySQL replication viabinlog
- Deploy the 2nd MyRocks secondary: Operating with a single
MyRocksinstance is not idea since losing that instance means dumping and loading again. So they add 2MyRocksinstances (and at the same time, delete 2InnoDBinstances) in the replica set.
MyRocksreplicas continuously replicated the live updates form theInnoDBprimary, but they only served read traffic. During this phase, MyShadow (a system for shadow production queries - which we will talk about a bit later) is heavily utilized: Production read queries were shadowed and replayed on theMyRocksreplicas, their behavior was monitored and compared to theInnoDBreplicas. In the meantime, Data Correctness pair checks (another toolkit to ensure the data correctness - again, we’ll go into detail in the next section) were run between theMyRocksandInnoDBreplicas to ensure their data remained in sync.- This phase lasted for a few months, until
MyRocksproved itself in read-only capacity (and passed all consistency checks), they moved to the next phase
The ultimate goal was to promote MyRocks to be the primary, which is more challenging. Before enabling the write migration, meta extended the MyShadow system to simulate write traffic on test instances. They replayed production write patterns to a MyRocks instance configured as a primary in a test environment. Once they were confident, they performed a controlled switchover in production: an InnoDB primary was demoted and MyRocks replica was promoted to primary in a few replica sets. In these sets, MyRocks started serving both reads and writes for the first time. Meanwhile, they still retain some InnoDB replicas as safety nets. The team closely monitor the system and run the data correctness checks between the MyRocks primary and one of the InnoDB secondaries to ensure the data remained identical as write accumulated. If any serious problem happen, they fell back to an InnoDB primary.
After running with MyRocks primaries and seeing stable performance and there’s no critical errors, Meta gradually expanded the MyRocks primary deployment to all replica sets and began decommissioning the remaining InnoDB instances. Eventually, by August 2017, the migration was essentially completed, making MyRocks the sole storage for UDB.
MyShadow is a system for shadowing production query and replaying them on target instances. Technically, Meta added a custom MySQL Audit plugin that captures every query running on production databases and streams them to an internal logging service. A component of MyShadow (The “Replayer”) read these logged queries and replays them against target instances.
How it is used:
- At first, it’s used to capture the read query from a live
InnoDBreplica and run them on a MyRocks replica, so the team can check:- Any mismatch in results or unexpected errors on
MyRockscompared toInnoDB - CPU usage, latency, I/O metrics relative to InnoDB
- Any mismatch in results or unexpected errors on
- At the later stage, it was utilized for the write simulation
While MyShadow validated live query behavior, Meta also built a Data Correctness tool to perform deep consistency checks between MyRocks and InnoDB. In production migrations, it’s critical to ensure that the new stores and returns the exact same data as the old one. For this, InnoDB served as the source-of-truth. This data correctness tool operates in several modes to catch any mismatch between engines at different level
- Single Mode: Verifies internal consistency within one database instance. It scans each MyRocks table’s primary key index and secondary indexes, comparing row counts and checksums of overlapping columns to ensure they match .
This catches bugs where an index might be out of sync with the table (for example, a compaction bug that failed to delete some keys) – such issues would show up as a mismatched count or checksum between the primary and secondary index . Single-mode checks found subtle internal bugs (e.g.
RocksDBcompaction not deleting keys properly) before they affected production . - Pair Mode: Performs a full data comparison between two instances - one running InnoDB and one running
MyRocks- to ensure they have identical content . To do this without downtime, the tool uses consistent snapshots: it brings both instances to the same transaction state (using a Global Transaction ID or by briefly pausing replication) and then scans all primary keys on both, comparing row counts and checksums for every table . Any missing or extra rows in one engine would be detected in this full-table comparison. In practice, Meta’s engineers leveraged MySQLGTIDsto synchronize snapshots in InnoDB and MyRocks replicas, then ran identicalSELECT COUNT(*)and checksum queries on each to verify consistency . This mode ensured that MyRocks never “lost” data or had phantom rows relative to InnoDB. - Select Mode: Replays a set of captured SELECT queries (from MyShadow’s query log) on both an
InnoDBinstance and aMyRocksinstance, then compares the result sets . This directly validates that for arbitrary reads (beyond just counts and checksums), both engines return the same user-visible data . Select mode is powerful for catching any query-level discrepancies, though it requires filtering out nondeterministic queries (e.g. those usingNOW()or other functions that might naturally differ). When a mismatch is found in select mode, it indicates an inconsistency that engineers can investigate in detail. For example, this method uncovered a bug in MyRocks’s prefix bloom filter logic that caused certain range scans to return fewer rows in MyRocks than they should - an issue that was fixed prior to rollout .
With the help of MyShadow, Meta engineers found an uncover issue - the differences in transaction isolation behavior (gap lock vs snapshot isolation) that caused MyRocks to throw more errors under certain concurrent writes.
What InnoDB does (gap locks in Repeatable Read): By default MySQL InnoDB uses REPEATABLE READ isolation, to prevent “phantom read”, InnoDB uses gap locks, so when you run the query like:
SELECT * FROM users WHERE id BETWEEN 10 AND 20 FOR UPDATE;SELECT * FROM users WHERE id BETWEEN 10 AND 20 FOR UPDATE;InnoDB locks not just the rows with id 10-20, but also the gaps between them. This means concurrent inserts into that range (i.e id = 15) will be blocked until the transaction commits.
What RocksDB/MyRocks does (snapshot isolation in Repeatable Read): MyRocks implements snapshot isolation, not gap locking, under the snapshot isolation model for Repeatable Read.
- Each transaction sees a consistent snapshot of database (based on sequence numbers)
- Writes don’t block inserts into gaps. Instead, conflicts are detected at commit time using write-write conflict checks.
This leads to the fact that, with concurrent writes, instead of blocking, MyRocks may allow both transaction to proceed, but when committing, it detects the conflict and abort one transaction.
So when Meta replayed real production queries on MyRocks with MyShadow, they saw more transaction abort errors compared to InnoDB.
Example:
- Txn A:
SELECT … FOR UPDATE WHERE id BETWEEN 10 AND 20 - Txn B:
INSERT INTO users (id=15)…InnoDB→ Txn B waits (gap lock). Eventually commits fine.MyRocks→ Txn B proceeds, but Txn A will abort at commit because of conflict.
Solution
Meta’s solution was to update the isolation level if possible from the application side. So instead of using Repeatable Read isolation level, they switch to Read Committed if possible.
- **Space savings:**MyRocks used only ~38% of the space compared to InnoDB (even with
InnoDB'scompression enabled). Early deployments showed >50% space reduction. WithZstandardcompression and periodic compaction, space savings improved even further. - CPU efficiency:
- Write-heavy workloads (replication only):
MyRockswas ~40% more CPU efficient thanInnoDB. This advantage stems fromMyRocks' ability to maintain secondary indexes without requiring random reads, unlikeInnoDB. - Mixed read + write traffic:
MyRocksshowed slightly lower CPU utilization than InnoDB.
- Write-heavy workloads (replication only):
- **Instance density:**MyRocks instances were 60% smaller than
InnoDB, enabling 2.5× higher instance density per host, while still maintaining CPU utilization below Meta's 40% target per server. - **Read latency:**Read latency remained comparable between MyRocks and InnoDB.
- Other factors improving MyRocks efficiency:
- Support for better compression algorithms (
Zstandard,LZ4), while InnoDB only supportedZlib. - MySQL binaries built with feedback-directed optimization (FDO) specifically tuned for MyRocks workloads, reducing CPU usage by approximately 7–10%.
- Support for better compression algorithms (
https://www.vldb.org/pvldb/vol13/p3217-matsunobu.pdf
https://www.percona.com/blog/the-impacts-of-fragmentation-in-mysql/
https://www.slideshare.net/slideshow/myrocks-deep-dive/61103198