Google MapReduce: Simplified Data Processing on Large Clusters

In the early 2000s, Google had lots of computations that transform some kind of raw data (crawled documents, web request logs,…) to various of derived data such as inverted indices, summaries of the number of pages crawled per host,… Most such computations were simple, but they required a way to process terabytes, or even petabytes of data on thousands of commodity machines. Developers had to manually handle parallelism, data distribution, fault tolerance, and recovery.

Google built an abstraction inspired by the map and reduce primitives after noticing a simple pattern in most large-scale computations: data is repeatedly transformed (map) and then aggregated (reduce) into a final result. This idea became MapReduce - a distributed computing model (implemented as a library inside Google) that hides the messy parts of parallel processing, data distribution, and failure recovery, so engineers can focus solely on defining the map and reduce logic.

At its core, MapReduce takes a collection of input key/value pairs and produces another collection as output. User needs to define 2 types of function:

Map: function that processes each input pair and emits a series of intermediate key/value pairs. MapReduce then automatically groups all intermediate values that share the same key and sends them to the Reduce function.
Reduce: function that receives a key and the list of all associated values, merging or aggregating them into a smaller result (typically zero or one output value). To handle large datasets efficiently, these values are streamed to the reducer through an iterator rather than loaded entirely into memory.

In formal terms:

map (k1, v1) → list(k2, v2)
reduce (k2, list(v2)) → list(v2)

map (k1, v1) → list(k2, v2)
reduce (k2, list(v2)) → list(v2)

For example, the word count program

Pseudo code

map(String key, String value):
	// key: document name
	// value: document contents
	for each word w in value:
		EmitIntermediate(w, "1");
		
		
reduce(String key, Iterator values):
	// key: a word
	// values: a list of counts
	int result = 0;
	for each v in values:
		result += ParseInt(v);
	Emit(AsString(result));

map(String key, String value):
	// key: document name
	// value: document contents
	for each word w in value:
		EmitIntermediate(w, "1");
		
		
reduce(String key, Iterator values):
	// key: a word
	// values: a list of counts
	int result = 0;
	for each v in values:
		result += ParseInt(v);
	Emit(AsString(result));

⚠️ MapReduce is a programming model, **it defines an interface, not a single framework. Different systems can implement this model in their own way. At Google, engineers built a MapReduce library in C++ that realized this interface and was tailored to their infrastructure and workloads. From this point on, when I mention MapReduce, I’m referring specifically to Google’s C++ implementation.

The MapReduce runs on a cluster of machines.
The input files are split into $M$ files ( $M$ is defined by user).

Map task: an execution instance of the user-defined map function on a portion of the input data.
- As the input file is slit into $M$ files, there are $M$ map tasks in total.
Reduce task: an execution instance of user-defined reduce function on a partition of the intermediate data.
- The number of reduce tasks ( $R$ ) is defined by user, it’s equal to number of output files that user wants.

In practice:

We tend to choose $M$ $M$ so that each input file is roughly $16MB$ $16 M B$ to $64MB$ $64 M B$
- Why? Because the input data is stored on GFS. GFS divides each file into $64MB$ blocks, and stores several copies of each block (typically 3 copies) on different machines.
The $R$ is small multiple of the number of worker machines (what’s worker will be described in the next section) that we expect to use.

For example, Google often perform MapReduce computations with $M = 200,000$ and $R = 5,000$ , using $2,000$ worker machines.

Master: A single machine that coordinates the entire MapReduce job. It assigns map or reduce tasks to idle workers in the cluster.
- There is only one master!
- It maintains several internal data structures to track system state, for each map and reduce task, it records the task’s status (idle, in-progress, or completed) and the identity of the worker executing it.
- To minimize network traffic, the Master attempts to schedule map tasks on workers that are physically close to the input data blocks (data locality).
Worker: A process that executes either a map task or a reduce task as assigned by the Master. A single worker may perform both roles during a job’s lifetime, but never at the same time.

The results of map tasks are:
- stored in:
  - The memory of the worker that executes the task
  - Worker’s local disk: Periodically, the data in worker’s machine is flushed its local disk
  ⚠️ Hence, If the worker fails, all intermediate data it produced is lost!
- partitioned into $R$ local files in the Worker that executes it.
The result of one reduce task is stored in one single file on GFS.

For example, if we have $R = 3$ , then this is the high level of how map workers and reduce workers interact.

For example, the reduce worker 1 only interest in the partition 1 output in each map worker!

Following is the high level overview of the system.

There are $M$ map tasks and $R$ reduce tasks, Master picks idle workers and assign each one a map task or a reduce task.
The map worker executes the map function, stores the result in its memory, periodically, the results are flushed to its local disk. The locations of the data are passed back to the mater.
When the reduce worker receives the file locations from the Master, it invokes procedure calls to read the data from disk of map workers using these locations. After reading all the data, it sorts the data by the intermediate key, so the occurrences of the same key are grouped together. If the data doesn’t fit in its memory, external sort is used.
- The sorting step is crucial as it allows the reduce worker to process one key at a time without holding the entire dataset in memory. As soon as all values for a key are processed, the reduce worker can safely write the aggregated result to disk. This streaming design keeps memory usage low and produces sorted output naturally.
- External sort: In the paper they did not reveal the detail implementation, but I guess it sorts data in memory-sized chunks, writes them to disk, and then merges them into a single sorted stream. This allows reduce workers to handle massive datasets efficiently, using only sequential disk reads instead of loading everything into memory. Maybe something similar to Merge Sort
The reduce worker iterates over the sorted intermediate data, executes the reduce function with the key and its values, then stores the results to one single final output file stored in GFS.
When all map tasks and reduce tasks have been completed, the master wakes up, return back the results to user:
- Note: The results here are just some metadata: the locations of the final output files produced by the reduce tasks.

As the outputs are multiple files, they could be the input of another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned in multiple files.

The Master pings the Worker periodically, if there’s no response back → mark the worker as failed, and all the map/reduce tasks executing/executed by that worker are reset back to idle state so they can be reassigned to other idle workers.

If the failed worker had completed map tasks, those must be re-executed, because their intermediate outputs are stored only on that worker’s local disk and are now lost. In contrast, completed reduce tasks don’t need to be redone as their final outputs are already safely stored in GFS.

When re-executed map tasks finish on new workers, they generate new intermediate files. The Master updates these new file locations and notifies any reducers that still need the data. Reducers that hadn’t yet fetched from the failed worker will simply pull the data from the new one.

The Master periodically checkpoints its internal state including the progress and status of all map and reduce tasks to persistent storage (it’s not mentioned in the paper, but I think it could be GFS or S3?). If the Master process crashes, a new instance can be started and restored from the latest checkpoint to continue coordinating the job without restarting from scratch. But in their implementation, if the master fails, the entire computation aborts!

Even though MapReduce runs across thousands of unreliable machines, it still guarantees a clean and predictable result as if the program had been executed once, sequentially, without any failure.

This strong guarantee holds when both the map and reduce functions are deterministic - the same input always produces the same output. Under this condition, re-executing a failed map or reduce task doesn’t change the final result; it simply regenerates identical data.

To preserve correctness, MapReduce relies on atomic commits:

Each task writes its output to temporary files first.
Once the task completes successfully, these files are atomically renamed to their final destinations.
If a task is re-executed, only one version of its output is ever made visible, thanks to the atomic rename operation provided by GFS.

This ensures that the final state of the system is consistent and contains the output of exactly one execution per task even if failures and retries occur behind the scenes.

When the user’s functions are non-deterministic (for example, they depend on random numbers or timestamps), MapReduce still guarantees a weaker but reasonable consistency: each reduce task produces an output that corresponds to one valid sequential execution, though different reducers might have observed results from different re-executions of the same map task.

In short, as long as your map and reduce logic are deterministic, the output of a distributed, failure-ridden MapReduce job will be identical to that of a clean, single-machine run.

Straggle is a machine that takes an unusual long time to complete one of the last few maps or reduce tasks in the computation. It can happen by many reasons, one of them could be the a bad disk experiences frequent correctable errors that slow its read performance. As one machine can be assign multiple tasks, so those tasks can compete for the CPU, memory, local disk or even network bandwidth, all these properties contribute to the slow down in task processing.

To mitigate this issue, when a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress task. The task is marked as completed whenever either the primary or the backup execution completes

In some cases, there is significant repetition in the intermediate keys produced by each map task. For example, for the word counting, each map function can produces hundreds or thousands of records of the form $<the, 1>$ . All these counts will be sent over the network, which is costly.

So, MapReduce allows user to specify an optional Combiner function - a min-reducer that runs locally on the mapper’s output. It performs partial aggregation (i.e summing all $<the,1>$ pairs into $<the, N>$ ) before the data is sent across the network. This simple optimization can dramatically shrink the amount of intermediate data transferred, speeding up the job and reducing network load.

The MapReduce provides support for reading input data in several different formats. The goal is to define meaningful key/value pairs to improve data distribution across map tasks.

For example, for text input, the key/value pair could be: The key is the offset in the file, the value is the contents of the line.

User can also add support to a new input type by providing an implementation of the reader interface.

Sometimes, user-written map or reduce functions can crash deterministically on specific input records. For example, due to malformed data or unhandled edge cases. These crashes prevent the whole MapReduce job from finishing, even if 99.999% of the data is perfectly fine.

Ideally, you’d fix the bug, but that’s not always practical - especially when processing billions of records and a few bad ones aren’t worth stopping the job for.

To handle this, MapReduce offers an optional “skip bad records” mode. When it detects that a task repeatedly fails on the same input record, it automatically skips that record and continues processing the rest of the data. This feature allows long-running jobs to complete successfully, trading a few missing records for uninterrupted progress.

MapReduce includes a lightweight counter mechanism for tracking custom metrics during execution (e.g the number of words processed or malformed records skipped). Each worker periodically reports its counters to the Master, which aggregates and displays them, ignoring duplicates from re-executed tasks. Built-in counters track basic stats (e.g., input/output pairs), while user-defined ones help with quick sanity checks and debugging.

To be honest, this paper is not one of the best paper that I’ve read so far. It just give us some abstraction, without digging deep into some parts for example, what’s the scheduling policy, intermediate storage format, or performance optimizations…

I think it’s more about conceptual clarity than engineering depth. But it remains a foundational work worth reading.

https://drive.google.com/file/d/1fcJP_WE8j0L-QdxQhBSEzkyel8szFzbH/view