Distributed Consensus: Paxos
Dive into the fundamentals of Paxos, one of the most influential distributed consensus algorithms. Learn how Paxos ensures fault-tolerant decision-making in distributed systems, its key phases, challenges, and why it's the foundation for modern consensus protocols like Raft and Multi-Paxos
Paxos is a fundamental algorithm for achieving consensus in distributed systems, ensuring reliability even in the presence of failures and network delays. Introduced by Leslie Lamport, it serves as the foundation for modern consensus protocols like Raft and Multi-Paxos.
In this blog, I tried to explain the algorithm in my own words—at first, I thought it would be easier, but it turned out to be just as complex as ever. 😅 Still, I want to share it out and hopefully make it more understandable! 🚀
Distributed consensus is the process by which multiple computers in a distributed system agree on a single value, even if some nodes fail or messages are delayed.
Distributed consensus solves the fundamental problem of achieving consistent agreement in the presence of faults:
- Ensures system-wide consistency (e.g., in a database or lock service).
- Allows the system to tolerate failures of nodes or networks.
- Prevents split-brain scenarios where different parts of the system may hold conflicting information.
A consensus algorithm defines:
- Rules that nodes follow when proposing values.
- Mechanisms for handling conflicts (e.g., multiple proposals).
- Recovery processes if a node or network fails.
- Latency: Delayed messages or high round-trip times can slow the consensus process.
- Node Failures: Some nodes may crash or become unresponsive.
- Network Partitions: The network may split into isolated groups of nodes, each unaware of the others.
Paxos is a distributed consensus algorithm devised by Leslie Lamport. It ensures:
- Only one value is ultimately chosen.
- If a majority of nodes are active, a value will eventually be chosen.
- Once chosen, the value does not change.
Where Is It Used?
- Google Spanner: Uses a variant of Paxos (Multi-Paxos) for global consistency.
- Apache ZooKeeper: Inspired by Paxos (though it uses ZAB), for consistent configuration and leader election.
- Others: Consul, etcd (though etcd actually uses Raft, Paxos’s conceptual sibling)
Roles
Each node in Paxos can take on different roles:
- Proposer: Suggests a value by sending proposals.
- Acceptor: Accepts or rejects proposals.
- Learner: Learns which value was ultimately chosen.
A single node can simultaneously act as a proposer, an acceptor, and a learner.
Terms:
- Proposal: When a proposer want to raise a value to be agree on across the system, it sends out the proposal to acceptors.
For better understanding on Paxos algorithm, I think we can go through some idea to understand why Paxos algorithm must have that decision.
Problem:
Let say we have 5 nodes, and we want those 5 nodes to agree on a single value.
Each node can propose value that it wants.Let say we have 5 nodes, and we want those 5 nodes to agree on a single value.
Each node can propose value that it wants.The most simple solution here is to let just one node can decide which value should be agreed across all nodes. With a single acceptor, even if multiple values are proposed simultaneously, the system can still reach agreement on one value across all nodes.
The problem becomes obvious: if Node5 fails, our entire system breaks down.
While a single acceptor works, it lacks fault tolerance. A better approach is to use multiple acceptors, where a value becomes "chosen" when accepted by a majority of nodes (the quorum). This system remains operational even if some nodes fail, as long as we maintain majority acceptance.
But when we have multiple acceptors, we need to define some rules that all acceptors must follow, so that we can achieve the goal - there is one and only one value that’s chosen (mentioned in the previous section)
An acceptor must accept the first proposal it receives.
Since we can’t predict how many values proposers will submit, this rule guarantees that even if only one value is proposed, the acceptor will accept it. By enforcing this, Paxos ensures that at least one value is always accepted, forming the basis for reaching consensus.
But here comes another problem, let say, now we let all 5 nodes to be acceptors, and if 3 of them (the quorum) accept on a value, it should be the chosen one.
There’s a situation like
In this scenario, each node acts as both a proposer and an acceptor simultaneously. Each block represents the proposal to propose for the value. The timing of proposal arrivals matters—pay close attention to where the block is, as it indicates when an acceptor receives a proposal. Since an acceptor must accept the first proposal it receives, this can lead to a situation where no single value is accepted by a majority. Instead, different acceptors may accept different values, preventing consensus. In this example, X1 is accepted by Node1 & Node2, X2 is accepted by Node3 & Node4, while X3 is accepted by Node5 . Hence, there’s no value which is chosen by majority (3 nodes)
So in order to avoid above situation, we can add a second rule.
The acceptor can accept multiple proposals it receives.
By doing this, we eliminate the above case, but it introduces a new challenge. Let’s consider the following scenario:
We still have 5 nodes, this time, Node1 want all others nodes to agree on value X1 it gets acceptance from Node2, Node3 → Quorum met (Node1, Node2, Node3) → X1 is chosen.
Right after that, Node5 also wants to propose value X2, it requests the accept from Node3, Node4, and got accepted → Quorum met (Node3, Node4, Node5) → X3 is chosen.
→ 2 split brain detected, we have 2 chosen value X1 & X3 which violates the goal
There is one and only one chosen value
💥 We have conflict here, which value should be the chosen one then?
You might wonder why the process doesn’t stop once X1 is accepted by a majority of nodes. The key issue is:
How does Node5 know that a value has already been chosen by the majority?
Let’s go through some solutions:
- ❓What if before proposing the value,
Node5requests to check if there’s any value chosen by majority of nodes? This won’t work in following situation. At the timeNode5wants to propose value (the blue dot), it check with other acceptors and there’s no value accepted by majority then it also sends its proposal.
- ❓What if acceptor saves its accepted value → We allow the acceptor to accept multiple values, so we’ll never know which accepted value is the “chosen” one.
- ❓What if the proposer of the chosen value directly notifies others? This may work sometimes, but it does not guarantee correctness. Notifications can be delayed or lost, meaning
Node5might start a proposal (the blue dot indicates the timeNode5starts sending proposals to other acceptors) before receiving the notification fromNode1.
How do acceptors or proposers know a value has already been chosen by a majority?
- Checking with all nodes is unreliable in asynchronous networks—some might be slow or unresponsive.
- Storing multiple accepted values doesn’t reveal which one reached a majority.
- Notification from the chosen value’s proposer can be delayed or lost.
Let’s step back and examine our goal again.
In a distributed system, communication between nodes is asynchronous. Each acceptor only knows about the proposals it has accepted but has no visibility into what values have been accepted by other nodes. In a decentralized, asynchronous environment, synchronizing state across nodes is extremely difficult.
Let’s walk through some points:
(1) If value
Xis chosen (accepted by majority of acceptors), no future valueYshould reach consensus
(1) leads to:
→ (2) if an acceptor
(A)has already acceptedX, then every subsequent proposal it accepts must also proposeX
However, there are three major issues in a distributed system:
- Acceptors only “accept” values but do not track consensus – An acceptor is only responsible for accepting proposals. It does not know whether a value it has accepted is the first, the last, or the chosen one.
- Acceptors cannot communicate with each other to determine the chosen value – Due to network asynchrony, an acceptor has no way of knowing whether a majority of acceptors have accepted the same value.
- We cannot rely on immediate synchronization – If we try to synchronize acceptors immediately, it would require real-time communication, which is impractical in a distributed setting.
So, in order to retain (2), we come to (3)
(3) Once a value X has been accepted, all subsequent proposals—regardless of the proposer—must also propose X.
Since acceptors are passive nodes that can only accept or reject proposals without knowing the global state of the system, ensuring consensus must start with the proposer—the active entity responsible for initiating proposals.
The goal of consensus is for all nodes in the system to agree on a single value, not for individual proposers to "win" by getting their specific value accepted. To achieve this, we allow acceptors to store their accepted values, and proposers must retrieve this information from acceptors before proposing a value. By doing so, a proposer can determine whether a value has already been accepted by a majority or has the highest likelihood of leading to consensus. If such a value exists, the proposer must adopt it to maintain consistency.
This rule is critical for preventing conflicting values from reaching consensus. Without it, multiple proposers could suggest different values, leading to split-brain scenarios where no single value is chosen.
To enforce this, before proposing a value, the proposer must first send a "prepare" request to all acceptors, asking for their most recent accepted values. Each acceptor's response provides insight into the system's state. Since not all acceptors may respond due to failures or network delays, the proposer only needs to collect responses from a majority of acceptors to proceed.
Once a majority of responses are received, the proposer sends an "accept" request to all acceptors, containing the value it intends to propose—either a newly selected value or the highest accepted value returned in the prepare phase.
Challenges in Maintaining Consensus:
- Multiple accepted values: According to Second Rule: An Acceptor Can Accept Multiple Proposals, we allow the acceptor to accept more than one proposal, so need to determine which value to return in the prepare response.
- No Consensus Yet? → If no majority-accepted value exists, proposers must decide between proposing a new value or reusing an existing one depends on the states of the system it
- Ordering Proposals → Messages may arrive out of order, so we need a way to track newer proposals.
Let’s resolve them one by one, from easiest to toughest.
-
Ordering Proposal: To determine the order of proposals, we introduce a new value, let’s call it the proposal number - a unique (across all proposers), increasing identifier attached to each proposal. How it’s generated can vary based on the implementation.
-
Multiple accepted values: → Use the accepted value from the proposal with highest proposal number. We must use proposal number to get the latest one, because of asynchronous communication, the value from old proposal can reach the acceptor later than the newer proposal.
-
No Consensus Yet?: If a consensus value X has not yet emerged, there are two cases:
-
Some acceptors have accepted previous values → The proposer must adopt the value from the highest-numbered accepted proposal. Key idea: The goal is to reach consensus, not to push a specific value. If a value is already in progress (accepted by some acceptors, but not reach consensus yet). All future proposals must use that value as well. But there can be multiple accepted values → we must choose the value from proposal with highest number.
We can’t rely on the network request, so it’s better that the acceptor must store the highest proposal number it’s seen so far, then if a new value comes in, it must compare with highest seen proposal number and decide that it should accept the value or not.
But there’s still issue, for example:
2.6. Breakdown the problem & solution — figure 6 When
Node1wants to propose a value, it sends aPreparerequest toNode2,Node3,Node4,Node5. They report thatX4(Proposal#2) is the highest accepted value so far, soNode1must proposeX4.However, before Node1 actually sends out
Accept(X4),Node3,Node4, and ****Node5receive a proposal fromProposal #1with valueX3, and since they haven’t accepted anything yet, they acceptX3—forming a majority forX3.Later,
Node1still proceeds to sendAccept(X4)toNode2,Node3, andNode4, which also gets accepted (perhaps with Proposal #3). As a result, two different values (X3andX4) both appear to be accepted by a majority, creating a conflict.To resolve this, at the time we send prepare request, we must send out the proposal number, and force all acceptors to not accept any value (or even prepare request) of any proposal with lower proposal number.
2.6. Breakdown the problem & solution — figure 7 -
No acceptor has accepted any value → The proposer is free to propose a new value. However, if multiple proposers send different values simultaneously, conflicts may arise.
For example
2.6. Breakdown the problem & solution — figure 8 But with above approach, we can guarantee that only one value is accepted (the value of the proposal with highest proposal number).
-
Let’s wrap up all the pieces so far in following sections:
Each proposal is assigned a unique and increasing identifier, known as the proposal number. This ensures that proposals can be ordered and that acceptors can always distinguish newer proposals from older ones.
How is the Proposal Number Generated?
The implementation can vary, but it must satisfy two key properties:
- Uniqueness – No two proposers should generate the same proposal number.
- Monotonicity – Proposal numbers must always increase to establish a clear order.
Common approaches include:
- Timestamp + Node ID: Use a logical timestamp (e.g., system clock or Lamport timestamp) combined with a unique node identifier to ensure uniqueness across proposers.
- Local Counter + Node ID: Each proposer maintains a local counter that increments with each new proposal. The node ID ensures uniqueness in a multi-node system.
We see that in order to propose a value, each proposer must go through 2 steps:
- 1st step: Sending out
preparerequest to collect the information from all acceptors, and tell them to reject all proposal with lower proposal number as well. - 2nd step: If proposer receives the responses back from majority of acceptors, it starts ending out
acceptrequest to propose the value.
All the things we’ve gone through so far is the idea of Paxos algorithm, let’s revise the complete implementation.
A Paxos proposal contains:
- Proposal number: A unique, order comparable identifier across all nodes & proposals
- Value: The actual value proposer wants to propose
Each acceptor tracks:
minAcceptProposalNumber- The lowest proposal number that the acceptor is still willing to accept.acceptedValue- The value of the proposal with largest proposer number it has been accepted so far.
-
Proposer initiates Prepare Phase
- The proposer sends a Prepare(n) request to all acceptors, where n is a unique and increasing proposal number.
-
Acceptor processes the Prepare request
Each acceptor follows these rules when handling the Prepare(n) request:
-
If the proposal number n in the request is less than
minAcceptProposalNumber, it rejects the request.(This ensures that no outdated proposals interfere with ongoing consensus.)
-
Otherwise:
- The acceptor updates
minAcceptProposalNumberto n (preventing future proposals with lower numbers). - The acceptor responds with the latest proposal it has accepted (if any), including:
- Proposal number: proposal number of prepare request
- Accepted value: value of nearest accept request
- Accepted number: proposal number of nearest accept request
- The acceptor updates
-
When a proposer sends a Prepare(n) request, it waits for responses from a majority of acceptors.
Once it receives promises from a majority, the proposer determines which value to propose based on the following rules:
- If any of the responding acceptors have already accepted a proposal:
- The proposer must use the value from the proposal with the highest accepted number among the responses.
- If no acceptor has accepted any value before:
- The proposer can propose any new value of its choice.
Then, proposer sends the “Accept(n, value)” request to all acceptors. Once it gets back the accept responses from majority of acceptors. It knows that the consensus is reached and the value it proposed is the chosen one.
⚠️
Even if a proposer successfully completes the Prepare Phase by receiving OK responses from a majority of acceptors, there is no guarantee that its Accept Phase will also succeed.
Why?
- After an acceptor responds with a promise in the Prepare Phase, it is still free to accept a new prepare request with a higher proposal number from another proposer.
- If another proposer starts a new Prepare Phase with a higher proposal number, the acceptor updates its
minAcceptProposalNumberand will reject any subsequent accept requests with lower numbers.
What happens in the Accept Phase in Acceptor side?
- When the proposer sends an Accept(n, value) request to all acceptors, each acceptor checks whether
nis still the highest proposal number it has seen. - If n is still the highest, the acceptor accepts the proposal.
- If another proposer has sent a newer prepare request (with a higher proposal number) in the meantime, the acceptor rejects the accept request.
Implication
- A proposer may fail in the Accept Phase even after succeeding in the Prepare Phase.
- The proposer must retry with a higher proposal number if its accept request is rejected.
- This mechanism ensures that only the latest proposal progresses, preventing outdated proposals from overriding a more recent consensus effort.
Earlier, we introduced the learner role, but we haven’t fully explored how it actually learns the chosen value. A learner is a node that needs to know the consensus value, but unlike proposers and acceptors, it does not actively participate in decision-making.
In Paxos, only the proposer directly knows which value has been chosen. While acceptors store the values they accept, this doesn’t necessarily mean they store the final chosen value.
How Can a Learner Obtain the Chosen Value?
There are multiple ways for learners to discover the final consensus value:
-
The Learner is also the Proposer
- If a learner is the proposer that successfully proposed the chosen value, it already knows the result because it receives accept confirmations from a majority of acceptors.
- Since Paxos guarantees that only one value can be chosen, the proposer can be certain of the result.
-
The Learner is an Acceptor or Another Independent Node
If the learner is not the proposer, it needs a way to find out the chosen value. There are several approaches:
- Proposers Notify Learners
- After a value is chosen, the proposer can send notifications to all learners.
- Issue: This approach can scale poorly if there are many proposers and learners. The number of messages exchanged grows as:
- This can introduce significant network overhead.
- Designated Learners
- Instead of notifying all learners, the system can assign a subset of learners to receive updates.
- These designated learners then relay the chosen value to others.
- This reduces message complexity while ensuring all learners eventually receive the correct value.
- Acceptors Notify Learners
- Acceptors can notify learners when they accept a value.
- The learner then checks whether the same proposal has been accepted by a majority of acceptors.
- If a majority has accepted the same proposal number, the learner can safely conclude that this is the chosen value.
- Proposers Notify Learners
Even though the current approach increases the likelihood of reaching consensus, a persistent issue arises when multiple proposers continuously send Prepare requests after being rejected. This typically happens due to:
- Higher Proposal Numbers Causing Rejections:
- If a proposer sends a Prepare(n) request but another proposer sends a higher-numbered proposal (n'), then the first proposer’s request will be rejected.
- This forces the proposer to restart with an even higher number, increasing contention.
- Multiple Proposers Retrying Simultaneously:
-
If several proposers are competing at the same time, they continuously interfere with each other’s proposals.
-
As a result, no proposer manages to complete both phases (prepare & accept), delaying consensus indefinitely.
3.4. The contention — figure 1
-
Solution: Exponential Backoff
Instead of immediately retrying after rejection, proposers should introduce a randomized exponential backoff delay before retrying. This reduces the likelihood of multiple proposers retrying at the same time, improving system stability.
How Exponential Backoff Works
- On rejection, a proposer waits for a delay before retrying.
- The delay doubles after each failed attempt, up to a maximum threshold.
- A random jitter is added to prevent synchronization issues (where multiple proposers retry at the exact same time).
- If the proposer gets a majority response, it proceeds with the accept phase.
- Safety: No two different values can both be chosen. Because acceptors do not accept conflicting values once they promise higher proposal numbers.
- Liveness: A value will eventually be chosen if a majority of nodes are up and can communicate.
- Fault Tolerance: The system works as long as a majority of acceptors remain functional. Even if we have some nodes are down in the middle (proposer, acceptors), the chosen one is still correct since it must be accepted by majority of acceptors.
- Conceptual Complexity: Paxos is famously hard to understand and implement correctly.
- Performance Bottlenecks: Frequent leader changes or many proposers can cause overhead.
- Alternatives: Protocols like Raft or ZAB (ZooKeeper Atomic Broadcast) may be simpler to implement or reason about.
- Distributed Databases (e.g., Spanner, CockroachDB).
- Cloud Storage (Microsoft Azure Storage, etc.).
- Coordination Services (Chubby, ZooKeeper).
I’ve implemented Paxos, but as mentioned in the Limitations of Paxos, it’s notoriously difficult to implement correctly. If you spot any flaws in my implementation, I’d really appreciate your feedback. 🙏
You can take a look at my implementation here:
https://github.com/trunghieu99tt/Paxos
It’s heavily inspired by: https://github.com/xiang90/paxos
Paxos is a cornerstone of modern distributed systems, guaranteeing consensus even under failures and network issues. While it can be complex, its principles form the basis of many modern consensus protocols. Understanding Paxos provides deep insights into how distributed systems maintain consistent state across unreliable infrastructures.
https://www.microsoft.com/en-us/research/publication/paxos-made-simple/
https://www.youtube.com/watch?v=d7nAGI_NZPk&t=1102s&ab_channel=GoogleTechTalks
https://www.youtube.com/watch?v=JEpsBg0AO6o&ab_channel=DiegoOngaro