Time, Clocks and Ordering in Distributed Systems

Time is a crucial part of distributed systems. We use it everywhere: ordering events, resolving conflicts, setting timeouts, schedulers, performance measurements, expiring caches, … but wall-clock time in your service is not as reliable or correct as you might think.

In this blog, we’ll delve into the concept of time in distributed systems, and one of the most crucial tasks: ordering events.

When we talk about time, we need to talk about clocks. In a distributed system, there are two broad kinds of clocks:

Physical clock: returns “real” time and measures it in seconds (e.g. 1:50 PM).
Logical clock: returns time relative to something - how long since this server started, how many events have occurred, and so on. It is mostly used to determine order in the system (e.g. whether event A occurred later than event B).

They include analogue / mechanical clocks based on pendulums or similar mechanisms and digital clocks based on something like a vibrating quartz crystal.

Quartz clock: Uses a laser-trimmed quartz crystal that mechanically resonates at a specific frequency. Quartz oscillators are cheap and show up in computers, phones, microwaves, … but they are not highly accurate. Crystals are manufactured to a tolerance, and manufacturing is imperfect. Temperature also affects the crystal, so different units can vibrate at slightly different rates - some run fast, others slow.
- The rate at which a clock runs fast or slow is called drift, often expressed in parts per million (ppm). 1 ppm is a relative error of one part per million: for timekeeping, that is about one microsecond per second (not one millisecond per second - that would be 1000 ppm). At $\sim 1$ ppm, drift is on the order of $\sim 86$ ms per day and $\sim 32$ s per year if uncorrected.
Atomic clock: Based on quantum-mechanical properties of certain atoms (e.g. caesium or rubidium). Fact: the SI second is defined to be exactly 9,192,631,770 periods of a particular hyperfine transition frequency of the caesium-133 atom. Atomic clocks provide much greater accuracy than typical quartz. Accuracy is approximately 1 in $10^{-14}$ (1 second in 3 million years!)

Atomic clocks are expensive, so quartz oscillators are used in most machines. Since these clocks drift, they need to be adjusted from time to time. One of the most common approaches is NTP (Network Time Protocol). Servers called NTP servers (aka time sources) obtain time from high-precision references (e.g. atomic clocks, GNSS-disciplined devices) or from other NTP servers. Other machines request time from these servers (and can become NTP servers themselves if they want to serve requests for others). Those exchanges use the NTP protocol.

NTP organizes sources in a hierarchical, semi-layered system. Each level is called a stratum.

Stratum 0: High-precision references such as atomic clocks, GNSS (including GPS), other radio clocks, or PTP-disciplined devices. They are reference sources; they typically do not act as general-purpose NTP servers on your LAN.
Stratum 1: Machines with a direct attachment to a stratum-0 source (e.g. GPS receiver, serial/PPS to a reference) - often run by ISPs, enterprises, or public NTP projects.
Stratum 2 … n: Application servers and other hosts that sync to stratum-1 (or lower-stratum) servers and adjust their local clocks.

⚠️ Note:

From Stratum 1 onward, servers in stratum $n$ can sync from stratum $(n-1)$ sources or peer with other servers in the same stratum $n$ .
The upper limit for stratum is 15; stratum 16 is used to indicate that a device is unsynchronized.

Time synchronization over a network is hard because of unpredictable latency and processing delay. To reduce the effect of these variations, NTP collects multiple samples and uses filtering and selection (related to ideas like Marzullo's algorithm) to discard outliers and estimate offset and delay before adjusting the local clock.

The time is corrected in two steps:

Estimate the clock offset
Apply the correction

Let’s say the offset the client needs to apply is clock offset $\theta$

The NTP client sends the request at $t_1$ , the NTP server receives the request at $t_2$ and responds at $t_3$ . The NTP client receives the response at $t_4$ .

If we assume the network latency is the same between request & response

→ the network roundtrip is: $\epsilon = (t_4 - t_1) - (t_3 - t_2)$

→ The estimated time in the NTP server when client receives response: $t_3 + \frac{\epsilon}2$

→ Estimated clock offset: $\theta$ = $t_3 + \frac{\epsilon}{2} - t_4$ = $\frac{t_2 - t_1 + t_3 - t_4}{2}$

Slewing: If the $\left|\theta\right| \lt 125\space ms$ , slightly speed the clock up or slow the clock down by up to 500 ppm (sync the clock within $\sim 5$ min)
Stepping: But if $125\space ms \le \left|\theta\right| \lt 1,000s$ , the NTP client steps its clock to the estimated time. Because $\theta$ can be positive or negative, applications that read wall-clock time can observe sudden jumps forward or backward.
But if the $\theta \ge 1,000s$ , panic and do nothing.

As wall-clock time can jump backward (or forward), that raises important issues for systems or logic that depend on a monotonically increasing notion of “now,” for example:

Event ordering based on server timestamps
Timeout logic:

if now > deadline

if now > deadline

If you compare the current wall-clock time to a fixed deadline, a backward step can make this check misbehave. (For durations, monotonic clocks such as CLOCK_MONOTONIC on Unix avoid this class of problem.)

Broken last-write-wins: If you resolve conflicts with last-write-wins using wall-clock timestamps, skew and steps can break the intended ordering.

There is something we programmers often ignore - the leap second.

The International Atomic Time (TAI) is based on atomic clocks: the SI second is 9,192,631,770 periods of a specific caesium-133 transition. UTC is kept close to the astronomical day (UT1) by inserting occasional leap seconds so civil time stays aligned with Earth’s irregular rotation (tides, core motion, earthquakes, etc.). There is a slow drift between atomic time and “Earth rotation” time; very roughly, mean solar days are currently a bit longer than 86,400 SI seconds, so leap seconds are needed every few years on average (not on a fixed yearly schedule).

To bridge atomic time and civil time, we use Coordinated Universal Time (UTC): it differs from TAI by a whole number of seconds (the cumulative leap-second offset), chosen so UTC stays within about 0.9 s of UT1.

When needed, a leap second is usually inserted at the end of June 30 or December 31 UTC (announced in advance by IERS). Leap seconds are defined to be +1 s or −1 s; so far, only positive leap seconds have been used.

Because of leap seconds, UTC can include a minute with 61 seconds (e.g. 23:59:60), and a calendar day is not always exactly 86,400 seconds: on a leap-second day the day is 86,401 or 86,399 seconds long. That complicates software that assumes fixed minute and day lengths.

Poor handling of the leap second on 30 June 2012 is what caused the simultaneous failures of many services on that day.

An event is something happening at one node (sending or receiving a message, or a local execution step)

Ordering events in distributed systems is a crucial part of the story, as it affects the correctness and consistency of the system. Here are the main relationships between two events:

Let’s say we have two events $a$ and $b$ . We say $a$ happens-before $b$ (written $a \rightarrow b$ ) if one of the following holds:

$a$ and $b$ occurred at the same node, and $a$ occurred before $b$ in that node’s local execution order.
$a$ is the sending message $m$ , and $b$ is the receipt of that same message $m$ (assuming sent messages are unique)
There exists an event $c$ such that $a \rightarrow c$ and $c \rightarrow b$

This relation is a partial order: for some $a$ and $b$ , neither $a \rightarrow b$ nor $b \rightarrow a$ . Then we say $a$ and $b$ are concurrent ( $a \parallel b$ ).

Note: In Lamport’s sense, $a \rightarrow b$ means $a$ could have influenced $b$ (causal potential), not necessarily direct physical causation in the everyday sense.

The happens-before relation $a \rightarrow b$ motivates causal ordering: we must guarantee that:

if $a \rightarrow b$ , then all nodes in the system must observe $a$ before $b$ .

But if $a \parallel b$ , different nodes may observe $a$ and $b$ in different orders.

Causal ordering is often sufficient for correctness, because it never violates cause-and-effect chains. Many systems, however, want a stronger property: total ordering.

All nodes observe all events in exactly the same order.

Unlike causal ordering, which only constrains causally related pairs, total ordering picks a single global sequence over all events, including concurrent ones ( $a \parallel b$ ).

Let’s take a look at an example:

At $t_1$ , nodeA sends message $m_1$ to nodeB and nodeC.

At $t_2$ , nodeB receives message $m_1$ , then sends the message $m_2$ to nodeA and nodeC.

Due to network delay, nodeC might receive $m_2$ before $m_1$ . We need a way to infer the right order of $m_1$ and $m_2$ at nodeC.

One approach is to attach $t_1$ and $t_2$ to $m_1$ and $m_2$ and compare those timestamps.

→ The problem here is: $t_1$ is the time of nodeA's clock, while $t_2$ is the time of nodeB's clock. From the previous section, we know that it's not guaranteed that $t_2$ ≥ $t_1$ as nodeB's clock might be skewed - e.g. jumping backward when syncing via NTP. Relying on the timestamp ordering results in incorrect order ( $m_2$ → $m_1$ )

That’s when Logical Clock comes into play!

A logical clock does not answer “what time is it?” in the wall-clock sense; it answers “how far along a reference progression are we?” - for example, since process start or event count. The unit need not be seconds; it can be any quantity from which we can derive happens-before information.

In this post we cover two kinds of logical clocks:

Lamport Timestamps
Vector Clocks

Each node maintains a logical counter $t$ , which is incremented whenever a local event occurs. This counter represents the node’s notion of logical time and advances monotonically as the system progresses.

When a node sends a message $m$ , it attaches its current counter value to the message. Suppose this value is $t_1$ , it serves as a timestamp indicating when the message was sent in the sender’s logical timeline.

Upon receiving the message, the recipient compares the received timestamp $t_1$ with its own local counter $t_2$ . To preserve causality, the recipient updates its counter to be greater than both values. In practice, this is done by setting:

$t = \max(t_1, t_2) + 1$

This update ensures that the receive event is always considered to occur after the send event in logical time, thereby maintaining the happens-before relationship across nodes.

Properties of this scheme: Let’s say we have two events $a$ and $b$ , with Lamport timestamps $t_1$ and $t_2$ respectively. Then:

If $a \rightarrow b$ , then $t_1 < t_2$
However, $t_1 < t_2$ does not imply $a \rightarrow b$
Possible $t_1$ = $t_2$ even if $a \neq b$

⚠️ Note that: $\rightarrow$ here means event $a$ causally precedes $b$ in the Lamport happens-before sense.

Back to the example in “Using physical clocks to ensure ordering?” - with Lamport timestamps we can order $m_1$ before $m_2$ at C in the scenario we care about, because:

Let’s say, nodeA sends the message $m_1$ with its Lamport timestamp $t_1$ =3.

Let’s say, at the time nodeB receives the message $m_1$ , its counter is = 2. On receiving the $m_1$ , it updates its Lamport timestamp to $t_2 = \max(2, 3) + 1 = 4$ , then, attaches $t_2$ to message $m_2$ before sending to nodeC.

When nodeC receives $m_1$ and $m_2$ , according to attached Lamport timestamp, it can determine that $m_2$ should be processed after $m_1$ even if it receives $m_2$ before $m_1$ .

Furthermore, if we want to avoid $t_1 = t_2$ for distinct events, we can pair the logical counter with a unique node id - write $L(\text{event})$ for that pair.

Doing so, we also can achieve total order, as for now, for any events $a$ and $b$ , we don’t have $L(a) = L(b)$ . Ordering events using $L$ results in total order.

One important caveat of Lamport timestamps is that they cannot fully determine the relationship between two arbitrary events. Given two timestamps $t_1$ and $t_2$ , it is not possible to conclude whether one event happened before the other, or whether they are concurrent.

At first glance, this may seem contradictory to what we have discussed so far. However, the limitation becomes clear once we recall that Lamport time is simply a logical counter, maintained independently by each node. Different nodes may have counters that progress at very different rates.

Consider the following scenario. Suppose node A currently has a Lamport timestamp of 1, while node B has already advanced to 10. Now imagine:

Node A sends a message $m_1$ with Lamport timestamp 1 to node C.
Node B sends a message $m_2$ with Lamport timestamp 10 to node C.
These two events are independent, meaning neither $a \rightarrow b$ nor $b \rightarrow a.$

From the definition in Happens-before relation section, we know that $a$ and $b$ are concurrent. However, the Lamport timestamp of $b$ is still greater than that of $a$ , hence, at node C it is impossible to tell from scalar timestamps alone whether $m_1$ and $m_2$ are concurrent or causally ordered.

Nevertheless, Lamport timestamps remain extremely useful. While they cannot infer causality from timestamp comparison alone, they do guarantee the following:

$m_1 \rightarrow m_2 \Rightarrow t(m_1) < t(m_2)$

In other words, Lamport clocks never violate causal relationships. Even when concurrency is ambiguous, picking an order consistent with increasing Lamport time is safe for causality (e.g. ordering $m_1$ then $m_2$ in the example above), though it may not match real-time or fairness.

Given the limitation of Lamport timestamps, if we want to determine whether two events are concurrent, we need a more expressive form of logical time: the vector clock.

Assume there are N nodes in the system. In Lamport clocks, each node maintains a single logical counter. Vector clocks extend this idea by having each node maintain a vector of counters, one for every node in the system.

Formally, each node maintains a vector:

$\langle t_1, t_2, t_3, \dots, t_N \rangle$

where $t_i$ is how many events at node $i$ are in the causal past of this vector (what this node knows about each process).

Note that: In real-world application, the vector clocks can be represented by a map from each node ID to that node’s logical timestamp instead of an array.

Each node updates and propagates this vector as follows:

Local event: When an event occurs at node $N_i$ , it increments its own entry: $t_i = t_i + 1$
Send event: When sending a message, the node attaches its current vector clock to the message.
Receive event: Upon receiving a message, the node merges the received vector with its local vector by taking the element-wise maximum: $t_j = \max(t_j^{local}, t_j^{received}) \quad \forall j$ After that, it increments its own counter to reflect the receive event.

Unlike Lamport timestamps, vector clocks capture causal history across all nodes. By comparing two vector clocks, we can determine:

If one event happened before another
If two events are concurrent

Suppose we have two events with vector clocks:

$V = \langle v_1, v_2, \dots, v_N \rangle \quad \text{and} \quad W = \langle w_1, w_2, \dots, w_N \rangle$

Each component counts how many events from the corresponding node are included in that event’s causal history.

Comparison Rules

To compare two vector clocks, we examine them element by element:

We say $V \rightarrow W$ (V happens-before W) if: $v_i \le w_i \quad \forall i \quad \text{and} \quad v_j < w_j \text{ for at least one } j$
Similarly, $W \rightarrow V$ if the reverse holds.
If neither condition is satisfied, then $V \parallel W$ : the two events are concurrent.

Example:

Let’s return to the Lamport example where scalar timestamps cannot tell whether $m_1$ happens-before $m_2$ or the two sends are concurrent.

Let’s examine the case $m_1$ happens-before $m_2$ , then we have this:

At node C, the vectors attached to $m_1$ and $m_2$ are $\langle 1,0,0 \rangle$ and $\langle 1, 10, 0\rangle$ respectively. Comparing componentwise:

1 = 1
0 < 10
0 = 0

1 = 1
0 < 10
0 = 0

Hence, $m_1$ happens-before $m_2$ !

What if those events are concurrent?

At node C, the vectors are $\langle 1,0,0 \rangle$ and $\langle 0, 10, 0\rangle$ respectively:

1 >= 0
0 <= 10
0 <= 0

1 >= 0
0 <= 10
0 <= 0

→ $m_1 \parallel m_2$

Vector clocks are powerful, but they have some challenges:

Scalability: They require a lot of memory to maintain the vector for each client. In a system with 1000 clients, each client needs to store a vector of 1000 elements, and this list grows unbounded as the system grows.
Performance: They require a lot of computation to compare the vectors for each client.
Complexity: They are complex to understand and use.

Dynamo-style databases (e.g. Amazon Dynamo): Vector clocks track object versions to detect concurrent updates and trigger reconciliation. See Dynamo: Amazon’s Highly Available Key-value Store (SOSP 2007).
Riak : Uses vector clocks to manage sibling versions and resolve conflicts. See e.g. Vector clocks revisited and the Riak docs.
CRDT-based systems : Often carry causal metadata (vector clocks or compact variants) so concurrent updates can be merged safely. See Shapiro et al., A comprehensive study of Convergent and Commutative Replicated Data Types.
Distributed debugging / causal tracing : Vector clocks (or derived orderings) help reconstruct what happened-before what across nodes when you debug races. See Kleppmann, “Logical clocks are easy” (ACM Queue).

The Lamport and vector clocks provide a reliable way to preserve causality, but they have no connection to the real-world time. Physical clocks, on the other hand, reflect actual time but are unreliable due to clock skew and synchronization issues.

Hybrid Logical Clocks (HLC) are designed to bridge the gap. An HLC timestamp is typically represented as a pair:

$\langle t_{wall}, t_{logical} \rangle$

The first component, $t_{wall}$ , is derived from the system’s physical clock. The second component, $t_{logical}$ , is a logical counter used to ensure correct ordering when physical time alone is insufficient.

The core idea of HLC is:

Use physical time whenever possible, and fall back to logical time when necessary.

For most events, the physical clock provides a reasonable approximation of ordering. However, when clock skew or message delays would cause incorrect ordering, the logical component intervenes to preserve causality.

When a local event occurs, the node compares its current physical time with its last known HLC timestamp. If the physical clock has advanced, it updates the wall-clock component and resets the logical counter. Otherwise, it increments the logical counter to maintain monotonicity.

When sending a message, the node attaches its current HLC timestamp.

Upon receiving a message, the node updates its timestamp by taking the maximum of three values: its local physical time, its current HLC timestamp, and the received timestamp. If the wall-clock component remains unchanged, the logical counter is incremented to ensure that the receive event is ordered after the send event.

This mechanism ensures that causality is preserved even when physical clocks are not perfectly synchronized.

Pure physical clocks can violate causality. A message sent later may appear to occur earlier if the receiver’s clock is behind. This leads to incorrect ordering and subtle bugs in distributed systems.

With lexicographic comparison of HLC timestamps as pairs $(t_{wall}, t_{logical})$ (not wall time alone), HLC is designed so that causal precedence implies timestamp order:

$a \rightarrow b \Rightarrow HLC(a) < HLC(b)$

At the same time, unlike purely logical clocks, HLC remains close to real-world time. This makes it suitable for applications that require both ordering and time-based semantics, such as snapshot reads or time-travel queries.

You can view a simple implementation here:

When NTP steps the system clock, any API backed by a time-of-day (wall-clock) source can jump forward or backward. That breaks the idea that end − start always equals real elapsed time: the difference can be too large, too small, or even negative.

In Java, System.currentTimeMillis() is such a clock: milliseconds since the Unix epoch. Use it for “what time is it?”, not for benchmarking.

System.nanoTime() uses a monotonic clock: it always moves forward and is not reset by stepping (NTP may still slew its rate, but discontinuities come from stepping the wall clock, not from nanoTime()). So nanoTime() is the right tool for timing doSomething()-style work on one machine.

Trade-off: a nanoTime() value is meaningless in isolation - it is time since an arbitrary origin (e.g. boot). Only differences between readings on the same node are defined. Comparing monotonic timestamps across different machines is not meaningful.

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/12/Time-Clocks-and-the-Ordering-of-Events-in-a-Distributed-System.pdf

https://en.wikipedia.org/wiki/Marzullo's_algorithm

https://en.wikipedia.org/wiki/Network_Time_Protocol

https://www.youtube.com/watch?v=OKHIdpOAxto&list=PLeKd45zvjcDFUEv_ohr_HdUFe97RItdiB&index=10

https://martinfowler.com/articles/patterns-of-distributed-systems/lamport-clock.html

https://martinfowler.com/articles/patterns-of-distributed-systems/hybrid-clock.html

https://cs.stackexchange.com/questions/40061/happened-before-and-causal-order

https://www.cl.cam.ac.uk/teaching/2122/ConcDisSys/dist-sys-notes.pdf

https://riak.com/posts/technical/vector-clocks-revisited/index.html?p=9545.html

https://aphyr.com/posts/294-call-me-maybe-cassandra