Understanding distributed time using Logical Clocks

To build a strong foundation in distributed systems, it’s essential to first understand the concept of distributed time. To begin this journey, we’ll start by exploring the landmark research paper “Time, Clocks, and the Ordering of Events in a Distributed System,” published by Leslie Lamport in 1978. This paper is a must-read for anyone designing or building distributed systems.

What is time?

Time is an abstract concept and is the fundamental dimension of our universe. Though time in physical clock that we use every day (Example - Apple Watch) is an implementation of humans to bring order to the life on earth.

The point I am trying to bring is how we implement the concept of time is dependent on the use case we are trying to solve.

What problem does time solve in physical world?

Time’s fundamental use case is to provide basic concept of order in which events occur.

How we measure time in physical clock?

There are two types of physical clocks commonly used -

Quartz based clock - The most common clock on most devices and servers are quartz clock that uses a quartz crystal’s vibrations to maintain time. Its accurate upto few seconds per month, and need to be adjusted to keep it in sync. The oscillations of quartz crystal is dependent on temperature. For example - In hot summer day or in hot server room, the clock will gets slow and need to be corrected.
Atomic Clock - These are expensive clocks and are used to attain high accuracy. For example - Satellites uses atomic clock. The concept of atomic clock is to use oscillations of atoms to calculate time. It is incredibly accurate with only gaining or losing 1 second in millions of years.

Why do we care about time?

We care about ordering of events, we want to know when one event happened before, ofter or concurrently with another.

What is the problem?

We use quartz based clock for most of the devices, and we understood it either goes slow or fast based on temperature. This means the clock need to be corrected to stay in sync with time defined by UTC.
This also means if you have 10 servers with each having their own clock, then each will have their own time. The time in server 1 will not be same as time in server 3.

This is a real problem, as time in different servers cannot be trusted. This means I cannot provide order to the events happening in distributed system using time in physical clock.

Logs at Server 1
Server 1 logs Event A sent at 10:00 local time
Server 1 logs Event E received at 10:08 local time
Logs at Server 2
Server 2 logs Event B received at 10:10 local time
Server 2 logs Event D sent at 10:12 local time

Ordering of events using local physical time of servers - A, E, B, D
Correct ordering of Events - A -> B, D -> E.

The Partial Ordering.

As we can see above using the server’s physical clock is not enough to tell the ordering of events. We need to define specification that explain the order of events.
Lamport came up with the concept of “happened before” relation without using physical clock. We can use “->” or “hb” to denote the “happened before” relation.

Definition of “happened before” relation

Theory Of Relativity Thought Experiment

The relation “->” on the set of events of a system is the smallest relation satisfying the following three conditions:

If a and b are events in the same process, and a comes before b, then a -> b.
If a is the sending of a message by one process and b is the receipt of the same message by another process, then a -> b.
If a -> b and b -> c then a -> c.

Two distinct events a and b are said to be concurrent if a -/-> b and b -/-> a.
Note: concurrent does not mean simultaneous. It just means event a does not have a causality relation with event b.

Logical Clocks

We can now define clocks in the system. We define a clock C_i for each process P_i to be a function which assigns a number C_i(a) to any event a in that process.

Clock Condition: For any events a, b: if a -> b then C(a) < C(b).

Using the happens before relation we can define clock conditions - C1. If a and b are events in process P_i, and a comes before b, then C_i(a) < C_i(b). C2. If a is the sending of a message by process P_i and b is the receipt of that message by process P_i, then C_i(a) < C_i(b).

Algorithm

Lets implement C1 and C2.

To satisfy Condition C1, we can simply assign numbers in increasing order to successive events in a process.

Condition C2, will require communication of local time between process.
Process P_i will include its local time C_i(a) while sending a message to P_j.
P_j on receiving the message will assign its time with Max(C_i(a), C_j(a)) + 1.

Example of logical clock in action

Limitation of logical clocks

Since it defines only scalar time, one cannot determine the ordering by simply examining their timestamps.

Learning to take away?

In a distributed system, understanding and enforcing causality is more fundamental than trying to mimic real-world physical time.

Lamport showed that logical time is sufficient for maintaining consistency and ordering in distributed systems — a revolutionary idea that has influenced almost all later work in distributed computing.

Here are few practical learnings -

Events in a distributed system form a partial order.
Getting true total order is impractical without introducing serious bottlenecks or single point of failures.
Most large scale distributed systems uses the partial order into an arbitrary total order. Making sure the total order is deterministic is important for one to have predictable behaviour. Without deterministic order it will become nightmare to debug and impossible to reason about the behaviour of distributed systems.

References

Time clocks research paper by Lamport published in 1978.