9 Sep 2018

Time, Clock and the Ordering of Events in a Distributed System

Link to the paper

The paper can be found here - a highly recommended read!

What is this about?

In simple words: While the concept of timing is fundamental to our day-to-day lives, it has to be carefully examined in the world of distributed systems. This paper talks about exactly that. The objective here is three-folds:
* Discuss that the happened before relation on events in a distributed system is a partial order
* Method to synchronize system of logical clocks and henceforth define total ordering of events
* Using it to synchronize physical clocks

Terminology:

Distributed Systems: Any system in which the message transmission delay is not negligible compared to the time between events in single process. Eg. set spatially spacially separated processes or even a multi-threaded system
happened-before relation: a classic way of referring to the order of occurrance of events based on which of the two occured first

Towards Objective 1

The idea of time, which is derived from the order in which events occur, is very fundamental to our day-to-day lives. (Eg. something is said to occur at 3.15 if it happens after 3.15 in clock and before 3.16 in the clock). However, in distributed systems, often it is impossible to say that one of the two events occurred first. The relation “happened-before” is therefore only a partial relation in such a system.

Some FAQs (questions in case you are confused)

Why can’t I simply use a real clock?
Ans: Since, it is not a part of the system, and not observable withing systems. One way could be to embed real clocks inside them to keep track of the time, but then clocks are not perfectly accurate and do not keep precise time.
Can I not keep track of a single clock and the events that are occuring on the two systems?
Ans: Yeah, you could do that, but then you would always have to be a part of the system to gurantee the ordering.

Nonetheless, the following definition of the happened-before relation is such that all of us would agree to it:

Definition (happened-before relation -->): The relation -> on the set of events of a system is the smallest relation satisfying:
* If a and b are events in the same process, and if a comes before b, then a –> b
* If a is sending of a message by one process and b is reveipt of the same message by another process, then a –> b
* Transitivity: if a –> b and b –> c, then a –> c
Clearly, –> is neither reflexive nor symmetric.
(Intuitively: we can also think of this relation as the one saying that it is possible for event a to causally affect event b if a –> b)

Corollary: Two distinct events a and b are concurrent if a -/-> b and b -/-> a

Process Diagram

Consider the image above. It represents time in the vertical direction. The dots denote events and the wavy lines denote the messages. Clearly p1 –> r4 while p3 and q3 are concurrent. (Even though q3 appears to occur at an earlier physical time than p3, process P cannot know what process Q did at q3 until it receives the message p4.)

Towards Objective 2

The author here describes a way to synchronize logical clocks across systems that will help us extend the above happened-before relation into a total order. This is just a way to assign numbers to our events, where the number can be thought of as time at which the event occurred.

Let there be a clock Ci for the process Pi, where Ci(a) assigns a number to any event a in the process Pi
Let C represent the system of clocks across all processes. i.e. C(a) = Ci(a) if a is an event in process i
For C to be correct, it should obey the happened-before relation. This condition is stated as follows:

Clock-Condition: For any events a,b if a –> b then C(a) < C(b)

Observe that the converse need not hold, since that would imply that any two concurrent events must occur at the same logical time.

It is easy to see that the clock-condition is satisfied if the following hold:
- C1: if a and b are events in process Pi and a comes before b, then Ci(a) < Ci(b)
- C2: if a is sending of a message by Pi and b is receiving of this message by Pj then Ci(a) < Cj(b)
These conditions C1 and C2 are inturn satisfied by the following implementation rules respectively:
- IR1: Each process Pi increments Ci between any two successive events
- IR2: If a is sending of a message by process Pi, then the message m contains a timestamp Tm = Ci(a). Upon receiving this message, process Pj sets Cj greater than or equal to the present value of Cj and greater than Tm
Using the above system of logical clocks, it is rather straight forward to place a total ordering (==>) on the events. Simply, order the events by the time they occur at. To break ties, use arbitrary order of processes.
This total order relation is not unique. Different choices of clock yield different relations.
This total ordering of events is very important in distributed systems. This can be illustrated by using this to solve the distributed version of the mutual exclusion problem:
- Consider a system composed of a fixed collection of processes that share a single resource.
- We wish our mutual exclusion algorithm to follow: (a) a process which has been granted resource must release it before another process it granted access to it. (b) requests must be grandeted in order in which they were made (c) Starvation Freedom: if every process granted resource eventually releases it, then every request is granted.
This problem is non-trivial and cannot be solved by using a central master process/scheduling process. To see this:
- Suppose P0 is the master process.
- Say P1 sends a request to P0 and then sends a message to P2.
- Upon receiving this request P2 sends a request to P0
- It is possible that P0 receives request from P2 before the request from P1 in which case the condition (b) is violated.
To solve this, we use logical clock described by IR1 and IR2 to define total ordering of events. With this ordering the slution is straightforward.
Assume the messages are received in order in which they are sent (this assumption can be avoided by introducing message numbers). Also that each process maintains its request queue hidden from others. Initially these contain T0:P0 where P0 is the process initially granted and T0 is less than any clock.
Algorithm to obtain mutual exclusion satisfying above conditions:
- To request resource, Pi sends message Tm:Pi to every other process and puts that message on its own request queue, where Tm is the timestamp of the message.
- When Pj receives Tm:Pi, it places it on its request queu and sends a (timestamped) ack to Pi
- To release a resource, Pi removes any Tm:Pi message from its request queue and sends a (timestamped) Pi release resource message
- When Pj receives a Pi release resource message it removes any Tm:Pi messages from its request queue
- Process Pi is granted the resource when the follwoing conditions are satisfied: (a) There is Tm:Pi in the request queue ordered before any other reuqest (b) Pi has received message from every other process timestamped later than Tm
This method allows one to implement any desired form of multiprocess synchronization in a distributed system. However the algorithm requires active particiaption of all processes and a single process failure will halt the algorithm. However, observe that without physical time there is not way to distinguish a failed process from one which is just pausing between events.

Towards Objective 3: Having the notion of physical clocks

Consider the following situation. Suppose a person issues request A on computer A and then telephones friend in another city to have him issue request B from PC B. It is possible for request B to receive a lower timestamp and be ordered before request A. This can happen because system has no way of knowing that A actually preceded B, since the precedence information is external to the system.

Let T be set of all system events. Let T1 be set of events in T along with external events. Let ~~>> denote happened-before relation in T1. There is no way for any alogrithm based solely on events of T to correctly order A before B since internal to system A -/-> B. So, in order to solve the anamolous behaviour, we can either ask the user to incorporate the external information into the system by explicitly assigning event B a timestamp greater than that of request A. The second approach is to construct a system of clocks that satisfies the following condition:

Strong Clock Condition: For any events a,b in T:
if a ~~» b then C(a) < C(b)

Note that this is not generally satisfied by our logical clocks. Let us identify T1 with some set of real events and let ~~>> be partial ordering of events in T1.

Physical Clocks

Let Ci(t) be the reading of clock Ci at physical time t (also assume Ci to be continuous except when we reset it)
We want this Ci to be closely synchronized with the physical time to avoid anomalies as above. Hence, we want:
- PC1: There exists constant k « 1, such that |dCi(t)/dt - 1| < k (i.e. Ci(t) runs at approximately the same rate as the physical time)
- PC2: For all i,j: |Ci(t) - Cj(t)| < e, for a small constant e (i.e. not all should all run at approximately the same pace, they should at all times be synchronized with each other - allowing a max tolerance parameter of e)
Since, it is almost impossible to get the clocks to run at the same rate, they will slowly drift apart and hence must be reset after every some time to follow PC2.
The goal of the complete exercise was to avoid the anomalous behaviour observed previously, where in out of world events actually cause the wierd behaviour to be observed. Hence, we must insure that events T follow the strong clock condition.
Let u > 0 be a constant less than the shortest transmission time for the interprocess messages. This means, for any event a and b, such that a occurs at physical time t and a ~~» b, then b must occur at a physical time later than t+u
Avoiding anomalous behaviour is equivalent to: Ci(t+u)-Cj(t) > 0 for any i,j,t. This condition along with PC1 and PC2 gives the inequality e <= (1-k)*u (See image below)

Math

This ineuqality along with PC1 and PC2 ensures that anomalous behaviour will never occur. PC1 is relatively easy to establish. Crystal controlled clocks have k <= 1e-6 (10^-6)
Following is an algorithm to ensure PC2 is followed:
- Let m be a message sent at physical time t and received at time t1.
- Define v_m = t1 - t = total delay of m. Clearly u < v_m. Call e<sub>m</sub>=v_m-u = unpredictable delay of m.
- Change IR1 and IR2 to the following:
  IR1: For each i, if Pi does not receive message at time t, then dCi(t)/dt > 0
  IR2: (a) If Pi sends a message m at time t, then m contains timestamp Tm = Ci(t).
  (b) Upon receiving m at time t1, process Pj sets Cj(t1) >= max(Cj(t1), Tm+u)
- Note though the definitions above use the physical time, the rule implementation only requires the process to know its own clock time running at certain rate unless reset.
Using the above implementation rules, not only do we have PC2 but we can also establish theoretical abounds on the length of time it can take for the clocks to become synchronized when the system is first started. This however is a little mathematically involved theorem.

Discussion

A very important point in the world of distributed systems is that the ordering of events is only a partial order. This idea is usefulin reasoning and thinking about distributed systems. The paper helps concretise the fact in our minds. Extending this to form a tolta order was relatively straight forward. Using this to synchronize the physical clocks is nice technique, since we don’t often realize how we get confused by such anomalous behaviour in our regular dealing with distributed systems.

Visitors:

Stats:

0 comments