11 May 2017

Epidemic Algorithms for Replicated Database Management

Epidemic Algorithms for Replicated Database Management

Why use?

Maintaining mutual consistency across different sites, on updates, insertion and deletions, when a database is replicated is non-trivial and a significant problem. Though, it sounds reasonable to maintain a list of all replication servers and send direct updates to all when an update occurs at a site, it can cause large network load on the link of the node that has the initial update. Also, in case of constantly adding and leaving nodes, maintaining a consistent list of a million or a few hundered thousand nodes at every site consistently itself is difficult. In the face of the above mentioned problems, the algorithms described in the paper can come in handy.

The described algorithms have been used in the clearinghouse servers of the Xerox Corporate Internet and have proven to be very useful.

Formal introduction to the Problem

Consider a database replicated at many sites in a large heterogeneous, slightly unreliable and slowly changing network of hundred or thousand servers. The paper talks about ways in which consistency can be maintained across these sites. The model restrictions under which we formulate the methods are as follows:

  • Each update is injected at a single site and it or a later update on the same data must be propagated to all the sites

  • Relaxed form of consistency, where databases are allowed to have some inconsistent states for a short time, which eventually converge to full consistent states for an update.

Important Factors that we take into account as a performance metric for evaluating the proposed algorithms:

  • Time required for an update to propagate to all sites

  • network traffic generated in propagating a single update to all sites.

The methods discussed in the paper are:

  • Direct mail: where each new update is directly mailed/sent from its entry site to all others. Not entirely reliable when individual sites do not know about all the sites in the network.

  • Anti-entropy: every site regularly chooses another site at random and resolves differences between their database inconsistency, by exchange of messages. Though this is reliable, the rate at which an update propagated through this method is relatively low.

  • Rumor mongering: sites initially are ignorant. When a new update arrives at a site, it becomes a hot rumor and this entry site periodically chooses another site at random and ensures they get the update. On receiving the update other sites follow a similar procedure as this site. When a site has tried to share a hot rumor with too many sites that have seen the update already, the rumor ceases to be hot and is eventually dropped. This is a faseter approach of propagating updates, but does not guarantee eventual consistency at all sites.

The two later methods, described aboe fall under the category of epidemic algorithms, because that is how an epidemic spreads.


Infective: A site that is holding the update and is willing to propagate it

Susceptible: A site that has not yet received the update.

Removed: A site that has the update but is no longer willing to share it.

Formal Description of replicated database and the above epidemic methods

Consider a network containing set S of n sites, each storing a copy of a database. The database copy at a site s belonging to S is a time varying partial function:

s.valueOf: K -> (v:V x t:T)

where K is a set of keys (names),

** V is the set of values (NIL -> meaning corresponding element identified by key k has been deleted)

** and T is a set of timestamps.

Further discussions have been simplified by considering a database that stores the value and timestamp for a single key. Therefore, s.valueOf is an element of (v:V x t:T)

Direct Mail:

This method attemps to notify all other sites as soon as the update occurs. The basic algorithm at each site is as follows:

If s is the first site that sees the update then

	PostMail to s', msg: ("update",s.valueOf)

Upon receiving the message (“Update”,(v,t)) site s executes:

IF s.valueOf.t < t THEN
	s.vakueOf <- (v,t)

Reliability of this method is a function of reliability of the network, queues and other underneath protocols. Though this approach only generates n messages per update, all these produce common traffic on the node sending update and this link becomes the overall bottleneck in the process. So this cannot be scalable beyond certain limits. Also maintaing a consistent list of all nodes that are alive can be difficult in a scenario where nodes are allowed to leave and join the network.


There can be 3 simpleset kinds of Anti-entropy algorithms based on pushing an update, polling or pulling an update and a hybrid of these two.

Each node in the network runs the following algorithm:


where ResolveDifference method can use one of the above mentioned techniques.


ResolveDifference: Proc[s,s'] = {
	IF s.ValueOf.t > s'.ValueOf.t THEN
		s'.ValueOf <- s.ValueOf;


ResolveDifference: Proc[s,s'] = {
	IF s.ValueOf.t < s'.ValueOf.t THEN
		s.ValueOf <- s'.ValueOf;


ResolveDifference: Proc[s,s'] = {
		s.ValueOf.t > s'.ValueOf.t => s'.ValueOf <- s.ValueOf;
		s.ValueOf.t < s'.ValueOf.t => s.ValueOf <- s'.ValueOf;

Different methods can be used to choose site s’ to suit different network topologies. From results established in the epidemic theory, the paper borrows result and states that the expected time to propagate the update is proporational to the log of population size and is given as log2(n) + ln(n) + O(1) for large n;

This type of a method ensures that even if direct mail leaves out a few sites, update will eventually be distributed to all sites. Clearly, this method is too slow to be used as a standalone method to distribute updates and generally works well after intially distributing the update to a good majority of nodes.

Mathematical analysis

Let pi be the probability that a site remains suceptible after i rounds.

Then for a pull based algorithm, we can easily see that

pi+1 = (pi)2

this is because pi+1 is a combination of a susceptible node choosing another susceptible node for the anti-entropy mechanism.

For a push based algorithm, we see that the relation turns out to be:

pi+1 = pi (1 - 1/n)n(1-pi)

(Explanation: choosing a susceptible node: pi.

Remaining nodes: n-1.

Probability of choosing a particular node s’: 1 - 1/n)

Probability of choosing a node without the update: (1 - 1/n)*(1-1/n)…n(1-p) times


which, for large n and small p, becomes:

pi+1 = pi e-1

The above expressions suggest that pull or push-pull based algorithms are greatly preferable to push.

The paper also discusses a few techniques to reduce the expensive operations of comparing databases at two nodes, performed in the above mentioned anti-entropy algorithms such as comparing hash for most recent few entries. This can be checked out in paper, though the paper does not discuss these in much detail.


Anti-entropy is too slow to be used as a standalone mechanism for propagating updates. At the same time direct mail has a problem of bottleneck links at the source of the update. So, we require a fast propagation mechanism which does not produce too much traffic at the update source. Rumor-Mongering is one such mechanism. This is a complex epidemic algorithm, and overcomes the O(n) bottleneck at the originating site that occurs in the direct mail scenario. But these kind of complex epidemic algorithms have a finite probability of failure to propagate an update to all sites.
General rumor-spreading senario: Say there is a town of n people and say a person ‘A’ get a hot news from somewhere outside. He calls a few people he knows and tells them about the news. These people further excited by the news tell it to their friends and the hot news or hot rumor starts spreading fast like a forest fire. But soon a person realises that most other people already know this news update and lose interest in calling others and telling them about this. Eventually people stom talking about it. In case of update propagation, update is the hot news and nodes are the people. When an active individual makes an unnecessary update propagation, then with probability 1/k the active site loses interest in sharing the rumor.

Mathematical analysis:

Let s,i and r represent the fraction of individuals susceptible, infective and removed respectively, so that s+i+r=1;

ds/dt = -si

di/dt = si - (1-s)t/k

Solving the above equations we have:

i(s) = -(k+1)(1-s)/k + log(s)/k

The value of constant was calculated using the condition: i(1) = 0


s = e<sup>-(k+1)(1-s)</sup>

Residue: This is the value of s when i is 0, that is the remaining susceptibles when the epidemic finishes. So we would like the residue to be as small as possible. Above analysis shows, it is feasible to make residue arbitrarily small.

Traffic: Average messages per site - measured as m = Total messages/number of sites for simplicity.


tavg: Avg. delay is the difference between the time of the initial injection of an update and the arribal of the update at a given site, averaged over all sites.

tlast: delay until the reception by the last site that will receive the update during the epidemic. Update messages may continue to appear in the network after tlast but will never reach a susceptible site.

The paper gives some analysis based on the above metric for a few variations of the complex epidemic algorithms. The majorly differ in deciding when an infective site loses its interest and becomes removed. Some variations are blind vs feedback and counter vs coin.

The paper then proposes using an anti-entropy algorithm along with a complex epidemic for propagating updates. This hybrid technique ensures that the update is eventually heard by all the sites and also majority of the sites receive it rather quickly. This also prevents the O(n) bottleneck of the direct mail.

Deletion and Death Certificates

Using the above techniques, deletion seems non trivial since communicating absence of an entry can be a little difficult. Turns out, it is not and just sending a death certificate corresponding to a key can do the job.

During propagation, when old copies of deleted items meet, death certificates, the old items are removed. If we keep a death certificate long enough, it eventually cancels all old copies of its associated item. So, the next obvious question is when should we delete the death certificate? (since we would surely not want to store it forever). This indeed looks like a difficult problem.

A key thing to note is: if the deletion is not conversed to all sites and the death certificate is deleted from everywhere, other processes may transfer the key value from the non-deleted sites to others and eventually the entry will get restablished.

One strategy to overcome this is that a few sites retain death certificates for longer than usual even when the rumor or deletion is no more hot. If they again see a update for the key to which a stored death certificate corresponds, these death certificates can be reactivated and transmitted again. These are called Dormant Death certificates

An easy implementation of Dormant Death Certificate

The implementation uses two thresholds, t1 and t2, and attaches a list of r retention site names to each death certificate. When a death certificate is created, its retention sites are chosen at random. The death certificate is propagated by the mechanism used for ordinary updates. Each server retains all
death certificates timestamped within t1 of the current time. Once t1 is reached, most servers delete the death certificate, but every server on the death certificate’s retention site list saves a dormant copy. Dormant death certificates are discarded when t1 + t2 is reached.

Spatial Distributions:

The paper lastly discusses the effect of network topology (or spatial distribution of servers as they call it) on the mechanism and how to modify them to suit these needs. Up to this point we have regarded the network as uniform, but in reality the cost of sending an update to a nearby site is much lower than the cost of sneding the update to a distant site. To reduce communication costs, we would like the protocols to favor nearby neighbors, but there are drawbacks to too much local favoritism. The paper next explores various tradeoffs in the different deisgn decisions. People willing to really implement this mechanism might find it important to go through this once.

My take and conclusion

The paper replaces the deterministic algorithms used previously with the above discussed randomized algorithms, that seemed to have increased the performance and ease of maintaining consistency across replicated servers to a significant extent as indicated by the paper. Simple observations that anti-entropy is sort of close to an epidemic led to consider other epidemic like algorithms such as rumor mongering. We may be able to device more complex epidemic like algorithms with a even better performance, based on our observation of the real world.

The paper overall is very well presented and successfully tackles the problem of consistency amongst replicas. This has been a revolutionary progress in the field of Distributed Systems. Many other similar protocls have risen since then which may be of interest to the reader. I hope I was helpful in understanding the key features of the paper.

Visitors: visitor counter