14 Oct 2018

CAP Theorem

Some context

One of the most popular impossibility result in the world of distributed computing, the CAP Theorem has frequently confused people and has been misinterpretated by many. In here, I try to write in layman terms what the theorem means and what are its implications in your choice of systems.

Introduction

We often hear people stating the CAP Theorem states as:

Of three properties:

  • consistency (hereafter referred to as C)
  • availability (or A)
  • partition tolerance (P)

only two can be satisfied simultaneously.

This statement however places C, A and P in equal terms, which is bit of a mistake. Also, having any or all of these properties are not actually binary choices and hence stating the theorem as above is some way exploiting language! One needs to understand what underlying meaning behind the terms is. So, lets jump into the crux of it. We will go into the topic in the following order for peadagogical purposes:

  • Abstractions & System / Fault Model: discussing this is important because algorithms or any system is eventually built over certain abstractions. The kinds of things that can be assumed to be available and the things that are to be handled by the algorithm is further defined by the algorithm.
  • CAP Theorem: we then discuss the theorem, its implications and the intuition behind it
  • Take Aways

Abstractions & System Model

Abstractions, fundamentally are ways of taking a system, shredding the unnecessary things from it, presenting the rest in a more manageable fashion and calling it another system. A System Model is the description of these abstractions that are built into the systems and the ones that aren’t. Any distributed algorithm attains its objectives within the purview of these system specifications.

Fault Model similarly is the specification of kinds of faults we might expect in our system and thing that our distributed algorithm must be capable of handling.

In general, the description of both can be specified by answering the following questions:

  • Capabilities of the nodes and their failure models. Some of which are:
    • Crash-Stop failure: The node crashes and stops functioning
    • Crash-recovery failure: The node may crash and recover to function correctly.
    • Byzantine failure: Nodes might continue to function but behave mischeiviously! - this is one of the tougher ones to handle (i.e. takes more effort and performance hit and is not generally handled in any commercial systems)
  • Capabilities of the communication channel
    • Can the links fail?
    • Is there an upper bound on message delivery times?
    • Are messages delivered in the order that they were published? - (using TCP handles it, while systems relying on UDP don’t)
  • Clock, synchronization and ordering:
    • Synchronous: Processes execute in lock-step; upper bound on transmission delay, each process has accurate clock:
    • Aysnchronous: Processes execute at independent rate; no bounds on transmission delays; no accurate clocks!

CAP Theorem

As stated previously, the CAP theorem states that of three properties:

  • consistency (hereafter referred to as C)
  • availability (or A)
  • partition tolerance (P)

only two can be satisfied simultaneously

However, the notion of ‘2 out of 3’ is greatly misleading (as also pointed by Eric Brewer - who first put forth this conjecture). The reason for this is two-fold:

  • C and A are not really binary properties (unless we limit ourselves to strong consistency)
  • C, A, P cannot be considered on equal footings

If the things, don’t yet look clear to you, keep moving as we try to clear the smoke of the air.

Consistency and Availability

So what’s Consistency?

If querying any node in the cluster returns you the same result, it is said to be consistent. However, this is the strongest form of consistency. In CAP, the C means … surprize, surprize …. linearizability.

Consistency, more generally speaking, can be divided into:
* Strong Consistency:
* Linearizability: all operations appear to have executed atomically in an order that is consistent with the global real-time ordering of operations
* Sequential Consistency: all operations appear to have executed atomically in some order that is consistent with the order seen at individual nodes and that is equal at all nodes.

While linearizability requires order in which operation take effect to be equal to the actual real time ordering of operations, sequential consistency allows their reordering as long as order observed at each node remains consistent with each other and with that of their individual program orders. The difference, clearly is irrelevant and not noticeable to the end user interacting with the system.

  • Weak Consistency: Anything that is not strong consistency is weak consistency. A few commonly used kinds:
    • Client-centric consistency:guarantee that a client will never see older versions of a data item, from what they have already seen. This however, does not require that the client be shown the latest data immediately.
    • Eventual consistency: This form of consistency says that after some undefined amount of time, all replicas will agree on the same values and be consistent.

WTF is Availability?

A system is said to be available under a fault scenario when it continues to function through failures or that fault scenario. That is, a system serving requests even though one or more components of it might have crashed or failed is said to be available under that fault scenario. As consistency, availability again is not binary because of the term fault scenario. A system might be able to handle one or more kinds of faults that occur. While the CA kind of systems (mentioned in the section below) cannot handle any node failures, a CP system of 2n+1 nodes can handle upto n non-byzantine faults (that is it continues to function as long as the majority of n+1 can be maintained)

C, A and P of the cap theorem - not on equal footings

While, we may choose between C and A, partition tolerance or P has to be either provided by abstractions underneath (i.e. assumed to be present) or should be taken to be absent. Also, while there is no partition, the system can be both consistent and available. It is at the time of partition in a distributed system that we have to choose between strong consistency and availability.

  • CA systems: If we assume that there are no partitions that are going to occur or that we may not need to be resilient to them and we might be able to offer the CA form of a system. Examples of these include any full strict quorum kind of systems such as a 2-phase commit or 3-phase commit. The C and A are fairly obvious in these. However, a node failure (which cannot be distinguished from the network failure to that node) can be said to result in a partition, under which the CA system halts to remain consistent. So, frankly the A in here serves no real value. (If you are not sure what 2-phase commit and 3-phase commits are, let me know in the comments and I might write another post on it :-) )

  • CP systems: These include systems with majority quorum protocols such as Raft or Paxos. In these, since the decisions are made by the majority only the majority partition remains available, while the nodes in the minority partition are rendered unavailable.

  • AP systems: Any system that can resolve conflicts in incosistent data is a good example of this kind (such as Amazon Dynamo). Since the system continues to function through the network partition, different parts of it might have different data and hence the inconsisteny resolution is required.

This is depicted below:

CAP Venn Diagram

src: Distributed Systems for fun and profit

Intuition:

The intuition behind cap theorem is fairly straightforward once understood. Hopefully, the following illustration helps this understanding:

Consider two nodes that form a distributed system. Say, A and B. When there is a network partition between them, they cannot communicate. At this point you can either, we are left with two choices:

  • Allow the nodes to get out of sync (giving up consistency), or

  • Consider the system to be “down” (giving up availability)

Take Aways

  • In any distributed system, at the time of partition, the choice between C and A is binary only when dealy with strong form of consistency

  • It is possible to be both available and consistent at times, when there are no network partitions. Hence, choosing CP or AP during a partition doesn’t imply that the system is not available or consistent at other times.

  • We might be able to attain both availability and consistency during a network partition, should we be okay with weaker forms of consistency.

  • In the end, a good rephrasing of the CAP theorem from above might be as follows: In a distributed data store, at the time of network partition you have to chose either Consistency or Availability and cannot get both. (src: SO ans)

Feel free to comment your doubts or discussions in comments section below.


Tags:
Visitors: visitor counter
Stats:
0 comments