The basic protocol is first discussed which only has the safety gurantee. A simple method of extension will then be described to reduce contention on proposed values and make it more lively.
Stages in the algorithm:
Nodes receiving the propose(n) reply to the above message with a accept(n, n1, val) and promise to not accept any proposal number lower than n. Here n1 is the largest proposal number less than n seen by the node. val is the value of this proposal corresponding to the largest proposal number (possibly null). In case this node has already accepted a proposal number higher than n by sending an accept message to that, the node rejects this proposal (simply by not replying to this proposal).
Proposal Value The proposer then does the following:
if it received accept from every node in some majority, it chooses as value the highest val it received back along with a accept message. (Chooses a value it likes, if all the val it received were null). Let’s call the selected value selected_val. Proposer then goes ahead and sends a select(n,selected_val) to all the nodes in this majority set.
if it did not receive accept from a majority set of nodes, it goes back to step one and proposes a higher proposal number
The nodes upon receiving select(n, selected_val), can reply with vote(n, selected_val), or not reply at all if they made a promise to a higher proposal number via step 2.
If the proposer receives vote(n,selected_val) from a majority nodes, the selected_val actually gets selected and the proposer sends a sucess(n, selected_val) to all the nodes.
None of the above steps result in a state that violates consistency. The proof of the same can be found in the paper, but is not required. I will try to answer a few question below to make this more intuitive.
Lastly, it is important to note that the above protocol allows only to arrive at a decision on a single value. However, real world systems might require to arrive at consensus on more than one values. This can be acheived by having an index associated with the value and running the above protocol for each index. Step 1-2 are independent of the value or the index and can be run only once, after which steps 3-to-6 can be run for each index of value required to arrive consensus on.
Another thing to note is that progress is not guranteed even if more than a majority of nodes are working fine. In such a case progress can be prevented because of many proposers proposing a proposal number, where some nodes accept some of these proposals which results in proposers going back to first step and proposing more proposal numbers.
To avoid this proposal message contention, a leader can be elected using any simple leader election protocol. You may be thinking that leader election itself is a distributed consensus problem, which is true. However, leader election in this case can be done via any algorithm which does not have strong consistency gurantees etc. This is because the safety property of paxos will still hold true even if two or more leaders are elected in face of failures during the leader election. All-in-all multiple leaders/proposers only slow the consensus but do not lead to violation in the safety property.
Lastly, I am writing a faq section which includes a few questions that I struggled with personally.
Q) Why does paxos require a majority of nodes to participate in the voting?
Ans) Any two majority sets will always have an non-intersection, which means there will always be atleast one node during any voting procession that participated in a previous successful voting, and hence this node will send the already accepted value to the proposer along with its accept message (which the proposer chooses - proof of which can be seen in the paper).
Q) Assume the following hypothetical condition. A node (say N) in the majority set (say M1) fails after voting for a value but just before receiving success for it and then restarts to participate in another voting procession whose majority set (say M2) contains only N from M1. Quorum M2 might choose/select another value than the one chosen by M1, right?
Ans) No. Even though node N failed before receiving success message, it had sent accept message for the value in quorum M1. When sending accept in M2 (in case the proposal number in this voting is higher), node N will send the previously accepted value from M1, which will then be chosen by proposer in step 3 of voting in M2. Hence, if the consensus were to arrive in M2, it would again be on the same value. Note that the value from M1 was chosen in M2 since there always exists a voting M2, that occured just after the consensus on the value of M1 and the step 3 chooses the value with the largest previous proposal number.
Q) Paxos is all cool Prakhar. However, why would I use Paxos (which seems not so stragithforward) instead of a simple quorum method such as W+R > N as in Cassandra? Does Paxos provide guarantees that are not provided by W+R>N method?
Ans) Very-well mate. I like where you are going with your question. Much of the following answer is taken from here.
Yes Paxos provides gurantees not provided by read-write quorum systems. After a successful write, both kind of systems behave similarly. The difference is in what happens during writes and/or failures. Until you have received a successful ack from W nodes, then the data might be written to some nodes and not to others and hence there is no gurantee that the whole system agrees on a single value. If you try to read the data back at this point, some clients may get the new data back and some the old data back. In other words, the system is not immediately consistent. . This is because writes aren’t atomic across nodes in these systems. There are usually mechanisms to “heal” an inconsistency like this and “eventually” the system will become consistent again. With Paxos, writes can be made atomic across nodes and inconsistencies between nodes are therefore possible to avoid. The Paxos algorithm makes it possible to guarantee that non-faulty nodes never disagree on the outcome of a write, at any point in time. Either the write succeeded everywhere or nowhere. In summary paxos favours consistency over availability (or CP over AP in the CAP theorem).
Q) Fine, Prakhar. But does paxos also have advantages over 2PC or 3PC (phase commit)?
Ans) Yes. While 2-phase commit protocol is vulnerable to a single node failure, paxos is resilient to that. Infact, it is resilient to several node failures as long as there is a majority of them functioning. 3-phase commit similarly violates consistency in the face of network partition. On the other hand paxos will only take longer to arrive at a consensus in such a case, but never violate the safety property. So, in a way paxos does have an advantage if you look at the safety property.
I might dig deeper with the comparision between various consensus protocols in an another blog post if this intruiges me or others any further. Or I might write about the several variants/applications of paxos that are being used in the popular practical systems of the day.
The paper can be found here
Some more reference material: