Consensus
Consensus is hard.
When do you need it? When you have a bunch of processes, and…
they need to deliver the same messages, in the same order (totally-ordered/atomic broadcast)
they need to each know what other processes exist, and keep those lists up to date (group membership problem, failure detection)
one of them needs to play a particular specific role and the others need to agree who that is (leader election)
they need to be able to take turns accessing a resource that only one can access at a time (mutual exclusion problem)
they’re participating in a transaction and need to agree on a commit/abort decision (distributed transaction commit)
We can view consensus as a kind of box, with multiple inputs going in and coming out; the inputs might differ, but in a correct system, they agree coming out.
Properties
Consensus algorithms try to satisfy the following properties:
termination: each correct process eventually decides on a value (whatever it is)
agreement: all correct processes decide on the same value
validity (aka integrity, nontriviality): the agreed-upon value must be one of the proposed values
Note
In the asynchronous network + crash fault model, no algorithm actually satisfies all 3 of these (Fischer, Lynch, Paterson, 1983)! In this model, we have to compromise, and usually it’s termination that we compromise on.
Paxos
Leslie Lamport, 1998
Each process takes on some of 3 roles:
proposer - proposes values
acceptor - contribute to choosing from among the proposed values
learner - learns the agreed-upon value
A process could take on multiple roles, but we usually examine the case where each process only takes on one. Each process that plays any role is called a Paxos node.
Paxos nodes must:
persist data
know how many nodes is a majority of acceptors
How it Works
Phase 1
- proposer: sends a prepare message with a unique proposal number to a majority of acceptors,
Prepare(n)
the proposal number n must be unique, and higher than any proposal number that this proposer has used before
- proposer: sends a prepare message with a unique proposal number to a majority of acceptors,
acceptor: upon receiving a
Prepare(n)
message, check: “did I previously promise to ignore requests with this proposal number?”if so, ignore the message
- otherwise, promise to ignore requests with proposal number < n, and reply with
Promise(n)
(*) Promise(n)
: “I will ignore any request with a proposal number < n”when a majority reply with a promise, we reach a milestone: it is impossible to get a majority to promise anything lower than n
- otherwise, promise to ignore requests with proposal number < n, and reply with
Phase 2 - the proposer has received Promise(n)
from a majority of acceptors (for some n)
- proposer: send an
Accept(n, val)
message to at least a majority of acceptors, where: n is the proposal number that was promised
val is the actual value it wants to propose (**)
- proposer: send an
- acceptor: upon receiving an
Accept(n, val)
message, check if it previously promised to ignore requests with n if so, ignore the message
- otherwise, reply with
Accepted(n, val)
, and also sendsAccepted(n, val)
to all learners when the majority of acceptors have sent an
Accepted(n, val)
message for a given n, we reach a milestone: we have consensus on val (but no one knows)
- otherwise, reply with
- acceptor: upon receiving an
When each participant receives Accepted
from a majority of acceptors, then they know consensus is reached
(this happens separately on the proposer and learners)
Getting Weird
Consider the following execution:
P1 sends
Prepare(5)
- P2 sends
Prepare(4)
it doesn’t receive ``Promise``s in time, so it tries again with a higher proposal number
- P2 sends
P1 sends
Accept(5, foo)
and the acceptors sendAccepted(5, foo)
at the same time P2 sendsPrepare(6)
oh no!
So we have to change what happens in phase 1:
acceptor: upon receiving a
Prepare(n)
message, check: “did I previously promise to ignore requests with this proposal number?”if so, ignore the message
- otherwise, check if it has previously accepted anything
if so, reply with
Promise(n, (n_prev, val_prev))
, wheren_prev
is the highest previously-accepted proposal number andval_prev
is the previously accepted proposed valueotherwise, reply with
Promise(n)
Now, in phase 2, the proposer has to do something different having received Promise(n)
or
Promise(n, (n_prev, val_prev))
from a majority of acceptors:
- proposer: send an
Accept(n, val)
message to at least a majority of acceptors, where: n is the proposal number that was promised
- val is chosen as follows:
the
val_prev
corresponding to the highestn_prev
or the value it wants, if no
n_prev
info was received
- proposer: send an
Final Algorithm
Phase 1
- proposer: sends a prepare message with a unique proposal number to a majority of acceptors,
Prepare(n)
the proposal number n must be unique, and higher than any proposal number that this proposer has used before
- proposer: sends a prepare message with a unique proposal number to a majority of acceptors,
acceptor: upon receiving a
Prepare(n)
message, check: “did I previously promise to ignore requests with this proposal number?”if so, ignore the message
- otherwise, check if it has previously accepted anything
if so, reply with
Promise(n, (n_prev, val_prev))
, wheren_prev
is the highest previously-accepted proposal number andval_prev
is the previously accepted proposed valueotherwise, reply with
Promise(n)
Phase 2 - the proposer has received Promise(n)
or Promise(n, (n_prev, val_prev))
from a majority of
acceptors (for some n)
- proposer: send an
Accept(n, val)
message to at least a majority of acceptors, where: n is the proposal number that was promised
- val is chosen as follows:
the
val_prev
corresponding to the highestn_prev
or the value it wants, if no
n_prev
info was received
- proposer: send an
- acceptor: upon receiving an
Accept(n, val)
message, check if it previously promised to ignore requests with n if so, ignore the message
- otherwise, reply with
Accepted(n, val)
, and also sendsAccepted(n, val)
to all learners when the majority of acceptors have sent an
Accepted(n, val)
message for a given n, we reach a milestone: we have consensus on val (but no one knows)
- otherwise, reply with
- acceptor: upon receiving an
Paxos satisfies agreement and validity! What might cause it not to terminate…?
Non-Termination
Paxos can fail to terminate if you have duelling proposers:
In the above execution, a proposer never receives a majority Accepted
because the other proposer butts in.
So why don’t we just always only have one proposer? Theoretically we could just have one process declare itself the leader and tell the others what the accepted value it is… but picking the leader requires consensus in itself!
We could, however, choose a different leader election protocol to choose the proposer for a paxos run; and that leader election protocol could have different guarantees (e.g. termination and validity, instead of agreement/validity)
Multi-Paxos
Paxos is good for gaining consensus on a single value - for multiple (e.g. a sequence of values), you have to rerun the whole thing. What if you want to decide on a sequence of values?
For example, if you wanted to implement TO delivery (e.g. in a system where 2 messages are sent), you need to agree on two values: what message is sent first, and which is second. In normal Paxos, this takes a lot of messages!
What if we (well, the accepted proposer) just kept sending Accept
messages with the same proposal number (i.e. kept
repeating phase 2)? Turns out, you can keep doing this until your messages start getting ignored (i.e. a higher
Prepare
is received)!
And if a second process butts in, Multi-Paxos pretty much just becomes normal Paxos.
Note
An alternative way to agree on a sequence is batching - queuing up multiple values and using normal Paxos to get consensus on a batch at a time.
Paxos: Fault Tolerance
We can’t have just one acceptor, since it can crash.
Acceptors
If we have 3 acceptors, only one can crash and Paxos will still work - you need to hear from both live acceptors.
If you have 5, you can accept 2 crashes. In general, a minority of acceptors can crash.
If f is the number of acceptor crashes you want to tolerate, you need \(2f+1\) acceptors.
Proposers
If f is the number of proposer crashes you want to tolerate, you need \(f+1\) proposers.
Omission Faults
Paxos is tolerant to omission faults (given timeouts) - it might not terminate, but that’s already not a guarantee, so eh. In this scenario, it’s safe but not live - fail-safe.
Other Consensus Protocols
- Raft (Diego Ongaro, John Ousterhout, 2014)
designed to be easier to understand than other protocols
Zab (Zookeeper Atomic Broadcast; Yahoo Research, late 2000s)
Viewstamped Replication (Brian Oki, Barbara Liskov, 1998)
All of these are for a sequence of values, like Multi-Paxos.