Safety & Liveness
Previously:
FIFO delivery
Causal delivery
Totally ordered delivery
These are all safety properties: properties that say a “bad” thing won’t happen; on the other hand, a liveness property says that a “good” thing will happen (e.g. eventual delivery).
Note
The word eventually is often an indicator of a liveness property.
Now, formally:
- safety property
say that a “bad” thing won’t happen
can be violated in a finite execution (can point to where it went wrong)
- liveness property
says that a “good” thing will (eventually) happen
cannot be violated in a finite execution (may take infinite time to satisfy)
It’s the combination of safety and liveness that makes systems useful - for example, otherwise, you could just drop every message to satisfy FIFO.
In fact, every property is either a safety property, a liveness property, or both.
Let’s try and define reliable delivery:
(Take 1): Let \(P_1\) be a process that sends a message to process \(P_2\). If neither \(P_1\) not \(P_2\) crashes (and not all messages are lost), then \(P_2\) eventually delivers m.
Well, it’s a start, but the whole “not crashing” thing is weak.
Fault Models
A fault model tells you which kind of faults can occur. In the example system of two processes, one asking the other for a value, we have the following faults:
message from \(M_1\) is lost - omission fault
message from \(M_1\) is slow - timing fault
\(M_2\) crashed - crash fault
\(M_2\) is slow - timing fault
message from \(M_2\) is slow - timing fault
message from \(M_2\) is lost - omission fault
\(M_2\) lies - byzantine fault
- crash fault
a process fails by halting (stops sending/receiving messages) and not everyone necessarily knows about it
(There is actually a subset, called “fail-stop faults”, where a process halts and everyone knows it halts)
- omission fault
a message is lost (a process fails to send/receive a single message)
- timing fault
a process responds too late (or too early)
- Byzantine fault
a process behaves in an arbitrary or malicious way
Note
There’s a subset of Byzantine faults between omission and Byzantine called “authentication-detectable Byzantine faults” - i.e. message corruptions that can be detected, and “downgraded” to an omission fault
If protocol X tolerates crash faults and protocol Y tolerates omission faults, does Y also tolerate crash faults?
Yes! Crash faults are just a special case of omission faults where all messages to/from a process are lost.
In fact, Byzantine faults are also a superset of crash (and omission) faults.
Note
We left out timing faults in the above diagram, since we’re dealing only in async in this class.
- fault model
A fault model is a specification that says what kind of faults a system can exhibit, and this tells you what kinds of faults need to be tolerated.
In this class, we focus on the omission model (which includes the crash model).
Two Generals Problem
Tom Scott: https://www.youtube.com/watch?v=IP-rGJKSZ3s
In the omission model, it is impossible for Alice and Bob to attack and know for sure that the other will attack.
How do we mitigate?
Probabilistic Certainty
One option is to have Alice constantly send until she receives an ACK; then the longer Bob goes without receiving a message, the more sure he is that she has received the ACK. Note that it’s not 100% guaranteed; every single message since then could have failed, but it works.
Common Knowledge
There is common knowledge of p when everyone knows p, everyone knows that everyone knows p, everyone knows that everyone knows that everyone knows p…
Fault Tolerance
What does it mean to tolerate a class of faults? Usually it’s defined by how/how much your program reacts to a fault.
A correct program satisfies both its safety and liveness properties, but often satisfying both is impossible during a fault.
So really, it’s about how wrong it goes in the presence of a fault.
live |
not live |
|
---|---|---|
safe |
masking |
fail-safe |
not safe |
non-masking |
:( |
Reliable Delivery, Take 2
Previously: Let \(P_1\) be a process that sends a message m to \(P_2\). If neither \(P_1\) nor \(P_2\) crashes (and not all messages are lost), then \(P_2\) eventually delivers m.
Do we need not all messages are lost
? Yes, if we’re working under the omission model.
So how do we implement it?
One implementation: repeat until ack
Alice puts a message in the send buffer
on timeout, send what’s in the buffer
when ack received, delete message from buffer
One problem is if Bob’s ACK gets dropped, and he receives the message twice. This may or may not be an issue, depending on the message. For example, if the message is something like “increase value by 1”, that’s an issue!
Note
Client A sending a set request, then client B, then a repeated send of client A is not actually an issue here - it’s the same if client A’s initial message was delayed (one client will be sad).
In this scenario, your messages should be idempotent - sending them multiple times should have the same effect as if it was sent once.
So this is actually at-least-once delivery! But is exactly-once delivery possible?
Not really… most systems that claim exactly-once delivery in real life are one of the following:
the messages were idempotent anyway
they’re making an effort to deduplicate messages on the receiver
Reliable Broadcast
broadcast: one sender, everyone receives
Note
If you have a unicast primitive, you can implement broadcast by sending multiple unicasts in very quick succession.
- reliable broadcast
If a correct process delivers a broadcast message m, then all correct processes deliver m.
Note
A correct process is a process that is acting correctly in a given fault model (e.g. not crashed in crash model).
Ex. If Alice sends a message to Bob and Carol in the crash model, Bob delivers it, but Carol crashes before she can, reliable broadcast is not violated because Carol is not correct.
Ex. If Alice is sending a message to Bob and Carol in the crash model but crashes after Bob’s is sent but before Carol’s is, reliable broadcast is violated, since Bob is a correct process that delivered the message, but Carol did not.
If a process can crash in the middle of a broadcast, how do we get reliable broadcast?
If you receive a broadcast message, forward it to everyone else (or, at least, everyone but the sender) before delivering it:
But if no one crashes, each process receives it twice:
But each process can keep track of messages they’ve already delivered/forwarded to not do it twice.
Important
Fault tolerance often involves making copies!