Distributed Consensus
Consensus is the problem of getting a group of processes that fail independently to agree on a single value, or on a sequence of values, in the presence of network delays, message loss, and process crashes. Consensus is the foundation on which replicated state machines, leader election, distributed locks, configuration management, and strongly consistent databases are built. This skill catalogs the core results, algorithms, and design heuristics a cloud-systems practitioner needs to reason about coordination primitives without reinventing them.
Agent affinity: lamport (consensus theory, logical clocks, TLA+), decandia (quorum mechanics in Dynamo-style stores), dean (Paxos/Spanner experience in production systems)
Concept IDs: cloud-multi-service-coordination, cloud-procedure-execution, cloud-requirements-tracing
Why Consensus is Hard
Distributed systems fail in ways that single-node systems do not. Messages arrive late, arrive out of order, arrive twice, or never arrive. Processes crash, restart, and come back with stale state. Networks partition and heal. Clocks drift. Two observers watching the same sequence of events can see them in different orders and both be telling the truth about what they saw. Building reliable systems on this substrate requires algorithms that are correct under the worst combinations of these failures, not just the common cases.
The core insight, due to Lamport, is that "time" in a distributed system is not a physical thing — it is a partial order over events derived from message causality. "Happened-before" (a -> b if a and b are on the same process in program order, or if a is a send and b is the matching receive) gives a causal structure that distributed algorithms can actually reason about. Wall-clock time is an optimization for the common case, not a correctness foundation.
The FLP Impossibility Result
Fischer, Lynch, and Paterson (1985) proved that in an asynchronous system with even one faulty process, no deterministic consensus algorithm can guarantee termination. The proof constructs an adversarial scheduler that can always delay messages to keep the system in a bivalent state — a state from which either decision is still reachable.
This result does not say consensus is impossible. It says that any consensus algorithm must give up something: either synchrony assumptions (Paxos, Raft assume partial synchrony and eventually a stable leader), or determinism (randomized consensus terminates with probability 1), or fault tolerance (can tolerate zero failures if synchrony is strong). Every real consensus algorithm sits somewhere on this trade-off curve.
Lamport Clocks
A Lamport logical clock is a function L from events to integers satisfying: if a -> b then L(a) < L(b). The simplest implementation is a per-process counter C:
- Before each local event,
C := C + 1. - When sending a message, attach
C. - On receiving a message with timestamp
t, setC := max(C, t) + 1.
Lamport clocks give a total order consistent with causality (ties can be broken by process ID). They do not detect concurrency — if L(a) < L(b), you cannot tell whether a -> b or a || b.
Vector Clocks
A vector clock for n processes is a length-n integer vector V at each process. Process i increments V[i] on local events, attaches V to outgoing messages, and on receive sets V[j] := max(V[j], t[j]) for all j, then increments V[i].
Vector clocks give full causality: a -> b iff V(a) < V(b) (component-wise less-than-or-equal with at least one strict inequality). Concurrency is detectable: a || b iff neither V(a) < V(b) nor V(b) < V(a).
Vector clocks are the backbone of Dynamo-style eventually consistent stores for detecting conflicting versions, and of causal consistency systems in general.
The Paxos Family
Paxos (Lamport, 1998 — the "Part-Time Parliament" paper, and later "Paxos Made Simple") is a protocol for agreeing on a single value among a set of processes, at least a majority of which are non-faulty.
Roles. Proposers propose values, acceptors vote, learners learn the chosen value. A process may play multiple roles.
Two phases.
Phase 1 (Prepare). A proposer picks a ballot number n larger than any it has used, and sends Prepare(n) to a majority of acceptors. Each acceptor promises not to accept any ballot with number less than n, and replies with the highest-numbered proposal it has already accepted (if any).
Phase 2 (Accept). If the proposer got promises from a majority, it picks a value: if any of the replies contained a prior accepted proposal, it must use the value from the highest-numbered one (this preserves the safety invariant); otherwise it may propose its own value. It sends Accept(n, v) to a majority. Each acceptor accepts unless it has since promised a higher ballot.
A value is chosen once a majority of acceptors have accepted it. Once chosen, it cannot change — the phase 1 "must use highest prior" rule ensures any later successful proposer picks the same value.
Multi-Paxos. Running basic Paxos for every entry in a log is wasteful. Multi-Paxos elects a stable leader and skips phase 1 for subsequent slots until the leader loses leadership. This is how Paxos is actually used in systems like Chubby and Spanner.
Fast Paxos. Allows clients to send directly to acceptors when there is no contention, at the cost of larger quorums when contention is detected.
Raft
Raft (Ongaro and Ousterhout, 2014) is a consensus protocol designed for understandability. It decomposes consensus into three explicit subproblems: leader election, log replication, and safety. It adds a state machine abstraction and a clean membership change protocol.
Key design choices.
- Strong leader. All client requests go through the leader. Followers are passive. This trades some flexibility for clarity.
- Randomized timeouts. Followers become candidates after a randomized election timeout, reducing split votes.
- Log matching property. If two logs contain an entry with the same index and term, they are identical up to and including that entry.
Raft is used in etcd, Consul, CockroachDB, and many newer systems. The trade-off versus Paxos is: Raft is easier to implement correctly, Paxos is more flexible under pathological network conditions.
Viewstamped Replication and Virtual Synchrony
VR (Oki and Liskov, 1988) predates Paxos and was rediscovered after Raft. It uses view numbers as ballot numbers and has an explicit primary/backup structure. Virtual synchrony (Birman) extends the ideas with group membership as a first-class concept.
These matter because they remind you that "Paxos vs Raft" is a false dichotomy — consensus has a family of related solutions, and which you pick depends on the application's shape, not on algorithmic superiority.
Byzantine Fault Tolerance (Sketch)
Byzantine failures are failures where a process may send arbitrary, including malicious, messages. Crash-fault-tolerant algorithms (Paxos, Raft) assume failures are crash-only. BFT algorithms (PBFT, HotStuff, Tendermint) tolerate up to f Byzantine processes out of 3f + 1.
For most cloud-systems work inside a trusted datacenter boundary, CFT is sufficient. BFT becomes relevant when the failure domain includes adversarial actors — cross-organization blockchain systems, supply-chain attestation, or hardened aerospace flight software.
Quorum Systems: N, R, W
In a replicated read/write store with N replicas, a read quorum of R and a write quorum of W satisfies strong consistency when R + W > N. This is the DeCandia/Dynamo formulation and appears in Cassandra, Riak, and many others.
W = N, R = 1. Fast reads, slow writes. Write availability fails if any replica is down.W = 1, R = N. Fast writes, slow reads. Read availability fails if any replica is down.W = R = (N+1)/2. Balanced, majority quorums. Survives any