Consistency and Consensus
Consistency and Consensus
The previous chapter was pessimistic—cataloging everything that can go wrong. This chapter is more optimistic: despite all those problems, we can build distributed systems that work correctly.
The key is consensus—getting nodes to agree on something. Once you have consensus, you can build powerful abstractions like leaders, transactions, and total ordering.
Consistency Guarantees
Different systems offer different guarantees about how data behaves:
Eventual Consistency
The weakest practical guarantee: if you stop writing and wait long enough, all replicas eventually converge to the same value.
Problem: "Eventually" could be seconds or hours. In the meantime, reads might return anything.
Stronger Guarantees
We often need more:
- Read-after-write: See your own writes immediately
- Monotonic reads: Never see older data after seeing newer
- Causal consistency: See effects in causal order
The strongest guarantee is linearizability.
Linearizability
Linearizability makes a distributed system behave like a single-copy system with atomic operations. Once a write completes, all subsequent reads see that value.
Linearizability Violated
Client A: write(x = 1) ──────────────────────►
│
Client B: read(x) = 1 ──────────►
│
Client C: read(x) = 0 ◄── VIOLATION!
If B saw x=1, then C reading after B should also see x=1.What Makes a System Linearizable?
If we could draw all operations on a timeline, and there's a single point where each operation "takes effect," the system is linearizable if:
- All reads return the value of the most recent write
- Once any read returns a new value, all subsequent reads return that value or newer
The Cost of Linearizability
Linearizability requires coordination, which is expensive:
- Latency: Operations must wait for responses from other nodes
- Availability: If some nodes are unreachable, operations might block
Warning
The CAP theorem states you can't have both linearizability (C) and availability (A) during a network partition (P). Since partitions happen, you must choose between consistency and availability.
When Do You Need Linearizability?
Single-leader replication: The leader's position must be linearizable (only one leader at a time).
Distributed locks: Only one client can hold a lock at a time.
Uniqueness constraints: Usernames, file paths—must be globally unique.
Cross-channel timing dependencies: If a message and a side effect must be seen together.
The Photo Upload Problem
1. Upload service saves image to storage
2. Upload service sends "resize" message to queue
3. Resize worker reads message
4. Resize worker tries to read image from storage
If storage isn't linearizable, step 4 might see the image
before step 1's write is visible—or not see it at all!Ordering Guarantees
Many problems in distributed systems reduce to ordering:
- Which write happened first?
- Which message should be processed first?
- Which node became leader first?
Causal Ordering
Causality is a partial ordering: if operation A could have influenced operation B, then A must come before B. Otherwise, they're concurrent.
Causal vs Concurrent
Causal:
A: Post question "How do I sort?"
B: Post answer "Use quicksort" ← caused by seeing A
Concurrent:
A: Alice posts about cats
B: Bob posts about dogs
(Neither influenced the other)Causal ordering is weaker than linearizability but still useful—and cheaper to implement.
Lamport Timestamps
A simple way to create a total order consistent with causality:
Each node keeps a counter. Every operation includes (counter, nodeID):
- When a node does something, increment counter
- When receiving a message, set counter = max(local, received) + 1
Lamport Timestamps
Node A: (1, A), (2, A), (3, A)...
Node B: receives (2, A), sets counter = 3
Node B: (3, B), (4, B)...
Total order: (1,A), (2,A), (3,A), (3,B), (4,B)...Limitation: Lamport timestamps tell you the order after the fact, but can't help you decide in real-time whether you've seen all events.
Total Order Broadcast
For many problems, we need all nodes to see messages in the same order. Total order broadcast guarantees:
- Reliable delivery: If a message is delivered to one node, it's delivered to all
- Total ordering: All nodes receive messages in the same order
This is equivalent to consensus and useful for:
- Database replication (all replicas apply writes in same order)
- Implementing serializable transactions
- Creating a log that all nodes agree on
Consensus
Consensus means getting several nodes to agree on something. It sounds simple but is surprisingly hard in the presence of failures.
The Consensus Problem
Given nodes that can propose values, consensus ensures:
- Agreement: All nodes decide the same value
- Integrity: Every node decides at most once, and only on a proposed value
- Validity: If a node decides value v, some node proposed v
- Termination: Every non-crashed node eventually decides
FLP Impossibility
A famous result: in an asynchronous system (no timing bounds), no consensus algorithm can guarantee termination if even one node can crash.
Note
FLP doesn't mean consensus is impossible in practice. Real systems have partial synchrony—timing bounds usually hold. Algorithms can make progress during stable periods.
Practical Consensus Algorithms
Paxos: The classic algorithm, notoriously difficult to understand and implement.
Raft: Designed for understandability, widely used (etcd, Consul).
Zab: Used by ZooKeeper.
These algorithms work in rounds:
- A leader proposes a value
- Followers vote on the proposal
- If a quorum agrees, the value is decided
- If leader fails, elect a new one and retry
Consensus in Practice
Consensus is expensive:
- Requires multiple round-trips
- Blocks during leader elections
- Sensitive to network latency
For this reason, many systems avoid consensus where possible, using it only for:
- Leader election
- Atomic commit
- Membership changes
Coordination Services
ZooKeeper and etcd provide consensus-based coordination primitives:
What They Offer
- Linearizable key-value store (small amounts of data)
- Total ordered operations (with ordering guarantees)
- Failure detection (sessions and heartbeats)
- Change notifications (watches)
Use Cases
Leader election: Multiple nodes try to create the same ephemeral node; only one succeeds.
Distributed locks: Acquire a lock by creating a node; release by deleting it.
Service discovery: Services register themselves; clients watch for changes.
Configuration management: Store config in ZooKeeper; services watch for updates.
Leader Election with ZooKeeper
1. All nodes try to create /leader ephemeral node
2. One succeeds → becomes leader
3. Others watch /leader for deletion
4. Leader's session expires → /leader deleted
5. Watchers notified → back to step 1Membership and Coordination
ZooKeeper implements atomic broadcast:
- All writes go through a single leader
- Leader assigns a sequence number (zxid)
- All nodes see writes in zxid order
This makes it suitable for coordination tasks, but not for bulk data storage.
Limits of Consensus
Consensus is powerful but has fundamental limitations:
Performance
Consensus requires quorum responses for every decision. This means:
- At least 3 nodes needed (5 for better fault tolerance)
- Latency is bounded by slowest quorum member
- Network partitions can stall progress
Dynamic Membership
Changing the set of participating nodes is complex. Most systems require careful procedures to add or remove nodes.
Algorithms Are Complex
Implementing consensus correctly is notoriously difficult. Use battle-tested libraries (etcd, ZooKeeper) rather than rolling your own.
Warning
Many systems that claim to implement consensus have subtle bugs. The formal proofs are complex, and edge cases are easy to miss.
Summary
Building reliable distributed systems requires careful attention to consistency and ordering:
| Guarantee | Meaning | Cost |
|---|---|---|
| Eventual consistency | Replicas converge eventually | Low |
| Causal consistency | Respects happens-before | Medium |
| Linearizability | Acts like single copy | High |
Ordering concepts:
| Concept | Properties |
|---|---|
| Lamport timestamps | Total order, consistent with causality |
| Total order broadcast | Same order on all nodes |
| Consensus | All nodes agree on a value |
Note
Linearizability, total order broadcast, and consensus are equivalent in power—if you can implement one, you can implement the others.
Practical advice:
- Use coordination services (ZooKeeper, etcd) rather than implementing consensus yourself
- Minimize consensus usage—it's expensive; use it only where truly needed
- Understand your consistency requirements—many applications work fine with weaker guarantees
- Accept that during partitions, you must choose between consistency and availability
Key insight: once you have consensus, you can build higher-level abstractions—leaders, locks, atomic broadcasts—that make the rest of your system much easier to reason about.