Consistency and Consensus

8 min read•Designing Data-Intensive Applications

Consistency and Consensus

The previous chapter was pessimistic—cataloging everything that can go wrong. This chapter is more optimistic: despite all those problems, we can build distributed systems that work correctly.

The key is consensus—getting nodes to agree on something. Once you have consensus, you can build powerful abstractions like leaders, transactions, and total ordering.

Consistency Guarantees

Different systems offer different guarantees about how data behaves:

Eventual Consistency

The weakest practical guarantee: if you stop writing and wait long enough, all replicas eventually converge to the same value.

Problem: "Eventually" could be seconds or hours. In the meantime, reads might return anything.

Stronger Guarantees

We often need more:

Read-after-write: See your own writes immediately
Monotonic reads: Never see older data after seeing newer
Causal consistency: See effects in causal order

The strongest guarantee is linearizability.

Linearizability

Linearizability makes a distributed system behave like a single-copy system with atomic operations. Once a write completes, all subsequent reads see that value.

Linearizability Violated

PLAINTEXT

Client A:  write(x = 1) ──────────────────────►
                    │
Client B:            read(x) = 1 ──────────►
                                    │
Client C:              read(x) = 0 ◄── VIOLATION!
 
If B saw x=1, then C reading after B should also see x=1.

What Makes a System Linearizable?

If we could draw all operations on a timeline, and there's a single point where each operation "takes effect," the system is linearizable if:

All reads return the value of the most recent write
Once any read returns a new value, all subsequent reads return that value or newer

The Cost of Linearizability

Linearizability requires coordination, which is expensive:

Latency: Operations must wait for responses from other nodes
Availability: If some nodes are unreachable, operations might block

Warning

The CAP theorem states you can't have both linearizability (C) and availability (A) during a network partition (P). Since partitions happen, you must choose between consistency and availability.

When Do You Need Linearizability?

Single-leader replication: The leader's position must be linearizable (only one leader at a time).

Distributed locks: Only one client can hold a lock at a time.

Uniqueness constraints: Usernames, file paths—must be globally unique.

Cross-channel timing dependencies: If a message and a side effect must be seen together.

The Photo Upload Problem

PLAINTEXT

1. Upload service saves image to storage
2. Upload service sends "resize" message to queue
3. Resize worker reads message
4. Resize worker tries to read image from storage
 
If storage isn't linearizable, step 4 might see the image
before step 1's write is visible—or not see it at all!

Ordering Guarantees

Many problems in distributed systems reduce to ordering:

Which write happened first?
Which message should be processed first?
Which node became leader first?

Causal Ordering

Causality is a partial ordering: if operation A could have influenced operation B, then A must come before B. Otherwise, they're concurrent.

Causal vs Concurrent

PLAINTEXT

Causal:
  A: Post question "How do I sort?"
  B: Post answer "Use quicksort"  ← caused by seeing A
 
Concurrent:
  A: Alice posts about cats
  B: Bob posts about dogs
  (Neither influenced the other)

Causal ordering is weaker than linearizability but still useful—and cheaper to implement.

Lamport Timestamps

A simple way to create a total order consistent with causality:

Each node keeps a counter. Every operation includes (counter, nodeID):

When a node does something, increment counter
When receiving a message, set counter = max(local, received) + 1

Lamport Timestamps

PLAINTEXT

Node A: (1, A), (2, A), (3, A)...
Node B: receives (2, A), sets counter = 3
Node B: (3, B), (4, B)...
 
Total order: (1,A), (2,A), (3,A), (3,B), (4,B)...

Limitation: Lamport timestamps tell you the order after the fact, but can't help you decide in real-time whether you've seen all events.

Total Order Broadcast

For many problems, we need all nodes to see messages in the same order. Total order broadcast guarantees:

Reliable delivery: If a message is delivered to one node, it's delivered to all
Total ordering: All nodes receive messages in the same order

This is equivalent to consensus and useful for:

Database replication (all replicas apply writes in same order)
Implementing serializable transactions
Creating a log that all nodes agree on

Consensus

Consensus means getting several nodes to agree on something. It sounds simple but is surprisingly hard in the presence of failures.

The Consensus Problem

Given nodes that can propose values, consensus ensures:

Agreement: All nodes decide the same value
Integrity: Every node decides at most once, and only on a proposed value
Validity: If a node decides value v, some node proposed v
Termination: Every non-crashed node eventually decides

FLP Impossibility

A famous result: in an asynchronous system (no timing bounds), no consensus algorithm can guarantee termination if even one node can crash.

Note

FLP doesn't mean consensus is impossible in practice. Real systems have partial synchrony—timing bounds usually hold. Algorithms can make progress during stable periods.

Practical Consensus Algorithms

Paxos: The classic algorithm, notoriously difficult to understand and implement.

Raft: Designed for understandability, widely used (etcd, Consul).

Zab: Used by ZooKeeper.

These algorithms work in rounds:

A leader proposes a value
Followers vote on the proposal
If a quorum agrees, the value is decided
If leader fails, elect a new one and retry

Consensus in Practice

Consensus is expensive:

Requires multiple round-trips
Blocks during leader elections
Sensitive to network latency

For this reason, many systems avoid consensus where possible, using it only for:

Leader election
Atomic commit
Membership changes

Coordination Services

ZooKeeper and etcd provide consensus-based coordination primitives:

What They Offer

Linearizable key-value store (small amounts of data)
Total ordered operations (with ordering guarantees)
Failure detection (sessions and heartbeats)
Change notifications (watches)

Use Cases

Leader election: Multiple nodes try to create the same ephemeral node; only one succeeds.

Distributed locks: Acquire a lock by creating a node; release by deleting it.

Service discovery: Services register themselves; clients watch for changes.

Configuration management: Store config in ZooKeeper; services watch for updates.

Leader Election with ZooKeeper

PLAINTEXT

1. All nodes try to create /leader ephemeral node
2. One succeeds → becomes leader
3. Others watch /leader for deletion
4. Leader's session expires → /leader deleted
5. Watchers notified → back to step 1

Membership and Coordination

ZooKeeper implements atomic broadcast:

All writes go through a single leader
Leader assigns a sequence number (zxid)
All nodes see writes in zxid order

This makes it suitable for coordination tasks, but not for bulk data storage.

Limits of Consensus

Consensus is powerful but has fundamental limitations:

Performance

Consensus requires quorum responses for every decision. This means:

At least 3 nodes needed (5 for better fault tolerance)
Latency is bounded by slowest quorum member
Network partitions can stall progress

Dynamic Membership

Changing the set of participating nodes is complex. Most systems require careful procedures to add or remove nodes.

Algorithms Are Complex

Implementing consensus correctly is notoriously difficult. Use battle-tested libraries (etcd, ZooKeeper) rather than rolling your own.

Warning

Many systems that claim to implement consensus have subtle bugs. The formal proofs are complex, and edge cases are easy to miss.

Summary

Building reliable distributed systems requires careful attention to consistency and ordering:

Guarantee	Meaning	Cost
Eventual consistency	Replicas converge eventually	Low
Causal consistency	Respects happens-before	Medium
Linearizability	Acts like single copy	High

Ordering concepts:

Concept	Properties
Lamport timestamps	Total order, consistent with causality
Total order broadcast	Same order on all nodes
Consensus	All nodes agree on a value

Note

Linearizability, total order broadcast, and consensus are equivalent in power—if you can implement one, you can implement the others.

Practical advice:

Use coordination services (ZooKeeper, etcd) rather than implementing consensus yourself
Minimize consensus usage—it's expensive; use it only where truly needed
Understand your consistency requirements—many applications work fine with weaker guarantees
Accept that during partitions, you must choose between consistency and availability

Key insight: once you have consensus, you can build higher-level abstractions—leaders, locks, atomic broadcasts—that make the rest of your system much easier to reason about.