Replication
Replication
Replication means keeping copies of the same data on multiple machines connected via a network. Why bother?
- Latency: Keep data geographically close to users
- Availability: System continues working even if some machines fail
- Throughput: Scale out the number of machines that can serve read queries
The challenge: keeping data consistent across replicas when it changes.
Note
If data never changes, replication is trivial—just copy it once. All the difficulty comes from handling changes to data.
Leaders and Followers
The most common replication approach: leader-based replication (also called master-slave or primary-replica).
How It Works
- One replica is designated the leader (master, primary)
- Other replicas are followers (slaves, read replicas, secondaries)
- Writes go only to the leader
- The leader sends data changes to all followers as a replication log
- Followers apply changes in the same order
- Reads can go to any replica
Leader-Based Replication
Client writes:
┌─────────────────────────────────┐
│ │
▼ │
┌──────────┐ replication log ┌───┴──────┐
│ LEADER │────────────────────►│ FOLLOWER │
│ │ │ │
└──────────┘ └──────────┘
│ ▲
│ replication log │
└──────────────────────────────►│
┌──────────┐
│ FOLLOWER │
│ │
└──────────┘
Client reads: (can use any replica)This is used by PostgreSQL, MySQL, MongoDB, Kafka, and many others.
Synchronous vs Asynchronous Replication
When the leader receives a write, how long does it wait before confirming?
Synchronous
The leader waits until followers confirm they've received the write before responding to the client.
Advantage: Followers are guaranteed to have an up-to-date copy Disadvantage: If any follower doesn't respond, writes are blocked
Asynchronous
The leader responds immediately after its own write, without waiting for followers.
Advantage: Leader can continue even if followers fall behind Disadvantage: If leader fails, some writes may be lost
Semi-Synchronous
Common compromise: one follower is synchronous, others are asynchronous. If the synchronous follower becomes slow, another takes over.
Warning
Fully synchronous replication is impractical—a single node failure would halt the entire system. Most systems use asynchronous or semi-synchronous replication.
Setting Up New Followers
Adding a new replica isn't as simple as copying files—the data is constantly changing. Here's a consistent approach:
- Take a consistent snapshot of the leader's database
- Copy the snapshot to the new follower
- Follower connects to leader and requests all changes since the snapshot
- Once caught up, follower can process ongoing changes
Most databases automate this process.
Handling Node Failures
Follower Failure: Catch-up Recovery
If a follower crashes and restarts, it knows its last processed position in the replication log. It simply requests all changes since that position and catches up.
Leader Failure: Failover
Leader failure is trickier. The system must:
- Detect that the leader has failed (usually via timeout)
- Choose a new leader (often the most up-to-date follower)
- Reconfigure the system so clients send writes to the new leader
Failover Process
1. Leader stops responding to heartbeats
2. Followers start election
└─ Usually choose follower with most recent data
3. New leader is chosen
4. Old leader's unreplicated writes are... lost? conflicting?
└─ This is the hard part
5. Clients are redirected to new leaderFailover Pitfalls
Lost writes: If the old leader had accepted writes not yet replicated, they vanish. What if other systems (caches, external APIs) already acted on those writes?
Split brain: Two nodes both believe they're the leader. Without careful design, both accept writes, causing data divergence.
Timeout tuning: Too short = unnecessary failovers during load spikes. Too long = extended downtime.
Warning
Failover is surprisingly difficult to get right. Many systems opt for manual failover, with operators triggering it when confident the leader has failed.
Replication Logs
How does the leader send changes to followers?
Statement-Based Replication
Send the SQL statements themselves:
INSERT INTO users (name, email) VALUES ('Alice', 'alice@example.com');
UPDATE users SET last_login = NOW() WHERE id = 123;Problem: Non-deterministic functions (NOW(), RAND()) give different results on each replica. Statements with side effects or auto-incrementing columns also cause issues.
Write-Ahead Log (WAL) Shipping
Send the same low-level log the leader uses for crash recovery. This describes which bytes changed in which disk blocks.
Problem: Tightly coupled to storage engine internals. Followers must run the same database version.
Logical (Row-Based) Log Replication
Send a logical description of changes:
INSERT into users: {id: 1, name: "Alice", email: "alice@example.com"}
UPDATE users WHERE id=123: {last_login: "2024-01-15 10:30:00"}
DELETE from users WHERE id: 456Advantage: Decoupled from storage format. Easier for external systems to consume (change data capture).
Trigger-Based Replication
Application-level triggers copy data to another table when changes occur. Maximum flexibility but more overhead and potential for bugs.
Replication Lag
With asynchronous replication, followers may be seconds or even minutes behind the leader. This is replication lag.
For read-heavy workloads reading from followers, lag causes weird effects:
Reading Your Own Writes
You submit a comment, refresh the page, and your comment isn't there yet (because you read from a lagging follower).
Solutions:
- Read your own recent writes from the leader
- Track the timestamp of your last write; don't read from followers behind that point
- Client remembers its last write position; reject reads from followers that haven't caught up
Read-Your-Writes Consistency
1. User posts comment → goes to Leader
2. User refreshes page → reads from Follower
3. Follower hasn't caught up → comment missing!
Fix: Route user's reads to Leader for N seconds after their writesMonotonic Reads
You query twice and see newer data first, then older data (because you hit different followers with different lag).
Solution: Route each user's reads to the same replica (e.g., based on hash of user ID).
Consistent Prefix Reads
In partitioned databases, you might see an answer before the question if partitions have different lag.
Causality Violation
Partition A (lagging): "What's the score?"
Partition B (current): "It's 3-2!"
An observer might see: "It's 3-2!" then "What's the score?"Solution: Ensure causally related writes go to the same partition, or use explicit causal ordering.
Multi-Leader Replication
In leader-based replication, the leader is a bottleneck and single point of failure. Multi-leader (multi-master) replication allows writes to multiple nodes.
Use Cases
Multi-datacenter operation: Each datacenter has its own leader. Better latency, better availability when a datacenter goes down.
Clients with offline operation: Mobile apps can work offline, syncing later. Each device is effectively a leader.
Collaborative editing: Google Docs allows multiple users to edit simultaneously.
The Big Problem: Write Conflicts
What happens when two leaders accept conflicting writes?
Write Conflict
Datacenter A: UPDATE users SET name = 'Alice Smith' WHERE id = 1;
Datacenter B: UPDATE users SET name = 'Alice Jones' WHERE id = 1;
Both succeed locally. When they replicate... which one wins?Conflict avoidance: Route all writes for a given record to the same leader. Simple, but loses benefits if that datacenter fails.
Converging to a consistent state: All replicas must eventually agree. Strategies include:
- Last write wins (based on timestamp or ID) — simple but loses data
- Merge the values somehow ("Alice Smith-Jones"?)
- Record the conflict for later resolution (shopping cart approach)
- Prompt the user to resolve
Warning
Multi-leader replication adds significant complexity. Avoid it unless you have a strong use case (multi-datacenter, offline clients, collaborative editing).
Leaderless Replication
Some databases abandon the leader concept entirely. Any replica can accept writes. Dynamo-style databases (Cassandra, Riak, Voldemort) work this way.
Quorums
With n replicas, writes must be confirmed by w nodes, and reads must query r nodes.
As long as w + r > n, you're guaranteed that at least one node has the latest data.
Quorum Configuration
n = 3 replicas
w = 2 (write to at least 2)
r = 2 (read from at least 2)
Write: Send to all 3, succeed when 2 confirm
Read: Query all 3, use the most recent value
Since w + r = 4 > 3, reads and writes overlap → we see latest dataHandling Stale Data
When you read from r nodes, you might get different values. Two repair mechanisms:
Read repair: When reading detects a stale value, write the newer value back to the stale replica.
Anti-entropy: Background process that compares replicas and copies missing data.
Sloppy Quorums
If required nodes are unavailable, a sloppy quorum lets you use other nodes temporarily. When original nodes recover, data is hinted handoff to them.
This improves availability but means you might read stale data even with w + r > n.
Summary
Replication serves three main purposes:
| Purpose | Approach |
|---|---|
| Reduce latency | Place replicas near users |
| Increase availability | Keep working if nodes fail |
| Increase read throughput | Serve reads from multiple replicas |
The main replication architectures:
| Architecture | Leaders | Complexity | Conflict Handling |
|---|---|---|---|
| Single-leader | 1 | Low | None (leader decides) |
| Multi-leader | 2+ | High | Required |
| Leaderless | 0 | Medium | Required |
Note
Replication lag is inevitable with asynchronous replication. Design your application to handle temporary inconsistency, or pay the performance cost of synchronous replication.
Key concepts:
- Synchronous vs asynchronous determines durability guarantees
- Failover is hard; lost writes and split-brain are real risks
- Replication lag causes read-your-writes, monotonic reads, and causality issues
- Multi-leader and leaderless require conflict resolution
- Quorums (w + r > n) provide probabilistic guarantees