The Trouble with Distributed Systems
The Trouble with Distributed Systems
Working with distributed systems is fundamentally different from working with a single computer. Things that are simple on one machine become surprisingly hard across multiple machines connected by a network.
This chapter is about being pessimistic—cataloging everything that can go wrong. Understanding failure modes is essential for building systems that work despite them.
Note
If you're building a system that runs on a single machine, most of this chapter doesn't apply. But as soon as you have multiple machines communicating over a network, you're in distributed systems territory.
Faults and Partial Failures
On a single computer, operations either work or they don't. A deterministic system produces the same result every time for the same input.
Distributed systems are different:
- Some parts of the system can be broken while others work fine
- Failures are nondeterministic—the same operation might work now but fail a moment later
- You may not even know if something failed or just took a long time
Partial Failure
You send a request to three servers:
Server A: Responds successfully
Server B: Crashed, no response
Server C: Responded, but you never received it (network dropped packet)
What happened? You can't know for sure.This partial failure is what makes distributed systems fundamentally hard.
Unreliable Networks
Most distributed systems use asynchronous packet networks—you send a message, and the network delivers it eventually, with no guarantees about timing.
When you send a request and expect a response, many things can go wrong:
- Request was lost
- Request is waiting in a queue
- Remote node has crashed
- Remote node is temporarily unresponsive
- Response was lost
- Response is waiting in a queue
Warning
You have no way to distinguish these cases. The only information you have is "I haven't received a response yet."
Network Timeouts
The usual approach: declare failure after a timeout. But choosing the right timeout is hard:
- Too short: Declare nodes failed that were just slow (false positives)
- Too long: Take forever to detect actual failures
There's no "correct" timeout—it depends on your specific requirements.
The Timeout Dilemma
Network round-trip: usually 1ms
Timeout: 5ms seems safe
But then:
- Garbage collection pause: 100ms
- Network congestion: 500ms
- Disk waiting for write: 50ms
Your "failed" node was actually just slow.Network Partitions
Sometimes a network fault splits nodes into groups that can communicate internally but not with each other. This is a network partition or "netsplit."
Network partitions are surprisingly common:
- Studies of data centers show network faults happen regularly
- A misconfigured switch can isolate a rack
- Cloud providers experience partition events
Unreliable Clocks
Every computer has a clock, but clocks across machines don't agree perfectly. This causes problems when you rely on timestamps for ordering or coordination.
Time-of-Day Clocks
These try to return the current date and time. Usually synchronized with NTP (Network Time Protocol).
Problem: Clocks can jump—forward or backward—when synchronized. A clock that was behind might suddenly jump ahead by seconds or minutes.
Monotonic Clocks
These always move forward, good for measuring elapsed time:
const start = process.hrtime();
doSomething();
const elapsed = process.hrtime(start);Problem: Only useful for duration on a single machine. Meaningless to compare monotonic clocks across machines.
Clock Synchronization Problems
NTP synchronization is approximate:
- Network delays vary
- Quartz oscillators drift
- NTP servers can give wrong time
- Leap seconds cause jumps
Relying on Timestamps is Dangerous
Node A: "I wrote this value at timestamp 100"
Node B: "I wrote this value at timestamp 105"
But Node A's clock was 10 seconds fast!
Actually: Node B's write happened first.
Using timestamps for ordering: incorrect resolution.Warning
Last-write-wins (LWW) conflict resolution based on timestamps can silently lose data when clocks disagree. The "later" write might actually be the earlier one.
Synchronized Clocks for Real
Google's Spanner uses GPS and atomic clocks to achieve tight synchronization. But even then, they expose the uncertainty:
Spanner: "The time is somewhere between [T1, T2]"When uncertainty intervals don't overlap, you know the ordering. When they do, you must wait until they don't.
Process Pauses
A node in a distributed system can be "paused" at any moment:
- Garbage collection: JVM, Go, .NET can pause for seconds
- Virtual machine suspension: VM can be suspended and resumed
- Context switches: OS might preempt your process
- Disk I/O: Waiting for synchronous writes
- Swapping: Memory swapped to disk
During the pause, the process has no idea time is passing. It continues exactly where it left off—with stale information.
GC Pause Causes Trouble
Thread thinks: Reality:
t=0: Acquire lease (30 sec) t=0: Acquire lease
t=1: Start processing t=1: GC pause starts
t=2: Continue processing ...
t=35: GC pause ends
(lease expired at t=30!)
t=35: Still has "valid" lease Another node has lease now
t=35: Writes to storage CONFLICTThere's no good way to predict pauses. You must assume your code can pause at any point for an arbitrary duration.
Knowledge, Truth, and Lies
In a distributed system, a node can't know anything for certain—it can only make decisions based on messages it receives, which may be delayed, lost, or false.
The Truth Is Defined by the Majority
A node cannot trust its own judgment. Consider:
- A node thinks it's the leader
- But it was partitioned from the network
- Meanwhile, other nodes elected a new leader
Who's right? In distributed systems, truth is decided by quorum—a majority of nodes must agree.
Fencing Tokens
Leases and locks are problematic because a node might think it holds a lock while it doesn't.
Fencing tokens help: each time a lock is granted, include an incrementing token. The storage layer rejects requests with old tokens.
Fencing Tokens
Node A: Acquires lock, gets token 33
Node A: GC pause...
Node B: Acquires lock (A's expired), gets token 34
Node B: Writes to storage with token 34 ✓
Node A: GC pause ends
Node A: Tries to write with token 33 ✗ (rejected, token < 34)Byzantine Faults
We've assumed nodes are honest—they may be slow or crashed, but they don't lie.
Byzantine faults occur when nodes actively misbehave:
- Sending corrupted data
- Claiming to be a different node
- Pretending to have received messages they didn't
Most internal systems don't need Byzantine fault tolerance. It's mainly relevant for:
- Blockchain/cryptocurrency (nodes might be adversarial)
- Aerospace (radiation can flip bits)
- Multi-party systems where participants don't trust each other
System Models
To reason about distributed systems, we need to abstract away some complexity. A system model defines what faults we expect.
Timing Models
Synchronous model: Network delay, process pauses, and clock drift all have known upper bounds. Very restrictive; rarely realistic.
Partially synchronous model: System usually behaves synchronously, but occasionally exceeds bounds. Most realistic for real systems.
Asynchronous model: No timing assumptions at all. Very restrictive; can't even use timeouts.
Failure Models
Crash-stop faults: A node can crash at any moment, and never comes back.
Crash-recovery faults: A node can crash and restart, with durable storage preserved.
Byzantine faults: Nodes can behave arbitrarily, including maliciously.
Note
Most practical systems assume a partially synchronous model with crash-recovery faults. This balances realism with the ability to make progress.
Algorithm Correctness
Distributed algorithms must be correct despite the faults we expect. We define correctness through properties:
Safety Properties
"Nothing bad happens." If violated, we can point to a specific moment when it went wrong.
Examples:
- Uniqueness: No two nodes hold the same lock
- Consistency: Reads return recent writes
- Ordering: Events happen in causal order
Liveness Properties
"Something good eventually happens." We can't point to a moment of violation, only observe it hasn't happened yet.
Examples:
- Availability: Every request eventually receives a response
- Termination: Algorithm eventually completes
Warning
There's often a fundamental tradeoff: we can guarantee safety but may have to sacrifice liveness under certain fault scenarios.
Summary
Distributed systems face challenges that don't exist on single machines:
| Challenge | Why It's Hard |
|---|---|
| Network failures | Packets can be lost, delayed, or reordered |
| Partial failures | Some parts work while others don't |
| Unreliable clocks | Clocks drift and jump; ordering is uncertain |
| Process pauses | Code can pause at any time for arbitrary duration |
| Lack of global knowledge | No node knows the true state of the system |
Key concepts:
- Timeouts detect failures but can't distinguish slow from crashed
- Network partitions are a fact of life in distributed systems
- Clocks are unreliable for ordering; use logical clocks instead
- GC pauses and other interruptions can violate assumptions
- Truth is defined by quorum, not individual nodes
- Fencing tokens prevent stale clients from causing damage
- System models help us reason about what faults to handle
Note
The goal isn't to eliminate all failures—that's impossible. The goal is to build systems that behave correctly despite expected failures.
This pessimistic view might seem discouraging, but it's necessary. Only by understanding what can go wrong can we design systems that work correctly in the real world.