The Trouble with Distributed Systems

8 min read•Designing Data-Intensive Applications

The Trouble with Distributed Systems

Working with distributed systems is fundamentally different from working with a single computer. Things that are simple on one machine become surprisingly hard across multiple machines connected by a network.

This chapter is about being pessimistic—cataloging everything that can go wrong. Understanding failure modes is essential for building systems that work despite them.

Note

If you're building a system that runs on a single machine, most of this chapter doesn't apply. But as soon as you have multiple machines communicating over a network, you're in distributed systems territory.

Faults and Partial Failures

On a single computer, operations either work or they don't. A deterministic system produces the same result every time for the same input.

Distributed systems are different:

Some parts of the system can be broken while others work fine
Failures are nondeterministic—the same operation might work now but fail a moment later
You may not even know if something failed or just took a long time

Partial Failure

PLAINTEXT

You send a request to three servers:
  Server A: Responds successfully
  Server B: Crashed, no response
  Server C: Responded, but you never received it (network dropped packet)
 
What happened? You can't know for sure.

This partial failure is what makes distributed systems fundamentally hard.

Unreliable Networks

Most distributed systems use asynchronous packet networks—you send a message, and the network delivers it eventually, with no guarantees about timing.

When you send a request and expect a response, many things can go wrong:

Request was lost
Request is waiting in a queue
Remote node has crashed
Remote node is temporarily unresponsive
Response was lost
Response is waiting in a queue

Warning

You have no way to distinguish these cases. The only information you have is "I haven't received a response yet."

Network Timeouts

The usual approach: declare failure after a timeout. But choosing the right timeout is hard:

Too short: Declare nodes failed that were just slow (false positives)
Too long: Take forever to detect actual failures

There's no "correct" timeout—it depends on your specific requirements.

The Timeout Dilemma

PLAINTEXT

Network round-trip: usually 1ms
Timeout: 5ms seems safe
 
But then:
- Garbage collection pause: 100ms
- Network congestion: 500ms
- Disk waiting for write: 50ms
 
Your "failed" node was actually just slow.

Network Partitions

Sometimes a network fault splits nodes into groups that can communicate internally but not with each other. This is a network partition or "netsplit."

Network partitions are surprisingly common:

Studies of data centers show network faults happen regularly
A misconfigured switch can isolate a rack
Cloud providers experience partition events

Unreliable Clocks

Every computer has a clock, but clocks across machines don't agree perfectly. This causes problems when you rely on timestamps for ordering or coordination.

Time-of-Day Clocks

These try to return the current date and time. Usually synchronized with NTP (Network Time Protocol).

Problem: Clocks can jump—forward or backward—when synchronized. A clock that was behind might suddenly jump ahead by seconds or minutes.

Monotonic Clocks

These always move forward, good for measuring elapsed time:

JAVASCRIPT

const start = process.hrtime();
doSomething();
const elapsed = process.hrtime(start);

Problem: Only useful for duration on a single machine. Meaningless to compare monotonic clocks across machines.

Clock Synchronization Problems

NTP synchronization is approximate:

Network delays vary
Quartz oscillators drift
NTP servers can give wrong time
Leap seconds cause jumps

Relying on Timestamps is Dangerous

PLAINTEXT

Node A: "I wrote this value at timestamp 100"
Node B: "I wrote this value at timestamp 105"
 
But Node A's clock was 10 seconds fast!
Actually: Node B's write happened first.
 
Using timestamps for ordering: incorrect resolution.

Warning

Last-write-wins (LWW) conflict resolution based on timestamps can silently lose data when clocks disagree. The "later" write might actually be the earlier one.

Synchronized Clocks for Real

Google's Spanner uses GPS and atomic clocks to achieve tight synchronization. But even then, they expose the uncertainty:

PLAINTEXT

Spanner: "The time is somewhere between [T1, T2]"

When uncertainty intervals don't overlap, you know the ordering. When they do, you must wait until they don't.

Process Pauses

A node in a distributed system can be "paused" at any moment:

Garbage collection: JVM, Go, .NET can pause for seconds
Virtual machine suspension: VM can be suspended and resumed
Context switches: OS might preempt your process
Disk I/O: Waiting for synchronous writes
Swapping: Memory swapped to disk

During the pause, the process has no idea time is passing. It continues exactly where it left off—with stale information.

GC Pause Causes Trouble

PLAINTEXT

Thread thinks:                    Reality:
t=0: Acquire lease (30 sec)      t=0: Acquire lease
t=1: Start processing            t=1: GC pause starts
t=2: Continue processing         ...
                                 t=35: GC pause ends
                                 (lease expired at t=30!)
t=35: Still has "valid" lease    Another node has lease now
t=35: Writes to storage          CONFLICT

There's no good way to predict pauses. You must assume your code can pause at any point for an arbitrary duration.

Knowledge, Truth, and Lies

In a distributed system, a node can't know anything for certain—it can only make decisions based on messages it receives, which may be delayed, lost, or false.

The Truth Is Defined by the Majority

A node cannot trust its own judgment. Consider:

A node thinks it's the leader
But it was partitioned from the network
Meanwhile, other nodes elected a new leader

Who's right? In distributed systems, truth is decided by quorum—a majority of nodes must agree.

Fencing Tokens

Leases and locks are problematic because a node might think it holds a lock while it doesn't.

Fencing tokens help: each time a lock is granted, include an incrementing token. The storage layer rejects requests with old tokens.

Fencing Tokens

PLAINTEXT

Node A: Acquires lock, gets token 33
Node A: GC pause...
 
Node B: Acquires lock (A's expired), gets token 34
Node B: Writes to storage with token 34 ✓
 
Node A: GC pause ends
Node A: Tries to write with token 33 ✗ (rejected, token < 34)

Byzantine Faults

We've assumed nodes are honest—they may be slow or crashed, but they don't lie.

Byzantine faults occur when nodes actively misbehave:

Sending corrupted data
Claiming to be a different node
Pretending to have received messages they didn't

Most internal systems don't need Byzantine fault tolerance. It's mainly relevant for:

Blockchain/cryptocurrency (nodes might be adversarial)
Aerospace (radiation can flip bits)
Multi-party systems where participants don't trust each other

System Models

To reason about distributed systems, we need to abstract away some complexity. A system model defines what faults we expect.

Timing Models

Synchronous model: Network delay, process pauses, and clock drift all have known upper bounds. Very restrictive; rarely realistic.

Partially synchronous model: System usually behaves synchronously, but occasionally exceeds bounds. Most realistic for real systems.

Asynchronous model: No timing assumptions at all. Very restrictive; can't even use timeouts.

Failure Models

Crash-stop faults: A node can crash at any moment, and never comes back.

Crash-recovery faults: A node can crash and restart, with durable storage preserved.

Byzantine faults: Nodes can behave arbitrarily, including maliciously.

Note

Most practical systems assume a partially synchronous model with crash-recovery faults. This balances realism with the ability to make progress.

Algorithm Correctness

Distributed algorithms must be correct despite the faults we expect. We define correctness through properties:

Safety Properties

"Nothing bad happens." If violated, we can point to a specific moment when it went wrong.

Examples:

Uniqueness: No two nodes hold the same lock
Consistency: Reads return recent writes
Ordering: Events happen in causal order

Liveness Properties

"Something good eventually happens." We can't point to a moment of violation, only observe it hasn't happened yet.

Examples:

Availability: Every request eventually receives a response
Termination: Algorithm eventually completes

Warning

There's often a fundamental tradeoff: we can guarantee safety but may have to sacrifice liveness under certain fault scenarios.

Summary

Distributed systems face challenges that don't exist on single machines:

Challenge	Why It's Hard
Network failures	Packets can be lost, delayed, or reordered
Partial failures	Some parts work while others don't
Unreliable clocks	Clocks drift and jump; ordering is uncertain
Process pauses	Code can pause at any time for arbitrary duration
Lack of global knowledge	No node knows the true state of the system

Key concepts:

Timeouts detect failures but can't distinguish slow from crashed
Network partitions are a fact of life in distributed systems
Clocks are unreliable for ordering; use logical clocks instead
GC pauses and other interruptions can violate assumptions
Truth is defined by quorum, not individual nodes
Fencing tokens prevent stale clients from causing damage
System models help us reason about what faults to handle

Note

The goal isn't to eliminate all failures—that's impossible. The goal is to build systems that behave correctly despite expected failures.

This pessimistic view might seem discouraging, but it's necessary. Only by understanding what can go wrong can we design systems that work correctly in the real world.