Reliable, Scalable, and Maintainable Applications
Reliable, Scalable, and Maintainable Applications
Modern applications are increasingly data-intensive rather than compute-intensive. The bottleneck is usually the amount, complexity, and rate of change of data—not raw CPU cycles. These applications need to store data (databases), remember expensive operations (caches), allow searching (search indexes), send messages asynchronously (message queues), and process data periodically (batch processing).
What Makes an Application Data-Intensive?
A data-intensive application typically combines several building blocks:
- Databases — Store data for later retrieval
- Caches — Remember expensive operation results for faster reads
- Search indexes — Allow keyword or filtered searching
- Stream processing — Handle messages between processes asynchronously
- Batch processing — Periodically crunch large amounts of data
Note
The challenge isn't just using these tools individually—it's combining them effectively while ensuring the overall system remains reliable, scalable, and maintainable.
The Three Pillars
When building data systems, we care about three fundamental properties:
1. Reliability
The system should continue to work correctly even when things go wrong.
2. Scalability
The system should handle growth gracefully.
3. Maintainability
The system should be easy to work on over time.
Reliability
Reliability means the system continues to work correctly even when faults occur. Note the distinction:
- Fault: A component deviating from spec
- Failure: The system as a whole stops providing service
We can't prevent all faults, but we can build fault-tolerant systems that prevent faults from causing failures.
Warning
There's no such thing as 100% reliable. The goal is to make systems as reliable as needed for the specific use case, while accepting the cost tradeoffs.
Types of Faults
Hardware Faults
Hard disks crash, RAM becomes faulty, power grids fail. In a data center with 10,000 disks, you might expect one disk to die per day on average.
Traditional approach: Add redundancy—RAID for disks, dual power supplies, hot-swappable CPUs.
Modern approach: Software fault-tolerance techniques that allow machines to be lost entirely, enabling rolling upgrades without downtime.
Hardware Failure Rates
| Component | Mean Time to Failure (MTTF) |
|---|---|
| Hard disk | 10-50 years |
| Server | 3-5 years |
| Data center power | Hours to days |
With thousands of machines, something is always broken somewhere.
Software Faults
Software bugs can lie dormant for years until triggered by unusual circumstances:
- A bug that crashes every server when given a particular bad input
- A runaway process using all CPU, memory, or disk
- A service that becomes slow, causing cascading slowdowns
- Dependency failures that take down dependent services
Software faults are harder to anticipate because they're correlated—a single bug can affect all nodes simultaneously.
Human Errors
Humans are unreliable. Studies of large internet services show that configuration errors by operators are the leading cause of outages.
How to minimize human errors:
-
Design systems that minimize opportunities for error
- Well-designed APIs make doing the right thing easy
- Dangerous operations should require confirmation
-
Decouple places where mistakes happen from places that cause failures
- Sandbox environments for safe experimentation
- Gradual rollouts to detect problems before full deployment
-
Test thoroughly at all levels
- Unit tests, integration tests, end-to-end tests
- Automated testing catches many issues before production
-
Allow quick recovery
- Fast rollback mechanisms for code changes
- Gradual rollouts so problems affect few users first
-
Set up detailed monitoring
- Performance metrics and error rates
- Early warning systems to detect problems quickly
Scalability
Scalability describes a system's ability to cope with increased load. It's not a one-dimensional label—you can't say "X is scalable" in absolute terms. Rather, it's about specific questions:
"If the system grows in a particular way, what are our options for coping?"
Describing Load
Before discussing handling growth, we need to describe the current load. Load parameters capture this, and the right choice depends on your system architecture:
- Web server: Requests per second
- Database: Ratio of reads to writes
- Chat room: Number of simultaneously active users
- Cache: Hit rate
Twitter's Scaling Challenge
Twitter's main operations (circa 2012):
- Post tweet: ~4,600 requests/second average, ~12,000/second peak
- Home timeline: ~300,000 requests/second
The challenge wasn't the volume of posts—it was the fan-out. When a user posts, that tweet needs to appear in all their followers' timelines.
Approach 1: Query on read
Home timeline = SELECT tweets FROM tweets
JOIN follows ON tweets.sender_id = follows.followee_id
WHERE follows.follower_id = current_userSimple, but slow at 300k requests/second.
Approach 2: Write to all follower timelines When posting, insert the tweet into every follower's cached timeline.
This works well for most users, but celebrities with millions of followers create write storms.
Hybrid solution: Regular users use approach 2. Celebrities' tweets are fetched separately and merged when reading.
Describing Performance
Once you've described load, investigate what happens when load increases:
- If you increase a load parameter while keeping resources fixed, how is performance affected?
- If you want to keep performance constant, how much do you need to increase resources?
Throughput vs Response Time
- Throughput: Number of records processed per second (batch systems)
- Response time: Time from request to response (online systems)
Warning
Response time varies with each request. Don't think of it as a single number—think of it as a distribution.
Percentiles
The median (50th percentile, p50) is a good measure of typical response time—half of requests are faster, half are slower.
But to understand outliers, look at higher percentiles:
- p95: 95% of requests are faster than this
- p99: 99% of requests are faster than this
- p999: 99.9% of requests are faster than this
Why Percentiles Matter
If your median response time is 200ms but your p99 is 5 seconds:
- Most users have a good experience
- 1 in 100 users waits 25x longer than typical
- These are often your most valuable customers (more data, more activity)
Amazon observed that a 100ms increase in response time reduced sales by 1%.
Tail latency amplification: When a request requires multiple backend calls, the slowest call determines the total time. With many parallel calls, the chance of hitting a slow outlier increases.
Approaches to Scaling
Scaling Up (Vertical Scaling)
Move to a more powerful machine. Simpler, but has limits and cost grows superlinearly.
Scaling Out (Horizontal Scaling)
Distribute load across multiple smaller machines. More complex, but more flexible.
In reality, you'll use a mix. A pragmatic approach:
- Start with a single powerful machine
- Scale out when you hit limits or for redundancy
- Design with eventual distribution in mind
Note
There's no magic formula for scaling. A system handling 100,000 requests/second at 1KB each requires different architecture than 3 requests/minute at 2GB each—even though both have the same throughput in bytes.
Maintainability
The majority of software cost is in ongoing maintenance:
- Fixing bugs
- Keeping systems operational
- Investigating failures
- Adapting to new platforms
- Adding new features
- Repaying technical debt
Good design principles minimize pain and make maintenance enjoyable:
Operability
Make it easy for operations teams to keep the system running smoothly.
Good operability means:
- Easy monitoring of system health
- Good documentation of system behavior
- Avoiding dependency on individual machines
- Providing self-healing capabilities where possible
- Predictable behavior that minimizes surprises
- Good default configurations with manual overrides
Simplicity
New engineers should be able to understand the system easily.
Complexity manifests as:
- Tight coupling between modules
- Tangled dependencies
- Inconsistent naming conventions
- Hacks to work around performance problems
- Special cases to handle edge conditions
Abstraction is the best tool for removing complexity. A good abstraction hides implementation details behind a clean interface. Examples: SQL hiding storage engine details, programming languages hiding machine code.
Evolvability (Extensibility)
Making changes should be easy.
System requirements constantly change:
- New facts emerge
- Business priorities shift
- Users request new features
- Platforms evolve
- Legal requirements change
Agile methodologies help at the team level. For larger systems, evolvability requires good design:
- Loose coupling
- Clear interfaces
- Small, focused components
Note
There's no magic trick to making systems evolvable. Simple and easy-to-understand systems are generally easier to modify than complex ones.
Summary
This chapter introduced the foundational vocabulary for thinking about data-intensive applications:
| Concept | Key Question |
|---|---|
| Reliability | Does the system work correctly when things go wrong? |
| Scalability | Can the system handle growth? |
| Maintainability | Can people work with the system productively? |
These goals often conflict—increased reliability might reduce maintainability, optimizing for scalability might add complexity. The art is finding the right balance for your specific situation.
Warning
Don't over-engineer for hypothetical future scale. Build for current needs while designing for change. Premature optimization is the root of much unnecessary complexity.