Reliable, Scalable, and Maintainable Applications

8 min read•Designing Data-Intensive Applications

Reliable, Scalable, and Maintainable Applications

Modern applications are increasingly data-intensive rather than compute-intensive. The bottleneck is usually the amount, complexity, and rate of change of data—not raw CPU cycles. These applications need to store data (databases), remember expensive operations (caches), allow searching (search indexes), send messages asynchronously (message queues), and process data periodically (batch processing).

What Makes an Application Data-Intensive?

A data-intensive application typically combines several building blocks:

Databases — Store data for later retrieval
Caches — Remember expensive operation results for faster reads
Search indexes — Allow keyword or filtered searching
Stream processing — Handle messages between processes asynchronously
Batch processing — Periodically crunch large amounts of data

Note

The challenge isn't just using these tools individually—it's combining them effectively while ensuring the overall system remains reliable, scalable, and maintainable.

The Three Pillars

When building data systems, we care about three fundamental properties:

1. Reliability

The system should continue to work correctly even when things go wrong.

2. Scalability

The system should handle growth gracefully.

3. Maintainability

The system should be easy to work on over time.

Reliability

Reliability means the system continues to work correctly even when faults occur. Note the distinction:

Fault: A component deviating from spec
Failure: The system as a whole stops providing service

We can't prevent all faults, but we can build fault-tolerant systems that prevent faults from causing failures.

Warning

There's no such thing as 100% reliable. The goal is to make systems as reliable as needed for the specific use case, while accepting the cost tradeoffs.

Types of Faults

Hardware Faults

Hard disks crash, RAM becomes faulty, power grids fail. In a data center with 10,000 disks, you might expect one disk to die per day on average.

Traditional approach: Add redundancy—RAID for disks, dual power supplies, hot-swappable CPUs.

Modern approach: Software fault-tolerance techniques that allow machines to be lost entirely, enabling rolling upgrades without downtime.

Hardware Failure Rates

Component	Mean Time to Failure (MTTF)
Hard disk	10-50 years
Server	3-5 years
Data center power	Hours to days

With thousands of machines, something is always broken somewhere.

Software Faults

Software bugs can lie dormant for years until triggered by unusual circumstances:

A bug that crashes every server when given a particular bad input
A runaway process using all CPU, memory, or disk
A service that becomes slow, causing cascading slowdowns
Dependency failures that take down dependent services

Software faults are harder to anticipate because they're correlated—a single bug can affect all nodes simultaneously.

Human Errors

Humans are unreliable. Studies of large internet services show that configuration errors by operators are the leading cause of outages.

How to minimize human errors:

Design systems that minimize opportunities for error
- Well-designed APIs make doing the right thing easy
- Dangerous operations should require confirmation
Decouple places where mistakes happen from places that cause failures
- Sandbox environments for safe experimentation
- Gradual rollouts to detect problems before full deployment
Test thoroughly at all levels
- Unit tests, integration tests, end-to-end tests
- Automated testing catches many issues before production
Allow quick recovery
- Fast rollback mechanisms for code changes
- Gradual rollouts so problems affect few users first
Set up detailed monitoring
- Performance metrics and error rates
- Early warning systems to detect problems quickly

Scalability

Scalability describes a system's ability to cope with increased load. It's not a one-dimensional label—you can't say "X is scalable" in absolute terms. Rather, it's about specific questions:

"If the system grows in a particular way, what are our options for coping?"

Describing Load

Before discussing handling growth, we need to describe the current load. Load parameters capture this, and the right choice depends on your system architecture:

Web server: Requests per second
Database: Ratio of reads to writes
Chat room: Number of simultaneously active users
Cache: Hit rate

Twitter's Scaling Challenge

Twitter's main operations (circa 2012):

Post tweet: ~4,600 requests/second average, ~12,000/second peak
Home timeline: ~300,000 requests/second

The challenge wasn't the volume of posts—it was the fan-out. When a user posts, that tweet needs to appear in all their followers' timelines.

Approach 1: Query on read

PLAINTEXT

Home timeline = SELECT tweets FROM tweets
                JOIN follows ON tweets.sender_id = follows.followee_id
                WHERE follows.follower_id = current_user

Simple, but slow at 300k requests/second.

Approach 2: Write to all follower timelines When posting, insert the tweet into every follower's cached timeline.

This works well for most users, but celebrities with millions of followers create write storms.

Hybrid solution: Regular users use approach 2. Celebrities' tweets are fetched separately and merged when reading.

Describing Performance

Once you've described load, investigate what happens when load increases:

If you increase a load parameter while keeping resources fixed, how is performance affected?
If you want to keep performance constant, how much do you need to increase resources?

Throughput vs Response Time

Throughput: Number of records processed per second (batch systems)
Response time: Time from request to response (online systems)

Warning

Response time varies with each request. Don't think of it as a single number—think of it as a distribution.

Percentiles

The median (50th percentile, p50) is a good measure of typical response time—half of requests are faster, half are slower.

But to understand outliers, look at higher percentiles:

p95: 95% of requests are faster than this
p99: 99% of requests are faster than this
p999: 99.9% of requests are faster than this

Why Percentiles Matter

If your median response time is 200ms but your p99 is 5 seconds:

Most users have a good experience
1 in 100 users waits 25x longer than typical
These are often your most valuable customers (more data, more activity)

Amazon observed that a 100ms increase in response time reduced sales by 1%.

Tail latency amplification: When a request requires multiple backend calls, the slowest call determines the total time. With many parallel calls, the chance of hitting a slow outlier increases.

Approaches to Scaling

Scaling Up (Vertical Scaling)

Move to a more powerful machine. Simpler, but has limits and cost grows superlinearly.

Scaling Out (Horizontal Scaling)

Distribute load across multiple smaller machines. More complex, but more flexible.

In reality, you'll use a mix. A pragmatic approach:

Start with a single powerful machine
Scale out when you hit limits or for redundancy
Design with eventual distribution in mind

Note

There's no magic formula for scaling. A system handling 100,000 requests/second at 1KB each requires different architecture than 3 requests/minute at 2GB each—even though both have the same throughput in bytes.

Maintainability

The majority of software cost is in ongoing maintenance:

Fixing bugs
Keeping systems operational
Investigating failures
Adapting to new platforms
Adding new features
Repaying technical debt

Good design principles minimize pain and make maintenance enjoyable:

Operability

Make it easy for operations teams to keep the system running smoothly.

Good operability means:

Easy monitoring of system health
Good documentation of system behavior
Avoiding dependency on individual machines
Providing self-healing capabilities where possible
Predictable behavior that minimizes surprises
Good default configurations with manual overrides

Simplicity

New engineers should be able to understand the system easily.

Complexity manifests as:

Tight coupling between modules
Tangled dependencies
Inconsistent naming conventions
Hacks to work around performance problems
Special cases to handle edge conditions

Abstraction is the best tool for removing complexity. A good abstraction hides implementation details behind a clean interface. Examples: SQL hiding storage engine details, programming languages hiding machine code.

Evolvability (Extensibility)

Making changes should be easy.

System requirements constantly change:

New facts emerge
Business priorities shift
Users request new features
Platforms evolve
Legal requirements change

Agile methodologies help at the team level. For larger systems, evolvability requires good design:

Loose coupling
Clear interfaces
Small, focused components

Note

There's no magic trick to making systems evolvable. Simple and easy-to-understand systems are generally easier to modify than complex ones.

Summary

This chapter introduced the foundational vocabulary for thinking about data-intensive applications:

Concept	Key Question
Reliability	Does the system work correctly when things go wrong?
Scalability	Can the system handle growth?
Maintainability	Can people work with the system productively?

These goals often conflict—increased reliability might reduce maintainability, optimizing for scalability might add complexity. The art is finding the right balance for your specific situation.

Warning

Don't over-engineer for hypothetical future scale. Build for current needs while designing for change. Premature optimization is the root of much unnecessary complexity.