The Future of Data Systems

7 min read•Designing Data-Intensive Applications

The Future of Data Systems

Throughout this guide, we've explored databases, distributed systems, batch processing, and stream processing. Each component has its strengths and limitations. The art of building data-intensive applications lies in combining these components effectively.

This final chapter synthesizes what we've learned and explores emerging patterns for building robust data systems.

Data Integration

Most applications need more than one data system:

Relational database for transactions
Search index for full-text search
Cache for low-latency reads
Analytics warehouse for business intelligence
Message queue for async processing

The challenge: keeping all these systems in sync.

The Derived Data Approach

Think of data systems in two categories:

Systems of record: The authoritative source of truth. Writes go here first.

Derived data: Computed from the system of record. Can be rebuilt if lost.

Primary and Derived Data

PLAINTEXT

System of Record: User events in Kafka
                        │
          ┌─────────────┼─────────────┐
          ▼             ▼             ▼
   (derived)       (derived)     (derived)
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ PostgreSQL   │ │ Elasticsearch│ │ Redis Cache  │
│ (OLTP)       │ │ (Search)     │ │ (Sessions)   │
└──────────────┘ └──────────────┘ └──────────────┘

If Elasticsearch fails, rebuild it from the event log.

Change Data Capture (CDC)

To keep derived systems in sync, capture every change to the primary database:

Database writes a change to its log
CDC tool reads the log
Publishes changes to a stream (Kafka)
Derived systems consume the stream

CDC Pipeline

PLAINTEXT

User action → PostgreSQL → CDC → Kafka → Consumer → Elasticsearch
                │
                └── Debezium reads WAL

CDC tools: Debezium, Maxwell, Databus.

Event Sourcing Revisited

Instead of capturing changes from a database, make events the primary source:

Application logic produces events
Events are the system of record
All data systems are derived from events

Advantages:

Complete audit log by design
Easy to evolve derived views
Time travel: query state at any point in history

Challenges:

Compaction: event logs grow forever
Schema evolution: events last forever, must remain readable

Unbundling Databases

Traditional databases bundle many features:

Storage
Indexing
Query processing
Transactions
Replication

What if we unbundle these into separate components?

Unbundled Database Architecture

PLAINTEXT

Traditional Database:
┌─────────────────────────────────────┐
│  Query Engine                       │
│  Indexes                            │
│  Storage Engine                     │
│  Replication                        │
└─────────────────────────────────────┘
 
Unbundled:
┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│ Storage      │  │ Index        │  │ Query        │
│ (S3, HDFS)   │  │(Elasticsearch)│ │ (Presto)    │
└──────────────┘  └──────────────┘  └──────────────┘
         │               │                │
         └───────────────┴────────────────┘
                         │
                  ┌──────────────┐
                  │ Event Log    │
                  │ (Kafka)      │
                  └──────────────┘

This is essentially what modern data platforms do:

Kafka as the central nervous system
Various stores for different access patterns
Processing engines for batch and stream

Dataflow: A Unifying Concept

The Unix pipe philosophy applied to distributed systems:

PLAINTEXT

Input → Transform → Output → Transform → Output → ...

Each stage:

Reads from immutable input
Produces immutable output
Can be rerun independently
Can be composed with others

Designing for Correctness

Beyond performance and scalability, systems must be correct. Data should not be silently corrupted.

End-to-End Argument

Correctness checks at lower levels don't guarantee end-to-end correctness:

TCP checksums catch corruption in transit
But what about bugs in your application logic?
Or corruption in storage?
Or misconfiguration?

Solution: Application-level checks that verify data integrity end-to-end.

End-to-End Verification

PLAINTEXT

1. User uploads file
2. Store file in S3
3. Record metadata in database
4. Return success to user
 
End-to-end check:
- Compute hash of uploaded file
- Store hash with metadata
- On read: recompute hash, compare with stored
- Catches any corruption anywhere in the pipeline

Exactly-Once Execution

Many operations should happen exactly once:

Charge a credit card once
Send an email once
Debit an account once

Warning

Networks are unreliable. A successful operation with a lost response looks like a failure. Retrying may cause duplication.

Solutions:

Idempotent operations: Doing the same thing twice has the same effect as doing it once.

SQL

-- Not idempotent
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
 
-- Idempotent (using a unique request ID)
INSERT INTO transfers (id, account, amount)
VALUES ('request-123', 1, -100)
ON CONFLICT (id) DO NOTHING;

Deduplication: Track which operations have been applied.

Ethics and Responsibility

Data systems are not just technical artifacts—they affect people's lives:

Privacy: What data do you really need? How long do you keep it?
Bias: Machine learning models can perpetuate or amplify biases in training data
Surveillance: The ability to track behavior creates temptation for misuse
Manipulation: Recommendation algorithms can be exploited

Note

As builders of data systems, we have a responsibility to consider the human impact of our technical decisions.

Questions to Ask

Do we need this data? Only collect what's necessary.
Who has access? Minimize access, audit usage.
How long do we keep it? Set retention policies, actually delete.
What if we're wrong? Build correction and appeal mechanisms.
What if it's stolen? Plan for breach scenarios.

Evolving Architecture

No architecture is perfect forever. Systems must evolve:

Start Simple

"Do the simplest thing that could possibly work."

Many successful systems started simple:

Single database
Monolithic application
No caching

Complexity should be added when actually needed, not in anticipation.

Grow Incrementally

When scaling is needed:

Measure to find actual bottlenecks
Add one component at a time
Validate it helps before adding more

Incremental Evolution

PLAINTEXT

Stage 1: Monolith + PostgreSQL
 
Stage 2: Add Redis cache for hot data
         (Solved: read latency)
 
Stage 3: Add Elasticsearch for search
         (Solved: full-text search)
 
Stage 4: Add Kafka for event processing
         (Solved: async processing, CDC)
 
Stage 5: Add data warehouse for analytics
         (Solved: OLAP queries)
 
Each stage: measure, add, verify.

Plan for Change

Use standard interfaces (SQL, HTTP, Kafka protocol)
Keep components loosely coupled
Make it possible to swap implementations
Maintain backward compatibility

Summary: Key Principles

Throughout this guide, several principles have recurred:

Data Models Matter

Choose the right model for your access patterns:

Relational for complex queries and transactions
Document for flexible, self-contained records
Graph for heavily connected data
Time-series for temporal data

Embrace Trade-offs

Every decision involves trade-offs:

Consistency vs availability (CAP)
Latency vs throughput (batch vs stream)
Flexibility vs guarantees (schema-less vs schema)
Simplicity vs performance (single node vs distributed)

Distributed Systems Are Hard

Expect:

Network failures
Partial failures
Clock skew
Byzantine behavior (sometimes)

Design for:

Fault tolerance
Idempotency
Eventual consistency where appropriate
Strong consistency where required

Immutability Is Powerful

Event logs as systems of record
Derived views built from events
Version control for data
Reproducible processing from immutable inputs

Monitoring and Observability

You can't fix what you can't see:

Metrics (what's happening)
Logs (why it's happening)
Traces (how it's happening)

Final Thoughts

Building data-intensive applications is challenging but rewarding. The field continues to evolve:

Serverless and edge computing change deployment models
Machine learning becomes integral to data processing
Real-time expectations increase
Privacy regulations constrain what's possible

Note

The fundamentals covered in this guide—data models, storage, replication, partitioning, transactions, consensus, batch, and stream processing—will remain relevant as technologies evolve. Understanding why systems work the way they do helps you evaluate new technologies and make informed decisions.

Good luck building your data-intensive applications. Remember:

Understand your requirements before choosing technologies
Start simple and add complexity when needed
Measure everything so you know what's actually happening
Plan for failure because it will happen
Consider the human impact of your technical decisions

The tools and techniques exist to build reliable, scalable, and maintainable data systems. It's up to you to combine them thoughtfully.