Learning Guides
Menu

The Future of Data Systems

7 min readDesigning Data-Intensive Applications

The Future of Data Systems

Throughout this guide, we've explored databases, distributed systems, batch processing, and stream processing. Each component has its strengths and limitations. The art of building data-intensive applications lies in combining these components effectively.

This final chapter synthesizes what we've learned and explores emerging patterns for building robust data systems.


Data Integration

Most applications need more than one data system:

  • Relational database for transactions
  • Search index for full-text search
  • Cache for low-latency reads
  • Analytics warehouse for business intelligence
  • Message queue for async processing

The challenge: keeping all these systems in sync.

The Derived Data Approach

Think of data systems in two categories:

Systems of record: The authoritative source of truth. Writes go here first.

Derived data: Computed from the system of record. Can be rebuilt if lost.

Primary and Derived Data

PLAINTEXT
System of Record: User events in Kafka

          ┌─────────────┼─────────────┐
          ▼             ▼             ▼
   (derived)       (derived)     (derived)
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ PostgreSQL   │ │ Elasticsearch│ │ Redis Cache  │
│ (OLTP)       │ │ (Search)     │ │ (Sessions)   │
└──────────────┘ └──────────────┘ └──────────────┘

If Elasticsearch fails, rebuild it from the event log.

Change Data Capture (CDC)

To keep derived systems in sync, capture every change to the primary database:

  1. Database writes a change to its log
  2. CDC tool reads the log
  3. Publishes changes to a stream (Kafka)
  4. Derived systems consume the stream

CDC Pipeline

PLAINTEXT
User action → PostgreSQL → CDC → Kafka → Consumer → Elasticsearch

                └── Debezium reads WAL

CDC tools: Debezium, Maxwell, Databus.

Event Sourcing Revisited

Instead of capturing changes from a database, make events the primary source:

  1. Application logic produces events
  2. Events are the system of record
  3. All data systems are derived from events

Advantages:

  • Complete audit log by design
  • Easy to evolve derived views
  • Time travel: query state at any point in history

Challenges:

  • Compaction: event logs grow forever
  • Schema evolution: events last forever, must remain readable

Unbundling Databases

Traditional databases bundle many features:

  • Storage
  • Indexing
  • Query processing
  • Transactions
  • Replication

What if we unbundle these into separate components?

Unbundled Database Architecture

PLAINTEXT
Traditional Database:
┌─────────────────────────────────────┐
│  Query Engine                       │
│  Indexes                            │
│  Storage Engine                     │
│  Replication                        │
└─────────────────────────────────────┘
 
Unbundled:
┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│ Storage      │  │ Index        │  │ Query        │
│ (S3, HDFS)   │  │(Elasticsearch)│ │ (Presto)    │
└──────────────┘  └──────────────┘  └──────────────┘
         │               │                │
         └───────────────┴────────────────┘

                  ┌──────────────┐
                  │ Event Log    │
                  │ (Kafka)      │
                  └──────────────┘

This is essentially what modern data platforms do:

  • Kafka as the central nervous system
  • Various stores for different access patterns
  • Processing engines for batch and stream

Dataflow: A Unifying Concept

The Unix pipe philosophy applied to distributed systems:

PLAINTEXT
Input → Transform → Output → Transform → Output → ...

Each stage:

  • Reads from immutable input
  • Produces immutable output
  • Can be rerun independently
  • Can be composed with others

Designing for Correctness

Beyond performance and scalability, systems must be correct. Data should not be silently corrupted.

End-to-End Argument

Correctness checks at lower levels don't guarantee end-to-end correctness:

  • TCP checksums catch corruption in transit
  • But what about bugs in your application logic?
  • Or corruption in storage?
  • Or misconfiguration?

Solution: Application-level checks that verify data integrity end-to-end.

End-to-End Verification

PLAINTEXT
1. User uploads file
2. Store file in S3
3. Record metadata in database
4. Return success to user
 
End-to-end check:
- Compute hash of uploaded file
- Store hash with metadata
- On read: recompute hash, compare with stored
- Catches any corruption anywhere in the pipeline

Exactly-Once Execution

Many operations should happen exactly once:

  • Charge a credit card once
  • Send an email once
  • Debit an account once

Warning

Networks are unreliable. A successful operation with a lost response looks like a failure. Retrying may cause duplication.

Solutions:

Idempotent operations: Doing the same thing twice has the same effect as doing it once.

SQL
-- Not idempotent
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
 
-- Idempotent (using a unique request ID)
INSERT INTO transfers (id, account, amount)
VALUES ('request-123', 1, -100)
ON CONFLICT (id) DO NOTHING;

Deduplication: Track which operations have been applied.


Ethics and Responsibility

Data systems are not just technical artifacts—they affect people's lives:

  • Privacy: What data do you really need? How long do you keep it?
  • Bias: Machine learning models can perpetuate or amplify biases in training data
  • Surveillance: The ability to track behavior creates temptation for misuse
  • Manipulation: Recommendation algorithms can be exploited

Note

As builders of data systems, we have a responsibility to consider the human impact of our technical decisions.

Questions to Ask

  • Do we need this data? Only collect what's necessary.
  • Who has access? Minimize access, audit usage.
  • How long do we keep it? Set retention policies, actually delete.
  • What if we're wrong? Build correction and appeal mechanisms.
  • What if it's stolen? Plan for breach scenarios.

Evolving Architecture

No architecture is perfect forever. Systems must evolve:

Start Simple

"Do the simplest thing that could possibly work."

Many successful systems started simple:

  • Single database
  • Monolithic application
  • No caching

Complexity should be added when actually needed, not in anticipation.

Grow Incrementally

When scaling is needed:

  1. Measure to find actual bottlenecks
  2. Add one component at a time
  3. Validate it helps before adding more

Incremental Evolution

PLAINTEXT
Stage 1: Monolith + PostgreSQL
 
Stage 2: Add Redis cache for hot data
         (Solved: read latency)
 
Stage 3: Add Elasticsearch for search
         (Solved: full-text search)
 
Stage 4: Add Kafka for event processing
         (Solved: async processing, CDC)
 
Stage 5: Add data warehouse for analytics
         (Solved: OLAP queries)
 
Each stage: measure, add, verify.

Plan for Change

  • Use standard interfaces (SQL, HTTP, Kafka protocol)
  • Keep components loosely coupled
  • Make it possible to swap implementations
  • Maintain backward compatibility

Summary: Key Principles

Throughout this guide, several principles have recurred:

Data Models Matter

Choose the right model for your access patterns:

  • Relational for complex queries and transactions
  • Document for flexible, self-contained records
  • Graph for heavily connected data
  • Time-series for temporal data

Embrace Trade-offs

Every decision involves trade-offs:

  • Consistency vs availability (CAP)
  • Latency vs throughput (batch vs stream)
  • Flexibility vs guarantees (schema-less vs schema)
  • Simplicity vs performance (single node vs distributed)

Distributed Systems Are Hard

Expect:

  • Network failures
  • Partial failures
  • Clock skew
  • Byzantine behavior (sometimes)

Design for:

  • Fault tolerance
  • Idempotency
  • Eventual consistency where appropriate
  • Strong consistency where required

Immutability Is Powerful

  • Event logs as systems of record
  • Derived views built from events
  • Version control for data
  • Reproducible processing from immutable inputs

Monitoring and Observability

You can't fix what you can't see:

  • Metrics (what's happening)
  • Logs (why it's happening)
  • Traces (how it's happening)

Final Thoughts

Building data-intensive applications is challenging but rewarding. The field continues to evolve:

  • Serverless and edge computing change deployment models
  • Machine learning becomes integral to data processing
  • Real-time expectations increase
  • Privacy regulations constrain what's possible

Note

The fundamentals covered in this guide—data models, storage, replication, partitioning, transactions, consensus, batch, and stream processing—will remain relevant as technologies evolve. Understanding why systems work the way they do helps you evaluate new technologies and make informed decisions.

Good luck building your data-intensive applications. Remember:

  1. Understand your requirements before choosing technologies
  2. Start simple and add complexity when needed
  3. Measure everything so you know what's actually happening
  4. Plan for failure because it will happen
  5. Consider the human impact of your technical decisions

The tools and techniques exist to build reliable, scalable, and maintainable data systems. It's up to you to combine them thoughtfully.