Stream Processing

8 min read•Designing Data-Intensive Applications

Stream Processing

Batch processing operates on bounded datasets—you know when the input ends. But in many scenarios, data arrives continuously: website clicks, sensor readings, financial transactions. Stream processing handles this unbounded data, processing it as it arrives.

Note

Stream processing is like batch processing, but for data that never ends. Many concepts carry over, but the unbounded nature creates new challenges.

What Is a Stream?

A stream is a sequence of events, each with a timestamp, that arrives over time. Events are typically small and self-contained:

JSON

{"timestamp": "2024-01-15T10:30:00Z", "user_id": 123, "action": "click", "page": "/products/456"}
{"timestamp": "2024-01-15T10:30:01Z", "user_id": 456, "action": "purchase", "amount": 99.99}
{"timestamp": "2024-01-15T10:30:02Z", "user_id": 123, "action": "click", "page": "/checkout"}

Event Sourcing

An event stream can be a complete log of everything that happened:

Rather than storing current state, store the events that led to it
Current state is derived by replaying events
Complete audit trail by design

Event Sourcing: Shopping Cart

Events:

PLAINTEXT

{user: 123, action: "add_item", product: "A"}
{user: 123, action: "add_item", product: "B"}
{user: 123, action: "remove_item", product: "A"}
{user: 123, action: "checkout"}

Derived State: Cart contents = [B] Order placed with [B]

Message Brokers

How do you transmit events from producers to consumers? Message brokers (message queues) handle this:

Two Patterns

Direct messaging (RabbitMQ, ActiveMQ):

Consumer receives message and acknowledges
Message is deleted after acknowledgment
If consumer crashes before ack, message is redelivered

Log-based messaging (Apache Kafka, Amazon Kinesis):

Messages are appended to an immutable log
Consumers track their position (offset) in the log
Messages are retained for a configurable period
Multiple consumers can read the same messages

Log-Based Messaging

PLAINTEXT

Log:     [0: click] [1: view] [2: purchase] [3: click] [4: view]
                                  ↑
Consumer A: offset = 2 (processing "purchase")
 
Consumer B: offset = 4 (processing "view", caught up)
 
Consumer C: offset = 0 (replaying from beginning)

Why Logs?

Durability: Events are persisted; consumers can replay if needed.

Multiple consumers: Each consumer tracks its own offset.

Replayability: Start from any point in history.

Ordering: Within a partition, order is guaranteed.

Warning

Log-based brokers (Kafka) partition data across multiple logs. Ordering is guaranteed within a partition, but not across partitions. Design your partitioning accordingly.

Processing Streams

What can you do with a stream of events?

1. Write to Storage

Take events and write them somewhere—database, file system, search index.

PLAINTEXT

Stream → [Processor] → Database

This is essentially streaming ETL.

2. Push to Users

Send events to users in real-time—notifications, dashboards, alerts.

PLAINTEXT

Stream → [Processor] → Push notification / WebSocket

3. Produce Another Stream

Transform, filter, enrich, or aggregate events:

PLAINTEXT

Raw events → [Processor] → Derived events → [Another Processor] → ...

This is where stream processing gets interesting.

Stream Processing Operations

Simple Transformations

Like batch processing: map, filter, project fields.

JAVASCRIPT

// Filter for purchases over $100
stream
  .filter((event) => event.action === "purchase" && event.amount > 100)
  .map((event) => ({ user: event.user_id, amount: event.amount }));

Joins

Combining streams is trickier than in batch because data is unbounded:

Stream-Stream Join: Joining two streams based on some key within a time window.

Stream-Stream Join

Clicks and purchases both have user_id. Find clicks that led to purchases within 1 hour:

PLAINTEXT

Click stream:   [t=10:00, user=123, page=/product/A]
Purchase stream: [t=10:45, user=123, product=A, amount=$50]
 
Join result: User 123 clicked product A and bought it 45 min later

Stream-Table Join: Enrich a stream with data from a table (database lookup).

PLAINTEXT

Event: {user_id: 123, action: click}
Table: {id: 123, name: "Alice", tier: "premium"}
Enriched: {user_id: 123, user_name: "Alice", tier: "premium", action: click}

Table-Table Join: Maintain a materialized view of joined tables, updated as either table changes.

Windowing

For unbounded data, you often need to group events into finite windows:

Tumbling windows: Fixed-size, non-overlapping.

PLAINTEXT

|--5 min--|--5 min--|--5 min--|

Hopping windows: Fixed-size, overlapping.

PLAINTEXT

|--5 min--|
    |--5 min--|
        |--5 min--|

Sliding windows: Events within a duration of each other.

PLAINTEXT

Each event: "all events within 5 minutes before me"

Session windows: Group by activity, with gaps as boundaries.

PLAINTEXT

[activity]   [gap]   [activity] [activity]   [gap]   [activity]
|--session 1--|      |------session 2------|      |--session 3--|

Windowed Aggregation

JAVASCRIPT

// Count clicks per URL per 5-minute tumbling window
stream
  .filter((e) => e.action === "click")
  .keyBy((e) => e.url)
  .window(TumblingWindow.of(5, MINUTES))
  .count();
 
// Output: (url, window_start, count)
// ("/products", "10:00:00", 145)
// ("/products", "10:05:00", 132)

Time and Ordering

Time is surprisingly tricky in stream processing.

Event Time vs Processing Time

Event time: When the event actually occurred (from the event's timestamp).

Processing time: When the system processes the event.

These can differ significantly due to:

Network delays
Batching at the source
Consumer lag
Retries

Warning

If you use processing time for windowing, late-arriving events are assigned to the wrong window. Use event time when the business logic depends on when things actually happened.

Late Events

With event-time processing, how long do you wait for late events before closing a window?

Watermarks help: a watermark is a timestamp that says "all events with timestamps before this have arrived."

Watermarks

PLAINTEXT

Time:    10:00   10:01   10:02   10:03   10:04   10:05
Events:    A       B       -       C       D       E
                          ↑
                    Watermark: 10:02
                    "Events before 10:02 are complete"
 
Late event arrives: timestamp 10:01 at clock time 10:05
  → After watermark, so it's "late"

Options for late events:

Drop them (simplest, loses data)
Include them and update previous results
Send them to a side output for special handling

Fault Tolerance

Stream processors run continuously, so failures are inevitable. How do you recover without losing or duplicating data?

Microbatching

Process the stream in tiny batches (e.g., 1 second). Each batch is treated like a small batch job with atomic commit.

Spark Streaming uses this approach.

Advantage: Reuses batch fault tolerance mechanisms. Disadvantage: Latency is at least one batch interval.

Checkpointing

Periodically save the state of the processor. On failure, restore from checkpoint and replay events from that point.

Apache Flink uses this approach.

Advantage: True streaming (event-at-a-time processing). Disadvantage: Must handle "exactly once" semantics carefully.

Exactly-Once Semantics

The holy grail: each event affects the output exactly once, even if the processor crashes and restarts.

Exactly-Once Challenge

PLAINTEXT

1. Read event
2. Update internal state
3. Write output
4. Crash!
 
On restart: Did we write the output or not?
If we replay the event, might we duplicate it?

Achieving exactly-once:

Checkpoint state + output position atomically
Use idempotent writes (writing same thing twice = no change)
Use distributed transactions (expensive)

Note

"Exactly once" is really "effectively once"—the processing may happen multiple times, but the end result is as if it happened once.

Stream Processing Frameworks

Framework	Approach	Key Features
Apache Kafka Streams	Library	Embedded in application, simple setup
Apache Flink	Engine	True streaming, exactly-once, powerful state
Apache Spark Streaming	Engine	Microbatching, unified with batch
Apache Storm	Engine	Original storm processing, now legacy
Amazon Kinesis	Managed	AWS integration, serverless option
Google Dataflow	Managed	Unified batch/stream, auto-scaling

Batch vs Stream

The line between batch and stream processing is blurring:

Batch:

Process bounded datasets
Can see entire input before producing output
High throughput, high latency

Stream:

Process unbounded data
Must produce output before seeing all input
Lower throughput, lower latency

Unification (Spark, Flink, Dataflow):

Same API for batch and stream
Batch = stream processing on bounded input
Makes code more portable and consistent

Unified Processing

PYTHON

# Same code works for batch and stream!
data \
  .filter(lambda e: e['action'] == 'purchase') \
  .group_by(key=lambda e: e['product_id']) \
  .sum(field='amount')
 
# Batch: runs on historical data, outputs final result
# Stream: runs continuously, outputs updates

Summary

Stream processing extends batch processing concepts to unbounded data:

Concept	Batch	Stream
Input	Bounded (files)	Unbounded (events)
Processing	Run to completion	Run forever
Output	Complete results	Incremental updates
Time	Historical	Real-time

Key concepts:

Message brokers transmit events (direct vs log-based)
Joins are possible but require windows for stream-stream joins
Windowing (tumbling, hopping, sliding, session) groups unbounded data
Event time vs processing time affects correctness
Watermarks handle late data
Exactly-once requires careful checkpointing or idempotent writes

Note

Stream processing is increasingly important as businesses want real-time insights, recommendations, and fraud detection. Understanding these concepts helps you build responsive, event-driven systems.

The next chapter brings everything together, exploring how to combine these concepts into complete data systems.