Learning Guides
Menu

Stream Processing

8 min readDesigning Data-Intensive Applications

Stream Processing

Batch processing operates on bounded datasets—you know when the input ends. But in many scenarios, data arrives continuously: website clicks, sensor readings, financial transactions. Stream processing handles this unbounded data, processing it as it arrives.

Note

Stream processing is like batch processing, but for data that never ends. Many concepts carry over, but the unbounded nature creates new challenges.


What Is a Stream?

A stream is a sequence of events, each with a timestamp, that arrives over time. Events are typically small and self-contained:

JSON
{"timestamp": "2024-01-15T10:30:00Z", "user_id": 123, "action": "click", "page": "/products/456"}
{"timestamp": "2024-01-15T10:30:01Z", "user_id": 456, "action": "purchase", "amount": 99.99}
{"timestamp": "2024-01-15T10:30:02Z", "user_id": 123, "action": "click", "page": "/checkout"}

Event Sourcing

An event stream can be a complete log of everything that happened:

  • Rather than storing current state, store the events that led to it
  • Current state is derived by replaying events
  • Complete audit trail by design

Event Sourcing: Shopping Cart

Events:

PLAINTEXT
{user: 123, action: "add_item", product: "A"}
{user: 123, action: "add_item", product: "B"}
{user: 123, action: "remove_item", product: "A"}
{user: 123, action: "checkout"}

Derived State: Cart contents = [B] Order placed with [B]


Message Brokers

How do you transmit events from producers to consumers? Message brokers (message queues) handle this:

Two Patterns

Direct messaging (RabbitMQ, ActiveMQ):

  • Consumer receives message and acknowledges
  • Message is deleted after acknowledgment
  • If consumer crashes before ack, message is redelivered

Log-based messaging (Apache Kafka, Amazon Kinesis):

  • Messages are appended to an immutable log
  • Consumers track their position (offset) in the log
  • Messages are retained for a configurable period
  • Multiple consumers can read the same messages

Log-Based Messaging

PLAINTEXT
Log:     [0: click] [1: view] [2: purchase] [3: click] [4: view]

Consumer A: offset = 2 (processing "purchase")
 
Consumer B: offset = 4 (processing "view", caught up)
 
Consumer C: offset = 0 (replaying from beginning)

Why Logs?

Durability: Events are persisted; consumers can replay if needed.

Multiple consumers: Each consumer tracks its own offset.

Replayability: Start from any point in history.

Ordering: Within a partition, order is guaranteed.

Warning

Log-based brokers (Kafka) partition data across multiple logs. Ordering is guaranteed within a partition, but not across partitions. Design your partitioning accordingly.


Processing Streams

What can you do with a stream of events?

1. Write to Storage

Take events and write them somewhere—database, file system, search index.

PLAINTEXT
Stream → [Processor] → Database

This is essentially streaming ETL.

2. Push to Users

Send events to users in real-time—notifications, dashboards, alerts.

PLAINTEXT
Stream → [Processor] → Push notification / WebSocket

3. Produce Another Stream

Transform, filter, enrich, or aggregate events:

PLAINTEXT
Raw events → [Processor] → Derived events → [Another Processor] → ...

This is where stream processing gets interesting.


Stream Processing Operations

Simple Transformations

Like batch processing: map, filter, project fields.

JAVASCRIPT
// Filter for purchases over $100
stream
  .filter((event) => event.action === "purchase" && event.amount > 100)
  .map((event) => ({ user: event.user_id, amount: event.amount }));

Joins

Combining streams is trickier than in batch because data is unbounded:

Stream-Stream Join: Joining two streams based on some key within a time window.

Stream-Stream Join

Clicks and purchases both have user_id. Find clicks that led to purchases within 1 hour:

PLAINTEXT
Click stream:   [t=10:00, user=123, page=/product/A]
Purchase stream: [t=10:45, user=123, product=A, amount=$50]
 
Join result: User 123 clicked product A and bought it 45 min later

Stream-Table Join: Enrich a stream with data from a table (database lookup).

PLAINTEXT
Event: {user_id: 123, action: click}
Table: {id: 123, name: "Alice", tier: "premium"}
Enriched: {user_id: 123, user_name: "Alice", tier: "premium", action: click}

Table-Table Join: Maintain a materialized view of joined tables, updated as either table changes.

Windowing

For unbounded data, you often need to group events into finite windows:

Tumbling windows: Fixed-size, non-overlapping.

PLAINTEXT
|--5 min--|--5 min--|--5 min--|

Hopping windows: Fixed-size, overlapping.

PLAINTEXT
|--5 min--|
    |--5 min--|
        |--5 min--|

Sliding windows: Events within a duration of each other.

PLAINTEXT
Each event: "all events within 5 minutes before me"

Session windows: Group by activity, with gaps as boundaries.

PLAINTEXT
[activity]   [gap]   [activity] [activity]   [gap]   [activity]
|--session 1--|      |------session 2------|      |--session 3--|

Windowed Aggregation

JAVASCRIPT
// Count clicks per URL per 5-minute tumbling window
stream
  .filter((e) => e.action === "click")
  .keyBy((e) => e.url)
  .window(TumblingWindow.of(5, MINUTES))
  .count();
 
// Output: (url, window_start, count)
// ("/products", "10:00:00", 145)
// ("/products", "10:05:00", 132)

Time and Ordering

Time is surprisingly tricky in stream processing.

Event Time vs Processing Time

Event time: When the event actually occurred (from the event's timestamp).

Processing time: When the system processes the event.

These can differ significantly due to:

  • Network delays
  • Batching at the source
  • Consumer lag
  • Retries

Warning

If you use processing time for windowing, late-arriving events are assigned to the wrong window. Use event time when the business logic depends on when things actually happened.

Late Events

With event-time processing, how long do you wait for late events before closing a window?

Watermarks help: a watermark is a timestamp that says "all events with timestamps before this have arrived."

Watermarks

PLAINTEXT
Time:    10:00   10:01   10:02   10:03   10:04   10:05
Events:    A       B       -       C       D       E

                    Watermark: 10:02
                    "Events before 10:02 are complete"
 
Late event arrives: timestamp 10:01 at clock time 10:05
  → After watermark, so it's "late"

Options for late events:

  • Drop them (simplest, loses data)
  • Include them and update previous results
  • Send them to a side output for special handling

Fault Tolerance

Stream processors run continuously, so failures are inevitable. How do you recover without losing or duplicating data?

Microbatching

Process the stream in tiny batches (e.g., 1 second). Each batch is treated like a small batch job with atomic commit.

Spark Streaming uses this approach.

Advantage: Reuses batch fault tolerance mechanisms. Disadvantage: Latency is at least one batch interval.

Checkpointing

Periodically save the state of the processor. On failure, restore from checkpoint and replay events from that point.

Apache Flink uses this approach.

Advantage: True streaming (event-at-a-time processing). Disadvantage: Must handle "exactly once" semantics carefully.

Exactly-Once Semantics

The holy grail: each event affects the output exactly once, even if the processor crashes and restarts.

Exactly-Once Challenge

PLAINTEXT
1. Read event
2. Update internal state
3. Write output
4. Crash!
 
On restart: Did we write the output or not?
If we replay the event, might we duplicate it?

Achieving exactly-once:

  • Checkpoint state + output position atomically
  • Use idempotent writes (writing same thing twice = no change)
  • Use distributed transactions (expensive)

Note

"Exactly once" is really "effectively once"—the processing may happen multiple times, but the end result is as if it happened once.


Stream Processing Frameworks

FrameworkApproachKey Features
Apache Kafka StreamsLibraryEmbedded in application, simple setup
Apache FlinkEngineTrue streaming, exactly-once, powerful state
Apache Spark StreamingEngineMicrobatching, unified with batch
Apache StormEngineOriginal storm processing, now legacy
Amazon KinesisManagedAWS integration, serverless option
Google DataflowManagedUnified batch/stream, auto-scaling

Batch vs Stream

The line between batch and stream processing is blurring:

Batch:

  • Process bounded datasets
  • Can see entire input before producing output
  • High throughput, high latency

Stream:

  • Process unbounded data
  • Must produce output before seeing all input
  • Lower throughput, lower latency

Unification (Spark, Flink, Dataflow):

  • Same API for batch and stream
  • Batch = stream processing on bounded input
  • Makes code more portable and consistent

Unified Processing

PYTHON
# Same code works for batch and stream!
data \
  .filter(lambda e: e['action'] == 'purchase') \
  .group_by(key=lambda e: e['product_id']) \
  .sum(field='amount')
 
# Batch: runs on historical data, outputs final result
# Stream: runs continuously, outputs updates

Summary

Stream processing extends batch processing concepts to unbounded data:

ConceptBatchStream
InputBounded (files)Unbounded (events)
ProcessingRun to completionRun forever
OutputComplete resultsIncremental updates
TimeHistoricalReal-time

Key concepts:

  • Message brokers transmit events (direct vs log-based)
  • Joins are possible but require windows for stream-stream joins
  • Windowing (tumbling, hopping, sliding, session) groups unbounded data
  • Event time vs processing time affects correctness
  • Watermarks handle late data
  • Exactly-once requires careful checkpointing or idempotent writes

Note

Stream processing is increasingly important as businesses want real-time insights, recommendations, and fraud detection. Understanding these concepts helps you build responsive, event-driven systems.

The next chapter brings everything together, exploring how to combine these concepts into complete data systems.