Stream Processing
Stream Processing
Batch processing operates on bounded datasets—you know when the input ends. But in many scenarios, data arrives continuously: website clicks, sensor readings, financial transactions. Stream processing handles this unbounded data, processing it as it arrives.
Note
Stream processing is like batch processing, but for data that never ends. Many concepts carry over, but the unbounded nature creates new challenges.
What Is a Stream?
A stream is a sequence of events, each with a timestamp, that arrives over time. Events are typically small and self-contained:
{"timestamp": "2024-01-15T10:30:00Z", "user_id": 123, "action": "click", "page": "/products/456"}
{"timestamp": "2024-01-15T10:30:01Z", "user_id": 456, "action": "purchase", "amount": 99.99}
{"timestamp": "2024-01-15T10:30:02Z", "user_id": 123, "action": "click", "page": "/checkout"}Event Sourcing
An event stream can be a complete log of everything that happened:
- Rather than storing current state, store the events that led to it
- Current state is derived by replaying events
- Complete audit trail by design
Event Sourcing: Shopping Cart
Events:
{user: 123, action: "add_item", product: "A"}
{user: 123, action: "add_item", product: "B"}
{user: 123, action: "remove_item", product: "A"}
{user: 123, action: "checkout"}Derived State: Cart contents = [B] Order placed with [B]
Message Brokers
How do you transmit events from producers to consumers? Message brokers (message queues) handle this:
Two Patterns
Direct messaging (RabbitMQ, ActiveMQ):
- Consumer receives message and acknowledges
- Message is deleted after acknowledgment
- If consumer crashes before ack, message is redelivered
Log-based messaging (Apache Kafka, Amazon Kinesis):
- Messages are appended to an immutable log
- Consumers track their position (offset) in the log
- Messages are retained for a configurable period
- Multiple consumers can read the same messages
Log-Based Messaging
Log: [0: click] [1: view] [2: purchase] [3: click] [4: view]
↑
Consumer A: offset = 2 (processing "purchase")
Consumer B: offset = 4 (processing "view", caught up)
Consumer C: offset = 0 (replaying from beginning)Why Logs?
Durability: Events are persisted; consumers can replay if needed.
Multiple consumers: Each consumer tracks its own offset.
Replayability: Start from any point in history.
Ordering: Within a partition, order is guaranteed.
Warning
Log-based brokers (Kafka) partition data across multiple logs. Ordering is guaranteed within a partition, but not across partitions. Design your partitioning accordingly.
Processing Streams
What can you do with a stream of events?
1. Write to Storage
Take events and write them somewhere—database, file system, search index.
Stream → [Processor] → DatabaseThis is essentially streaming ETL.
2. Push to Users
Send events to users in real-time—notifications, dashboards, alerts.
Stream → [Processor] → Push notification / WebSocket3. Produce Another Stream
Transform, filter, enrich, or aggregate events:
Raw events → [Processor] → Derived events → [Another Processor] → ...This is where stream processing gets interesting.
Stream Processing Operations
Simple Transformations
Like batch processing: map, filter, project fields.
// Filter for purchases over $100
stream
.filter((event) => event.action === "purchase" && event.amount > 100)
.map((event) => ({ user: event.user_id, amount: event.amount }));Joins
Combining streams is trickier than in batch because data is unbounded:
Stream-Stream Join: Joining two streams based on some key within a time window.
Stream-Stream Join
Clicks and purchases both have user_id. Find clicks that led to purchases within 1 hour:
Click stream: [t=10:00, user=123, page=/product/A]
Purchase stream: [t=10:45, user=123, product=A, amount=$50]
Join result: User 123 clicked product A and bought it 45 min laterStream-Table Join: Enrich a stream with data from a table (database lookup).
Event: {user_id: 123, action: click}
Table: {id: 123, name: "Alice", tier: "premium"}
Enriched: {user_id: 123, user_name: "Alice", tier: "premium", action: click}Table-Table Join: Maintain a materialized view of joined tables, updated as either table changes.
Windowing
For unbounded data, you often need to group events into finite windows:
Tumbling windows: Fixed-size, non-overlapping.
|--5 min--|--5 min--|--5 min--|Hopping windows: Fixed-size, overlapping.
|--5 min--|
|--5 min--|
|--5 min--|Sliding windows: Events within a duration of each other.
Each event: "all events within 5 minutes before me"Session windows: Group by activity, with gaps as boundaries.
[activity] [gap] [activity] [activity] [gap] [activity]
|--session 1--| |------session 2------| |--session 3--|Windowed Aggregation
// Count clicks per URL per 5-minute tumbling window
stream
.filter((e) => e.action === "click")
.keyBy((e) => e.url)
.window(TumblingWindow.of(5, MINUTES))
.count();
// Output: (url, window_start, count)
// ("/products", "10:00:00", 145)
// ("/products", "10:05:00", 132)Time and Ordering
Time is surprisingly tricky in stream processing.
Event Time vs Processing Time
Event time: When the event actually occurred (from the event's timestamp).
Processing time: When the system processes the event.
These can differ significantly due to:
- Network delays
- Batching at the source
- Consumer lag
- Retries
Warning
If you use processing time for windowing, late-arriving events are assigned to the wrong window. Use event time when the business logic depends on when things actually happened.
Late Events
With event-time processing, how long do you wait for late events before closing a window?
Watermarks help: a watermark is a timestamp that says "all events with timestamps before this have arrived."
Watermarks
Time: 10:00 10:01 10:02 10:03 10:04 10:05
Events: A B - C D E
↑
Watermark: 10:02
"Events before 10:02 are complete"
Late event arrives: timestamp 10:01 at clock time 10:05
→ After watermark, so it's "late"Options for late events:
- Drop them (simplest, loses data)
- Include them and update previous results
- Send them to a side output for special handling
Fault Tolerance
Stream processors run continuously, so failures are inevitable. How do you recover without losing or duplicating data?
Microbatching
Process the stream in tiny batches (e.g., 1 second). Each batch is treated like a small batch job with atomic commit.
Spark Streaming uses this approach.
Advantage: Reuses batch fault tolerance mechanisms. Disadvantage: Latency is at least one batch interval.
Checkpointing
Periodically save the state of the processor. On failure, restore from checkpoint and replay events from that point.
Apache Flink uses this approach.
Advantage: True streaming (event-at-a-time processing). Disadvantage: Must handle "exactly once" semantics carefully.
Exactly-Once Semantics
The holy grail: each event affects the output exactly once, even if the processor crashes and restarts.
Exactly-Once Challenge
1. Read event
2. Update internal state
3. Write output
4. Crash!
On restart: Did we write the output or not?
If we replay the event, might we duplicate it?Achieving exactly-once:
- Checkpoint state + output position atomically
- Use idempotent writes (writing same thing twice = no change)
- Use distributed transactions (expensive)
Note
"Exactly once" is really "effectively once"—the processing may happen multiple times, but the end result is as if it happened once.
Stream Processing Frameworks
| Framework | Approach | Key Features |
|---|---|---|
| Apache Kafka Streams | Library | Embedded in application, simple setup |
| Apache Flink | Engine | True streaming, exactly-once, powerful state |
| Apache Spark Streaming | Engine | Microbatching, unified with batch |
| Apache Storm | Engine | Original storm processing, now legacy |
| Amazon Kinesis | Managed | AWS integration, serverless option |
| Google Dataflow | Managed | Unified batch/stream, auto-scaling |
Batch vs Stream
The line between batch and stream processing is blurring:
Batch:
- Process bounded datasets
- Can see entire input before producing output
- High throughput, high latency
Stream:
- Process unbounded data
- Must produce output before seeing all input
- Lower throughput, lower latency
Unification (Spark, Flink, Dataflow):
- Same API for batch and stream
- Batch = stream processing on bounded input
- Makes code more portable and consistent
Unified Processing
# Same code works for batch and stream!
data \
.filter(lambda e: e['action'] == 'purchase') \
.group_by(key=lambda e: e['product_id']) \
.sum(field='amount')
# Batch: runs on historical data, outputs final result
# Stream: runs continuously, outputs updatesSummary
Stream processing extends batch processing concepts to unbounded data:
| Concept | Batch | Stream |
|---|---|---|
| Input | Bounded (files) | Unbounded (events) |
| Processing | Run to completion | Run forever |
| Output | Complete results | Incremental updates |
| Time | Historical | Real-time |
Key concepts:
- Message brokers transmit events (direct vs log-based)
- Joins are possible but require windows for stream-stream joins
- Windowing (tumbling, hopping, sliding, session) groups unbounded data
- Event time vs processing time affects correctness
- Watermarks handle late data
- Exactly-once requires careful checkpointing or idempotent writes
Note
Stream processing is increasingly important as businesses want real-time insights, recommendations, and fraud detection. Understanding these concepts helps you build responsive, event-driven systems.
The next chapter brings everything together, exploring how to combine these concepts into complete data systems.