Batch Processing

8 min read•Designing Data-Intensive Applications

Batch Processing

So far we've focused on online systems—databases and services that respond to requests. But another crucial category is batch processing: taking a large amount of data as input, running a job on it, and producing output data.

Batch jobs are ubiquitous:

Building search indexes
Training machine learning models
Calculating analytics and reports
ETL (Extract, Transform, Load) for data warehouses

Note

The distinction: online systems care about response time (latency), while batch systems care about throughput—how much data can be processed in a given time.

Unix Tools: The Original Batch Processing

Unix command-line tools are a surprisingly powerful batch processing system:

BASH

cat access.log |
  awk '{print $7}' |
  sort |
  uniq -c |
  sort -rn |
  head -n 10

This pipeline finds the 10 most frequently requested URLs in a log file.

Unix Philosophy

Make each program do one thing well
Expect the output of every program to become the input to another
Design programs to be connected

Key features:

Uniform interface: Everything is a file (or stream of bytes)
Immutability: Input files are unchanged; new output files are created
Composability: Small tools combined into complex workflows

Unix Pipeline Benefits

BASH

# Each step can be tested independently
cat log.txt | head -n 100 | awk '{print $7}'  # test awk pattern
 
# Intermediate results can be inspected
cat log.txt | awk '{print $7}' | less
 
# Components can be swapped
cat log.txt | cut -d' ' -f7 | sort | uniq -c  # cut instead of awk

The biggest limitation: Unix tools run on a single machine. For truly large datasets, we need to go distributed.

MapReduce

MapReduce (popularized by Google and implemented in Hadoop) applies the Unix philosophy to distributed systems.

How It Works

A MapReduce job has two phases:

Map: Extract key-value pairs from each input record (runs on many machines in parallel)

Reduce: Collect all values for each key and combine them (all values for a key go to one reducer)

Word Count with MapReduce

Input:

PLAINTEXT

"hello world"
"hello hadoop"

Map phase (runs in parallel on each input):

PLAINTEXT

"hello world" → [("hello", 1), ("world", 1)]
"hello hadoop" → [("hello", 1), ("hadoop", 1)]

Shuffle (group by key):

PLAINTEXT

"hello" → [1, 1]
"world" → [1]
"hadoop" → [1]

Reduce phase (sum values for each key):

PLAINTEXT

"hello" → 2
"world" → 1
"hadoop" → 1

Distributed Execution

Input is split into partitions (typically HDFS blocks)
Mapper tasks run in parallel across machines
Mappers write output to local disk
Shuffle phase transfers data to reducers (sorted by key)
Reducers process each key and write to output files

PLAINTEXT

Input Files                    Output Files
    │                              ▲
    ▼                              │
┌────────┐  ┌────────┐       ┌────────┐  ┌────────┐
│ Mapper │  │ Mapper │       │Reducer │  │Reducer │
└────────┘  └────────┘       └────────┘  └────────┘
    │           │                ▲           ▲
    └───────────┴───── shuffle ──┴───────────┘

Joins in MapReduce

Real-world data often requires joining datasets. MapReduce handles this through different techniques:

Sort-Merge Join: Both datasets are partitioned by join key. Records with the same key end up at the same reducer.

Broadcast Join: If one dataset is small, send it to all mappers. Each mapper can join locally.

Partitioned Hash Join: If both datasets are partitioned the same way, each partition can be joined locally.

Sort-Merge Join

PLAINTEXT

Users:          Events:
id=1, name=Alice    user_id=1, action=click
id=2, name=Bob      user_id=2, action=view
                    user_id=1, action=view
 
Map phase (both inputs):
  → (1, {name: Alice})
  → (2, {name: Bob})
  → (1, {action: click})
  → (2, {action: view})
  → (1, {action: view})
 
Reduce (for key=1):
  {name: Alice}, {action: click}, {action: view}
  → ("Alice", "click"), ("Alice", "view")

Beyond MapReduce

MapReduce is powerful but verbose. Writing even simple operations requires significant boilerplate. Higher-level frameworks have emerged:

Dataflow Engines

Spark, Flink, and Tez represent workflows as directed acyclic graphs (DAGs) of operators.

Advantages over MapReduce:

No mandatory map-then-reduce structure
Operators can be chained without writing to disk
Better optimization of execution plans
More expressive APIs

Word Count: MapReduce vs Spark

MapReduce (Java):

JAVA

public static class TokenizerMapper extends Mapper<...> {
  public void map(LongWritable key, Text value, Context context) {
    StringTokenizer itr = new StringTokenizer(value.toString());
    while (itr.hasMoreTokens()) {
      word.set(itr.nextToken());
      context.write(word, one);
    }
  }
}
// Plus Reducer class, driver class, configuration...

Spark:

SCALA

textFile.flatMap(_.split(" "))
        .map(word => (word, 1))
        .reduceByKey(_ + _)

Fault Tolerance

When a machine fails mid-computation, work is lost. How do systems recover?

MapReduce: Mappers write to disk; reducers can re-read if needed. Failed tasks are rerun from the beginning.

Spark: Tracks lineage (which transformations produced which data). If a partition is lost, recompute it from the parent data.

Note

Spark's approach is more efficient when recomputation is cheap, but can be problematic if a failure late in a long chain requires recomputing everything.

Graphs and Iterative Processing

Some problems involve graphs (social networks, web links, road networks) and require iterative algorithms:

PageRank
Shortest paths
Connected components
Machine learning training

MapReduce is awkward for iteration—you must chain multiple jobs, and the intermediate state is written to disk each time.

The Bulk Synchronous Parallel (BSP) Model

Graph processing frameworks (Pregel, Giraph, GraphX) use BSP:

Each vertex is a unit of computation
In each iteration ("superstep"), vertices process messages from previous step
Vertices send messages to neighbors
Barrier synchronization between supersteps
Repeat until convergence

PageRank in BSP

PLAINTEXT

Superstep 1:
  Each page sends its rank / outlink_count to all linked pages
 
Superstep 2:
  Each page sums incoming contributions
  New rank = 0.15 + 0.85 * sum
  Send new contributions...
 
Repeat until ranks stabilize.

The Output of Batch Processing

What do you do with batch job results?

Building Search Indexes

The original MapReduce use case at Google: crawl the web, build indexes, write to servers that handle search queries.

Building Machine Learning Models

Train on historical data, produce a model file that can be loaded for inference.

Key-Value Stores for Lookup

Build a read-only database from batch output. Applications can then query it.

Batch Output to Key-Value Store

PLAINTEXT

1. MapReduce job analyzes user behavior
2. Output: user_id → recommended_products
3. Build read-only key-value database from output
4. Web servers query database for recommendations
5. Next day: new batch job produces new database
6. Atomically switch to new database

Philosophy: Data as Immutable Input

A key principle: batch jobs read input and produce new output. They don't modify input.

Benefits:

Easy to recover from bugs (rerun with fixed code)
Easy to experiment (run two versions, compare output)
Easy to audit (all data versions preserved)

Warning

This is in contrast to OLTP databases where data is mutated in place. The batch approach trades storage cost for operational simplicity.

Comparing Batch Processing Approaches

Aspect	Unix	MapReduce	Dataflow (Spark)
Scale	Single machine	Cluster	Cluster
Intermediate storage	Pipes (memory)	HDFS (disk)	Memory + disk
Fault tolerance	Re-run pipeline	Re-run task	Lineage-based recompute
Expressiveness	Limited	Two-phase only	Arbitrary DAG
Latency	Low	High	Medium

Summary

Batch processing has a different set of priorities than online systems:

Online (OLTP)	Batch
Latency critical	Throughput critical
Small requests	Large datasets
Read/write	Read-only input, append-only output
Always available	Can run when convenient

Key concepts:

MapReduce applies Unix philosophy at scale: simple operations, immutable data, composition
Joins in batch: sort-merge, broadcast, partitioned hash
Dataflow engines (Spark, Flink) generalize MapReduce with DAGs and better performance
Graph processing (Pregel, BSP) handles iterative algorithms
Immutability is a core principle: batch jobs read input and produce new output

Note

Batch processing is the foundation of data infrastructure: building indexes, training models, generating reports. Understanding it helps you design data systems that scale.

The next chapter explores stream processing—similar concepts but for data that arrives continuously rather than in bounded batches.