Batch Processing
Batch Processing
So far we've focused on online systems—databases and services that respond to requests. But another crucial category is batch processing: taking a large amount of data as input, running a job on it, and producing output data.
Batch jobs are ubiquitous:
- Building search indexes
- Training machine learning models
- Calculating analytics and reports
- ETL (Extract, Transform, Load) for data warehouses
Note
The distinction: online systems care about response time (latency), while batch systems care about throughput—how much data can be processed in a given time.
Unix Tools: The Original Batch Processing
Unix command-line tools are a surprisingly powerful batch processing system:
cat access.log |
awk '{print $7}' |
sort |
uniq -c |
sort -rn |
head -n 10This pipeline finds the 10 most frequently requested URLs in a log file.
Unix Philosophy
- Make each program do one thing well
- Expect the output of every program to become the input to another
- Design programs to be connected
Key features:
- Uniform interface: Everything is a file (or stream of bytes)
- Immutability: Input files are unchanged; new output files are created
- Composability: Small tools combined into complex workflows
Unix Pipeline Benefits
# Each step can be tested independently
cat log.txt | head -n 100 | awk '{print $7}' # test awk pattern
# Intermediate results can be inspected
cat log.txt | awk '{print $7}' | less
# Components can be swapped
cat log.txt | cut -d' ' -f7 | sort | uniq -c # cut instead of awkThe biggest limitation: Unix tools run on a single machine. For truly large datasets, we need to go distributed.
MapReduce
MapReduce (popularized by Google and implemented in Hadoop) applies the Unix philosophy to distributed systems.
How It Works
A MapReduce job has two phases:
Map: Extract key-value pairs from each input record (runs on many machines in parallel)
Reduce: Collect all values for each key and combine them (all values for a key go to one reducer)
Word Count with MapReduce
Input:
"hello world"
"hello hadoop"Map phase (runs in parallel on each input):
"hello world" → [("hello", 1), ("world", 1)]
"hello hadoop" → [("hello", 1), ("hadoop", 1)]Shuffle (group by key):
"hello" → [1, 1]
"world" → [1]
"hadoop" → [1]Reduce phase (sum values for each key):
"hello" → 2
"world" → 1
"hadoop" → 1Distributed Execution
- Input is split into partitions (typically HDFS blocks)
- Mapper tasks run in parallel across machines
- Mappers write output to local disk
- Shuffle phase transfers data to reducers (sorted by key)
- Reducers process each key and write to output files
Input Files Output Files
│ ▲
▼ │
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ Mapper │ │ Mapper │ │Reducer │ │Reducer │
└────────┘ └────────┘ └────────┘ └────────┘
│ │ ▲ ▲
└───────────┴───── shuffle ──┴───────────┘Joins in MapReduce
Real-world data often requires joining datasets. MapReduce handles this through different techniques:
Sort-Merge Join: Both datasets are partitioned by join key. Records with the same key end up at the same reducer.
Broadcast Join: If one dataset is small, send it to all mappers. Each mapper can join locally.
Partitioned Hash Join: If both datasets are partitioned the same way, each partition can be joined locally.
Sort-Merge Join
Users: Events:
id=1, name=Alice user_id=1, action=click
id=2, name=Bob user_id=2, action=view
user_id=1, action=view
Map phase (both inputs):
→ (1, {name: Alice})
→ (2, {name: Bob})
→ (1, {action: click})
→ (2, {action: view})
→ (1, {action: view})
Reduce (for key=1):
{name: Alice}, {action: click}, {action: view}
→ ("Alice", "click"), ("Alice", "view")Beyond MapReduce
MapReduce is powerful but verbose. Writing even simple operations requires significant boilerplate. Higher-level frameworks have emerged:
Dataflow Engines
Spark, Flink, and Tez represent workflows as directed acyclic graphs (DAGs) of operators.
Advantages over MapReduce:
- No mandatory map-then-reduce structure
- Operators can be chained without writing to disk
- Better optimization of execution plans
- More expressive APIs
Word Count: MapReduce vs Spark
MapReduce (Java):
public static class TokenizerMapper extends Mapper<...> {
public void map(LongWritable key, Text value, Context context) {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
// Plus Reducer class, driver class, configuration...Spark:
textFile.flatMap(_.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)Fault Tolerance
When a machine fails mid-computation, work is lost. How do systems recover?
MapReduce: Mappers write to disk; reducers can re-read if needed. Failed tasks are rerun from the beginning.
Spark: Tracks lineage (which transformations produced which data). If a partition is lost, recompute it from the parent data.
Note
Spark's approach is more efficient when recomputation is cheap, but can be problematic if a failure late in a long chain requires recomputing everything.
Graphs and Iterative Processing
Some problems involve graphs (social networks, web links, road networks) and require iterative algorithms:
- PageRank
- Shortest paths
- Connected components
- Machine learning training
MapReduce is awkward for iteration—you must chain multiple jobs, and the intermediate state is written to disk each time.
The Bulk Synchronous Parallel (BSP) Model
Graph processing frameworks (Pregel, Giraph, GraphX) use BSP:
- Each vertex is a unit of computation
- In each iteration ("superstep"), vertices process messages from previous step
- Vertices send messages to neighbors
- Barrier synchronization between supersteps
- Repeat until convergence
PageRank in BSP
Superstep 1:
Each page sends its rank / outlink_count to all linked pages
Superstep 2:
Each page sums incoming contributions
New rank = 0.15 + 0.85 * sum
Send new contributions...
Repeat until ranks stabilize.The Output of Batch Processing
What do you do with batch job results?
Building Search Indexes
The original MapReduce use case at Google: crawl the web, build indexes, write to servers that handle search queries.
Building Machine Learning Models
Train on historical data, produce a model file that can be loaded for inference.
Key-Value Stores for Lookup
Build a read-only database from batch output. Applications can then query it.
Batch Output to Key-Value Store
1. MapReduce job analyzes user behavior
2. Output: user_id → recommended_products
3. Build read-only key-value database from output
4. Web servers query database for recommendations
5. Next day: new batch job produces new database
6. Atomically switch to new databasePhilosophy: Data as Immutable Input
A key principle: batch jobs read input and produce new output. They don't modify input.
Benefits:
- Easy to recover from bugs (rerun with fixed code)
- Easy to experiment (run two versions, compare output)
- Easy to audit (all data versions preserved)
Warning
This is in contrast to OLTP databases where data is mutated in place. The batch approach trades storage cost for operational simplicity.
Comparing Batch Processing Approaches
| Aspect | Unix | MapReduce | Dataflow (Spark) |
|---|---|---|---|
| Scale | Single machine | Cluster | Cluster |
| Intermediate storage | Pipes (memory) | HDFS (disk) | Memory + disk |
| Fault tolerance | Re-run pipeline | Re-run task | Lineage-based recompute |
| Expressiveness | Limited | Two-phase only | Arbitrary DAG |
| Latency | Low | High | Medium |
Summary
Batch processing has a different set of priorities than online systems:
| Online (OLTP) | Batch |
|---|---|
| Latency critical | Throughput critical |
| Small requests | Large datasets |
| Read/write | Read-only input, append-only output |
| Always available | Can run when convenient |
Key concepts:
- MapReduce applies Unix philosophy at scale: simple operations, immutable data, composition
- Joins in batch: sort-merge, broadcast, partitioned hash
- Dataflow engines (Spark, Flink) generalize MapReduce with DAGs and better performance
- Graph processing (Pregel, BSP) handles iterative algorithms
- Immutability is a core principle: batch jobs read input and produce new output
Note
Batch processing is the foundation of data infrastructure: building indexes, training models, generating reports. Understanding it helps you design data systems that scale.
The next chapter explores stream processing—similar concepts but for data that arrives continuously rather than in bounded batches.