Learning Guides
Menu

Batch Processing

8 min readDesigning Data-Intensive Applications

Batch Processing

So far we've focused on online systems—databases and services that respond to requests. But another crucial category is batch processing: taking a large amount of data as input, running a job on it, and producing output data.

Batch jobs are ubiquitous:

  • Building search indexes
  • Training machine learning models
  • Calculating analytics and reports
  • ETL (Extract, Transform, Load) for data warehouses

Note

The distinction: online systems care about response time (latency), while batch systems care about throughput—how much data can be processed in a given time.


Unix Tools: The Original Batch Processing

Unix command-line tools are a surprisingly powerful batch processing system:

BASH
cat access.log |
  awk '{print $7}' |
  sort |
  uniq -c |
  sort -rn |
  head -n 10

This pipeline finds the 10 most frequently requested URLs in a log file.

Unix Philosophy

  1. Make each program do one thing well
  2. Expect the output of every program to become the input to another
  3. Design programs to be connected

Key features:

  • Uniform interface: Everything is a file (or stream of bytes)
  • Immutability: Input files are unchanged; new output files are created
  • Composability: Small tools combined into complex workflows

Unix Pipeline Benefits

BASH
# Each step can be tested independently
cat log.txt | head -n 100 | awk '{print $7}'  # test awk pattern
 
# Intermediate results can be inspected
cat log.txt | awk '{print $7}' | less
 
# Components can be swapped
cat log.txt | cut -d' ' -f7 | sort | uniq -c  # cut instead of awk

The biggest limitation: Unix tools run on a single machine. For truly large datasets, we need to go distributed.


MapReduce

MapReduce (popularized by Google and implemented in Hadoop) applies the Unix philosophy to distributed systems.

How It Works

A MapReduce job has two phases:

Map: Extract key-value pairs from each input record (runs on many machines in parallel)

Reduce: Collect all values for each key and combine them (all values for a key go to one reducer)

Word Count with MapReduce

Input:

PLAINTEXT
"hello world"
"hello hadoop"

Map phase (runs in parallel on each input):

PLAINTEXT
"hello world" → [("hello", 1), ("world", 1)]
"hello hadoop" → [("hello", 1), ("hadoop", 1)]

Shuffle (group by key):

PLAINTEXT
"hello" → [1, 1]
"world" → [1]
"hadoop" → [1]

Reduce phase (sum values for each key):

PLAINTEXT
"hello" → 2
"world" → 1
"hadoop" → 1

Distributed Execution

  1. Input is split into partitions (typically HDFS blocks)
  2. Mapper tasks run in parallel across machines
  3. Mappers write output to local disk
  4. Shuffle phase transfers data to reducers (sorted by key)
  5. Reducers process each key and write to output files
PLAINTEXT
Input Files                    Output Files
    │                              ▲
    ▼                              │
┌────────┐  ┌────────┐       ┌────────┐  ┌────────┐
│ Mapper │  │ Mapper │       │Reducer │  │Reducer │
└────────┘  └────────┘       └────────┘  └────────┘
    │           │                ▲           ▲
    └───────────┴───── shuffle ──┴───────────┘

Joins in MapReduce

Real-world data often requires joining datasets. MapReduce handles this through different techniques:

Sort-Merge Join: Both datasets are partitioned by join key. Records with the same key end up at the same reducer.

Broadcast Join: If one dataset is small, send it to all mappers. Each mapper can join locally.

Partitioned Hash Join: If both datasets are partitioned the same way, each partition can be joined locally.

Sort-Merge Join

PLAINTEXT
Users:          Events:
id=1, name=Alice    user_id=1, action=click
id=2, name=Bob      user_id=2, action=view
                    user_id=1, action=view
 
Map phase (both inputs):
  → (1, {name: Alice})
  → (2, {name: Bob})
  → (1, {action: click})
  → (2, {action: view})
  → (1, {action: view})
 
Reduce (for key=1):
  {name: Alice}, {action: click}, {action: view}
  → ("Alice", "click"), ("Alice", "view")

Beyond MapReduce

MapReduce is powerful but verbose. Writing even simple operations requires significant boilerplate. Higher-level frameworks have emerged:

Dataflow Engines

Spark, Flink, and Tez represent workflows as directed acyclic graphs (DAGs) of operators.

Advantages over MapReduce:

  • No mandatory map-then-reduce structure
  • Operators can be chained without writing to disk
  • Better optimization of execution plans
  • More expressive APIs

Word Count: MapReduce vs Spark

MapReduce (Java):

JAVA
public static class TokenizerMapper extends Mapper<...> {
  public void map(LongWritable key, Text value, Context context) {
    StringTokenizer itr = new StringTokenizer(value.toString());
    while (itr.hasMoreTokens()) {
      word.set(itr.nextToken());
      context.write(word, one);
    }
  }
}
// Plus Reducer class, driver class, configuration...

Spark:

SCALA
textFile.flatMap(_.split(" "))
        .map(word => (word, 1))
        .reduceByKey(_ + _)

Fault Tolerance

When a machine fails mid-computation, work is lost. How do systems recover?

MapReduce: Mappers write to disk; reducers can re-read if needed. Failed tasks are rerun from the beginning.

Spark: Tracks lineage (which transformations produced which data). If a partition is lost, recompute it from the parent data.

Note

Spark's approach is more efficient when recomputation is cheap, but can be problematic if a failure late in a long chain requires recomputing everything.


Graphs and Iterative Processing

Some problems involve graphs (social networks, web links, road networks) and require iterative algorithms:

  • PageRank
  • Shortest paths
  • Connected components
  • Machine learning training

MapReduce is awkward for iteration—you must chain multiple jobs, and the intermediate state is written to disk each time.

The Bulk Synchronous Parallel (BSP) Model

Graph processing frameworks (Pregel, Giraph, GraphX) use BSP:

  1. Each vertex is a unit of computation
  2. In each iteration ("superstep"), vertices process messages from previous step
  3. Vertices send messages to neighbors
  4. Barrier synchronization between supersteps
  5. Repeat until convergence

PageRank in BSP

PLAINTEXT
Superstep 1:
  Each page sends its rank / outlink_count to all linked pages
 
Superstep 2:
  Each page sums incoming contributions
  New rank = 0.15 + 0.85 * sum
  Send new contributions...
 
Repeat until ranks stabilize.

The Output of Batch Processing

What do you do with batch job results?

Building Search Indexes

The original MapReduce use case at Google: crawl the web, build indexes, write to servers that handle search queries.

Building Machine Learning Models

Train on historical data, produce a model file that can be loaded for inference.

Key-Value Stores for Lookup

Build a read-only database from batch output. Applications can then query it.

Batch Output to Key-Value Store

PLAINTEXT
1. MapReduce job analyzes user behavior
2. Output: user_id → recommended_products
3. Build read-only key-value database from output
4. Web servers query database for recommendations
5. Next day: new batch job produces new database
6. Atomically switch to new database

Philosophy: Data as Immutable Input

A key principle: batch jobs read input and produce new output. They don't modify input.

Benefits:

  • Easy to recover from bugs (rerun with fixed code)
  • Easy to experiment (run two versions, compare output)
  • Easy to audit (all data versions preserved)

Warning

This is in contrast to OLTP databases where data is mutated in place. The batch approach trades storage cost for operational simplicity.


Comparing Batch Processing Approaches

AspectUnixMapReduceDataflow (Spark)
ScaleSingle machineClusterCluster
Intermediate storagePipes (memory)HDFS (disk)Memory + disk
Fault toleranceRe-run pipelineRe-run taskLineage-based recompute
ExpressivenessLimitedTwo-phase onlyArbitrary DAG
LatencyLowHighMedium

Summary

Batch processing has a different set of priorities than online systems:

Online (OLTP)Batch
Latency criticalThroughput critical
Small requestsLarge datasets
Read/writeRead-only input, append-only output
Always availableCan run when convenient

Key concepts:

  • MapReduce applies Unix philosophy at scale: simple operations, immutable data, composition
  • Joins in batch: sort-merge, broadcast, partitioned hash
  • Dataflow engines (Spark, Flink) generalize MapReduce with DAGs and better performance
  • Graph processing (Pregel, BSP) handles iterative algorithms
  • Immutability is a core principle: batch jobs read input and produce new output

Note

Batch processing is the foundation of data infrastructure: building indexes, training models, generating reports. Understanding it helps you design data systems that scale.

The next chapter explores stream processing—similar concepts but for data that arrives continuously rather than in bounded batches.