Partitioning

8 min read•Designing Data-Intensive Applications

Partitioning

Replication gives you copies of the same data on multiple nodes. Partitioning (also called sharding) splits the data itself, so different nodes hold different subsets.

Why partition?

Scalability: A single machine has limits. Partitioning spreads both data storage and query load across many machines.
Performance: Queries that touch only one partition can be served by one node.

Note

Partitioning is usually combined with replication. Each partition is replicated across multiple nodes for fault tolerance. A node may store several partitions.

Partitioning Strategies

The fundamental question: which records go to which partitions?

Partitioning by Key Range

Assign a continuous range of keys to each partition, like volumes of an encyclopedia (A-D, E-H, etc.).

Key Range Partitioning

PLAINTEXT

Partition 1: keys A-G
Partition 2: keys H-N
Partition 3: keys O-Z
 
"Alice" → Partition 1
"Zoe"   → Partition 3

Advantages:

Keys are sorted within each partition
Range queries are efficient (scan one partition)

Disadvantages:

Hot spots if access is skewed (all today's data goes to one partition)
Manual or dynamic boundary management

Bigtable, HBase, and early MongoDB used key-range partitioning.

Partitioning by Hash of Key

Apply a hash function to the key; partition based on hash value ranges.

PLAINTEXT

hash("Alice") = 0x3f7a → Partition 2
hash("Bob")   = 0xc123 → Partition 5

Advantages:

Distributes data more evenly
Less prone to hot spots
No need to manage boundaries

Disadvantages:

Loses sorting—range queries must touch all partitions
Adjacent keys are scattered

Cassandra, MongoDB (later versions), Voldemort use hash partitioning.

Consistent Hashing

A variant where the hash ring accommodates partition changes with minimal data movement. Each partition is responsible for a section of the ring.

Warning

The term "consistent hashing" is used inconsistently. In databases, it usually just means hash-based partitioning. The original algorithm (for caching) is more specific.

Handling Skewed Workloads

Even good partitioning strategies can create hot spots:

A celebrity's page gets millions of views
A sensor sends data every second while others are silent
Black Friday sales spike on certain products

Solutions:

Append random number to hot keys: Spread writes for key X across X_1, X_2, ... X_100. Reads must query all 100.
Application-level awareness: Treat known hot keys specially (cache them, route them differently).
Time-based sub-partitioning: For time-series data, partition by sensor_id + time_bucket.

Splitting a Hot Key

PLAINTEXT

Original: All writes for "celebrity_post_123" hit one partition
 
With suffix:
  "celebrity_post_123_00" → Partition A
  "celebrity_post_123_01" → Partition B
  ...
  "celebrity_post_123_99" → Partition X
 
Reads: Fetch from all 100 suffixes and merge

Secondary Indexes

Primary-key lookups are straightforward—hash the key, find the partition. But what about queries like:

SQL

SELECT * FROM users WHERE country = 'France';
SELECT * FROM products WHERE color = 'red' AND price < 100;

Secondary indexes don't map neatly to partitions. Two approaches exist:

Local Indexes (Document-Partitioned)

Each partition maintains its own secondary index, covering only documents in that partition.

Local Secondary Index

PLAINTEXT

Partition 0: {cars: [red_1, blue_3], motorcycles: [red_5]}
Partition 1: {cars: [silver_8], motorcycles: [red_2, black_4]}
 
Query "color = red":
  → Must query ALL partitions (scatter/gather)
  → Partition 0 returns: red_1, red_5
  → Partition 1 returns: red_2
  → Merge results

Advantage: Writes are local to one partition Disadvantage: Reads by secondary index must query all partitions (expensive)

Global Indexes (Term-Partitioned)

Build a global index across all data, but partition the index itself.

Global Secondary Index

PLAINTEXT

Index Partition A (colors a-m):
  blue → [doc_3 in P0, doc_7 in P2]
  black → [doc_4 in P1]
 
Index Partition B (colors n-z):
  red → [doc_1 in P0, doc_5 in P0, doc_2 in P1]
  silver → [doc_8 in P1]
 
Query "color = red":
  → Go to Index Partition B
  → Get doc_1, doc_5 from P0; doc_2 from P1
  → Only 2 data partition reads (not all partitions)

Advantage: Reads only touch relevant partitions Disadvantage: Writes may need to update multiple index partitions (cross-partition transaction)

Note

Global secondary indexes are often updated asynchronously, meaning reads may not see recent writes immediately.

Rebalancing Partitions

Over time, you need to move data between partitions:

Query throughput increases → Add more nodes
Dataset size increases → Add more storage
A node fails → Move its partitions elsewhere

Anti-Patterns

Don't use hash(key) mod N: When N changes, almost all data moves. Extremely expensive.

Fixed Number of Partitions

Create many more partitions than nodes upfront (e.g., 1000 partitions for 10 nodes). Each node handles many partitions.

To add a node: steal some partitions from every existing node.

Fixed Partitions Rebalancing

PLAINTEXT

Before (10 nodes, 1000 partitions): Each node has 100 partitions
 
Add 11th node:
  - Node 11 takes ~90 partitions from other nodes
  - Each original node gives up ~9 partitions
 
After: Each of 11 nodes has ~90 partitions

Advantages: Simple, predictable data movement Disadvantages: Must choose partition count upfront; too few limits scaling, too many adds overhead

Used by Riak, Elasticsearch, Couchbase, Voldemort.

Dynamic Partitioning

Start with few partitions. When a partition exceeds a size threshold, split it into two. When it shrinks, merge with an adjacent partition.

Advantages: Adapts to data size; works well with key-range partitioning Disadvantages: Initially all data in one partition (configurable pre-splitting helps)

Used by HBase, RethinkDB, MongoDB.

Proportional to Nodes

Keep a fixed number of partitions per node (e.g., 256). Adding a node creates new partitions by splitting random existing ones.

Used by Cassandra and Ketama.

Request Routing

When a client wants to read or write, how does it find the right partition?

Three approaches:

1. Contact Any Node

Node either handles the request or forwards to the correct node.

PLAINTEXT

Client → Node 3 → (forward) → Node 7 → Response → Node 3 → Client

2. Routing Tier

A dedicated routing layer (load balancer) knows the partitioning and directs requests.

PLAINTEXT

Client → Router → Node 7 → Response → Router → Client

3. Client-Side Routing

Client knows the partitioning and contacts the right node directly.

PLAINTEXT

Client → Node 7 → Response → Client

Routing Approaches

PLAINTEXT

           ┌─────────┐
           │ Client  │
           └────┬────┘
                │
    ┌───────────┼───────────┐
    │           │           │
    ▼           ▼           ▼
┌───────┐  ┌───────┐  ┌───────┐
│Node 1 │  │Router │  │Client │
│(any)  │  │Tier   │  │aware  │
└───────┘  └───────┘  └───────┘

Coordination Service

How does everyone know the current partition assignment? Often a coordination service like ZooKeeper:

Nodes register with ZooKeeper
ZooKeeper maintains partition → node mapping
Routers/clients subscribe to updates
When partitions move, ZooKeeper notifies subscribers

Note

Some databases (Cassandra, Riak) use a gossip protocol instead—nodes share state with each other, and any node can answer routing questions.

Parallel Query Execution

So far we've focused on simple key-value operations. For complex analytical queries, a massively parallel processing (MPP) query engine breaks queries into stages:

Parse and plan the query
Push work to relevant partitions
Execute in parallel on all partitions
Gather and merge results

MPP Query Execution

SQL

SELECT product_category, SUM(revenue)
FROM sales
WHERE sale_date >= '2024-01-01'
GROUP BY product_category;

Execution:

Send query to all partitions
Each partition filters and groups locally
Coordinator merges partial results
Final aggregation and response

PLAINTEXT

Coordinator
    │
    ├──► Partition 1: {Electronics: $10M, Books: $2M}
    ├──► Partition 2: {Electronics: $8M, Books: $3M}
    └──► Partition 3: {Electronics: $12M, Books: $1M}
 
    Merge: {Electronics: $30M, Books: $6M}

Data warehouses (Teradata, Redshift, Snowflake, BigQuery) excel at this.

Summary

Partitioning is essential for scaling beyond a single machine:

Strategy	Key Assignment	Range Queries	Hot Spot Risk
Key range	Continuous ranges	Efficient	Higher
Hash	Hash of key	All partitions	Lower

Secondary indexes add complexity:

Index Type	Write Cost	Read Cost
Local (document)	Low (single partition)	High (scatter/gather)
Global (term)	High (multiple partitions)	Low (targeted)

Rebalancing strategies:

Strategy	Partition Count	Best For
Fixed	Predetermined	Known scale
Dynamic	Adapts	Growing data
Per-node	Proportional	Consistent sizing

Warning

Partitioning introduces complexity: cross-partition queries, distributed transactions, and operational challenges. Only partition when you've outgrown a single machine.

Key takeaways:

Partition by hash for even distribution; by range for efficient range queries
Hot spots are still possible; application awareness helps
Secondary indexes are either local (scatter on read) or global (scatter on write)
Rebalancing should minimize data movement
Request routing requires knowing current partition assignments