Learning Guides
Menu

Partitioning

8 min readDesigning Data-Intensive Applications

Partitioning

Replication gives you copies of the same data on multiple nodes. Partitioning (also called sharding) splits the data itself, so different nodes hold different subsets.

Why partition?

  • Scalability: A single machine has limits. Partitioning spreads both data storage and query load across many machines.
  • Performance: Queries that touch only one partition can be served by one node.

Note

Partitioning is usually combined with replication. Each partition is replicated across multiple nodes for fault tolerance. A node may store several partitions.


Partitioning Strategies

The fundamental question: which records go to which partitions?

Partitioning by Key Range

Assign a continuous range of keys to each partition, like volumes of an encyclopedia (A-D, E-H, etc.).

Key Range Partitioning

PLAINTEXT
Partition 1: keys A-G
Partition 2: keys H-N
Partition 3: keys O-Z
 
"Alice" → Partition 1
"Zoe"   → Partition 3

Advantages:

  • Keys are sorted within each partition
  • Range queries are efficient (scan one partition)

Disadvantages:

  • Hot spots if access is skewed (all today's data goes to one partition)
  • Manual or dynamic boundary management

Bigtable, HBase, and early MongoDB used key-range partitioning.

Partitioning by Hash of Key

Apply a hash function to the key; partition based on hash value ranges.

PLAINTEXT
hash("Alice") = 0x3f7a → Partition 2
hash("Bob")   = 0xc123 → Partition 5

Advantages:

  • Distributes data more evenly
  • Less prone to hot spots
  • No need to manage boundaries

Disadvantages:

  • Loses sorting—range queries must touch all partitions
  • Adjacent keys are scattered

Cassandra, MongoDB (later versions), Voldemort use hash partitioning.

Consistent Hashing

A variant where the hash ring accommodates partition changes with minimal data movement. Each partition is responsible for a section of the ring.

Warning

The term "consistent hashing" is used inconsistently. In databases, it usually just means hash-based partitioning. The original algorithm (for caching) is more specific.


Handling Skewed Workloads

Even good partitioning strategies can create hot spots:

  • A celebrity's page gets millions of views
  • A sensor sends data every second while others are silent
  • Black Friday sales spike on certain products

Solutions:

  1. Append random number to hot keys: Spread writes for key X across X_1, X_2, ... X_100. Reads must query all 100.

  2. Application-level awareness: Treat known hot keys specially (cache them, route them differently).

  3. Time-based sub-partitioning: For time-series data, partition by sensor_id + time_bucket.

Splitting a Hot Key

PLAINTEXT
Original: All writes for "celebrity_post_123" hit one partition
 
With suffix:
  "celebrity_post_123_00" → Partition A
  "celebrity_post_123_01" → Partition B
  ...
  "celebrity_post_123_99" → Partition X
 
Reads: Fetch from all 100 suffixes and merge

Secondary Indexes

Primary-key lookups are straightforward—hash the key, find the partition. But what about queries like:

SQL
SELECT * FROM users WHERE country = 'France';
SELECT * FROM products WHERE color = 'red' AND price < 100;

Secondary indexes don't map neatly to partitions. Two approaches exist:

Local Indexes (Document-Partitioned)

Each partition maintains its own secondary index, covering only documents in that partition.

Local Secondary Index

PLAINTEXT
Partition 0: {cars: [red_1, blue_3], motorcycles: [red_5]}
Partition 1: {cars: [silver_8], motorcycles: [red_2, black_4]}
 
Query "color = red":
  → Must query ALL partitions (scatter/gather)
  → Partition 0 returns: red_1, red_5
  → Partition 1 returns: red_2
  → Merge results

Advantage: Writes are local to one partition Disadvantage: Reads by secondary index must query all partitions (expensive)

Global Indexes (Term-Partitioned)

Build a global index across all data, but partition the index itself.

Global Secondary Index

PLAINTEXT
Index Partition A (colors a-m):
  blue → [doc_3 in P0, doc_7 in P2]
  black → [doc_4 in P1]
 
Index Partition B (colors n-z):
  red → [doc_1 in P0, doc_5 in P0, doc_2 in P1]
  silver → [doc_8 in P1]
 
Query "color = red":
  → Go to Index Partition B
  → Get doc_1, doc_5 from P0; doc_2 from P1
  → Only 2 data partition reads (not all partitions)

Advantage: Reads only touch relevant partitions Disadvantage: Writes may need to update multiple index partitions (cross-partition transaction)

Note

Global secondary indexes are often updated asynchronously, meaning reads may not see recent writes immediately.


Rebalancing Partitions

Over time, you need to move data between partitions:

  • Query throughput increases → Add more nodes
  • Dataset size increases → Add more storage
  • A node fails → Move its partitions elsewhere

Anti-Patterns

Don't use hash(key) mod N: When N changes, almost all data moves. Extremely expensive.

Fixed Number of Partitions

Create many more partitions than nodes upfront (e.g., 1000 partitions for 10 nodes). Each node handles many partitions.

To add a node: steal some partitions from every existing node.

Fixed Partitions Rebalancing

PLAINTEXT
Before (10 nodes, 1000 partitions): Each node has 100 partitions
 
Add 11th node:
  - Node 11 takes ~90 partitions from other nodes
  - Each original node gives up ~9 partitions
 
After: Each of 11 nodes has ~90 partitions

Advantages: Simple, predictable data movement Disadvantages: Must choose partition count upfront; too few limits scaling, too many adds overhead

Used by Riak, Elasticsearch, Couchbase, Voldemort.

Dynamic Partitioning

Start with few partitions. When a partition exceeds a size threshold, split it into two. When it shrinks, merge with an adjacent partition.

Advantages: Adapts to data size; works well with key-range partitioning Disadvantages: Initially all data in one partition (configurable pre-splitting helps)

Used by HBase, RethinkDB, MongoDB.

Proportional to Nodes

Keep a fixed number of partitions per node (e.g., 256). Adding a node creates new partitions by splitting random existing ones.

Used by Cassandra and Ketama.


Request Routing

When a client wants to read or write, how does it find the right partition?

Three approaches:

1. Contact Any Node

Node either handles the request or forwards to the correct node.

PLAINTEXT
Client → Node 3 → (forward) → Node 7 → Response → Node 3 → Client

2. Routing Tier

A dedicated routing layer (load balancer) knows the partitioning and directs requests.

PLAINTEXT
Client → Router → Node 7 → Response → Router → Client

3. Client-Side Routing

Client knows the partitioning and contacts the right node directly.

PLAINTEXT
Client → Node 7 → Response → Client

Routing Approaches

PLAINTEXT
           ┌─────────┐
           │ Client  │
           └────┬────┘

    ┌───────────┼───────────┐
    │           │           │
    ▼           ▼           ▼
┌───────┐  ┌───────┐  ┌───────┐
│Node 1 │  │Router │  │Client │
│(any)  │  │Tier   │  │aware  │
└───────┘  └───────┘  └───────┘

Coordination Service

How does everyone know the current partition assignment? Often a coordination service like ZooKeeper:

  1. Nodes register with ZooKeeper
  2. ZooKeeper maintains partition → node mapping
  3. Routers/clients subscribe to updates
  4. When partitions move, ZooKeeper notifies subscribers

Note

Some databases (Cassandra, Riak) use a gossip protocol instead—nodes share state with each other, and any node can answer routing questions.


Parallel Query Execution

So far we've focused on simple key-value operations. For complex analytical queries, a massively parallel processing (MPP) query engine breaks queries into stages:

  1. Parse and plan the query
  2. Push work to relevant partitions
  3. Execute in parallel on all partitions
  4. Gather and merge results

MPP Query Execution

SQL
SELECT product_category, SUM(revenue)
FROM sales
WHERE sale_date >= '2024-01-01'
GROUP BY product_category;

Execution:

  1. Send query to all partitions
  2. Each partition filters and groups locally
  3. Coordinator merges partial results
  4. Final aggregation and response
PLAINTEXT
Coordinator

    ├──► Partition 1: {Electronics: $10M, Books: $2M}
    ├──► Partition 2: {Electronics: $8M, Books: $3M}
    └──► Partition 3: {Electronics: $12M, Books: $1M}
 
    Merge: {Electronics: $30M, Books: $6M}

Data warehouses (Teradata, Redshift, Snowflake, BigQuery) excel at this.


Summary

Partitioning is essential for scaling beyond a single machine:

StrategyKey AssignmentRange QueriesHot Spot Risk
Key rangeContinuous rangesEfficientHigher
HashHash of keyAll partitionsLower

Secondary indexes add complexity:

Index TypeWrite CostRead Cost
Local (document)Low (single partition)High (scatter/gather)
Global (term)High (multiple partitions)Low (targeted)

Rebalancing strategies:

StrategyPartition CountBest For
FixedPredeterminedKnown scale
DynamicAdaptsGrowing data
Per-nodeProportionalConsistent sizing

Warning

Partitioning introduces complexity: cross-partition queries, distributed transactions, and operational challenges. Only partition when you've outgrown a single machine.

Key takeaways:

  • Partition by hash for even distribution; by range for efficient range queries
  • Hot spots are still possible; application awareness helps
  • Secondary indexes are either local (scatter on read) or global (scatter on write)
  • Rebalancing should minimize data movement
  • Request routing requires knowing current partition assignments