Partitioning
Partitioning
Replication gives you copies of the same data on multiple nodes. Partitioning (also called sharding) splits the data itself, so different nodes hold different subsets.
Why partition?
- Scalability: A single machine has limits. Partitioning spreads both data storage and query load across many machines.
- Performance: Queries that touch only one partition can be served by one node.
Note
Partitioning is usually combined with replication. Each partition is replicated across multiple nodes for fault tolerance. A node may store several partitions.
Partitioning Strategies
The fundamental question: which records go to which partitions?
Partitioning by Key Range
Assign a continuous range of keys to each partition, like volumes of an encyclopedia (A-D, E-H, etc.).
Key Range Partitioning
Partition 1: keys A-G
Partition 2: keys H-N
Partition 3: keys O-Z
"Alice" → Partition 1
"Zoe" → Partition 3Advantages:
- Keys are sorted within each partition
- Range queries are efficient (scan one partition)
Disadvantages:
- Hot spots if access is skewed (all today's data goes to one partition)
- Manual or dynamic boundary management
Bigtable, HBase, and early MongoDB used key-range partitioning.
Partitioning by Hash of Key
Apply a hash function to the key; partition based on hash value ranges.
hash("Alice") = 0x3f7a → Partition 2
hash("Bob") = 0xc123 → Partition 5Advantages:
- Distributes data more evenly
- Less prone to hot spots
- No need to manage boundaries
Disadvantages:
- Loses sorting—range queries must touch all partitions
- Adjacent keys are scattered
Cassandra, MongoDB (later versions), Voldemort use hash partitioning.
Consistent Hashing
A variant where the hash ring accommodates partition changes with minimal data movement. Each partition is responsible for a section of the ring.
Warning
The term "consistent hashing" is used inconsistently. In databases, it usually just means hash-based partitioning. The original algorithm (for caching) is more specific.
Handling Skewed Workloads
Even good partitioning strategies can create hot spots:
- A celebrity's page gets millions of views
- A sensor sends data every second while others are silent
- Black Friday sales spike on certain products
Solutions:
-
Append random number to hot keys: Spread writes for key X across X_1, X_2, ... X_100. Reads must query all 100.
-
Application-level awareness: Treat known hot keys specially (cache them, route them differently).
-
Time-based sub-partitioning: For time-series data, partition by sensor_id + time_bucket.
Splitting a Hot Key
Original: All writes for "celebrity_post_123" hit one partition
With suffix:
"celebrity_post_123_00" → Partition A
"celebrity_post_123_01" → Partition B
...
"celebrity_post_123_99" → Partition X
Reads: Fetch from all 100 suffixes and mergeSecondary Indexes
Primary-key lookups are straightforward—hash the key, find the partition. But what about queries like:
SELECT * FROM users WHERE country = 'France';
SELECT * FROM products WHERE color = 'red' AND price < 100;Secondary indexes don't map neatly to partitions. Two approaches exist:
Local Indexes (Document-Partitioned)
Each partition maintains its own secondary index, covering only documents in that partition.
Local Secondary Index
Partition 0: {cars: [red_1, blue_3], motorcycles: [red_5]}
Partition 1: {cars: [silver_8], motorcycles: [red_2, black_4]}
Query "color = red":
→ Must query ALL partitions (scatter/gather)
→ Partition 0 returns: red_1, red_5
→ Partition 1 returns: red_2
→ Merge resultsAdvantage: Writes are local to one partition Disadvantage: Reads by secondary index must query all partitions (expensive)
Global Indexes (Term-Partitioned)
Build a global index across all data, but partition the index itself.
Global Secondary Index
Index Partition A (colors a-m):
blue → [doc_3 in P0, doc_7 in P2]
black → [doc_4 in P1]
Index Partition B (colors n-z):
red → [doc_1 in P0, doc_5 in P0, doc_2 in P1]
silver → [doc_8 in P1]
Query "color = red":
→ Go to Index Partition B
→ Get doc_1, doc_5 from P0; doc_2 from P1
→ Only 2 data partition reads (not all partitions)Advantage: Reads only touch relevant partitions Disadvantage: Writes may need to update multiple index partitions (cross-partition transaction)
Note
Global secondary indexes are often updated asynchronously, meaning reads may not see recent writes immediately.
Rebalancing Partitions
Over time, you need to move data between partitions:
- Query throughput increases → Add more nodes
- Dataset size increases → Add more storage
- A node fails → Move its partitions elsewhere
Anti-Patterns
Don't use hash(key) mod N: When N changes, almost all data moves. Extremely expensive.
Fixed Number of Partitions
Create many more partitions than nodes upfront (e.g., 1000 partitions for 10 nodes). Each node handles many partitions.
To add a node: steal some partitions from every existing node.
Fixed Partitions Rebalancing
Before (10 nodes, 1000 partitions): Each node has 100 partitions
Add 11th node:
- Node 11 takes ~90 partitions from other nodes
- Each original node gives up ~9 partitions
After: Each of 11 nodes has ~90 partitionsAdvantages: Simple, predictable data movement Disadvantages: Must choose partition count upfront; too few limits scaling, too many adds overhead
Used by Riak, Elasticsearch, Couchbase, Voldemort.
Dynamic Partitioning
Start with few partitions. When a partition exceeds a size threshold, split it into two. When it shrinks, merge with an adjacent partition.
Advantages: Adapts to data size; works well with key-range partitioning Disadvantages: Initially all data in one partition (configurable pre-splitting helps)
Used by HBase, RethinkDB, MongoDB.
Proportional to Nodes
Keep a fixed number of partitions per node (e.g., 256). Adding a node creates new partitions by splitting random existing ones.
Used by Cassandra and Ketama.
Request Routing
When a client wants to read or write, how does it find the right partition?
Three approaches:
1. Contact Any Node
Node either handles the request or forwards to the correct node.
Client → Node 3 → (forward) → Node 7 → Response → Node 3 → Client2. Routing Tier
A dedicated routing layer (load balancer) knows the partitioning and directs requests.
Client → Router → Node 7 → Response → Router → Client3. Client-Side Routing
Client knows the partitioning and contacts the right node directly.
Client → Node 7 → Response → ClientRouting Approaches
┌─────────┐
│ Client │
└────┬────┘
│
┌───────────┼───────────┐
│ │ │
▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐
│Node 1 │ │Router │ │Client │
│(any) │ │Tier │ │aware │
└───────┘ └───────┘ └───────┘Coordination Service
How does everyone know the current partition assignment? Often a coordination service like ZooKeeper:
- Nodes register with ZooKeeper
- ZooKeeper maintains partition → node mapping
- Routers/clients subscribe to updates
- When partitions move, ZooKeeper notifies subscribers
Note
Some databases (Cassandra, Riak) use a gossip protocol instead—nodes share state with each other, and any node can answer routing questions.
Parallel Query Execution
So far we've focused on simple key-value operations. For complex analytical queries, a massively parallel processing (MPP) query engine breaks queries into stages:
- Parse and plan the query
- Push work to relevant partitions
- Execute in parallel on all partitions
- Gather and merge results
MPP Query Execution
SELECT product_category, SUM(revenue)
FROM sales
WHERE sale_date >= '2024-01-01'
GROUP BY product_category;Execution:
- Send query to all partitions
- Each partition filters and groups locally
- Coordinator merges partial results
- Final aggregation and response
Coordinator
│
├──► Partition 1: {Electronics: $10M, Books: $2M}
├──► Partition 2: {Electronics: $8M, Books: $3M}
└──► Partition 3: {Electronics: $12M, Books: $1M}
Merge: {Electronics: $30M, Books: $6M}Data warehouses (Teradata, Redshift, Snowflake, BigQuery) excel at this.
Summary
Partitioning is essential for scaling beyond a single machine:
| Strategy | Key Assignment | Range Queries | Hot Spot Risk |
|---|---|---|---|
| Key range | Continuous ranges | Efficient | Higher |
| Hash | Hash of key | All partitions | Lower |
Secondary indexes add complexity:
| Index Type | Write Cost | Read Cost |
|---|---|---|
| Local (document) | Low (single partition) | High (scatter/gather) |
| Global (term) | High (multiple partitions) | Low (targeted) |
Rebalancing strategies:
| Strategy | Partition Count | Best For |
|---|---|---|
| Fixed | Predetermined | Known scale |
| Dynamic | Adapts | Growing data |
| Per-node | Proportional | Consistent sizing |
Warning
Partitioning introduces complexity: cross-partition queries, distributed transactions, and operational challenges. Only partition when you've outgrown a single machine.
Key takeaways:
- Partition by hash for even distribution; by range for efficient range queries
- Hot spots are still possible; application awareness helps
- Secondary indexes are either local (scatter on read) or global (scatter on write)
- Rebalancing should minimize data movement
- Request routing requires knowing current partition assignments