Storage and Retrieval
Storage and Retrieval
A database needs to do two fundamental things: store data when you give it, and return that data when you ask for it. Understanding how databases accomplish this helps you select the right storage engine and tune it for your workload.
Note
You don't need to implement a storage engine from scratch, but understanding the internals helps you reason about performance and make better architectural decisions.
The Simplest Database
The most basic database is two bash functions:
#!/bin/bash
db_set() {
echo "$1,$2" >> database
}
db_get() {
grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}This append-only log has excellent write performance—just append to a file. But reads are O(n)—you must scan the entire file.
To make reads faster, we need a different data structure: an index.
Indexes: The Fundamental Tradeoff
An index is an additional structure derived from primary data. It speeds up reads but slows down writes (because the index must be updated on every write).
Warning
Indexes are a tradeoff: faster reads, slower writes, more storage. Databases don't index everything by default—you choose indexes based on your query patterns.
Hash Indexes
The simplest indexing strategy: keep an in-memory hash map where keys map to byte offsets in the data file.
Hash Index in Action
Data file:
offset 0: {"key": "foo", "value": "bar"}
offset 50: {"key": "baz", "value": "qux"}
offset 100: {"key": "foo", "value": "updated_bar"}Hash map:
foo → 100 (most recent offset)
baz → 50Reads: Look up offset in hash map, read from disk at that position.
Bitcask (the storage engine in Riak) works this way. It's simple and blazingly fast for workloads where:
- You have many writes per key (each write updates the value)
- All keys fit in memory
- You don't need range queries
Compaction
Since we only append, the file grows forever. Compaction solves this by:
- Breaking the log into segments
- Periodically merging segments, keeping only the latest value per key
- Reading from the most recent segments first
SSTables and LSM-Trees
Sorted String Tables (SSTables) improve on hash indexes by sorting key-value pairs by key.
Advantages:
- Efficient merging: Like merge-sort, scan through files and keep newest value
- Sparse index: Don't need every key in memory—use binary search between known offsets
- Range queries: Keys are sorted, so scanning a range is efficient
- Compression: Can compress blocks of sorted data before writing to disk
LSM-Trees: How to Build SSTables
You can't sort an append-only log directly. Instead:
- Writes: Insert into an in-memory sorted tree (often a red-black tree or AVL tree), called a memtable
- When memtable gets too big: Write it to disk as an SSTable segment
- Reads: Check memtable first, then disk segments (most recent first)
- Background: Merge and compact segments periodically
This is called a Log-Structured Merge-Tree (LSM-Tree).
LSM-Tree Write Path
Write: SET key=123, value="hello"
│
▼
┌───────────────┐
│ Memtable │ ← Sorted tree in memory
│ (red-black) │
└───────────────┘
│
▼ (when full)
┌───────────────┐
│ SSTable #5 │ ← Immutable sorted file on disk
└───────────────┘
┌───────────────┐
│ SSTable #4 │ ← Older segment
└───────────────┘
┌───────────────┐
│ SSTable #3 │ ← Even older
└───────────────┘LevelDB, RocksDB, Cassandra, HBase, and many others use LSM-trees.
Bloom Filters
Reading non-existent keys is expensive—you check every SSTable and find nothing. Bloom filters help: a probabilistic data structure that can quickly tell you "definitely not present" or "maybe present."
B-Trees
B-trees are the most widely used indexing structure, standard in almost all relational databases. Unlike LSM-trees (which write sequentially), B-trees update fixed-size pages in place.
Structure
- Data is organized in pages (typically 4KB)
- Each page contains keys and references to child pages
- Leaf pages contain the actual values (or pointers to data)
- Pages are kept balanced—all leaves at the same depth
B-Tree Structure
┌────────────────────┐
│ Root Page │
│ 100 | 200 |300 │
└────────────────────┘
/ | \
/ | \
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Page: <100 │ │ Page: 100-200│ │ Page: >200 │
│ 10|40|70|90 │ │110|140|170|190│ │210|250|280 │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
▼ ▼ ▼
(leaf pages with actual data)Updates
To update a value:
- Find the leaf page containing the key
- Update the value in place
- Write the page back to disk
To add a new key:
- Find the appropriate leaf page
- If room, add the key
- If full, split the page into two and update the parent
Write-Ahead Log (WAL)
Since B-trees modify pages in place, crashes during writes can leave the tree corrupted. A write-ahead log solves this:
- Before modifying any page, append the change to a sequential log
- Then make the actual page modifications
- If crash occurs, replay the log on recovery
LSM-Trees vs B-Trees
| Aspect | LSM-Trees | B-Trees |
|---|---|---|
| Write pattern | Sequential (append-only) | Random (in-place updates) |
| Write throughput | Generally higher | Lower |
| Write amplification | Higher (compaction) | Lower |
| Read performance | May need to check multiple files | Single page lookup |
| Predictability | Compaction can cause latency spikes | More predictable |
| Storage overhead | Lower (compaction removes duplicates) | Fixed page overhead |
Note
LSM-trees often have better write throughput, while B-trees often have more consistent read latency. But these are generalizations—always benchmark for your specific workload.
Other Index Types
Secondary Indexes
Beyond the primary key, you often need to search by other columns. Secondary indexes enable this:
-- Primary index on id
-- Secondary indexes on email and created_at
SELECT * FROM users WHERE email = 'alice@example.com';
SELECT * FROM users WHERE created_at > '2024-01-01';The main difference from primary indexes: secondary index keys are not unique, so multiple rows may have the same value.
Multi-column Indexes
Concatenated indexes combine multiple columns:
-- Index on (last_name, first_name)
-- Good for: WHERE last_name = 'Smith' AND first_name = 'John'
-- Also helps: WHERE last_name = 'Smith'
-- Useless for: WHERE first_name = 'John'Spatial indexes (R-trees) handle multi-dimensional data like geographical coordinates.
Storing Values in the Index
Where does the actual data live?
- Heap file: Rows stored in a separate file; index contains pointers
- Clustered index: Rows stored directly in the index (faster reads, limited to one index per table)
- Covering index: Index includes some non-key columns to answer queries without touching heap
Transaction Processing vs Analytics
Databases serve different access patterns:
OLTP (Online Transaction Processing)
- Many small transactions
- Random reads/writes by primary key
- Low latency critical
- Latest data important
OLAP (Online Analytical Processing)
- Few large queries scanning many rows
- Aggregate calculations (sum, count, average)
- Query time less critical than throughput
- Historical data often sufficient
OLTP vs OLAP Queries
OLTP:
SELECT * FROM orders WHERE order_id = 12345;
UPDATE inventory SET quantity = quantity - 1 WHERE product_id = 789;OLAP:
SELECT
product_category,
SUM(revenue),
COUNT(DISTINCT customer_id)
FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-03-31'
GROUP BY product_category;Data Warehouses
Running analytical queries on production OLTP databases slows them down. Data warehouses solve this:
- Extract data from various OLTP systems
- Transform into analysis-friendly schema
- Load into separate warehouse database
This ETL (Extract-Transform-Load) process decouples analytics from operations.
Star and Snowflake Schemas
Data warehouses typically use dimensional modeling:
Fact tables: Large tables with events (sales, clicks, page views)
- Contains foreign keys to dimension tables
- Contains measures (price, quantity, duration)
Dimension tables: Smaller tables with attributes (who, what, where, when, how)
- Products, customers, stores, dates
Star Schema
┌─────────────┐
│ dim_date │
│ date_id │
│ day, month │
│ quarter │
└──────┬──────┘
│
┌─────────────┐ ┌──────┴──────┐ ┌─────────────┐
│ dim_product │◄────│ fact_sales │────►│dim_customer │
│ product_id │ │ date_id │ │ customer_id │
│ name, cat │ │ product_id │ │ name, region│
└─────────────┘ │ customer_id │ └─────────────┘
│ store_id │
│ quantity │ ┌─────────────┐
│ revenue │────►│ dim_store │
└─────────────┘ │ store_id │
│ city, state │
└─────────────┘Column-Oriented Storage
OLTP stores data row by row—efficient for transactions that touch entire rows. But analytical queries often need only a few columns from millions of rows.
Column-oriented storage stores each column separately:
Row vs Column Storage
Row storage:
Row 1: [2024-01-15, 101, "Widget", 5, 49.99]
Row 2: [2024-01-15, 102, "Gadget", 2, 99.99]
Row 3: [2024-01-16, 103, "Widget", 3, 49.99]Column storage:
date: [2024-01-15, 2024-01-15, 2024-01-16, ...]
product: [101, 102, 103, ...]
name: ["Widget", "Gadget", "Widget", ...]
quantity: [5, 2, 3, ...]
price: [49.99, 99.99, 49.99, ...]Query SELECT SUM(quantity) WHERE name='Widget' only reads two columns.
Column Compression
Columns often have repeated values, enabling efficient compression:
Bitmap encoding: For columns with few distinct values Run-length encoding: For sorted data with many consecutive identical values Dictionary encoding: Replace strings with smaller integer codes
Compressed columns can be processed directly (vectorized processing), making queries even faster.
Summary
Storage engines optimize for different access patterns:
| Engine Type | Optimized For | Examples |
|---|---|---|
| LSM-trees | Write-heavy workloads | LevelDB, RocksDB, Cassandra |
| B-trees | General-purpose OLTP | PostgreSQL, MySQL, SQLite |
| Column stores | Analytics (OLAP) | Redshift, BigQuery, ClickHouse |
Key concepts:
- Indexes trade write speed for read speed
- LSM-trees use append-only logs and compaction
- B-trees use fixed pages with in-place updates
- Column storage enables efficient analytical queries
Warning
Choosing the right storage engine requires understanding your access patterns. Profile your actual workload before making decisions.