Storage and Retrieval

9 min read•Designing Data-Intensive Applications

Storage and Retrieval

A database needs to do two fundamental things: store data when you give it, and return that data when you ask for it. Understanding how databases accomplish this helps you select the right storage engine and tune it for your workload.

Note

You don't need to implement a storage engine from scratch, but understanding the internals helps you reason about performance and make better architectural decisions.

The Simplest Database

The most basic database is two bash functions:

BASH

#!/bin/bash
db_set() {
  echo "$1,$2" >> database
}
 
db_get() {
  grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}

This append-only log has excellent write performance—just append to a file. But reads are O(n)—you must scan the entire file.

To make reads faster, we need a different data structure: an index.

Indexes: The Fundamental Tradeoff

An index is an additional structure derived from primary data. It speeds up reads but slows down writes (because the index must be updated on every write).

Warning

Indexes are a tradeoff: faster reads, slower writes, more storage. Databases don't index everything by default—you choose indexes based on your query patterns.

Hash Indexes

The simplest indexing strategy: keep an in-memory hash map where keys map to byte offsets in the data file.

Hash Index in Action

Data file:

PLAINTEXT

offset 0:    {"key": "foo", "value": "bar"}
offset 50:   {"key": "baz", "value": "qux"}
offset 100:  {"key": "foo", "value": "updated_bar"}

Hash map:

PLAINTEXT

foo → 100  (most recent offset)
baz → 50

Reads: Look up offset in hash map, read from disk at that position.

Bitcask (the storage engine in Riak) works this way. It's simple and blazingly fast for workloads where:

You have many writes per key (each write updates the value)
All keys fit in memory
You don't need range queries

Compaction

Since we only append, the file grows forever. Compaction solves this by:

Breaking the log into segments
Periodically merging segments, keeping only the latest value per key
Reading from the most recent segments first

SSTables and LSM-Trees

Sorted String Tables (SSTables) improve on hash indexes by sorting key-value pairs by key.

Advantages:

Efficient merging: Like merge-sort, scan through files and keep newest value
Sparse index: Don't need every key in memory—use binary search between known offsets
Range queries: Keys are sorted, so scanning a range is efficient
Compression: Can compress blocks of sorted data before writing to disk

LSM-Trees: How to Build SSTables

You can't sort an append-only log directly. Instead:

Writes: Insert into an in-memory sorted tree (often a red-black tree or AVL tree), called a memtable
When memtable gets too big: Write it to disk as an SSTable segment
Reads: Check memtable first, then disk segments (most recent first)
Background: Merge and compact segments periodically

This is called a Log-Structured Merge-Tree (LSM-Tree).

LSM-Tree Write Path

PLAINTEXT

Write: SET key=123, value="hello"
            │
            ▼
    ┌───────────────┐
    │   Memtable    │  ← Sorted tree in memory
    │ (red-black)   │
    └───────────────┘
            │
            ▼ (when full)
    ┌───────────────┐
    │  SSTable #5   │  ← Immutable sorted file on disk
    └───────────────┘
    ┌───────────────┐
    │  SSTable #4   │  ← Older segment
    └───────────────┘
    ┌───────────────┐
    │  SSTable #3   │  ← Even older
    └───────────────┘

LevelDB, RocksDB, Cassandra, HBase, and many others use LSM-trees.

Bloom Filters

Reading non-existent keys is expensive—you check every SSTable and find nothing. Bloom filters help: a probabilistic data structure that can quickly tell you "definitely not present" or "maybe present."

B-Trees

B-trees are the most widely used indexing structure, standard in almost all relational databases. Unlike LSM-trees (which write sequentially), B-trees update fixed-size pages in place.

Structure

Data is organized in pages (typically 4KB)
Each page contains keys and references to child pages
Leaf pages contain the actual values (or pointers to data)
Pages are kept balanced—all leaves at the same depth

B-Tree Structure

PLAINTEXT

                ┌────────────────────┐
                │    Root Page       │
                │  100  |  200  |300 │
                └────────────────────┘
                /         |          \
               /          |           \
  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
  │  Page: <100  │  │ Page: 100-200│  │  Page: >200  │
  │ 10|40|70|90  │  │110|140|170|190│ │210|250|280   │
  └──────────────┘  └──────────────┘  └──────────────┘
         │                 │                  │
         ▼                 ▼                  ▼
    (leaf pages with actual data)

Updates

To update a value:

Find the leaf page containing the key
Update the value in place
Write the page back to disk

To add a new key:

Find the appropriate leaf page
If room, add the key
If full, split the page into two and update the parent

Write-Ahead Log (WAL)

Since B-trees modify pages in place, crashes during writes can leave the tree corrupted. A write-ahead log solves this:

Before modifying any page, append the change to a sequential log
Then make the actual page modifications
If crash occurs, replay the log on recovery

LSM-Trees vs B-Trees

Aspect	LSM-Trees	B-Trees
Write pattern	Sequential (append-only)	Random (in-place updates)
Write throughput	Generally higher	Lower
Write amplification	Higher (compaction)	Lower
Read performance	May need to check multiple files	Single page lookup
Predictability	Compaction can cause latency spikes	More predictable
Storage overhead	Lower (compaction removes duplicates)	Fixed page overhead

Note

LSM-trees often have better write throughput, while B-trees often have more consistent read latency. But these are generalizations—always benchmark for your specific workload.

Other Index Types

Secondary Indexes

Beyond the primary key, you often need to search by other columns. Secondary indexes enable this:

SQL

-- Primary index on id
-- Secondary indexes on email and created_at
SELECT * FROM users WHERE email = 'alice@example.com';
SELECT * FROM users WHERE created_at > '2024-01-01';

The main difference from primary indexes: secondary index keys are not unique, so multiple rows may have the same value.

Multi-column Indexes

Concatenated indexes combine multiple columns:

SQL

-- Index on (last_name, first_name)
-- Good for: WHERE last_name = 'Smith' AND first_name = 'John'
-- Also helps: WHERE last_name = 'Smith'
-- Useless for: WHERE first_name = 'John'

Spatial indexes (R-trees) handle multi-dimensional data like geographical coordinates.

Storing Values in the Index

Where does the actual data live?

Heap file: Rows stored in a separate file; index contains pointers
Clustered index: Rows stored directly in the index (faster reads, limited to one index per table)
Covering index: Index includes some non-key columns to answer queries without touching heap

Transaction Processing vs Analytics

Databases serve different access patterns:

OLTP (Online Transaction Processing)

Many small transactions
Random reads/writes by primary key
Low latency critical
Latest data important

OLAP (Online Analytical Processing)

Few large queries scanning many rows
Aggregate calculations (sum, count, average)
Query time less critical than throughput
Historical data often sufficient

OLTP vs OLAP Queries

OLTP:

SQL

SELECT * FROM orders WHERE order_id = 12345;
UPDATE inventory SET quantity = quantity - 1 WHERE product_id = 789;

OLAP:

SQL

SELECT
  product_category,
  SUM(revenue),
  COUNT(DISTINCT customer_id)
FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-03-31'
GROUP BY product_category;

Data Warehouses

Running analytical queries on production OLTP databases slows them down. Data warehouses solve this:

Extract data from various OLTP systems
Transform into analysis-friendly schema
Load into separate warehouse database

This ETL (Extract-Transform-Load) process decouples analytics from operations.

Star and Snowflake Schemas

Data warehouses typically use dimensional modeling:

Fact tables: Large tables with events (sales, clicks, page views)

Contains foreign keys to dimension tables
Contains measures (price, quantity, duration)

Dimension tables: Smaller tables with attributes (who, what, where, when, how)

Products, customers, stores, dates

Star Schema

PLAINTEXT

                    ┌─────────────┐
                    │  dim_date   │
                    │ date_id     │
                    │ day, month  │
                    │ quarter     │
                    └──────┬──────┘
                           │
┌─────────────┐     ┌──────┴──────┐     ┌─────────────┐
│ dim_product │◄────│ fact_sales  │────►│dim_customer │
│ product_id  │     │ date_id     │     │ customer_id │
│ name, cat   │     │ product_id  │     │ name, region│
└─────────────┘     │ customer_id │     └─────────────┘
                    │ store_id    │
                    │ quantity    │     ┌─────────────┐
                    │ revenue     │────►│  dim_store  │
                    └─────────────┘     │ store_id    │
                                        │ city, state │
                                        └─────────────┘

Column-Oriented Storage

OLTP stores data row by row—efficient for transactions that touch entire rows. But analytical queries often need only a few columns from millions of rows.

Column-oriented storage stores each column separately:

Row vs Column Storage

Row storage:

PLAINTEXT

Row 1: [2024-01-15, 101, "Widget", 5, 49.99]
Row 2: [2024-01-15, 102, "Gadget", 2, 99.99]
Row 3: [2024-01-16, 103, "Widget", 3, 49.99]

Column storage:

PLAINTEXT

date:     [2024-01-15, 2024-01-15, 2024-01-16, ...]
product:  [101, 102, 103, ...]
name:     ["Widget", "Gadget", "Widget", ...]
quantity: [5, 2, 3, ...]
price:    [49.99, 99.99, 49.99, ...]

Query SELECT SUM(quantity) WHERE name='Widget' only reads two columns.

Column Compression

Columns often have repeated values, enabling efficient compression:

Bitmap encoding: For columns with few distinct values Run-length encoding: For sorted data with many consecutive identical values Dictionary encoding: Replace strings with smaller integer codes

Compressed columns can be processed directly (vectorized processing), making queries even faster.

Summary

Storage engines optimize for different access patterns:

Engine Type	Optimized For	Examples
LSM-trees	Write-heavy workloads	LevelDB, RocksDB, Cassandra
B-trees	General-purpose OLTP	PostgreSQL, MySQL, SQLite
Column stores	Analytics (OLAP)	Redshift, BigQuery, ClickHouse

Key concepts:

Indexes trade write speed for read speed
LSM-trees use append-only logs and compaction
B-trees use fixed pages with in-place updates
Column storage enables efficient analytical queries

Warning

Choosing the right storage engine requires understanding your access patterns. Profile your actual workload before making decisions.