Learning Guides
Menu

Storage and Retrieval

9 min readDesigning Data-Intensive Applications

Storage and Retrieval

A database needs to do two fundamental things: store data when you give it, and return that data when you ask for it. Understanding how databases accomplish this helps you select the right storage engine and tune it for your workload.

Note

You don't need to implement a storage engine from scratch, but understanding the internals helps you reason about performance and make better architectural decisions.


The Simplest Database

The most basic database is two bash functions:

BASH
#!/bin/bash
db_set() {
  echo "$1,$2" >> database
}
 
db_get() {
  grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}

This append-only log has excellent write performance—just append to a file. But reads are O(n)—you must scan the entire file.

To make reads faster, we need a different data structure: an index.


Indexes: The Fundamental Tradeoff

An index is an additional structure derived from primary data. It speeds up reads but slows down writes (because the index must be updated on every write).

Warning

Indexes are a tradeoff: faster reads, slower writes, more storage. Databases don't index everything by default—you choose indexes based on your query patterns.

Hash Indexes

The simplest indexing strategy: keep an in-memory hash map where keys map to byte offsets in the data file.

Hash Index in Action

Data file:

PLAINTEXT
offset 0:    {"key": "foo", "value": "bar"}
offset 50:   {"key": "baz", "value": "qux"}
offset 100:  {"key": "foo", "value": "updated_bar"}

Hash map:

PLAINTEXT
foo → 100  (most recent offset)
baz → 50

Reads: Look up offset in hash map, read from disk at that position.

Bitcask (the storage engine in Riak) works this way. It's simple and blazingly fast for workloads where:

  • You have many writes per key (each write updates the value)
  • All keys fit in memory
  • You don't need range queries

Compaction

Since we only append, the file grows forever. Compaction solves this by:

  1. Breaking the log into segments
  2. Periodically merging segments, keeping only the latest value per key
  3. Reading from the most recent segments first

SSTables and LSM-Trees

Sorted String Tables (SSTables) improve on hash indexes by sorting key-value pairs by key.

Advantages:

  • Efficient merging: Like merge-sort, scan through files and keep newest value
  • Sparse index: Don't need every key in memory—use binary search between known offsets
  • Range queries: Keys are sorted, so scanning a range is efficient
  • Compression: Can compress blocks of sorted data before writing to disk

LSM-Trees: How to Build SSTables

You can't sort an append-only log directly. Instead:

  1. Writes: Insert into an in-memory sorted tree (often a red-black tree or AVL tree), called a memtable
  2. When memtable gets too big: Write it to disk as an SSTable segment
  3. Reads: Check memtable first, then disk segments (most recent first)
  4. Background: Merge and compact segments periodically

This is called a Log-Structured Merge-Tree (LSM-Tree).

LSM-Tree Write Path

PLAINTEXT
Write: SET key=123, value="hello"


    ┌───────────────┐
    │   Memtable    │  ← Sorted tree in memory
    │ (red-black)   │
    └───────────────┘

            ▼ (when full)
    ┌───────────────┐
    │  SSTable #5   │  ← Immutable sorted file on disk
    └───────────────┘
    ┌───────────────┐
    │  SSTable #4   │  ← Older segment
    └───────────────┘
    ┌───────────────┐
    │  SSTable #3   │  ← Even older
    └───────────────┘

LevelDB, RocksDB, Cassandra, HBase, and many others use LSM-trees.

Bloom Filters

Reading non-existent keys is expensive—you check every SSTable and find nothing. Bloom filters help: a probabilistic data structure that can quickly tell you "definitely not present" or "maybe present."


B-Trees

B-trees are the most widely used indexing structure, standard in almost all relational databases. Unlike LSM-trees (which write sequentially), B-trees update fixed-size pages in place.

Structure

  • Data is organized in pages (typically 4KB)
  • Each page contains keys and references to child pages
  • Leaf pages contain the actual values (or pointers to data)
  • Pages are kept balanced—all leaves at the same depth

B-Tree Structure

PLAINTEXT
                ┌────────────────────┐
                │    Root Page       │
                │  100  |  200  |300 │
                └────────────────────┘
                /         |          \
               /          |           \
  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
  │  Page: <100  │  │ Page: 100-200│  │  Page: >200  │
  │ 10|40|70|90  │  │110|140|170|190│ │210|250|280   │
  └──────────────┘  └──────────────┘  └──────────────┘
         │                 │                  │
         ▼                 ▼                  ▼
    (leaf pages with actual data)

Updates

To update a value:

  1. Find the leaf page containing the key
  2. Update the value in place
  3. Write the page back to disk

To add a new key:

  1. Find the appropriate leaf page
  2. If room, add the key
  3. If full, split the page into two and update the parent

Write-Ahead Log (WAL)

Since B-trees modify pages in place, crashes during writes can leave the tree corrupted. A write-ahead log solves this:

  1. Before modifying any page, append the change to a sequential log
  2. Then make the actual page modifications
  3. If crash occurs, replay the log on recovery

LSM-Trees vs B-Trees

AspectLSM-TreesB-Trees
Write patternSequential (append-only)Random (in-place updates)
Write throughputGenerally higherLower
Write amplificationHigher (compaction)Lower
Read performanceMay need to check multiple filesSingle page lookup
PredictabilityCompaction can cause latency spikesMore predictable
Storage overheadLower (compaction removes duplicates)Fixed page overhead

Note

LSM-trees often have better write throughput, while B-trees often have more consistent read latency. But these are generalizations—always benchmark for your specific workload.


Other Index Types

Secondary Indexes

Beyond the primary key, you often need to search by other columns. Secondary indexes enable this:

SQL
-- Primary index on id
-- Secondary indexes on email and created_at
SELECT * FROM users WHERE email = 'alice@example.com';
SELECT * FROM users WHERE created_at > '2024-01-01';

The main difference from primary indexes: secondary index keys are not unique, so multiple rows may have the same value.

Multi-column Indexes

Concatenated indexes combine multiple columns:

SQL
-- Index on (last_name, first_name)
-- Good for: WHERE last_name = 'Smith' AND first_name = 'John'
-- Also helps: WHERE last_name = 'Smith'
-- Useless for: WHERE first_name = 'John'

Spatial indexes (R-trees) handle multi-dimensional data like geographical coordinates.

Storing Values in the Index

Where does the actual data live?

  • Heap file: Rows stored in a separate file; index contains pointers
  • Clustered index: Rows stored directly in the index (faster reads, limited to one index per table)
  • Covering index: Index includes some non-key columns to answer queries without touching heap

Transaction Processing vs Analytics

Databases serve different access patterns:

OLTP (Online Transaction Processing)

  • Many small transactions
  • Random reads/writes by primary key
  • Low latency critical
  • Latest data important

OLAP (Online Analytical Processing)

  • Few large queries scanning many rows
  • Aggregate calculations (sum, count, average)
  • Query time less critical than throughput
  • Historical data often sufficient

OLTP vs OLAP Queries

OLTP:

SQL
SELECT * FROM orders WHERE order_id = 12345;
UPDATE inventory SET quantity = quantity - 1 WHERE product_id = 789;

OLAP:

SQL
SELECT
  product_category,
  SUM(revenue),
  COUNT(DISTINCT customer_id)
FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-03-31'
GROUP BY product_category;

Data Warehouses

Running analytical queries on production OLTP databases slows them down. Data warehouses solve this:

  1. Extract data from various OLTP systems
  2. Transform into analysis-friendly schema
  3. Load into separate warehouse database

This ETL (Extract-Transform-Load) process decouples analytics from operations.

Star and Snowflake Schemas

Data warehouses typically use dimensional modeling:

Fact tables: Large tables with events (sales, clicks, page views)

  • Contains foreign keys to dimension tables
  • Contains measures (price, quantity, duration)

Dimension tables: Smaller tables with attributes (who, what, where, when, how)

  • Products, customers, stores, dates

Star Schema

PLAINTEXT
                    ┌─────────────┐
                    │  dim_date   │
                    │ date_id     │
                    │ day, month  │
                    │ quarter     │
                    └──────┬──────┘

┌─────────────┐     ┌──────┴──────┐     ┌─────────────┐
│ dim_product │◄────│ fact_sales  │────►│dim_customer │
│ product_id  │     │ date_id     │     │ customer_id │
│ name, cat   │     │ product_id  │     │ name, region│
└─────────────┘     │ customer_id │     └─────────────┘
                    │ store_id    │
                    │ quantity    │     ┌─────────────┐
                    │ revenue     │────►│  dim_store  │
                    └─────────────┘     │ store_id    │
                                        │ city, state │
                                        └─────────────┘

Column-Oriented Storage

OLTP stores data row by row—efficient for transactions that touch entire rows. But analytical queries often need only a few columns from millions of rows.

Column-oriented storage stores each column separately:

Row vs Column Storage

Row storage:

PLAINTEXT
Row 1: [2024-01-15, 101, "Widget", 5, 49.99]
Row 2: [2024-01-15, 102, "Gadget", 2, 99.99]
Row 3: [2024-01-16, 103, "Widget", 3, 49.99]

Column storage:

PLAINTEXT
date:     [2024-01-15, 2024-01-15, 2024-01-16, ...]
product:  [101, 102, 103, ...]
name:     ["Widget", "Gadget", "Widget", ...]
quantity: [5, 2, 3, ...]
price:    [49.99, 99.99, 49.99, ...]

Query SELECT SUM(quantity) WHERE name='Widget' only reads two columns.

Column Compression

Columns often have repeated values, enabling efficient compression:

Bitmap encoding: For columns with few distinct values Run-length encoding: For sorted data with many consecutive identical values Dictionary encoding: Replace strings with smaller integer codes

Compressed columns can be processed directly (vectorized processing), making queries even faster.


Summary

Storage engines optimize for different access patterns:

Engine TypeOptimized ForExamples
LSM-treesWrite-heavy workloadsLevelDB, RocksDB, Cassandra
B-treesGeneral-purpose OLTPPostgreSQL, MySQL, SQLite
Column storesAnalytics (OLAP)Redshift, BigQuery, ClickHouse

Key concepts:

  • Indexes trade write speed for read speed
  • LSM-trees use append-only logs and compaction
  • B-trees use fixed pages with in-place updates
  • Column storage enables efficient analytical queries

Warning

Choosing the right storage engine requires understanding your access patterns. Profile your actual workload before making decisions.