Data Models and Query Languages

8 min read•Designing Data-Intensive Applications

Data Models and Query Languages

Data models are perhaps the most important part of developing software—they shape not just how we write code, but how we think about the problem we're solving. Each layer of software hides the complexity of the layer below by providing a clean data model.

The Importance of Data Models

Consider how applications are built in layers:

Application developers model the real world (people, products, actions) as objects and data structures
Storage layer represents those structures as bytes in memory, disk, or network
Hardware represents bytes as electrical currents, magnetic fields, or light pulses

Each layer hides complexity by providing a clean abstraction to the layer above.

Note

Your choice of data model has profound effects on what the software can and cannot do. Features that are easy in one model might be awkward or impossible in another.

The Relational Model

SQL, based on the relational model proposed by Edgar Codd in 1970, dominates data storage. Data is organized into relations (tables), where each relation is an unordered collection of tuples (rows).

Why Relational Became Dominant

The relational model started as a theoretical proposal for better abstracting the internal representation of data. Previous models (hierarchical, network) required developers to think about internal data representation.

By the mid-1980s, RDBMSs and SQL became the go-to tools for data storage—business data processing, e-commerce, publishing, and more.

Representing a User in Relational Tables

users table:

user_id	first_name	last_name	email
1	Bill	Gates	bill@microsoft.com

positions table (one-to-many):

position_id	user_id	title	company
1	1	Co-founder	Microsoft
2	1	Board Member	Berkshire

education table (one-to-many):

education_id	user_id	school	degree
1	1	Harvard	None

The user_id serves as a foreign key linking related records.

Object-Relational Mismatch

Most application code today is object-oriented, but SQL data is stored in relational tables. This impedance mismatch requires a translation layer between objects and rows.

ORM frameworks (ActiveRecord, Hibernate) reduce boilerplate, but they can't completely hide the differences. There's an awkward translation between the application and database.

The Document Model

Document databases like MongoDB, CouchDB, and RethinkDB store data as self-contained documents, typically in JSON format.

Advantages of Documents

JSON

{
  "user_id": 1,
  "first_name": "Bill",
  "last_name": "Gates",
  "positions": [
    { "title": "Co-founder", "company": "Microsoft" },
    { "title": "Board Member", "company": "Berkshire" }
  ],
  "education": [{ "school": "Harvard", "degree": null }]
}

Better locality: All information about a user is in one place. No joins needed for simple retrievals.

Schema flexibility: Documents don't enforce a rigid structure. You can add fields without altering all existing documents.

Closer to application objects: The JSON structure maps more naturally to in-memory objects.

Note

Document databases are sometimes called "schemaless," but that's misleading. The schema is implicit—enforced by application code reading the data, not by the database on write.

Document Limitations

Joins are hard: If you need to reference data in other documents, you're on your own. The database won't help you.

Many-to-many relationships are awkward: When multiple documents reference the same entity, you either duplicate data (and manage consistency) or do application-level joins.

When Documents Struggle

Consider representing a social network:

JSON

{
  "user_id": 1,
  "name": "Alice",
  "friends": [2, 3, 4] // Just IDs, no actual data
}

To show Alice's friends' names, you need separate queries for each friend. The database doesn't understand relationships.

When to Use Documents

Document databases excel when:

Data has a natural document structure (tree of one-to-many relationships)
You rarely need joins or many-to-many relationships
Each document is fairly independent
You want schema flexibility

Relational vs Document: A Modern View

The debate has mellowed over time as databases have borrowed from each other:

Feature	Relational	Document
Schema	Explicit, enforced on write	Implicit, interpreted on read
Joins	Built-in, optimized	Application-level
Locality	Data spread across tables	Self-contained documents
Many-to-many	Natural with foreign keys	Requires denormalization
One-to-many	Requires joins	Natural nesting

Most modern relational databases support JSON columns. Most document databases have added some form of join operations.

Warning

The choice isn't always either/or. Many applications use both—relational for transactional data with complex relationships, and documents for content, logs, or configuration.

Graph Data Models

For data where relationships are central—and potentially many-to-many—neither relational nor document models are ideal. Graph databases make relationships first-class citizens.

The Property Graph Model

A graph consists of:

Vertices (nodes): Entities with unique IDs and properties
Edges: Connections between vertices, with a label and direction

A Social Graph

PLAINTEXT

Person:Alice --[:LIVES_IN]--> City:London
Person:Alice --[:BORN_IN]--> City:Boston
Person:Bob   --[:LIVES_IN]--> City:London
Person:Bob   --[:FRIENDS_WITH]--> Person:Alice
City:London  --[:IN]--> Country:UK

Everything is connected, and you can traverse relationships easily.

Graph Queries with Cypher

Cypher (used by Neo4j) is a declarative query language for graphs:

CYPHER

// Find people who live in the US
MATCH (person)-[:BORN_IN]->()-[:IN*0..]->(us:Country {name: 'United States'})
RETURN person.name

The -[:IN*0..]->syntax means "follow zero or more IN edges"—handling the variable depth of geographic hierarchies elegantly.

When Graphs Shine

Graphs excel at:

Social networks
Road or rail networks
Fraud detection (spotting unusual patterns)
Knowledge graphs
Recommendation engines

The key advantage: you don't need to know ahead of time which joins you'll need. You can traverse any path through the data.

Query Languages

How you ask for data matters as much as how you store it.

Declarative vs Imperative

Imperative queries specify how to get data:

JAVASCRIPT

function getSharkSpecies(animals) {
  const sharks = [];
  for (const animal of animals) {
    if (animal.family === "Sharks") {
      sharks.push(animal.species);
    }
  }
  return sharks;
}

Declarative queries specify what you want:

SQL

SELECT species FROM animals WHERE family = 'Sharks';

Note

Declarative is generally better because: 1. It's more concise and easier to understand 2. The database can optimize execution 3. Implementation can change without breaking queries

SQL: The Declarative Standard

SQL lets you describe the result set you want. The query optimizer determines the best execution plan.

SQL

SELECT species, COUNT(*) as count
FROM animals
WHERE family = 'Sharks'
GROUP BY species
ORDER BY count DESC;

You don't specify:

Which index to use
In what order to execute operations
How to implement the grouping

The database figures all that out.

MapReduce: Between Declarative and Imperative

MapReduce is a programming model for processing large datasets across many machines. It's somewhere between declarative and imperative.

JAVASCRIPT

// MongoDB MapReduce example
db.observations.mapReduce(
  // Map function: emit key-value pairs
  function () {
    emit(this.family, 1);
  },
  // Reduce function: combine values for same key
  function (key, values) {
    return values.reduce((a, b) => a + b, 0);
  },
  { query: { family: "Sharks" }, out: "shark_counts" },
);

Warning

MapReduce is powerful but verbose. Many databases now offer declarative alternatives. MongoDB added an aggregation pipeline that's often preferred over MapReduce.

Aggregation Pipelines

Modern declarative approaches like MongoDB's aggregation pipeline:

JAVASCRIPT

db.observations.aggregate([
  { $match: { family: "Sharks" } },
  { $group: { _id: "$species", count: { $sum: 1 } } },
  { $sort: { count: -1 } },
]);

More readable than MapReduce, and the database can optimize execution.

Triple Stores and SPARQL

A triple store represents data as three-part statements: (subject, predicate, object)

PLAINTEXT

(alice, livesIn, london)
(london, in, uk)
(uk, type, country)

This is essentially a graph where:

Subject and object are vertices
Predicate is the edge label

SPARQL

SPARQL is the query language for triple stores:

SPARQL

PREFIX : <http://example.org/>
 
SELECT ?person WHERE {
  ?person :livesIn ?city .
  ?city :in :uk .
}

Triple stores are common in semantic web and knowledge graph applications.

Summary

Different data models suit different use cases:

Model	Best For	Limitations
Relational	Transactional data, complex queries, many-to-many relationships	Object-relational mismatch, rigid schema
Document	Self-contained records, flexible schema, simple queries	Weak joins, data duplication
Graph	Heavily interconnected data, variable-depth traversals	Less mature, specialized use cases

Note

Most real applications use multiple data models. A single "one-size-fits-all" database rarely serves all needs optimally.

Query languages follow similar patterns:

Declarative (SQL, Cypher, SPARQL) — Describe what you want
Imperative (procedural code) — Describe how to get it
Hybrid (MapReduce, aggregation pipelines) — Somewhere in between

The trend is toward more declarative approaches, as they allow better optimization and are easier to reason about.