Learning Guides
Menu

Data Models and Query Languages

8 min readDesigning Data-Intensive Applications

Data Models and Query Languages

Data models are perhaps the most important part of developing software—they shape not just how we write code, but how we think about the problem we're solving. Each layer of software hides the complexity of the layer below by providing a clean data model.

The Importance of Data Models

Consider how applications are built in layers:

  1. Application developers model the real world (people, products, actions) as objects and data structures
  2. Storage layer represents those structures as bytes in memory, disk, or network
  3. Hardware represents bytes as electrical currents, magnetic fields, or light pulses

Each layer hides complexity by providing a clean abstraction to the layer above.

Note

Your choice of data model has profound effects on what the software can and cannot do. Features that are easy in one model might be awkward or impossible in another.


The Relational Model

SQL, based on the relational model proposed by Edgar Codd in 1970, dominates data storage. Data is organized into relations (tables), where each relation is an unordered collection of tuples (rows).

Why Relational Became Dominant

The relational model started as a theoretical proposal for better abstracting the internal representation of data. Previous models (hierarchical, network) required developers to think about internal data representation.

By the mid-1980s, RDBMSs and SQL became the go-to tools for data storage—business data processing, e-commerce, publishing, and more.

Representing a User in Relational Tables

users table:

user_idfirst_namelast_nameemail
1BillGatesbill@microsoft.com

positions table (one-to-many):

position_iduser_idtitlecompany
11Co-founderMicrosoft
21Board MemberBerkshire

education table (one-to-many):

education_iduser_idschooldegree
11HarvardNone

The user_id serves as a foreign key linking related records.

Object-Relational Mismatch

Most application code today is object-oriented, but SQL data is stored in relational tables. This impedance mismatch requires a translation layer between objects and rows.

ORM frameworks (ActiveRecord, Hibernate) reduce boilerplate, but they can't completely hide the differences. There's an awkward translation between the application and database.


The Document Model

Document databases like MongoDB, CouchDB, and RethinkDB store data as self-contained documents, typically in JSON format.

Advantages of Documents

JSON
{
  "user_id": 1,
  "first_name": "Bill",
  "last_name": "Gates",
  "positions": [
    { "title": "Co-founder", "company": "Microsoft" },
    { "title": "Board Member", "company": "Berkshire" }
  ],
  "education": [{ "school": "Harvard", "degree": null }]
}

Better locality: All information about a user is in one place. No joins needed for simple retrievals.

Schema flexibility: Documents don't enforce a rigid structure. You can add fields without altering all existing documents.

Closer to application objects: The JSON structure maps more naturally to in-memory objects.

Note

Document databases are sometimes called "schemaless," but that's misleading. The schema is implicit—enforced by application code reading the data, not by the database on write.

Document Limitations

Joins are hard: If you need to reference data in other documents, you're on your own. The database won't help you.

Many-to-many relationships are awkward: When multiple documents reference the same entity, you either duplicate data (and manage consistency) or do application-level joins.

When Documents Struggle

Consider representing a social network:

JSON
{
  "user_id": 1,
  "name": "Alice",
  "friends": [2, 3, 4] // Just IDs, no actual data
}

To show Alice's friends' names, you need separate queries for each friend. The database doesn't understand relationships.

When to Use Documents

Document databases excel when:

  • Data has a natural document structure (tree of one-to-many relationships)
  • You rarely need joins or many-to-many relationships
  • Each document is fairly independent
  • You want schema flexibility

Relational vs Document: A Modern View

The debate has mellowed over time as databases have borrowed from each other:

FeatureRelationalDocument
SchemaExplicit, enforced on writeImplicit, interpreted on read
JoinsBuilt-in, optimizedApplication-level
LocalityData spread across tablesSelf-contained documents
Many-to-manyNatural with foreign keysRequires denormalization
One-to-manyRequires joinsNatural nesting

Most modern relational databases support JSON columns. Most document databases have added some form of join operations.

Warning

The choice isn't always either/or. Many applications use both—relational for transactional data with complex relationships, and documents for content, logs, or configuration.


Graph Data Models

For data where relationships are central—and potentially many-to-many—neither relational nor document models are ideal. Graph databases make relationships first-class citizens.

The Property Graph Model

A graph consists of:

  • Vertices (nodes): Entities with unique IDs and properties
  • Edges: Connections between vertices, with a label and direction

A Social Graph

PLAINTEXT
Person:Alice --[:LIVES_IN]--> City:London
Person:Alice --[:BORN_IN]--> City:Boston
Person:Bob   --[:LIVES_IN]--> City:London
Person:Bob   --[:FRIENDS_WITH]--> Person:Alice
City:London  --[:IN]--> Country:UK

Everything is connected, and you can traverse relationships easily.

Graph Queries with Cypher

Cypher (used by Neo4j) is a declarative query language for graphs:

CYPHER
// Find people who live in the US
MATCH (person)-[:BORN_IN]->()-[:IN*0..]->(us:Country {name: 'United States'})
RETURN person.name

The -[:IN*0..]->syntax means "follow zero or more IN edges"—handling the variable depth of geographic hierarchies elegantly.

When Graphs Shine

Graphs excel at:

  • Social networks
  • Road or rail networks
  • Fraud detection (spotting unusual patterns)
  • Knowledge graphs
  • Recommendation engines

The key advantage: you don't need to know ahead of time which joins you'll need. You can traverse any path through the data.


Query Languages

How you ask for data matters as much as how you store it.

Declarative vs Imperative

Imperative queries specify how to get data:

JAVASCRIPT
function getSharkSpecies(animals) {
  const sharks = [];
  for (const animal of animals) {
    if (animal.family === "Sharks") {
      sharks.push(animal.species);
    }
  }
  return sharks;
}

Declarative queries specify what you want:

SQL
SELECT species FROM animals WHERE family = 'Sharks';

Note

Declarative is generally better because: 1. It's more concise and easier to understand 2. The database can optimize execution 3. Implementation can change without breaking queries

SQL: The Declarative Standard

SQL lets you describe the result set you want. The query optimizer determines the best execution plan.

SQL
SELECT species, COUNT(*) as count
FROM animals
WHERE family = 'Sharks'
GROUP BY species
ORDER BY count DESC;

You don't specify:

  • Which index to use
  • In what order to execute operations
  • How to implement the grouping

The database figures all that out.

MapReduce: Between Declarative and Imperative

MapReduce is a programming model for processing large datasets across many machines. It's somewhere between declarative and imperative.

JAVASCRIPT
// MongoDB MapReduce example
db.observations.mapReduce(
  // Map function: emit key-value pairs
  function () {
    emit(this.family, 1);
  },
  // Reduce function: combine values for same key
  function (key, values) {
    return values.reduce((a, b) => a + b, 0);
  },
  { query: { family: "Sharks" }, out: "shark_counts" },
);

Warning

MapReduce is powerful but verbose. Many databases now offer declarative alternatives. MongoDB added an aggregation pipeline that's often preferred over MapReduce.

Aggregation Pipelines

Modern declarative approaches like MongoDB's aggregation pipeline:

JAVASCRIPT
db.observations.aggregate([
  { $match: { family: "Sharks" } },
  { $group: { _id: "$species", count: { $sum: 1 } } },
  { $sort: { count: -1 } },
]);

More readable than MapReduce, and the database can optimize execution.


Triple Stores and SPARQL

A triple store represents data as three-part statements: (subject, predicate, object)

PLAINTEXT
(alice, livesIn, london)
(london, in, uk)
(uk, type, country)

This is essentially a graph where:

  • Subject and object are vertices
  • Predicate is the edge label

SPARQL

SPARQL is the query language for triple stores:

SPARQL
PREFIX : <http://example.org/>
 
SELECT ?person WHERE {
  ?person :livesIn ?city .
  ?city :in :uk .
}

Triple stores are common in semantic web and knowledge graph applications.


Summary

Different data models suit different use cases:

ModelBest ForLimitations
RelationalTransactional data, complex queries, many-to-many relationshipsObject-relational mismatch, rigid schema
DocumentSelf-contained records, flexible schema, simple queriesWeak joins, data duplication
GraphHeavily interconnected data, variable-depth traversalsLess mature, specialized use cases

Note

Most real applications use multiple data models. A single "one-size-fits-all" database rarely serves all needs optimally.

Query languages follow similar patterns:

  • Declarative (SQL, Cypher, SPARQL) — Describe what you want
  • Imperative (procedural code) — Describe how to get it
  • Hybrid (MapReduce, aggregation pipelines) — Somewhere in between

The trend is toward more declarative approaches, as they allow better optimization and are easier to reason about.