Encoding and Evolution

8 min read•Designing Data-Intensive Applications

Encoding and Evolution

Applications change over time. Features are added, data models evolve, and requirements shift. In a running system, old and new code—and old and new data formats—must coexist. This chapter explores how to encode data and evolve schemas without breaking things.

The Problem of Change

Consider a typical deployment scenario:

Server-side applications: Rolling upgrades deploy new code to a few nodes at a time
Client-side applications: Users may not update for months
Stored data: Data written years ago must still be readable

This means:

Backward compatibility: New code must read data written by old code
Forward compatibility: Old code must read data written by new code (often ignored, but crucial for rolling deployments)

Note

Forward compatibility is trickier—old code must somehow handle unknown fields without breaking. This requires careful schema design.

Encoding Formats

When data lives only in memory, you use language-specific structures (objects, arrays, hashmaps). When you send data over the network or write to disk, you need to encode it into a byte sequence.

Language-Specific Formats

Java's Serializable, Python's pickle, Ruby's Marshal—convenient but problematic:

Problem	Impact
Language lock-in	Can't share data across languages
Security vulnerabilities	Deserialization can instantiate arbitrary classes
Poor versioning	Schema evolution is often an afterthought
Inefficiency	Often not optimized for size or speed

Warning

Language-specific serialization is tempting for quick prototypes but problematic for anything crossing system boundaries. Avoid for inter-service communication.

JSON, XML, and CSV

Text-based formats are human-readable and widely supported:

JSON has become the dominant format for web APIs:

JSON

{
  "userName": "martin",
  "favoriteNumber": 1337,
  "interests": ["daydreaming", "hacking"]
}

Advantages:

Human-readable
Ubiquitous language support
Schema-less (flexible)

Disadvantages:

No distinction between integers and floats
No binary data support (must Base64 encode)
No schema enforcement
Verbose (wastes bandwidth)

XML is more structured but verbose:

XML

<user>
  <userName>martin</userName>
  <favoriteNumber>1337</favoriteNumber>
  <interests>
    <interest>daydreaming</interest>
    <interest>hacking</interest>
  </interests>
</user>

CSV is fine for tabular data but has many edge cases (escaping, encoding, no types).

Binary Encoding

For internal services where efficiency matters, binary formats are better. They're more compact and faster to parse.

MessagePack, BSON, etc.

Binary JSON variants encode JSON-like structures more efficiently:

PLAINTEXT

JSON: {"userName":"martin","favoriteNumber":1337}
Bytes: 47 characters
 
MessagePack: Same structure
Bytes: 32 bytes

The savings aren't dramatic because field names are still included in every record.

Schema-Based Binary Formats

Real efficiency gains come from separating schema from data. Apache Thrift, Protocol Buffers, and Apache Avro take this approach.

Protocol Buffers

Define a schema in .proto files:

PROTOBUF

message Person {
  required string user_name = 1;
  optional int64 favorite_number = 2;
  repeated string interests = 3;
}

Generated code provides type-safe accessors. The encoded data contains only:

Field tag numbers (1, 2, 3) instead of names
Type indicators
Values

Protocol Buffers Encoding

For the same data as above:

PLAINTEXT

Field 1 (string): "martin"   → 08 06 6d 61 72 74 69 6e
Field 2 (varint): 1337       → 10 b9 0a
Field 3 (string): repeated   → 1a 0b 64 61 79 64 72 65 61 6d 69 6e 67
                               1a 07 68 61 63 6b 69 6e 67

Total: 33 bytes (vs 47 for JSON)

Apache Thrift

Similar to Protocol Buffers, with two binary encodings:

BinaryProtocol: Simple, fast
CompactProtocol: More space-efficient (uses variable-length integers)

Apache Avro

Avro takes a different approach—the data contains no field tags at all. Instead, readers and writers must have the schema.

JSON

{
  "type": "record",
  "name": "Person",
  "fields": [
    { "name": "userName", "type": "string" },
    { "name": "favoriteNumber", "type": ["null", "long"], "default": null },
    { "name": "interests", "type": { "type": "array", "items": "string" } }
  ]
}

The encoded data is just values, in schema order. This makes it extremely compact but requires careful schema management.

Schema Evolution

Schemas change over time. The key question: how do you maintain compatibility?

Field Tags Are Forever

In Protocol Buffers and Thrift, field tags (the numbers) are the stable identifiers. Names can change, but tags must remain constant.

Adding a field:

PROTOBUF

message Person {
  required string user_name = 1;
  optional int64 favorite_number = 2;
  repeated string interests = 3;
  optional string email = 4;  // NEW FIELD
}

Old code ignores the unknown tag 4 (forward compatibility ✓)
New code sees tag 4 missing, uses default (backward compatibility ✓)

Removing a field:

Only optional fields can be removed
Never reuse the tag number for a different field

Warning

Never change a field's tag number or type. This breaks both forward and backward compatibility.

Avro Schema Evolution

Avro uses schema resolution—the writer's schema and reader's schema can differ. The library matches fields by name and handles mismatches:

New field with default → Reader uses default
Removed field → Reader ignores it
Field in both → Types must be compatible

The trade-off: you must track which schema version was used to write each piece of data.

Modes of Data Flow

How does encoded data actually move between systems?

Through Databases

You write encoded data to a database, then read it back later—possibly with different code versions.

Challenge: A database might be updated by old code, new code, and everything in between. Every version must handle every other version's data.

Common pitfall: Old code reads a record, ignores new fields it doesn't understand, then writes it back—erasing those fields. Preserve unknown fields!

The Lost Field Problem

PLAINTEXT

1. New code writes: {name: "Alice", email: "alice@example.com", phone: "555-1234"}
2. Old code reads:  {name: "Alice", email: "alice@example.com"}  // phone unknown
3. Old code writes: {name: "Alice", email: "alice@updated.com"}  // phone lost!

Solution: Old code should preserve fields it doesn't understand.

Through Service Calls

When services communicate, the client and server may be running different code versions.

REST/HTTP APIs typically use JSON. Schema evolution relies on convention (add fields, don't remove; version your API).

RPC (Remote Procedure Call) tries to make network calls look like local function calls. Protocol Buffers, Thrift, and gRPC provide schema-based RPC.

Note

The key insight: services should assume no control over their clients. Always maintain backward compatibility for reasonable time periods.

Through Message Queues

Producers and consumers of messages are decoupled. A message might be consumed days after it was produced, by completely different code.

Message passing combines benefits of databases and RPC:

Asynchronous (like databases)
Low latency (like RPC)
Load balancing
Replay capability

Popular systems: RabbitMQ, Apache Kafka, Amazon SQS.

Data Flow and Compatibility Summary

Data Flow	Writer	Reader	Key Concern
Database	Any version	Any version	All versions must interoperate
Service call	Client	Server	Server must handle old clients
Message queue	Producer	Consumer	Consumer may lag behind producer

Compatibility Rules

For smooth evolution:

Add fields as optional with sensible defaults
Never reuse deleted field names/numbers
Preserve unknown fields when reading and writing
Version your schemas explicitly
Test compatibility as part of CI/CD

Summary

Encoding and evolution are essential for long-lived systems:

Format	Pros	Cons	Best For
JSON/XML	Human-readable, universal	Verbose, no types	Public APIs, config
MessagePack	Compact JSON	Still includes field names	Internal APIs needing JSON-like structure
Protocol Buffers	Compact, typed, schema evolution	Requires schema management	Internal services, gRPC
Avro	Most compact, dynamic schemas	Complex schema resolution	Big data, Hadoop ecosystem

Note

The choice of encoding affects far more than wire size. It determines how easily you can evolve your system, how different teams can interoperate, and how long your data remains readable.

Key takeaways:

Backward compatibility: New code reads old data
Forward compatibility: Old code reads new data (ignoring unknown fields)
Schema-based formats make evolution explicit and manageable
Field identifiers (tags or names) must remain stable
Default values handle missing fields gracefully