Learning Guides
Menu

Encoding and Evolution

8 min readDesigning Data-Intensive Applications

Encoding and Evolution

Applications change over time. Features are added, data models evolve, and requirements shift. In a running system, old and new code—and old and new data formats—must coexist. This chapter explores how to encode data and evolve schemas without breaking things.

The Problem of Change

Consider a typical deployment scenario:

  • Server-side applications: Rolling upgrades deploy new code to a few nodes at a time
  • Client-side applications: Users may not update for months
  • Stored data: Data written years ago must still be readable

This means:

  • Backward compatibility: New code must read data written by old code
  • Forward compatibility: Old code must read data written by new code (often ignored, but crucial for rolling deployments)

Note

Forward compatibility is trickier—old code must somehow handle unknown fields without breaking. This requires careful schema design.


Encoding Formats

When data lives only in memory, you use language-specific structures (objects, arrays, hashmaps). When you send data over the network or write to disk, you need to encode it into a byte sequence.

Language-Specific Formats

Java's Serializable, Python's pickle, Ruby's Marshal—convenient but problematic:

ProblemImpact
Language lock-inCan't share data across languages
Security vulnerabilitiesDeserialization can instantiate arbitrary classes
Poor versioningSchema evolution is often an afterthought
InefficiencyOften not optimized for size or speed

Warning

Language-specific serialization is tempting for quick prototypes but problematic for anything crossing system boundaries. Avoid for inter-service communication.

JSON, XML, and CSV

Text-based formats are human-readable and widely supported:

JSON has become the dominant format for web APIs:

JSON
{
  "userName": "martin",
  "favoriteNumber": 1337,
  "interests": ["daydreaming", "hacking"]
}

Advantages:

  • Human-readable
  • Ubiquitous language support
  • Schema-less (flexible)

Disadvantages:

  • No distinction between integers and floats
  • No binary data support (must Base64 encode)
  • No schema enforcement
  • Verbose (wastes bandwidth)

XML is more structured but verbose:

XML
<user>
  <userName>martin</userName>
  <favoriteNumber>1337</favoriteNumber>
  <interests>
    <interest>daydreaming</interest>
    <interest>hacking</interest>
  </interests>
</user>

CSV is fine for tabular data but has many edge cases (escaping, encoding, no types).


Binary Encoding

For internal services where efficiency matters, binary formats are better. They're more compact and faster to parse.

MessagePack, BSON, etc.

Binary JSON variants encode JSON-like structures more efficiently:

PLAINTEXT
JSON: {"userName":"martin","favoriteNumber":1337}
Bytes: 47 characters
 
MessagePack: Same structure
Bytes: 32 bytes

The savings aren't dramatic because field names are still included in every record.


Schema-Based Binary Formats

Real efficiency gains come from separating schema from data. Apache Thrift, Protocol Buffers, and Apache Avro take this approach.

Protocol Buffers

Define a schema in .proto files:

PROTOBUF
message Person {
  required string user_name = 1;
  optional int64 favorite_number = 2;
  repeated string interests = 3;
}

Generated code provides type-safe accessors. The encoded data contains only:

  • Field tag numbers (1, 2, 3) instead of names
  • Type indicators
  • Values

Protocol Buffers Encoding

For the same data as above:

PLAINTEXT
Field 1 (string): "martin"   → 08 06 6d 61 72 74 69 6e
Field 2 (varint): 1337       → 10 b9 0a
Field 3 (string): repeated   → 1a 0b 64 61 79 64 72 65 61 6d 69 6e 67
                               1a 07 68 61 63 6b 69 6e 67

Total: 33 bytes (vs 47 for JSON)

Apache Thrift

Similar to Protocol Buffers, with two binary encodings:

  • BinaryProtocol: Simple, fast
  • CompactProtocol: More space-efficient (uses variable-length integers)

Apache Avro

Avro takes a different approach—the data contains no field tags at all. Instead, readers and writers must have the schema.

JSON
{
  "type": "record",
  "name": "Person",
  "fields": [
    { "name": "userName", "type": "string" },
    { "name": "favoriteNumber", "type": ["null", "long"], "default": null },
    { "name": "interests", "type": { "type": "array", "items": "string" } }
  ]
}

The encoded data is just values, in schema order. This makes it extremely compact but requires careful schema management.


Schema Evolution

Schemas change over time. The key question: how do you maintain compatibility?

Field Tags Are Forever

In Protocol Buffers and Thrift, field tags (the numbers) are the stable identifiers. Names can change, but tags must remain constant.

Adding a field:

PROTOBUF
message Person {
  required string user_name = 1;
  optional int64 favorite_number = 2;
  repeated string interests = 3;
  optional string email = 4;  // NEW FIELD
}
  • Old code ignores the unknown tag 4 (forward compatibility ✓)
  • New code sees tag 4 missing, uses default (backward compatibility ✓)

Removing a field:

  • Only optional fields can be removed
  • Never reuse the tag number for a different field

Warning

Never change a field's tag number or type. This breaks both forward and backward compatibility.

Avro Schema Evolution

Avro uses schema resolution—the writer's schema and reader's schema can differ. The library matches fields by name and handles mismatches:

  • New field with default → Reader uses default
  • Removed field → Reader ignores it
  • Field in both → Types must be compatible

The trade-off: you must track which schema version was used to write each piece of data.


Modes of Data Flow

How does encoded data actually move between systems?

Through Databases

You write encoded data to a database, then read it back later—possibly with different code versions.

Challenge: A database might be updated by old code, new code, and everything in between. Every version must handle every other version's data.

Common pitfall: Old code reads a record, ignores new fields it doesn't understand, then writes it back—erasing those fields. Preserve unknown fields!

The Lost Field Problem

PLAINTEXT
1. New code writes: {name: "Alice", email: "alice@example.com", phone: "555-1234"}
2. Old code reads:  {name: "Alice", email: "alice@example.com"}  // phone unknown
3. Old code writes: {name: "Alice", email: "alice@updated.com"}  // phone lost!

Solution: Old code should preserve fields it doesn't understand.

Through Service Calls

When services communicate, the client and server may be running different code versions.

REST/HTTP APIs typically use JSON. Schema evolution relies on convention (add fields, don't remove; version your API).

RPC (Remote Procedure Call) tries to make network calls look like local function calls. Protocol Buffers, Thrift, and gRPC provide schema-based RPC.

Note

The key insight: services should assume no control over their clients. Always maintain backward compatibility for reasonable time periods.

Through Message Queues

Producers and consumers of messages are decoupled. A message might be consumed days after it was produced, by completely different code.

Message passing combines benefits of databases and RPC:

  • Asynchronous (like databases)
  • Low latency (like RPC)
  • Load balancing
  • Replay capability

Popular systems: RabbitMQ, Apache Kafka, Amazon SQS.


Data Flow and Compatibility Summary

Data FlowWriterReaderKey Concern
DatabaseAny versionAny versionAll versions must interoperate
Service callClientServerServer must handle old clients
Message queueProducerConsumerConsumer may lag behind producer

Compatibility Rules

For smooth evolution:

  1. Add fields as optional with sensible defaults
  2. Never reuse deleted field names/numbers
  3. Preserve unknown fields when reading and writing
  4. Version your schemas explicitly
  5. Test compatibility as part of CI/CD

Summary

Encoding and evolution are essential for long-lived systems:

FormatProsConsBest For
JSON/XMLHuman-readable, universalVerbose, no typesPublic APIs, config
MessagePackCompact JSONStill includes field namesInternal APIs needing JSON-like structure
Protocol BuffersCompact, typed, schema evolutionRequires schema managementInternal services, gRPC
AvroMost compact, dynamic schemasComplex schema resolutionBig data, Hadoop ecosystem

Note

The choice of encoding affects far more than wire size. It determines how easily you can evolve your system, how different teams can interoperate, and how long your data remains readable.

Key takeaways:

  • Backward compatibility: New code reads old data
  • Forward compatibility: Old code reads new data (ignoring unknown fields)
  • Schema-based formats make evolution explicit and manageable
  • Field identifiers (tags or names) must remain stable
  • Default values handle missing fields gracefully