Synchronizing Databases with Real-Time Streams


Introduction

The relationship between databases and streams goes beyond simple asynchronous updates or messaging: they are deeply interconnected concepts. At the core, databases can be seen as the materialized state derived from an unbounded stream of change events. This subchapter explores how streams sync with databases, the role of change data capture (CDC) and event sourcing, and the advantages of thinking about state using an immutable event-driven model.


Keeping Systems in Sync

Modern applications use databases alongside systems like caches, search indexes, and warehouses, each optimized for a specific role. As data appears in multiple systems, they require synchronization to avoid diverging states:

  1. Batch ETL: A traditional method, entire database snapshots are transformed (e.g., for analytics) and periodically uploaded to warehouses or other systems.
    • Problem: High latency between updates.
  2. Dual Writes: Applications update both the database and external systems simultaneously.
    • Problem: Race conditions between concurrent update events lead to inconsistencies, as illustrated by interleaving writes reaching systems in different orders.

Change Data Capture (CDC)

Instead of relying on batch snapshots or error-prone dual writes, CDC tracks and extracts database changes in real-time:

  1. Databases produce replication logs as they process write operations. By monitoring these logs and forwarding changes to downstream consumers (e.g., cache, search, analytics), CDC creates robust synchronization pipelines.

CDC Implementation

  1. Log-Based Replication: Changes captured directly from the database’s write-ahead log (WAL), preserving write order.
    • Examples: Apache Kafka connectors, Debezium.
  2. Trigger-Based Replication: Database triggers log changes manually, updating a CDC table for downstream retrieval.
    • Drawback: Fragile and high-performance overhead.

By ensuring downstream systems apply changes in the same sequence as the primary database, CDC eliminates problems tied to race conditions and update divergence.


Event Sourcing

While CDC tracks changes at the database level, event sourcing is an architectural pattern where the application itself stores domain-level events (not direct state changes) in an immutable log:

  1. Events define what happened, independent of how that event affects stored state.
    • Example: Instead of mutating a table directly to reflect that a “seat was reserved,” event sourcing appends an “event log” entry saying, “Seat X reserved for User Y.”
  2. Application state becomes a derived materialized view of the event stream. If new requirements arise (e.g., showing reservation history), reprocessing the stream suffices without modifying existing state.

State, Streams, and Immutability

Immutability complements both CDC and event sourcing by addressing data consistency and recovery challenges:

  • Immutable Event Streams: Represent state changes over time, facilitating reproduction of any application state by replaying the log.
  • Integration with Log Compaction: Deletes overwritten data versions, keeping only the most recent while maintaining derived states effectively.

Use Cases:

  1. Database Recovery: Crash recovery by replaying event logs to rebuild the latest consistent state.
  2. Debugging and Auditing: Immutable logs allow complete traceability of all historical actions, preventing silent overwrites of important updates.

Conclusion

By bridging the gap between databases and streams through innovations like CDC, event sourcing, and immutability, modern systems achieve real-time synchronization at scale. These techniques not only eliminate issues like race conditions but also provide resilience and flexibility for evolving system designs. Adopting this mindset decouples immediate application needs from fixed schemas, enabling rich downstream processing and integration capabilities.

Series Designing Data-Intensive Applications Part 36 of 41
  1. Designing Reliable Data Systems
  2. What is Scalability in Data Systems?
  3. Building Maintainable Software Systems
  4. Relational Model Versus Document Model
  5. Speaking the Language of Data- A Guide to Query Languages
  6. Unraveling Connections- Exploring Graph-Like Data Models
  7. The Backbone of Databases- Data Structures that Power Storage
  8. Transaction Processing vs. Analytics Let's understand the divide
  9. Understanding Column-Oriented Storage- A Deep Dive into Analytics Optimization
  10. Formats for Encoding Data
  11. Modes of Dataflow in Distributed Systems
  12. Leaders and Followers - The Core of Replication
  13. Problems with Replication Lag - Challenges and Solutions
  14. Multi-Leader Replication in Distributed Databases
  15. Leaderless Replication Flexibility for Distributed Databases
  16. Partitioning and Replication in Scaling Distributed Databases
  17. Partitioning of Key-Value Data- Strategies and Challenges
  18. Partitioning and Secondary Indexes- Balancing Efficiency and Complexity
  19. Efficient Methods for Rebalancing Data in Distributed Systems
  20. Ensuring Accurate Request Routing in Distributed Databases
  21. The Slippery Concept of a Transaction
  22. Exploring Weak Isolation Levels in Databases
  23. Achieving Serializability in Transactions
  24. Faults and Partial Failures in Distributed Systems
  25. Navigating Unreliable Networks in Distributed Systems
  26. The Challenges of Unreliable Clocks in Distributed Systems
  27. Knowledge Truth and Lies in Distributed Systems
  28. Consistency Guarantees in Distributed Systems
  29. Linearizability in Distributed Systems
  30. Understanding Ordering Guarantees in Distributed Systems
  31. Achieving Reliability with Distributed Transactions and Consensus Mechanisms
  32. Leveraging Unix Tools for Efficient Batch Processing
  33. MapReduce and Distributed Filesystems- Foundations of Scalable Data Processing
  34. Advancing Beyond MapReduce- Modern Frameworks for Scalable Data Processing
  35. Enabling Reliable and Scalable Event Streams in Distributed Systems
  36. Synchronizing Databases with Real-Time Streams
  37. Unifying Batch and Stream Processing for Modern Pipelines
  38. Integrating Distributed Systems for Unified Data Pipelines
  39. Unbundling Monolithic Databases for Flexibility
  40. Building Correct Systems in Distributed Environments
  41. Ethical Data Practices for Building Better Systems

Want to get blog posts over email?

Enter your email address and get notified when there's a new post!