Modes of Dataflow in Distributed Systems


In distributed systems, the mode of dataflow determines how information is exchanged between processes that lack shared memory. This exchange involves encoding data into sequences of bytes (serialization) for transmission just as much as decoding it back (deserialization) on the receiving end. The choice of dataflow mode impacts system performance, reliability, and evolvability.


Three Common Modes of Dataflow

1. Dataflow Through Databases

In this mode, processes write to and read from a shared database:

  • Encoding and decoding: The writing process serializes the data, while the reader deserializes it when accessing the database later.
  • Backward compatibility: Necessary for ensuring older processes can still read newly written data.
  • Forward compatibility: Allows older readers encountering newer fields to preserve them when updating records.

A practical challenge arises when an older version of a process writes an updated record without knowing about the new fields added by a newer version. To avoid unintentionally discarding such fields, developers must carefully encode and decode database records during schema evolution.


2. Dataflow Through Services (REST and RPC)

This involves using APIs (usually exposed via HTTP in the case of REST or specialized network protocols for RPC) to directly pass data between processes.

  • REST APIs:
    • Typically rely on JSON for their payloads.
    • Evolve by making small changes, such as adding optional request/response fields while maintaining backward compatibility.

Example REST snippet:

GET /api/order/123  
Accept: application/json  
  • RPC Frameworks (e.g., gRPC, Thrift):
    • Use schemas (e.g., Protocol Buffers) to strictly define interface specifications.
    • Offer stronger type checking and performance compared to REST, but demand serialized compatibility between servers and clients.

One challenge with services is ensuring compatibility when the server updates before the clients during deployment. By following backward compatibility on requests and forward compatibility on responses, the two can evolve independently without breaking functionality.


3. Message-Passing Dataflow (Asynchronous Communication)

Asynchronous dataflow via a message broker (e.g., RabbitMQ, Kafka) builds flexibility and decouples sender and receiver processes. Key traits include:

  • Decoupling: Senders publish messages without needing to know the identity of the recipients.
  • Reliability improvements: Brokers buffer messages if recipients are unavailable or overloaded.
  • Consumer models:
    • One-to-One Delivery: Targeted message queues.
    • One-to-Many Broadcasting: Publish-subscribe topic models.

Example Workflow with Kafka:

[Producers → Publish → Kafka Topic → Consumers Process Messages Independently]  

Message brokers, however, typically use opaque payloads that the application layer must parse, meaning evolution depends on backward and forward-compatible encoding formats like Avro or Protocol Buffers.


Strategies for Evolvable Dataflow

  1. Preservation of Unknown Fields: Whether working with databases or messages between services, it’s important to avoid discarding fields added by newer processes that aren’t recognized by older versions. This ensures future data integrity in rolling upgrades.
  2. Leverage Versioning: Employ explicit schema versioning for APIs and payloads to gracefully handle compatibility over time.
  3. Decouple Processing via Queues: Embrace message brokers to reduce dependency on synchronous systems, ensuring better fault tolerance and ease of scaling independently.

Conclusion

The choice of dataflow mode—whether using databases, services, or message brokers—dictates how data processes interact and evolve within a distributed system. Selecting the right approach based on specific use cases while incorporating compatibility strategies ensures flexibility, resilience, and maintainability. As systems grow in complexity, understanding these modes enables engineers to design scalable and robust architectures.

Series Designing Data-Intensive Applications Part 11 of 41
  1. Designing Reliable Data Systems
  2. What is Scalability in Data Systems?
  3. Building Maintainable Software Systems
  4. Relational Model Versus Document Model
  5. Speaking the Language of Data- A Guide to Query Languages
  6. Unraveling Connections- Exploring Graph-Like Data Models
  7. The Backbone of Databases- Data Structures that Power Storage
  8. Transaction Processing vs. Analytics Let's understand the divide
  9. Understanding Column-Oriented Storage- A Deep Dive into Analytics Optimization
  10. Formats for Encoding Data
  11. Modes of Dataflow in Distributed Systems
  12. Leaders and Followers - The Core of Replication
  13. Problems with Replication Lag - Challenges and Solutions
  14. Multi-Leader Replication in Distributed Databases
  15. Leaderless Replication Flexibility for Distributed Databases
  16. Partitioning and Replication in Scaling Distributed Databases
  17. Partitioning of Key-Value Data- Strategies and Challenges
  18. Partitioning and Secondary Indexes- Balancing Efficiency and Complexity
  19. Efficient Methods for Rebalancing Data in Distributed Systems
  20. Ensuring Accurate Request Routing in Distributed Databases
  21. The Slippery Concept of a Transaction
  22. Exploring Weak Isolation Levels in Databases
  23. Achieving Serializability in Transactions
  24. Faults and Partial Failures in Distributed Systems
  25. Navigating Unreliable Networks in Distributed Systems
  26. The Challenges of Unreliable Clocks in Distributed Systems
  27. Knowledge Truth and Lies in Distributed Systems
  28. Consistency Guarantees in Distributed Systems
  29. Linearizability in Distributed Systems
  30. Understanding Ordering Guarantees in Distributed Systems
  31. Achieving Reliability with Distributed Transactions and Consensus Mechanisms
  32. Leveraging Unix Tools for Efficient Batch Processing
  33. MapReduce and Distributed Filesystems- Foundations of Scalable Data Processing
  34. Advancing Beyond MapReduce- Modern Frameworks for Scalable Data Processing
  35. Enabling Reliable and Scalable Event Streams in Distributed Systems
  36. Synchronizing Databases with Real-Time Streams
  37. Unifying Batch and Stream Processing for Modern Pipelines
  38. Integrating Distributed Systems for Unified Data Pipelines
  39. Unbundling Monolithic Databases for Flexibility
  40. Building Correct Systems in Distributed Environments
  41. Ethical Data Practices for Building Better Systems

Want to get blog posts over email?

Enter your email address and get notified when there's a new post!