Navigating Unreliable Networks in Distributed Systems


Introduction to Unreliable Networks

Distributed systems rely heavily on networks for node-to-node communication. However, networks are inherently untrustworthy—messages may be delayed, lost, duplicated, or even delivered out of order. Designing reliable distributed systems requires acknowledging and overcoming these inconsistencies, especially in asynchronous environments where guarantees around timing and delivery don’t exist.


Common Network Faults in Practice

Computer networks have been a focus of optimization for decades, yet they remain fundamentally unreliable. Key problems include:

  1. Packet Loss: Data packets may be dropped due to congestion, misconfiguration, or physical link failures.
  2. Delayed Responses: Network queues and overloaded links can dramatically increase response times.
  3. Node Crashes: A requesting or responding node might fail while in the middle of an exchange.
  4. Network Partitions: Certain parts of the network become isolated entirely, unable to communicate with the rest of the system.

Modern datacenters experience network disruptions regularly. A medium-sized datacenter may report a dozen faults each month, with these faults impacting anything from an individual machine to an entire rack.


Handling Unreliable Networks with Protocols

Distributed systems handle unreliable networks by employing multiple layers of error-handling mechanisms.

Error Correction and Detection

Techniques like error-correcting codes enable networks to recover from minor physical-layer faults. For example, data transmitted over wireless networks with minor interference is corrected using parity bits and other mechanisms.

Protocols for Network Reliability

  • TCP (Transmission Control Protocol): TCP handles packet retransmissions, duplicate elimination, and reordering, offering a reliable transport layer over an unreliable Internet Protocol (IP). While it solves many problems at the network level, it cannot address high-level concerns like application timeouts or unbounded delays.
  • UDP (User Datagram Protocol): For latency-sensitive applications like video streaming, UDP avoids retransmissions, sacrificing reliability for speed.

Challenges of Asynchronous Networks

In asynchronous network environments, the sender rarely knows the fate of its message:

  • Was the request lost?
  • Did the recipient crash, or is it just overloaded?
  • Was the reply message delayed or dropped?

A sender waiting indefinitely for a response might find nothing but silence, rendering fault diagnosis nearly impossible.

Timeouts: A Practical Workaround

Timeouts allow systems to abort ill-fated attempts and retry. However, timeout-based detection is itself imperfect:

  • A request might still succeed even when the sender gives up on waiting.
  • Premature timeouts could misclassify a slow node as a failed one, triggering unnecessary recovery mechanisms that amplify load imbalance.

Adjusting timeout duration requires careful experimentation, factoring in network variability and application needs.


Network Congestion and Queueing

Queueing delays occur when multiple nodes attempt to route traffic through a bottlenecked link simultaneously:

  1. Packets build up in queues at network switches, waiting their turn for transmission. In extreme cases, queues fill up entirely, resulting in packet drops.
  2. At the receiving end, incoming requests are queued until the application can process them. Overloaded servers amplify delays, further creating a vicious cycle.

Effective congestion avoidance strategies include flow control protocols (like TCP’s backpressure mechanism) and resource monitoring systems that evenly distribute workloads.


Building a Reliable System from Unreliable Networks

Engineers can build robust systems atop unreliable networks using the following approaches:

  1. Retry Logic with Idempotency
    • Systems must gracefully handle repeated requests, ensuring the same operation is not performed multiple times during retries.
  2. Quorum Systems
    • Distributed databases often use quorum writes and reads, where operations succeed if completed by a majority of replicas, negating dependency on any single node’s response.
  3. Monitoring and Failure Simulations
    • Regular stress tests and tools like Chaos Monkey mimic network faults to uncover system weaknesses, ensuring adequate recovery logic is in place.

Conclusion

Unreliable networks are a reality in distributed systems, but by leveraging well-designed protocols and thoughtful error-handling mechanisms, engineers can achieve resilience and reliability. While perfect reliability remains elusive, robust designs ensure networks perform seamlessly in high-demand and failure-prone environments. Handling unpredictability is a cornerstone of successful distributed systems.

Series Designing Data-Intensive Applications Part 25 of 41
  1. Designing Reliable Data Systems
  2. What is Scalability in Data Systems?
  3. Building Maintainable Software Systems
  4. Relational Model Versus Document Model
  5. Speaking the Language of Data- A Guide to Query Languages
  6. Unraveling Connections- Exploring Graph-Like Data Models
  7. The Backbone of Databases- Data Structures that Power Storage
  8. Transaction Processing vs. Analytics Let's understand the divide
  9. Understanding Column-Oriented Storage- A Deep Dive into Analytics Optimization
  10. Formats for Encoding Data
  11. Modes of Dataflow in Distributed Systems
  12. Leaders and Followers - The Core of Replication
  13. Problems with Replication Lag - Challenges and Solutions
  14. Multi-Leader Replication in Distributed Databases
  15. Leaderless Replication Flexibility for Distributed Databases
  16. Partitioning and Replication in Scaling Distributed Databases
  17. Partitioning of Key-Value Data- Strategies and Challenges
  18. Partitioning and Secondary Indexes- Balancing Efficiency and Complexity
  19. Efficient Methods for Rebalancing Data in Distributed Systems
  20. Ensuring Accurate Request Routing in Distributed Databases
  21. The Slippery Concept of a Transaction
  22. Exploring Weak Isolation Levels in Databases
  23. Achieving Serializability in Transactions
  24. Faults and Partial Failures in Distributed Systems
  25. Navigating Unreliable Networks in Distributed Systems
  26. The Challenges of Unreliable Clocks in Distributed Systems
  27. Knowledge Truth and Lies in Distributed Systems
  28. Consistency Guarantees in Distributed Systems
  29. Linearizability in Distributed Systems
  30. Understanding Ordering Guarantees in Distributed Systems
  31. Achieving Reliability with Distributed Transactions and Consensus Mechanisms
  32. Leveraging Unix Tools for Efficient Batch Processing
  33. MapReduce and Distributed Filesystems- Foundations of Scalable Data Processing
  34. Advancing Beyond MapReduce- Modern Frameworks for Scalable Data Processing
  35. Enabling Reliable and Scalable Event Streams in Distributed Systems
  36. Synchronizing Databases with Real-Time Streams
  37. Unifying Batch and Stream Processing for Modern Pipelines
  38. Integrating Distributed Systems for Unified Data Pipelines
  39. Unbundling Monolithic Databases for Flexibility
  40. Building Correct Systems in Distributed Environments
  41. Ethical Data Practices for Building Better Systems

Want to get blog posts over email?

Enter your email address and get notified when there's a new post!