Faults and Partial Failures in Distributed Systems


Introduction to Faults and Failures

Distributed systems, by their nature, span multiple computers communicating over unreliable networks. Partial failures represent an inherent challenge: some components in the system may fail unpredictably, while others continue to function. Unlike a single-computer system, where failures result in predictable crashes, distributed systems operate amid a continuous mess of faults.

Faults are especially hard to handle in such systems because they are often nondeterministic—that is, failures and delays do not always behave consistently, making symptom diagnosis and programming more complex.


What are Partial Failures?

In a distributed environment, partial failures occur when some nodes or message paths fail while others function correctly. For instance:

  • A subset of servers in a data center might go offline due to network issues, even though others remain operational.
  • Failure of one power distribution unit (PDU) may leave part of a system functional while another part crashes entirely.

Real-World Anecdote

A hypoglycemic driver crashing into a data center’s HVAC system once left entire racks offline while others remained fine—a clear example of distributed systems’ unpredictable failure modes.


Why Partial Failures Are Hard

  1. Nondeterministic Network Behavior
    • Message delivery and acknowledgment may become delayed, lost, or duplicated irregularly, leading to ambiguous states where you can’t tell whether a request succeeded or failed.
  2. Dependency on Multiple Nodes
    • Operations across multiple machines can fail inconsistently, resulting in unpredictable outcomes.
  3. System Recovery Can Be Ambiguous
    • During partitions or outages, systems may need to make decisions based on incomplete information, raising additional challenges like determining whether a failure requires retries, rollbacks, or leadership changes.

Cloud Computing vs. Supercomputing: Contrasting Philosophies

Supercomputers

  • These systems often follow a stop-and-restart model:
    • Jobs periodically checkpoint their state to durable storage.
    • The entire system halts when faults occur, and recovery begins from the last consistent state after fixing the fault.
    • This prioritizes correctness over availability.

Cloud Computing

  • In contrast, cloud systems prioritize availability:
    • Fault tolerance must be built into the software because failures are treated as inevitable. Systems like cloud services need to remain operational even during node failures, network interruptions, or rolling upgrades.
    • Cloud components are often commodity hardware, making failures more frequent, but the software is designed to handle such imperfections resiliently.

Supercomputers’ rigid fault-handling mechanisms contrast with the adaptive, fault-tolerant strategies required in cloud environments, emphasizing the need to build reliable systems from inherently unreliable components.


Building Resilient Distributed Systems

  1. Design for Fault Tolerance
    • Components with built-in fault-tolerance mechanisms, such as quorum protocols and replication, can prevent a single failure from cascading into system-wide outages.
    • For example, rolling upgrades allow individual nodes to undergo maintenance without interrupting service continuity.
  2. Accept Continuous Failures
    • With thousands of nodes, the assumption is that something is always broken. Engineers often adopt tools like Chaos Monkey to induce controlled failures and ensure systems behave predictably under unexpected stress.
  3. Make Use of Commodity Hardware
    • In cloud environments, replacing failed nodes is quicker and cheaper than guaranteeing component reliability. Systems should anticipate failures and recover automatically, either by reallocating workloads or spinning up new instances.

    Example: If a virtual machine underperforms, operators terminate it and start a new instance without affecting ongoing operations.


Conclusion

Partial failures are a defining characteristic of distributed systems, and addressing them requires designing with pessimism and resilience in mind. Unlike single-computer environments where software assumes ideal conditions, distributed systems engineers must embrace uncertainty to handle everything from hardware breakdowns to network glitches.

By adopting fault-tolerant principles, dynamic recovery processes, and thorough testing strategies, distributed systems can continue to perform reliably—even when certain parts fail unpredictably. This adaptability lies at the heart of modern cloud-based and internet-scale systems.

Series Designing Data-Intensive Applications Part 24 of 41
  1. Designing Reliable Data Systems
  2. What is Scalability in Data Systems?
  3. Building Maintainable Software Systems
  4. Relational Model Versus Document Model
  5. Speaking the Language of Data- A Guide to Query Languages
  6. Unraveling Connections- Exploring Graph-Like Data Models
  7. The Backbone of Databases- Data Structures that Power Storage
  8. Transaction Processing vs. Analytics Let's understand the divide
  9. Understanding Column-Oriented Storage- A Deep Dive into Analytics Optimization
  10. Formats for Encoding Data
  11. Modes of Dataflow in Distributed Systems
  12. Leaders and Followers - The Core of Replication
  13. Problems with Replication Lag - Challenges and Solutions
  14. Multi-Leader Replication in Distributed Databases
  15. Leaderless Replication Flexibility for Distributed Databases
  16. Partitioning and Replication in Scaling Distributed Databases
  17. Partitioning of Key-Value Data- Strategies and Challenges
  18. Partitioning and Secondary Indexes- Balancing Efficiency and Complexity
  19. Efficient Methods for Rebalancing Data in Distributed Systems
  20. Ensuring Accurate Request Routing in Distributed Databases
  21. The Slippery Concept of a Transaction
  22. Exploring Weak Isolation Levels in Databases
  23. Achieving Serializability in Transactions
  24. Faults and Partial Failures in Distributed Systems
  25. Navigating Unreliable Networks in Distributed Systems
  26. The Challenges of Unreliable Clocks in Distributed Systems
  27. Knowledge Truth and Lies in Distributed Systems
  28. Consistency Guarantees in Distributed Systems
  29. Linearizability in Distributed Systems
  30. Understanding Ordering Guarantees in Distributed Systems
  31. Achieving Reliability with Distributed Transactions and Consensus Mechanisms
  32. Leveraging Unix Tools for Efficient Batch Processing
  33. MapReduce and Distributed Filesystems- Foundations of Scalable Data Processing
  34. Advancing Beyond MapReduce- Modern Frameworks for Scalable Data Processing
  35. Enabling Reliable and Scalable Event Streams in Distributed Systems
  36. Synchronizing Databases with Real-Time Streams
  37. Unifying Batch and Stream Processing for Modern Pipelines
  38. Integrating Distributed Systems for Unified Data Pipelines
  39. Unbundling Monolithic Databases for Flexibility
  40. Building Correct Systems in Distributed Environments
  41. Ethical Data Practices for Building Better Systems

Want to get blog posts over email?

Enter your email address and get notified when there's a new post!