Unbundling Monolithic Databases for Flexibility


Introduction

Databases have traditionally consolidated functionalities, combining features like storage, indexing, query processing, and replication into one integrated system. However, as application requirements diversify, handling everything in one database becomes impractical. Unbundling databases proposes breaking monolithic systems into smaller, specialized components. These components work collectively but focus on mastering specific responsibilities, such as full-text search, analytics, or change data capture.

The unbundling approach not only increases flexibility but also aligns with the Unix philosophy of modularity—combining small tools with well-defined purposes to create reliable, scalable systems.


Why Unbundle Databases?

The one-size-fits-all database model often falls short when the application requires niche capabilities that a general-purpose DBMS cannot excel at (e.g., full-text searches, graph traversals, or machine learning). Breaking down databases enables:

  1. Specialized Performance: Each component is optimized for specific workloads, improving performance across diverse access patterns.
  2. Resilience and Scalability: Independent components reduce complexity and enable more fine-grained fault-tolerance strategies.

The Role of Derived Data

  • Unbundling databases aligns closely with the derived data architecture.
  • Secondary indexes, materialized views, caches, and full-text indexes may be externalized rather than built into a single database, using tools most suited to their tasks (e.g., Elasticsearch for full-text indexing).

This approach retains the flexibility to build pipelines that use multiple components, balancing performance and functionality across specialized systems.


Composing Data Storage Technologies

Unbundling databases means using different specialized systems and coordinating them for broader workflows. Components in an unbundled architecture include:

  1. Materialized Views: Precomputed aggregations for fast query response.
  2. Replication Logs: Coordinating data synchronization across multiple storage systems.
  3. Search Engines: External systems (such as Elasticsearch) tailored for full-text or fuzzy queries.

Unbundling emphasizes breadth—providing broad coverage for multiple workloads—over deep optimization for individual use cases.

Example:
Imagine using distributed blob storage to hold raw datasets, a graph database for social network queries, and Kafka streams for log-based synchronization, all integrating seamlessly for specific application needs.


Challenges in Unbundled Systems

While unbundling databases has advantages, challenges arise in orchestrating these components:

  1. Dataflow Complexity: Unbundling demands careful attention to how data flows between systems, possibly requiring custom code for communication and synchronization.
  2. Write Synchronization: Ensuring every system participating in unbundling receives writes reliably (e.g., through event logs) is crucial. A single disagreement may lead to data divergence.
  3. Operational Overhead: Each piece of infrastructure introduces its own operational quirks (e.g., scaling, fault tolerance), which can increase administrative costs.

Designing Applications Around Dataflow

In an unbundled database world, applications are seen as derivation functions feeding off a stream of state changes. These flows involve:

  1. Event Logs for Dataflow Coordination: Systems like Kafka or Pulsar provide a write-ahead log to synchronize streams with downstream systems like caches or materialized views.
  2. Continuous Derivations: Secondary indexes, full-text search indexes, and cached views are derived in near real-time through automation, minimizing manual coordination.

Dataflow-friendly architectures emphasize loose coupling between tools while offering predictable results.


Conclusion

Unbundling databases enables organizations to leverage highly specialized tools for diverse workloads while maintaining scalable and resilient systems. Although unbundling introduces challenges in coordination and integration, its modular approach fosters long-term extensibility. By managing derived dataflows and embracing the diversity of modern database tools, engineers can create systems tailored to the evolving complexities of data processing. This trend is set to grow as applications increasingly prioritize flexibility and performance.

Series Designing Data-Intensive Applications Part 39 of 41
  1. Designing Reliable Data Systems
  2. What is Scalability in Data Systems?
  3. Building Maintainable Software Systems
  4. Relational Model Versus Document Model
  5. Speaking the Language of Data- A Guide to Query Languages
  6. Unraveling Connections- Exploring Graph-Like Data Models
  7. The Backbone of Databases- Data Structures that Power Storage
  8. Transaction Processing vs. Analytics Let's understand the divide
  9. Understanding Column-Oriented Storage- A Deep Dive into Analytics Optimization
  10. Formats for Encoding Data
  11. Modes of Dataflow in Distributed Systems
  12. Leaders and Followers - The Core of Replication
  13. Problems with Replication Lag - Challenges and Solutions
  14. Multi-Leader Replication in Distributed Databases
  15. Leaderless Replication Flexibility for Distributed Databases
  16. Partitioning and Replication in Scaling Distributed Databases
  17. Partitioning of Key-Value Data- Strategies and Challenges
  18. Partitioning and Secondary Indexes- Balancing Efficiency and Complexity
  19. Efficient Methods for Rebalancing Data in Distributed Systems
  20. Ensuring Accurate Request Routing in Distributed Databases
  21. The Slippery Concept of a Transaction
  22. Exploring Weak Isolation Levels in Databases
  23. Achieving Serializability in Transactions
  24. Faults and Partial Failures in Distributed Systems
  25. Navigating Unreliable Networks in Distributed Systems
  26. The Challenges of Unreliable Clocks in Distributed Systems
  27. Knowledge Truth and Lies in Distributed Systems
  28. Consistency Guarantees in Distributed Systems
  29. Linearizability in Distributed Systems
  30. Understanding Ordering Guarantees in Distributed Systems
  31. Achieving Reliability with Distributed Transactions and Consensus Mechanisms
  32. Leveraging Unix Tools for Efficient Batch Processing
  33. MapReduce and Distributed Filesystems- Foundations of Scalable Data Processing
  34. Advancing Beyond MapReduce- Modern Frameworks for Scalable Data Processing
  35. Enabling Reliable and Scalable Event Streams in Distributed Systems
  36. Synchronizing Databases with Real-Time Streams
  37. Unifying Batch and Stream Processing for Modern Pipelines
  38. Integrating Distributed Systems for Unified Data Pipelines
  39. Unbundling Monolithic Databases for Flexibility
  40. Building Correct Systems in Distributed Environments
  41. Ethical Data Practices for Building Better Systems

Want to get blog posts over email?

Enter your email address and get notified when there's a new post!