Unifying Batch and Stream Processing for Modern Pipelines


Introduction

Stream processing applies the principles of batch processing to unbounded data streams, enabling real-time or low-latency operations. Unlike batch jobs with static inputs, stream processors handle continuous flows of events, introducing challenges like reasoning about time and ensuring fault tolerance. This post explores the architectures, tasks, and mechanisms of stream processing pipelines, along with advanced topics like joins and recovery techniques.


Defining Stream Processing

Broadly speaking, stream processing frameworks consume continuous data flows, perform transformations (e.g., filtering or enrichment), and produce outputs incrementally. Common examples include:

  1. Real-Time Dashboards:
    Processing clickstreams or user activity logs to update business metrics or visualizations dynamically.
  2. Alerts and Monitoring:
    Detecting fraudulent credit card transactions or tracking stock price anomalies.
  3. Materialized Views:
    Maintaining up-to-date search indexes or caches synchronized with a source database.

Core Stream Processing Tasks

  1. Basic Transformations
    • Operators like filter, map, and aggregate apply simple transformations to records within the stream. Examples: Converting temperature units or summing sales figures per region.
  2. Windowing for Aggregations
    Unlike batch systems, stream processors operate on dynamic time windows to group events and compute results, such as hourly averages or rolling counts. Types of windows include:
    • Tumbling Window: Fixed-size, non-overlapping windows (e.g., group by 1-minute intervals).
    • Hopping Window: Fixed-size with overlap (e.g., 5-minute windows shifting every 1 minute).
    • Session Window: Background activity-driven, dynamically bounded by user inactivity.
  3. Stream Joins for Enrichment
    Stream processors can join multiple unbounded datasets on shared keys. Examples include:
    • Stream-Stream Join: Correlates events from two streams occurring within a time window.
    • Stream-Table Join: Enriches stream events with lookups against a database or changelog.
    • Table-Table Join: Synchronizes two changelog streams to generate a materialized view.

Reasoning About Time

Handling time is a central challenge in stream processing because time appears in two forms:

  1. Event Time: When the data was generated, often provided as a timestamp in the event payload.
  2. Processing Time: When the system processes the data, which is subject to delays or stragglers.

Processing frameworks rely on watermarks to mark the progression of event time, ensuring late-arriving events can still be handled within a tolerance period.


Fault Tolerance and Exactly-Once Semantics

Failures in stream processing systems are inevitable, especially as jobs run continuously for extended durations. To provide reliable results, modern systems tackle:

  1. Checkpoints
    Services like Apache Flink and Kafka Streams periodically capture snapshots of operator states to durable storage. Recovery resumes seamlessly from the last checkpoint without reprocessing completed tasks.

  2. Microbatching
    Spark Streaming processes streams as small, batched chunks of data, offering consistency guarantees equivalent to traditional batch jobs.

  3. Idempotency
    Writes to an external sink (e.g., a database) are made idempotent, ensuring duplicate events caused by retries do not affect correctness.


Conclusion

Stream processing reshapes the data management landscape by bringing real-time capabilities to formerly batch-only workflows. With concepts like windowing, joins, and exactness guarantees, systems like Flink, Spark, and Kafka Streams enable reliable, low-latency applications at scale. As data streams grow in ubiquity, understanding these frameworks becomes a critical skill for modern developers.

Series Designing Data-Intensive Applications Part 37 of 41
  1. Designing Reliable Data Systems
  2. What is Scalability in Data Systems?
  3. Building Maintainable Software Systems
  4. Relational Model Versus Document Model
  5. Speaking the Language of Data- A Guide to Query Languages
  6. Unraveling Connections- Exploring Graph-Like Data Models
  7. The Backbone of Databases- Data Structures that Power Storage
  8. Transaction Processing vs. Analytics Let's understand the divide
  9. Understanding Column-Oriented Storage- A Deep Dive into Analytics Optimization
  10. Formats for Encoding Data
  11. Modes of Dataflow in Distributed Systems
  12. Leaders and Followers - The Core of Replication
  13. Problems with Replication Lag - Challenges and Solutions
  14. Multi-Leader Replication in Distributed Databases
  15. Leaderless Replication Flexibility for Distributed Databases
  16. Partitioning and Replication in Scaling Distributed Databases
  17. Partitioning of Key-Value Data- Strategies and Challenges
  18. Partitioning and Secondary Indexes- Balancing Efficiency and Complexity
  19. Efficient Methods for Rebalancing Data in Distributed Systems
  20. Ensuring Accurate Request Routing in Distributed Databases
  21. The Slippery Concept of a Transaction
  22. Exploring Weak Isolation Levels in Databases
  23. Achieving Serializability in Transactions
  24. Faults and Partial Failures in Distributed Systems
  25. Navigating Unreliable Networks in Distributed Systems
  26. The Challenges of Unreliable Clocks in Distributed Systems
  27. Knowledge Truth and Lies in Distributed Systems
  28. Consistency Guarantees in Distributed Systems
  29. Linearizability in Distributed Systems
  30. Understanding Ordering Guarantees in Distributed Systems
  31. Achieving Reliability with Distributed Transactions and Consensus Mechanisms
  32. Leveraging Unix Tools for Efficient Batch Processing
  33. MapReduce and Distributed Filesystems- Foundations of Scalable Data Processing
  34. Advancing Beyond MapReduce- Modern Frameworks for Scalable Data Processing
  35. Enabling Reliable and Scalable Event Streams in Distributed Systems
  36. Synchronizing Databases with Real-Time Streams
  37. Unifying Batch and Stream Processing for Modern Pipelines
  38. Integrating Distributed Systems for Unified Data Pipelines
  39. Unbundling Monolithic Databases for Flexibility
  40. Building Correct Systems in Distributed Environments
  41. Ethical Data Practices for Building Better Systems

Want to get blog posts over email?

Enter your email address and get notified when there's a new post!