Unifying Batch and Stream Processing for Modern Pipelines

Introduction

Stream processing applies the principles of batch processing to unbounded data streams, enabling real-time or low-latency operations. Unlike batch jobs with static inputs, stream processors handle continuous flows of events, introducing challenges like reasoning about time and ensuring fault tolerance. This post explores the architectures, tasks, and mechanisms of stream processing pipelines, along with advanced topics like joins and recovery techniques.

Defining Stream Processing

Broadly speaking, stream processing frameworks consume continuous data flows, perform transformations (e.g., filtering or enrichment), and produce outputs incrementally. Common examples include:

Real-Time Dashboards:
Processing clickstreams or user activity logs to update business metrics or visualizations dynamically.
Alerts and Monitoring:
Detecting fraudulent credit card transactions or tracking stock price anomalies.
Materialized Views:
Maintaining up-to-date search indexes or caches synchronized with a source database.

Core Stream Processing Tasks

Basic Transformations
- Operators like filter, map, and aggregate apply simple transformations to records within the stream. Examples: Converting temperature units or summing sales figures per region.
Windowing for Aggregations
Unlike batch systems, stream processors operate on dynamic time windows to group events and compute results, such as hourly averages or rolling counts. Types of windows include:
- Tumbling Window: Fixed-size, non-overlapping windows (e.g., group by 1-minute intervals).
- Hopping Window: Fixed-size with overlap (e.g., 5-minute windows shifting every 1 minute).
- Session Window: Background activity-driven, dynamically bounded by user inactivity.
Stream Joins for Enrichment
Stream processors can join multiple unbounded datasets on shared keys. Examples include:
- Stream-Stream Join: Correlates events from two streams occurring within a time window.
- Stream-Table Join: Enriches stream events with lookups against a database or changelog.
- Table-Table Join: Synchronizes two changelog streams to generate a materialized view.

Reasoning About Time

Handling time is a central challenge in stream processing because time appears in two forms:

Event Time: When the data was generated, often provided as a timestamp in the event payload.
Processing Time: When the system processes the data, which is subject to delays or stragglers.

Processing frameworks rely on watermarks to mark the progression of event time, ensuring late-arriving events can still be handled within a tolerance period.

Fault Tolerance and Exactly-Once Semantics

Failures in stream processing systems are inevitable, especially as jobs run continuously for extended durations. To provide reliable results, modern systems tackle:

Checkpoints
Services like Apache Flink and Kafka Streams periodically capture snapshots of operator states to durable storage. Recovery resumes seamlessly from the last checkpoint without reprocessing completed tasks.
Microbatching
Spark Streaming processes streams as small, batched chunks of data, offering consistency guarantees equivalent to traditional batch jobs.
Idempotency
Writes to an external sink (e.g., a database) are made idempotent, ensuring duplicate events caused by retries do not affect correctness.

Conclusion

Stream processing reshapes the data management landscape by bringing real-time capabilities to formerly batch-only workflows. With concepts like windowing, joins, and exactness guarantees, systems like Flink, Spark, and Kafka Streams enable reliable, low-latency applications at scale. As data streams grow in ubiquity, understanding these frameworks becomes a critical skill for modern developers.

Unifying Batch and Stream Processing for Modern Pipelines

Introduction

Defining Stream Processing

Core Stream Processing Tasks

Reasoning About Time

Fault Tolerance and Exactly-Once Semantics

Conclusion

More to read

Ethical Data Practices for Building Better Systems (Mar 12, 2023)

Building Correct Systems in Distributed Environments (Mar 3, 2023)

Unbundling Monolithic Databases for Flexibility (Feb 26, 2023)

Integrating Distributed Systems for Unified Data Pipelines (Feb 18, 2023)

Unifying Batch and Stream Processing for Modern Pipelines

Introduction

Defining Stream Processing

Core Stream Processing Tasks

Reasoning About Time

Fault Tolerance and Exactly-Once Semantics

Conclusion

Want to get blog posts over email?

More to read

Ethical Data Practices for Building Better Systems (Mar 12, 2023)

Building Correct Systems in Distributed Environments (Mar 3, 2023)

Unbundling Monolithic Databases for Flexibility (Feb 26, 2023)

Integrating Distributed Systems for Unified Data Pipelines (Feb 18, 2023)