Architecture • December 16, 2025

Shift Left or Stay Right?

Streaming vs Batch Processing Deep Dive

Boyan Balev Software Engineer

15 min

Most production data platforms use both batch and streaming approaches rather than choosing exclusively between them. The decision hinges on specific use case requirements rather than universal superiority.

Two Mental Models

Batch: Collect Everything, Analyze Thoroughly

Key Advantages:

Complete Picture: All data exists before processing begins; no late arrivals or ordering concerns; straightforward cross-dataset joins
Simple Retry Logic: Failed jobs can be rerun with full recomputation as a viable option; idempotent by design
Mature Ecosystem: Decades of established tooling (Airflow, dbt, Spark); widespread SQL knowledge; proven debugging patterns
Cost Predictable: Efficient resource usage through on-demand compute; no always-on infrastructure for sporadic workloads

Streaming: Process As It Happens

Key Advantages:

Millisecond Latency: React to events immediately through fraud detection, real-time pricing, and live dashboards
Event Sourcing: Event logs serve as source of truth; tables and caches become derived views; replay capability for state rebuilding
Unified Processing: Single codebase for real-time and batch via replay; Kappa architecture eliminates dual-system complexity
Decoupled Systems: Producers and consumers evolve independently; new consumers don’t require producer modifications; event-driven microservices

Architecture Patterns

Modern Batch Architecture

Sources → Airbyte (Extract & Load) → Apache Iceberg + Trino (Storage) → dbt (Transform) → BI Tools / Reverse ETL

Provides hourly/daily freshness with simplified operations.

Streaming Event-Driven Stack

Sources → Kafka/Redpanda (Capture) → Apache Flink (Transform) → ClickHouse/Redis (Serve)

Delivers sub-second latency with continuous processing.

Key Components Comparison

Component	Batch Stack	Streaming Stack
Ingestion	Airbyte, Apache NiFi, custom scripts	Debezium CDC, direct producers
Storage	Apache Iceberg, ClickHouse, Trino	Kafka (event log), ClickHouse (OLAP)
Transformation	dbt, Spark, SQL	Flink, ksqlDB, Spark Structured Streaming
Orchestration	Airflow, Dagster, Prefect	Always-on (Kubernetes, managed Flink)
Serving	Direct warehouse queries, caching	Pre-computed views in Redis/ClickHouse

Best-in-Class Technology

Batch Stack

Transform dbt SQL-first, tested, documented

Query Trino Distributed SQL at scale

Format Iceberg Open table format

Processing Spark Petabyte-scale batch

Streaming Stack

Processing Flink Stateful, exactly-once

Event Bus Kafka Durable event logs

Alternative Redpanda Kafka-compatible, simpler

Stream SQL ksqlDB SQL over streams

Code Examples

Example 1: Daily Revenue by Category

dbt (Batch)

-- Batch: Simple, readable, testable
SELECT
    DATE(order_timestamp) AS order_date,
    category,
    SUM(amount) AS daily_revenue,
    COUNT(*) AS order_count
FROM {{ ref('orders') }}
WHERE order_timestamp >= DATEADD('day', -30, CURRENT_DATE())
GROUP BY 1, 2

Flink SQL (Streaming)

-- Streaming: Continuous, requires windowing
SELECT
    TUMBLE_START(order_time, INTERVAL '1' DAY) AS order_date,
    category,
    SUM(amount) AS daily_revenue,
    COUNT(*) AS order_count
FROM orders
GROUP BY
    TUMBLE(order_time, INTERVAL '1' DAY),
    category

Batch wins due to simplicity, testability via dbt tests, and trivial historical recompute capabilities, unless real-time revenue updates are necessary.

Example 2: Fraud Detection (Velocity Check)

PySpark (Batch)

# Batch: Run hourly - detects fraud after the fact
from pyspark.sql import functions as F
from pyspark.sql.window import Window

user_window = Window.partitionBy("user_id").orderBy("timestamp")

flagged = (
    transactions
    .withColumn("prev_lat", F.lag("lat").over(user_window))
    .withColumn("prev_lon", F.lag("lon").over(user_window))
    .withColumn("prev_time", F.lag("timestamp").over(user_window))
    .withColumn("distance_km", haversine_udf("lat", "lon", "prev_lat", "prev_lon"))
    .withColumn("time_diff_min", (F.col("timestamp") - F.col("prev_time")) / 60)
    .filter((F.col("distance_km") > 500) & (F.col("time_diff_min") < 30))
)

Flink SQL (Streaming)

-- Streaming: Detect in real-time, block before damage
CREATE TABLE payments (
    txn_id STRING,
    user_id STRING,
    amount DECIMAL(10, 2),
    lat DOUBLE,
    lon DOUBLE,
    event_time TIMESTAMP(3),
    WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
) WITH ('connector' = 'kafka', ...);

-- Detect impossible velocity: >500km in <30 min
INSERT INTO fraud_alerts
SELECT p1.user_id, p1.txn_id, p2.txn_id,
       ST_Distance(ST_Point(p1.lon, p1.lat), ST_Point(p2.lon, p2.lat)) / 1000 AS km
FROM payments p1, payments p2
WHERE p1.user_id = p2.user_id
  AND p1.event_time < p2.event_time
  AND p2.event_time < p1.event_time + INTERVAL '30' MINUTE
  AND ST_Distance(...) > 500000;

Streaming excels here. Batch detects fraud hours post-incident while streaming blocks transactions within milliseconds, preventing significant financial losses.

Monitoring & Visibility

Batch Monitoring

Jobs produce binary outcomes (success/failure)
dbt tests catch data quality issues
Airflow displays DAG execution history
Clear audit trail exists per run

Streaming Monitoring

Continuous metrics track consumer lag, end-to-end latency, checkpoint success rates
Requires always-on monitoring
Distributed debugging presents greater challenges

Metric	Batch	Streaming
Data freshness	"Last run: 2 hours ago"	"Consumer lag: 50ms"
Pipeline health	Job success/failure rate	Checkpoint success, backpressure
Data quality	dbt tests, Great Expectations	Schema registry, runtime validation
Lineage	dbt docs, DataHub (mature)	Event correlation IDs (manual)
Debugging	Re-run failed job, inspect logs	Distributed tracing, state inspection

Operational Complexity

Batch Operations

Simple retry logic: Failed jobs rerun with full recomputation as viable fallback; no state corruption risk
Mature scheduling: Airflow, Dagster, Prefect offer well-understood dependency management
Delayed detection - Issues discovered when next job executes; hours or days after occurrence
DAG dependency hell - Complex pipelines create cascading failures; slow jobs block downstream processes

Streaming Operations

Immediate detection: Consumer lag spikes visible instantly without waiting for scheduled runs
Always-on processing: No scheduling management; no batch windows; continuous data flow
State management complexity: Checkpoints, savepoints, state backends require management; corrupted state may necessitate full Kafka rebuild
Upgrade complexity: Stateful job upgrades demand savepoints; schema changes need careful coordination

Cost Economics

When Batch is Cheaper

Infrequent processing (daily/weekly schedules)
Serverless warehouses (pay-per-query models)
Small data volumes (<1TB/day)
Ad-hoc analytics workloads
No always-on infrastructure required

When Streaming is Cheaper

High-volume continuous processing
Replacing multiple redundant batch jobs
Eliminating intermediate storage requirements
Real-time requirements preventing over-provisioning alternatives

Decision Framework

Clear Wins for Batch

Historical Analytics

Ad-hoc queries spanning years of data
Complex joins across large datasets
Columnar storage optimization

ML Training

Feature engineering leverages historical data
Model training gains nothing from real-time processing
Simpler implementation

Compliance Reporting

Audit reports, regulatory filings, point-in-time snapshots
Full reproducibility required

Clear Wins for Streaming

Fraud Detection

Milliseconds are critical
Block fraudulent transactions before completion
Real-time processing is essential

Operational Monitoring

System health tracking, alerting, live dashboards
Stale data becomes useless
Continuous processing required

Event-Driven Systems

Microservices choreography
Real-time notifications
User-facing features requiring instant responsiveness

Risks & Failure Modes

Batch Risks

Stale data: Decisions based on outdated information
Delayed detection: Problems discovered hours later
Cascade failures: One slow job blocks everything downstream
Resource contention: Batch windows competing for compute resources

Streaming Risks

Silent data loss: Misconfigured consumers drop events undetected
State corruption: Requires full rebuild from Kafka
Backpressure cascade: Slow consumer affects entire pipeline
Schema breaks: Producer changes break downstream consumers

Mitigation Strategies

Risk	Batch Mitigation	Streaming Mitigation
Data quality issues	dbt tests, Great Expectations, post-run validation	Schema registry, runtime validation, dead letter queues
Pipeline failures	Retry policies, alerting on failure, SLAs with buffer	Checkpointing, automatic restart, consumer lag alerting
Data loss	Idempotent jobs, backup before overwrite	Kafka retention, exactly-once semantics, idempotent sinks
Schema evolution	dbt schema tests, migration scripts	Schema registry compatibility rules, versioned topics

The Bottom Line

Neither paradigm is universally superior. They optimize for different outcomes:

Batch optimizes for simplicity, completeness, and cost-effectiveness for periodic workloads
Streaming optimizes for latency, reactivity, and continuous processing

Most modern data platforms employ both paradigms. Stream for operational use cases prioritizing latency. Batch for analytical use cases emphasizing completeness. Let requirements drive architecture rather than ideology.