Architecture

Shift Left or Stay Right?

Streaming vs Batch Processing Deep Dive

Boyan Balev
Boyan Balev Software Engineer
15 min
Shift Left or Stay Right?

Most production data platforms use both batch and streaming approaches rather than choosing exclusively between them. The decision hinges on specific use case requirements rather than universal superiority.

Two Mental Models

Batch: Collect Everything, Analyze Thoroughly

Key Advantages:

  1. Complete Picture — All data exists before processing begins; no late arrivals or ordering concerns; straightforward cross-dataset joins

  2. Simple Retry Logic — Failed jobs can be rerun with full recomputation as a viable option; idempotent by design

  3. Mature Ecosystem — Decades of established tooling (Airflow, dbt, Spark); widespread SQL knowledge; proven debugging patterns

  4. Cost Predictable — Efficient resource usage through on-demand compute; no always-on infrastructure for sporadic workloads

Streaming: Process As It Happens

Key Advantages:

  1. Millisecond Latency — React to events immediately through fraud detection, real-time pricing, and live dashboards

  2. Event Sourcing — Event logs serve as source of truth; tables and caches become derived views; replay capability for state rebuilding

  3. Unified Processing — Single codebase for real-time and batch via replay; Kappa architecture eliminates dual-system complexity

  4. Decoupled Systems — Producers and consumers evolve independently; new consumers don’t require producer modifications; event-driven microservices

Architecture Patterns

Modern Batch Architecture

Sources → Airbyte (Extract & Load) → Apache Iceberg + Trino (Storage) → dbt (Transform) → BI Tools / Reverse ETL

Provides hourly/daily freshness with simplified operations.

Streaming Event-Driven Stack

Sources → Kafka/Redpanda (Capture) → Apache Flink (Transform) → ClickHouse/Redis (Serve)

Delivers sub-second latency with continuous processing.

Key Components Comparison

ComponentBatch StackStreaming Stack
Ingestion Airbyte, Apache NiFi, custom scripts Debezium CDC, direct producers
Storage Apache Iceberg, ClickHouse, Trino Kafka (event log), ClickHouse (OLAP)
Transformation dbt, Spark, SQL Flink, ksqlDB, Spark Structured Streaming
Orchestration Airflow, Dagster, Prefect Always-on (Kubernetes, managed Flink)
Serving Direct warehouse queries, caching Pre-computed views in Redis/ClickHouse

Best-in-Class Technology

Batch Stack

Transform dbt SQL-first, tested, documented
Query Trino Distributed SQL at scale
Format Iceberg Open table format
Processing Spark Petabyte-scale batch

Streaming Stack

Processing Flink Stateful, exactly-once
Event Bus Kafka Durable event logs
Alternative Redpanda Kafka-compatible, simpler
Stream SQL ksqlDB SQL over streams

Code Examples

Example 1: Daily Revenue by Category

dbt (Batch)

-- Batch: Simple, readable, testable
SELECT
    DATE(order_timestamp) AS order_date,
    category,
    SUM(amount) AS daily_revenue,
    COUNT(*) AS order_count
FROM {{ ref('orders') }}
WHERE order_timestamp >= DATEADD('day', -30, CURRENT_DATE())
GROUP BY 1, 2

Flink SQL (Streaming)

-- Streaming: Continuous, requires windowing
SELECT
    TUMBLE_START(order_time, INTERVAL '1' DAY) AS order_date,
    category,
    SUM(amount) AS daily_revenue,
    COUNT(*) AS order_count
FROM orders
GROUP BY
    TUMBLE(order_time, INTERVAL '1' DAY),
    category
Batch Processing

Batch wins due to simplicity, testability via dbt tests, and trivial historical recompute capabilities—unless real-time revenue updates are necessary.

Example 2: Fraud Detection (Velocity Check)

PySpark (Batch)

# Batch: Run hourly - detects fraud after the fact
from pyspark.sql import functions as F
from pyspark.sql.window import Window

user_window = Window.partitionBy("user_id").orderBy("timestamp")

flagged = (
    transactions
    .withColumn("prev_lat", F.lag("lat").over(user_window))
    .withColumn("prev_lon", F.lag("lon").over(user_window))
    .withColumn("prev_time", F.lag("timestamp").over(user_window))
    .withColumn("distance_km", haversine_udf("lat", "lon", "prev_lat", "prev_lon"))
    .withColumn("time_diff_min", (F.col("timestamp") - F.col("prev_time")) / 60)
    .filter((F.col("distance_km") > 500) & (F.col("time_diff_min") < 30))
)

Flink SQL (Streaming)

-- Streaming: Detect in real-time, block before damage
CREATE TABLE payments (
    txn_id STRING,
    user_id STRING,
    amount DECIMAL(10, 2),
    lat DOUBLE,
    lon DOUBLE,
    event_time TIMESTAMP(3),
    WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
) WITH ('connector' = 'kafka', ...);

-- Detect impossible velocity: >500km in <30 min
INSERT INTO fraud_alerts
SELECT p1.user_id, p1.txn_id, p2.txn_id,
       ST_Distance(ST_Point(p1.lon, p1.lat), ST_Point(p2.lon, p2.lat)) / 1000 AS km
FROM payments p1, payments p2
WHERE p1.user_id = p2.user_id
  AND p1.event_time < p2.event_time
  AND p2.event_time < p1.event_time + INTERVAL '30' MINUTE
  AND ST_Distance(...) > 500000;
Streaming

Streaming excels here—batch detects fraud hours post-incident while streaming blocks transactions within milliseconds, preventing significant financial losses.

Monitoring & Visibility

Batch Monitoring

  • Jobs produce binary outcomes (success/failure)
  • dbt tests catch data quality issues
  • Airflow displays DAG execution history
  • Clear audit trail exists per run

Streaming Monitoring

  • Continuous metrics track consumer lag, end-to-end latency, checkpoint success rates
  • Requires always-on monitoring
  • Distributed debugging presents greater challenges
MetricBatchStreaming
Data freshness "Last run: 2 hours ago" "Consumer lag: 50ms"
Pipeline health Job success/failure rate Checkpoint success, backpressure
Data quality dbt tests, Great Expectations Schema registry, runtime validation
Lineage dbt docs, DataHub (mature) Event correlation IDs (manual)
Debugging Re-run failed job, inspect logs Distributed tracing, state inspection

Operational Complexity

Batch Operations

  • Simple retry logic — Failed jobs rerun with full recomputation as viable fallback; no state corruption risk
  • Mature scheduling — Airflow, Dagster, Prefect offer well-understood dependency management
  • Delayed detection — Issues discovered when next job executes; hours or days after occurrence
  • DAG dependency hell — Complex pipelines create cascading failures; slow jobs block downstream processes

Streaming Operations

  • Immediate detection — Consumer lag spikes visible instantly without waiting for scheduled runs
  • Always-on processing — No scheduling management; no batch windows; continuous data flow
  • State management complexity — Checkpoints, savepoints, state backends require management; corrupted state may necessitate full Kafka rebuild
  • Upgrade complexity — Stateful job upgrades demand savepoints; schema changes need careful coordination

Cost Economics

When Batch is Cheaper

  • Infrequent processing (daily/weekly schedules)
  • Serverless warehouses (pay-per-query models)
  • Small data volumes (<1TB/day)
  • Ad-hoc analytics workloads
  • No always-on infrastructure required

When Streaming is Cheaper

  • High-volume continuous processing
  • Replacing multiple redundant batch jobs
  • Eliminating intermediate storage requirements
  • Real-time requirements preventing over-provisioning alternatives

Decision Framework

Clear Wins for Batch

Historical Analytics

  • Ad-hoc queries spanning years of data
  • Complex joins across large datasets
  • Columnar storage optimization

ML Training

  • Feature engineering leverages historical data
  • Model training gains nothing from real-time processing
  • Simpler implementation

Compliance Reporting

  • Audit reports, regulatory filings, point-in-time snapshots
  • Full reproducibility required

Clear Wins for Streaming

Fraud Detection

  • Milliseconds are critical
  • Block fraudulent transactions before completion
  • Real-time processing is essential

Operational Monitoring

  • System health tracking, alerting, live dashboards
  • Stale data becomes useless
  • Continuous processing required

Event-Driven Systems

  • Microservices choreography
  • Real-time notifications
  • User-facing features requiring instant responsiveness

Risks & Failure Modes

Batch Risks

  • Stale data — Decisions based on outdated information
  • Delayed detection — Problems discovered hours later
  • Cascade failures — One slow job blocks everything downstream
  • Resource contention — Batch windows competing for compute resources

Streaming Risks

  • Silent data loss — Misconfigured consumers drop events undetected
  • State corruption — Requires full rebuild from Kafka
  • Backpressure cascade — Slow consumer affects entire pipeline
  • Schema breaks — Producer changes break downstream consumers

Mitigation Strategies

RiskBatch MitigationStreaming Mitigation
Data quality issues dbt tests, Great Expectations, post-run validation Schema registry, runtime validation, dead letter queues
Pipeline failures Retry policies, alerting on failure, SLAs with buffer Checkpointing, automatic restart, consumer lag alerting
Data loss Idempotent jobs, backup before overwrite Kafka retention, exactly-once semantics, idempotent sinks
Schema evolution dbt schema tests, migration scripts Schema registry compatibility rules, versioned topics

The Bottom Line

Neither paradigm is universally superior. They optimize for different outcomes:

  • Batch optimizes for simplicity, completeness, and cost-effectiveness for periodic workloads
  • Streaming optimizes for latency, reactivity, and continuous processing

Most modern data platforms employ both paradigms. Stream for operational use cases prioritizing latency. Batch for analytical use cases emphasizing completeness. Let requirements drive architecture rather than ideology.

Learn More