Kafka

Saga Pattern: Distributed Transactions for Microservices with Kafka

2026-04-19
NicheeLab Editorial Team

The Saga pattern is the standard way to maintain business consistency without relying on 2PC, but design and implementation that holds up in production demands a solid grasp of Kafka's official features.

Aligned with the CCDAK (Confluent Certified Developer for Apache Kafka) exam scope, this article summarizes Kafka's transactional API, Exactly-Once Semantics, topic/key design, and compensation design from a hands-on engineering perspective.

Saga Basics and Why Use Kafka

Saga achieves eventual consistency by combining a "series of local transactions" with "compensating actions on failure." There are two control styles: orchestration (a central coordinator directs the flow) and choreography (each service subscribes to events and autonomously triggers the next step). Kafka excels at durability, ordering, replay, and scaling — supporting robust implementations of either style.

From the CCDAK perspective, frequently tested topics include topic partitioning and ordering guarantees by key, producer idempotence and transactions, consumer sendOffsetsToTransaction, and Kafka Streams' exactly_once_v2. Baking these into the design from the start simplifies failure reprocessing and auditing.

  • Orchestration: a central Orchestrator dispatches commands. Observability and control are clear.
  • Choreography: each service progresses autonomously, triggered by events. Easier to keep coupling low.
  • Kafka's strengths: ordering and durability via append-only log, reprocessing, and scale. Provides an audit trail for compensation and review.
  • Exam essentials: where to correctly apply idempotence, transactional.id, EOS v2, and sendOffsetsToTransaction.
ApproachConsistency / AvailabilityImplementation / Operational Notes
Saga - Orchestration (Kafka)Eventual consistency. High observability and control.Orchestrator scaling/redundancy, centralized timeout and compensation management, separation of commands and events.
Saga - Choreography (Kafka)Eventual consistency. Loosely coupled and easy to extend.Event/schema evolution management, prevention of cycles and duplicate triggers, compensation implemented per service.
2PC / Distributed TXStrong consistency, but availability and latency tend to suffer.Risks of coordinator failure and blocking. Poor fit for microservices.
No compensation (anti-pattern)Inconsistency remains when failures occur.Cannot guarantee business consistency. Do not adopt.

Example of Saga on Kafka (choreography)

Kafka ClusterKafka Clusteremits next events/commandsOrder Serviceemits OrderEvtPayment Serviceconsumes OrderEvtShipping Serviceconsumes PayEvtorder-eventspayment-eventsshipping-eventspayment-cmdoptional

Event/Topic Design and Key Strategy

The basis of Saga is to record "events of facts" and decide the next action from them. In Kafka, separate events and commands logically, and use keys to secure ordering and locality. Fix the key to the same business entity (e.g., orderId) to gain ordering within a single partition.

Default to backward compatibility for schemas, and avoid breaking changes. Using Schema Registry makes schema evolution measurable, and aligns with what CCDAK tests (schema compatibility modes). Topics that also use log compaction (compact) are well-suited to snapshot semantics, and can be applied to project Saga state.

  • Events represent past facts (OrderCreated, PaymentAuthorized, etc.)
  • Commands represent intent (AuthorizePayment, ReserveInventory, etc.)
  • Keep keys consistent at the business entity ID level to avoid breaking ordering.
  • Attach an event ID (UUID) to events to defend against duplicate triggers.
  • Separate topics by purpose (-events vs. -commands).
  • When needed, combine compact + delete dual retention to keep both the latest state and history.

Foundations of Consistency: Outbox + Transactional Producer + EOS

Use the Outbox pattern to guarantee consistency between in-service DB updates and Kafka sends. The app commits the business row and the Outbox row in the same local transaction, and a component polling the Outbox (or CDC/Connect) delivers them to Kafka. This eliminates the two-phase problem between the DB and Kafka.

Enable idempotence on the Kafka producer, and use transactions where needed. With a transaction, you can atomically group writes across multiple topics/partitions with consumer offset commits. When you set exactly_once_v2 on Kafka Streams, it manages transactions internally and suppresses duplicates and losses across input processing and output.

  • Producer settings: enable.idempotence=true, appropriate acks=all, and assign a unique transactional.id.
  • In consumer-to-producer flows, use sendOffsetsToTransaction to bundle output and offsets into the same transaction.
  • processing.guarantee=exactly_once_v2 on Kafka Streams is the current recommendation.
  • Outbox stores JSON/Avro in a simple table — always write it in the same DB transaction.

Compensation, Retries, and Ordering on Failure

Saga assumes failures will happen. Set timeouts on each step, and record failures and expirations as factual events. Design compensation as business-reversible actions and make them idempotent (running the same compensation any number of times must be safe).

Make retries operable with exponential backoff plus a dead-letter queue (DLQ). When ordering matters, design for sequential processing within the same key, while ensuring an error-induced stop does not block the entire system.

  • Make compensation explicit through separate commands/events (CancelPayment, ReleaseInventory, etc.)
  • Event-source compensation sequences just like normal sequences, so they remain traceable.
  • Propagating retry counts and deadlines via message headers keeps the implementation simple.
  • Attach failure context to the DLQ (exception, stack summary, original message key) to ensure observability.
  • Process ordering-critical workloads with the same key and a single parallel consumer in the same consumer group.
  • Use an idempotency key (Idempotency-Key) on external API calls to defend against double execution.

Implementation Example: State Management with an Orchestrator and Kafka Streams

In the orchestrator style, you keep the Saga state machine in a state store, and when you receive a result event for each step, you emit the next command. With Kafka Streams, you can manage routing between topics and state transitions within a single topology, and exactly_once_v2 suppresses duplicates and partial failures.

Below is a minimal example that takes an order event and sequentially progresses through payment → inventory → shipping. Production deployments add timeout management, compensation transitions, audit logs, DLQ, and more.

  • Always set processing.guarantee=exactly_once_v2.
  • State store: key = orderId, value = SAGA state (current step, deadline, correlation ID, etc.)
  • Make external system calls asynchronous via command topics, and advance the state on the resulting events.
  • For compensation, emit dedicated commands and record success/failure factually as events.

A minimal orchestrator with Kafka Streams (Java)

Properties p = new Properties();
p.put(StreamsConfig.APPLICATION_ID_CONFIG, "saga-orchestrator");
p.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "broker:9092");
p.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE_V2);

StreamsBuilder b = new StreamsBuilder();
StoreBuilder<KeyValueStore<String, String>> store = Stores.keyValueStoreBuilder(
    Stores.persistentKeyValueStore("saga-store"),
    Serdes.String(), Serdes.String());
b.addStateStore(store);

KStream<String, OrderEvent> orders = b.stream("order-events",
    Consumed.with(Serdes.String(), orderEventSerde()));

KStream<String, Command> cmds = orders.transformValues(() -> new SagaOrchestrator("saga-store"), "saga-store")
    .flatMapValues((SagaStepResult r) -> r.outgoingCommands());

cmds.split()
    .branch((k, c) -> c.type() == CommandType.AUTH_PAYMENT,
        Named.as("payment"))
    .to("payment-commands", Produced.with(Serdes.String(), commandSerde()));

cmds.split()
    .branch((k, c) -> c.type() == CommandType.RESERVE_STOCK,
        Named.as("inventory"))
    .to("inventory-commands", Produced.with(Serdes.String(), commandSerde()));

// 支払い結果などのイベントを別ストリームで受け、状態を進める
KStream<String, PaymentEvent> pay = b.stream("payment-events",
    Consumed.with(Serdes.String(), paymentEventSerde()));

pay.process(() -> new Processor<String, PaymentEvent>() {
  private KeyValueStore<String, String> kv;
  @Override public void init(ProcessorContext ctx) {
    kv = (KeyValueStore<String, String>) ctx.getStateStore("saga-store");
  }
  @Override public void process(String key, PaymentEvent ev) {
    // 状態更新と次コマンド作成(省略)
  }
}, "saga-store");

KafkaStreams s = new KafkaStreams(b.build(), p);
s.start();

// 注意: Streams は内部でトランザクションを管理し、EOSv2 を実現する

Operations, Monitoring, and Exam-Prep Essentials

Saga succeeds or fails based on "whether you can observe it." Leave events for the start, success, failure, and compensation of each step, and make them traceable via correlation IDs. Continuously monitor metrics: consumer latency, retry rates, DLQ counts, and transaction abort rates.

For CCDAK prep, lock in transaction boundaries, offset-commit consistency, EOS prerequisites, ordering guarantees with keys and partitions, compaction and retention policies, and schema compatibility.

  • Retention strategy: use delete retention for events, and consider compact for the latest state or state projections.
  • Schema compatibility: default to backward compatibility, and move breaking changes to a new topic.
  • Failure recovery: keep duplicates down with an idempotent producer and EOS, and also make the sink idempotent.
  • Transactions: avoid time-outs from long idle periods, and stabilize transactional.id.
  • Security: attach audit logs and correlation IDs to headers, and tokenize PII.
  • Testing: inject synthetic failures (delays, duplicates, reordering) and verify the re-enterability of compensation.

Check with a Question

CCDAK

問題 1

In a Kafka-based Saga implementation, you want to process input events, write to multiple topics, and atomically commit offsets in the same processing unit. Which is the correct implementation?

  1. Use an Idempotent + Transactional producer, in this order: beginTransaction → send to multiple topics → sendOffsetsToTransaction → commitTransaction
  2. Set enable.auto.commit=true and call producer.flush() after processing — output and offsets will line up atomically on their own
  3. Writing to the same partition is naturally atomic, so a transaction is unnecessary
  4. Calling consumer commitSync() and producer send() sequentially on the same thread is sufficient

正解: A

Kafka transactions can commit writes across multiple topics/partitions and consumer offsets within a single atomic boundary. The correct order is beginTransaction → send records → sendOffsetsToTransaction (along with the consumer group ID) → commitTransaction. auto.commit alone does not guarantee atomicity, and sequential calls can still produce inconsistencies on failure.

Frequently Asked Questions

Are Kafka transactions a replacement for distributed transactions (2PC)?

Kafka transactions provide atomicity for writes across multiple Kafka partitions/topics and consumer offset commits. They do not bundle an external database and Kafka into a single distributed transaction. For external systems, combine patterns such as Outbox or CDC and rely on business-level compensation to achieve eventual consistency.

Does Exactly-Once Semantics (EOS) really guarantee "once and only once"?

Kafka Streams' exactly_once_v2 and producer transactions provide consistency between processing, output, and offset commits within the Kafka boundary. However, if the sink (external DB/API) is not idempotent, you must prevent duplicates on the outside. EOS suppresses duplicates and losses inside Kafka — it does not unconditionally guarantee "absolutely once" across the entire system.

Compensation logic is complex and hard to design. Where should I start?

First, enumerate business invariants and define the "smallest reversible action" when each step is broken. Treat compensation like the normal flow — turn it into events, make it observable, and make it safe to re-run with an idempotency key. Clarifying timeouts, maximum retry counts, and setting up a DLQ path for manual intervention will move the design forward.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Kafka

Kafka Topics & Partitions: Distribution Fundamentals (2026)

How Kafka topics and partitions enable scale — ordering guar...

Kafka

CCDAK Exam Guide: Confluent Certified Developer (2026)

Complete prep for the CCDAK exam — Producer/Consumer API, St...

Kafka

CCAAK Exam Guide: Confluent Certified Administrator (2026)

Pass the CCAAK exam — cluster management, partitions, securi...

Kafka

Kafka Replicas & ISR: Fault Tolerance Explained (2026)

Replica placement, in-sync replicas (ISR), leader election. ...

Kafka

Kafka Offsets: Commit Modes & Consumer Position (2026)

Offset semantics — auto vs. manual commit, __consumer_offset...

Browse all Kafka articles (101)
© 2026 NicheeLab All rights reserved.