Kafka

Kafka Error Handling Patterns: Retries, DLQ, and Compensation

2026-04-19
NicheeLab Editorial Team

Failures in Kafka are not unusual. Network jitter, schema mismatches, and brief downstream outages all happen. This article builds on stable concepts from the official documentation to organize how you design and operate retries, DLQs, and compensation logic.

As CCDAK preparation, we also cover the key points around delivery guarantees, transactions (EOS), offset commits, and per-partition ordering.

Classifying Failures and Design Principles

Start by classifying failures. Transient errors that clear quickly (network blips, throttling) call for different handling than permanent errors caused by the data itself (schema mismatches, validation failures, unexpected formats).

Next, separate business errors from technical errors. Technical errors are often recoverable via retries and backoff; business errors should be handled explicitly via DLQ or compensation.

Kafka provides message ordering on a per-partition basis. Always evaluate the impact of error handling on ordering — route records sharing a key to the same partition, restrict in-flight requests when needed, and make sure retry topics share the same partition count and partitioner.

  • Make the cause observable first (e.g., put the exception class, original topic/partition/offset into headers)
  • Retry transient errors with backoff; route permanent errors to the DLQ early
  • Keep order-sensitive keys on the same partition, and apply the same partitioning strategy to retry topics
  • If downstream is not idempotent, use enable.idempotence=true on the producer and consider transactions for exactly-once
  • Commit offsets only after results are durably persisted (at-least-once). For EOS, use sendOffsetsToTransaction

Choosing the Right Retry Strategy

Producer retries in Kafka are effective for transient broker or network failures. Combined with enable.idempotence=true, the client itself prevents duplicates from resends. When ordering matters, the standard practice is to cap max.in.flight.requests.per.connection at 1.

Consumer-side retry is the application's responsibility. Inline retries block the partition while running, hurting throughput. In practice, the most stable pattern is an N-stage chain of retry topics (retry-1, retry-2, ...) with progressively longer backoffs. Records that ultimately cannot be processed go to the DLQ. Kafka does not natively provide a "delay queue", so delays are expressed via dedicated retry topics and consumption interval control.

  • Example recommended producer settings: acks=all, enable.idempotence=true, max.in.flight.requests.per.connection=1, with delivery.timeout.ms tuned to your business SLA
  • Keep inline consumer retries small in count; route permanent errors to the DLQ promptly
  • Use the same partition count and partitioner on retry topics as the source topic to preserve key ordering
  • Increase backoff progressively (exponential / linear) and avoid infinite retries
PatternApplies toPrimary purposeStrength
Producer retriesSend failures (technical)Improved delivery success rateMinimizes duplicates when combined with idempotence
Consumer inline retryProcessing failures (technical / some business)Immediate retryEasy to implement
Retry topicsProcessing failures (mostly transient)Staged backoff and isolationKeeps main-stream throughput intact
DLQUnprocessable (permanent)Quarantining damaged recordsPrevents head-of-line blocking
CompensationBusiness inconsistenciesAfter-the-fact correctionRestores consistency across services

Retry topic and DLQ flow (preserving key order)

main-topicConsumerok → downstream / on error ↓retry-1-topicsame partition designRetry G1delay / throttleretry-2-topicRetry G2DLQ

Key Points for DLQ Design and Implementation

DLQ is not a built-in Kafka feature; it is implemented as a regular topic. Use a naming convention that is immediately recognizable, such as <original-topic>.DLQ or <domain>.dead-letter. Set retention longer than the main stream and prepare monitoring and reprocessing tooling (manual / batch / dedicated UI). The cleanup policy should usually be delete; avoid compaction unless key deletion has a meaningful semantic in your design.

To preserve ordering, match the partition count and partitioner of the DLQ and retry topics to the source topic. Carry the original metadata and error context in headers so reprocessing can reconstruct context. Avoid carelessly placing PII or secrets into headers or values, and design auditing and access controls deliberately.

  • Recommended headers: x-original-topic, x-original-partition, x-original-offset, x-exception-class, x-ex-message, x-attempt
  • Define a DLQ-specific "envelope" schema (original payload + metadata)
  • Make the reprocessing policy explicit (re-injection target, max attempts, whether manual approval is required)
  • Always have DLQ volume monitoring and threshold alerts in place

Sending to DLQ / retry topics from a Java client (simplified)

import org.apache.kafka.clients.consumer.*;
import org.apache.kafka.clients.producer.*;
import org.apache.kafka.common.header.internals.RecordHeader;
import org.apache.kafka.common.serialization.ByteArrayDeserializer;
import org.apache.kafka.common.serialization.ByteArraySerializer;
import java.nio.charset.StandardCharsets;
import java.time.Duration;
import java.util.*;

public class RetryAndDlqExample {
  static String MAIN = "orders";
  static String RETRY_PREFIX = MAIN + ".retry."; // retry.1, retry.2 など
  static String DLQ = MAIN + ".DLQ";

  public static void main(String[] args) {
    Properties cp = new Properties();
    cp.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "broker:9092");
    cp.put(ConsumerConfig.GROUP_ID_CONFIG, "orders-worker");
    cp.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");
    KafkaConsumer<byte[], byte[]> consumer = new KafkaConsumer<>(cp, new ByteArrayDeserializer(), new ByteArrayDeserializer());
    consumer.subscribe(Collections.singletonList(MAIN));

    Properties pp = new Properties();
    pp.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "broker:9092");
    pp.put(ProducerConfig.ACKS_CONFIG, "all");
    pp.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, "true");
    pp.put(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, "1");
    pp.put(ProducerConfig.DELIVERY_TIMEOUT_MS_CONFIG, "120000");
    KafkaProducer<byte[], byte[]> producer = new KafkaProducer<>(pp, new ByteArraySerializer(), new ByteArraySerializer());

    while (true) {
      ConsumerRecords<byte[], byte[]> records = consumer.poll(Duration.ofSeconds(1));
      for (ConsumerRecord<byte[], byte[]> r : records) {
        try {
          // ここで業務処理(例:検証/下流呼び出し/DB upsert)
          process(r);
          // 正常時のみオフセットを明示コミット
          consumer.commitSync(Collections.singletonMap(new TopicPartition(r.topic(), r.partition()), new OffsetAndMetadata(r.offset() + 1)));
        } catch (Exception e) {
          int attempt = headerInt(r, "x-attempt", 0) + 1;
          String nextTopic = attempt <= 2 ? RETRY_PREFIX + attempt : DLQ;
          ProducerRecord<byte[], byte[]> pr = new ProducerRecord<>(nextTopic, r.key(), r.value());
          pr.headers().add(new RecordHeader("x-original-topic", r.topic().getBytes(StandardCharsets.UTF_8)));
          pr.headers().add(new RecordHeader("x-original-partition", Integer.toString(r.partition()).getBytes(StandardCharsets.UTF_8)));
          pr.headers().add(new RecordHeader("x-original-offset", Long.toString(r.offset()).getBytes(StandardCharsets.UTF_8)));
          pr.headers().add(new RecordHeader("x-exception-class", e.getClass().getName().getBytes(StandardCharsets.UTF_8)));
          pr.headers().add(new RecordHeader("x-ex-message", Optional.ofNullable(e.getMessage()).orElse("").getBytes(StandardCharsets.UTF_8)));
          pr.headers().add(new RecordHeader("x-attempt", Integer.toString(attempt).getBytes(StandardCharsets.UTF_8)));
          producer.send(pr).get(); // シンプルに同期送信
          // 失敗レコードのオフセットは“処理済み”として進め、本流を詰まらせない
          consumer.commitSync(Collections.singletonMap(new TopicPartition(r.topic(), r.partition()), new OffsetAndMetadata(r.offset() + 1)));
        }
      }
    }
  }

  static void process(ConsumerRecord<byte[], byte[]> r) {
    // ダミー:例外を投げるかもしれない業務処理
  }

  static int headerInt(ConsumerRecord<byte[], byte[]> r, String key, int dflt) {
    var h = r.headers().lastHeader(key);
    if (h == null) return dflt;
    try { return Integer.parseInt(new String(h.value(), StandardCharsets.UTF_8)); } catch (Exception e) { return dflt; }
  }
}

Compensation, SAGA, and the Outbox Pattern

Compensation is about designing steps (cancellation / reversal) that retroactively undo a business inconsistency. The SAGA pattern keeps consistency without distributed transactions by having each step asynchronously publish success or compensation events.

The outbox pattern is effective for keeping DB updates and event publishing atomic. The application writes business data and an outbox row in the same DB transaction, and a CDC connector reliably forwards those rows to Kafka. When compensation is required, you publish a new compensation event and apply it downstream. Note that Kafka transactions (EOS) provide read-write atomicity within Kafka, not end-to-end atomicity with external systems.

  • Treat compensation as a separate event and ensure downstream consumers can apply it idempotently
  • Use the outbox to lock in the "write first, then publish" order and avoid the dual-write problem
  • Guarantee ordering and uniqueness of compensation events (key design, dedup)

Monitoring, Alerting, and Operational Pitfalls

Monitor retries and the DLQ as indicators of "flow health". Sudden DLQ spikes or retry-topic backlogs are early signs of mainline degradation. In addition to the usual producer / consumer client metrics, watching per-topic latency and backlog speeds up root-cause analysis.

Tilt alerts toward trends — "long-running backlog", "rising DLQ ratio", "growing retry stages" — and use tiered notifications to avoid jumping to immediate cutoffs. Reprocessing jobs should be idempotent and throttleable so that re-injection does not crush the mainline.

  • Producer: record-error-rate, record-retry-rate, request-latency-avg
  • Consumer: records-lag-max, commit-latency, poll-interval variance
  • Topics: message count, arrival rate, and backlog age on retry topics / DLQ
  • Example alerts: DLQ ratio > X%, rising share of retry-2+ traffic, lag exceeding SLA

CCDAK Key Points and Anti-Patterns

CCDAK frequently tests the differences between delivery guarantees (at-most-once / at-least-once / exactly-once), the idempotent producer, and the constraints of transactions (a single transaction can span writes to multiple partitions and offset commits, but does not guarantee atomicity with external systems).

Anti-patterns include unbounded retries, long blocking that exceeds max.poll.interval.ms, unmonitored DLQs, ordering breakage from mismatched partition counts on retry topics, and producer resends without idempotence enabled.

  • at-least-once delivery: commit offsets after processing
  • Exactly-once in Kafka: enable.idempotence + transactions + sendOffsetsToTransaction
  • To preserve ordering, set max.in.flight.requests.per.connection=1 (a latency trade-off)
  • DLQ is achieved through design, not a built-in feature; use headers to keep records reprocessable
  • Retry topics must share the same key space and partitioner

Check Your Understanding

CCDAK

問題 1

A consumer reads from topic A and writes processed results to topic B, sending permanent errors to a DLQ. You want to minimize duplicates and preserve ordering per key. Which design is the most appropriate?

  1. Set enable.idempotence=true, acks=all, and max.in.flight.requests.per.connection=1 on the producer, and use a transaction so that the write to B and the offset commit for A are bundled together via sendOffsetsToTransaction. On a permanent error, abort the transaction and send to the DLQ separately.
  2. Enabling auto-commit on the consumer and simply raising producer retries eliminates both duplicate and ordering problems.
  3. Inline-retrying forever on the consumer and pausing polls until the error clears always preserves order, so that alone is sufficient.
  4. Configuring the DLQ topic to use compaction only causes old failures to disappear automatically and strengthens ordering.

正解: A

Exactly-once within Kafka requires combining the idempotent producer with transactions and bundling the write and offset commit in the same transaction. When ordering matters, cap in-flight at 1. Permanent errors are quarantined to the DLQ after the transaction is aborted. B does not guarantee duplicate or ordering safety, C hurts throughput and stability, and D is inappropriate as a DLQ design (compaction-only is not the right cleanup policy here).

Frequently Asked Questions

Is DLQ a built-in Kafka feature?

No. A DLQ is implemented as a regular topic. The full "DLQ design" comprises a naming convention, metadata (headers carrying the original topic/partition/offset), a reprocessing procedure, and monitoring.

Can ordering be preserved when using retry topics?

Within a single topic, ordering is preserved per partition. If a retry topic uses the same partition count and key-based partitioner as the original topic, the relative order of records sharing the same key is effectively maintained. However, absolute ordering across topics is outside Kafka's guarantees.

Is it always better to increase producer retries?

Unbounded retries inflate latency. Resends only occur within the delivery.timeout.ms window — beyond that, the send is treated as failed. Combine with enable.idempotence=true and tune retries/backoff to match your SLA. If ordering is critical, also restrict in-flight requests.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Kafka

Kafka Topics & Partitions: Distribution Fundamentals (2026)

How Kafka topics and partitions enable scale — ordering guar...

Kafka

CCDAK Exam Guide: Confluent Certified Developer (2026)

Complete prep for the CCDAK exam — Producer/Consumer API, St...

Kafka

CCAAK Exam Guide: Confluent Certified Administrator (2026)

Pass the CCAAK exam — cluster management, partitions, securi...

Kafka

Kafka Replicas & ISR: Fault Tolerance Explained (2026)

Replica placement, in-sync replicas (ISR), leader election. ...

Kafka

Kafka Offsets: Commit Modes & Consumer Position (2026)

Offset semantics — auto vs. manual commit, __consumer_offset...

Browse all Kafka articles (101)
© 2026 NicheeLab All rights reserved.