Kafka Error Handling Patterns (2026)

Failures in Kafka are not unusual. Network jitter, schema mismatches, and brief downstream outages all happen. This article builds on stable concepts from the official documentation to organize how you design and operate retries, DLQs, and compensation logic.

As CCDAK preparation, we also cover the key points around delivery guarantees, transactions (EOS), offset commits, and per-partition ordering.

Classifying Failures and Design Principles

Start by classifying failures. Transient errors that clear quickly (network blips, throttling) call for different handling than permanent errors caused by the data itself (schema mismatches, validation failures, unexpected formats).

Next, separate business errors from technical errors. Technical errors are often recoverable via retries and backoff; business errors should be handled explicitly via DLQ or compensation.

Kafka provides message ordering on a per-partition basis. Always evaluate the impact of error handling on ordering — route records sharing a key to the same partition, restrict in-flight requests when needed, and make sure retry topics share the same partition count and partitioner.

Make the cause observable first (e.g., put the exception class, original topic/partition/offset into headers)
Retry transient errors with backoff; route permanent errors to the DLQ early
Keep order-sensitive keys on the same partition, and apply the same partitioning strategy to retry topics
If downstream is not idempotent, use enable.idempotence=true on the producer and consider transactions for exactly-once
Commit offsets only after results are durably persisted (at-least-once). For EOS, use sendOffsetsToTransaction

Choosing the Right Retry Strategy

Producer retries in Kafka are effective for transient broker or network failures. Combined with enable.idempotence=true, the client itself prevents duplicates from resends. When ordering matters, the standard practice is to cap max.in.flight.requests.per.connection at 1.

Consumer-side retry is the application's responsibility. Inline retries block the partition while running, hurting throughput. In practice, the most stable pattern is an N-stage chain of retry topics (retry-1, retry-2, ...) with progressively longer backoffs. Records that ultimately cannot be processed go to the DLQ. Kafka does not natively provide a "delay queue", so delays are expressed via dedicated retry topics and consumption interval control.

Example recommended producer settings: acks=all, enable.idempotence=true, max.in.flight.requests.per.connection=1, with delivery.timeout.ms tuned to your business SLA
Keep inline consumer retries small in count; route permanent errors to the DLQ promptly
Use the same partition count and partitioner on retry topics as the source topic to preserve key ordering
Increase backoff progressively (exponential / linear) and avoid infinite retries

Pattern	Applies to	Primary purpose	Strength
Producer retries	Send failures (technical)	Improved delivery success rate	Minimizes duplicates when combined with idempotence
Consumer inline retry	Processing failures (technical / some business)	Immediate retry	Easy to implement
Retry topics	Processing failures (mostly transient)	Staged backoff and isolation	Keeps main-stream throughput intact
DLQ	Unprocessable (permanent)	Quarantining damaged records	Prevents head-of-line blocking
Compensation	Business inconsistencies	After-the-fact correction	Restores consistency across services

Retry topic and DLQ flow (preserving key order)

Key Points for DLQ Design and Implementation

DLQ is not a built-in Kafka feature; it is implemented as a regular topic. Use a naming convention that is immediately recognizable, such as <original-topic>.DLQ or <domain>.dead-letter. Set retention longer than the main stream and prepare monitoring and reprocessing tooling (manual / batch / dedicated UI). The cleanup policy should usually be delete; avoid compaction unless key deletion has a meaningful semantic in your design.

To preserve ordering, match the partition count and partitioner of the DLQ and retry topics to the source topic. Carry the original metadata and error context in headers so reprocessing can reconstruct context. Avoid carelessly placing PII or secrets into headers or values, and design auditing and access controls deliberately.

Recommended headers: x-original-topic, x-original-partition, x-original-offset, x-exception-class, x-ex-message, x-attempt
Define a DLQ-specific "envelope" schema (original payload + metadata)
Make the reprocessing policy explicit (re-injection target, max attempts, whether manual approval is required)
Always have DLQ volume monitoring and threshold alerts in place

Sending to DLQ / retry topics from a Java client (simplified)

import org.apache.kafka.clients.consumer.*;
import org.apache.kafka.clients.producer.*;
import org.apache.kafka.common.header.internals.RecordHeader;
import org.apache.kafka.common.serialization.ByteArrayDeserializer;
import org.apache.kafka.common.serialization.ByteArraySerializer;
import java.nio.charset.StandardCharsets;
import java.time.Duration;
import java.util.*;

public class RetryAndDlqExample {
  static String MAIN = "orders";
  static String RETRY_PREFIX = MAIN + ".retry."; // retry.1, retry.2 など
  static String DLQ = MAIN + ".DLQ";

  public static void main(String[] args) {
    Properties cp = new Properties();
    cp.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "broker:9092");
    cp.put(ConsumerConfig.GROUP_ID_CONFIG, "orders-worker");
    cp.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");
    KafkaConsumer<byte[], byte[]> consumer = new KafkaConsumer<>(cp, new ByteArrayDeserializer(), new ByteArrayDeserializer());
    consumer.subscribe(Collections.singletonList(MAIN));

    Properties pp = new Properties();
    pp.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "broker:9092");
    pp.put(ProducerConfig.ACKS_CONFIG, "all");
    pp.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, "true");
    pp.put(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, "1");
    pp.put(ProducerConfig.DELIVERY_TIMEOUT_MS_CONFIG, "120000");
    KafkaProducer<byte[], byte[]> producer = new KafkaProducer<>(pp, new ByteArraySerializer(), new ByteArraySerializer());

    while (true) {
      ConsumerRecords<byte[], byte[]> records = consumer.poll(Duration.ofSeconds(1));
      for (ConsumerRecord<byte[], byte[]> r : records) {
        try {
          // ここで業務処理（例：検証/下流呼び出し/DB upsert）
          process(r);
          // 正常時のみオフセットを明示コミット
          consumer.commitSync(Collections.singletonMap(new TopicPartition(r.topic(), r.partition()), new OffsetAndMetadata(r.offset() + 1)));
        } catch (Exception e) {
          int attempt = headerInt(r, "x-attempt", 0) + 1;
          String nextTopic = attempt <= 2 ? RETRY_PREFIX + attempt : DLQ;
          ProducerRecord<byte[], byte[]> pr = new ProducerRecord<>(nextTopic, r.key(), r.value());
          pr.headers().add(new RecordHeader("x-original-topic", r.topic().getBytes(StandardCharsets.UTF_8)));
          pr.headers().add(new RecordHeader("x-original-partition", Integer.toString(r.partition()).getBytes(StandardCharsets.UTF_8)));
          pr.headers().add(new RecordHeader("x-original-offset", Long.toString(r.offset()).getBytes(StandardCharsets.UTF_8)));
          pr.headers().add(new RecordHeader("x-exception-class", e.getClass().getName().getBytes(StandardCharsets.UTF_8)));
          pr.headers().add(new RecordHeader("x-ex-message", Optional.ofNullable(e.getMessage()).orElse("").getBytes(StandardCharsets.UTF_8)));
          pr.headers().add(new RecordHeader("x-attempt", Integer.toString(attempt).getBytes(StandardCharsets.UTF_8)));
          producer.send(pr).get(); // シンプルに同期送信
          // 失敗レコードのオフセットは“処理済み”として進め、本流を詰まらせない
          consumer.commitSync(Collections.singletonMap(new TopicPartition(r.topic(), r.partition()), new OffsetAndMetadata(r.offset() + 1)));
        }
      }
    }
  }

  static void process(ConsumerRecord<byte[], byte[]> r) {
    // ダミー：例外を投げるかもしれない業務処理
  }

  static int headerInt(ConsumerRecord<byte[], byte[]> r, String key, int dflt) {
    var h = r.headers().lastHeader(key);
    if (h == null) return dflt;
    try { return Integer.parseInt(new String(h.value(), StandardCharsets.UTF_8)); } catch (Exception e) { return dflt; }
  }
}

Compensation, SAGA, and the Outbox Pattern

Compensation is about designing steps (cancellation / reversal) that retroactively undo a business inconsistency. The SAGA pattern keeps consistency without distributed transactions by having each step asynchronously publish success or compensation events.

The outbox pattern is effective for keeping DB updates and event publishing atomic. The application writes business data and an outbox row in the same DB transaction, and a CDC connector reliably forwards those rows to Kafka. When compensation is required, you publish a new compensation event and apply it downstream. Note that Kafka transactions (EOS) provide read-write atomicity within Kafka, not end-to-end atomicity with external systems.

Treat compensation as a separate event and ensure downstream consumers can apply it idempotently
Use the outbox to lock in the "write first, then publish" order and avoid the dual-write problem
Guarantee ordering and uniqueness of compensation events (key design, dedup)

Monitoring, Alerting, and Operational Pitfalls

Monitor retries and the DLQ as indicators of "flow health". Sudden DLQ spikes or retry-topic backlogs are early signs of mainline degradation. In addition to the usual producer / consumer client metrics, watching per-topic latency and backlog speeds up root-cause analysis.

Tilt alerts toward trends — "long-running backlog", "rising DLQ ratio", "growing retry stages" — and use tiered notifications to avoid jumping to immediate cutoffs. Reprocessing jobs should be idempotent and throttleable so that re-injection does not crush the mainline.

Producer: record-error-rate, record-retry-rate, request-latency-avg
Consumer: records-lag-max, commit-latency, poll-interval variance
Topics: message count, arrival rate, and backlog age on retry topics / DLQ
Example alerts: DLQ ratio > X%, rising share of retry-2+ traffic, lag exceeding SLA

CCDAK Key Points and Anti-Patterns

CCDAK frequently tests the differences between delivery guarantees (at-most-once / at-least-once / exactly-once), the idempotent producer, and the constraints of transactions (a single transaction can span writes to multiple partitions and offset commits, but does not guarantee atomicity with external systems).

Anti-patterns include unbounded retries, long blocking that exceeds max.poll.interval.ms, unmonitored DLQs, ordering breakage from mismatched partition counts on retry topics, and producer resends without idempotence enabled.

at-least-once delivery: commit offsets after processing
Exactly-once in Kafka: enable.idempotence + transactions + sendOffsetsToTransaction
To preserve ordering, set max.in.flight.requests.per.connection=1 (a latency trade-off)
DLQ is achieved through design, not a built-in feature; use headers to keep records reprocessable
Retry topics must share the same key space and partitioner

Check Your Understanding

CCDAK

問題 1

A consumer reads from topic A and writes processed results to topic B, sending permanent errors to a DLQ. You want to minimize duplicates and preserve ordering per key. Which design is the most appropriate?

Set enable.idempotence=true, acks=all, and max.in.flight.requests.per.connection=1 on the producer, and use a transaction so that the write to B and the offset commit for A are bundled together via sendOffsetsToTransaction. On a permanent error, abort the transaction and send to the DLQ separately.
Enabling auto-commit on the consumer and simply raising producer retries eliminates both duplicate and ordering problems.
Inline-retrying forever on the consumer and pausing polls until the error clears always preserves order, so that alone is sufficient.
Configuring the DLQ topic to use compaction only causes old failures to disappear automatically and strengthens ordering.

正解: A

Exactly-once within Kafka requires combining the idempotent producer with transactions and bundling the write and offset commit in the same transaction. When ordering matters, cap in-flight at 1. Permanent errors are quarantined to the DLQ after the transaction is aborted. B does not guarantee duplicate or ordering safety, C hurts throughput and stability, and D is inappropriate as a DLQ design (compaction-only is not the right cleanup policy here).

Frequently Asked Questions

Is DLQ a built-in Kafka feature?

No. A DLQ is implemented as a regular topic. The full "DLQ design" comprises a naming convention, metadata (headers carrying the original topic/partition/offset), a reprocessing procedure, and monitoring.

Can ordering be preserved when using retry topics?

Within a single topic, ordering is preserved per partition. If a retry topic uses the same partition count and key-based partitioner as the original topic, the relative order of records sharing the same key is effectively maintained. However, absolute ordering across topics is outside Kafka's guarantees.

Is it always better to increase producer retries?

Unbounded retries inflate latency. Resends only occur within the delivery.timeout.ms window — beyond that, the send is treated as failed. Combine with enable.idempotence=true and tune retries/backoff to match your SLA. If ordering is critical, also restrict in-flight requests.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

Kafka Error Handling Patterns: Retries, DLQ, and Compensation