Kafka

Kafka Connect Dead Letter Queue: Failed-Message Recovery Design

2026-04-19
NicheeLab Editorial Team

The DLQ is a mechanism that diverts records the connector, SMT, or converter could not process into a dedicated Kafka topic, allowing the rest of the records to keep flowing.

This article walks through which failures land in the DLQ, the configuration essentials, recovery patterns, operations and monitoring, and the points most likely to appear on the exam — end to end.

DLQ Basics and What CCDAK Tests

The Kafka Connect DLQ keeps the connector running even when individual record processing fails, sending only the failed records to a separate topic. Setting errors.tolerance=all and naming a DLQ topic are the baseline requirements.

On CCDAK, common topics include which failure types reach the DLQ, the relationship with errors.retry.*, the difference from log output, order guarantees, and topic design (partition count, retention, ACLs).

  • The DLQ avoids pipeline halts by isolating failures at the record level
  • The DLQ preserves failures in a more structured way than logs (metadata in headers)
  • Because the DLQ is a separate topic, order consistency with the source topic is not guaranteed
  • The responsibility for reprocessing lies with the DLQ consumer (the repair and re-publish process)

Failure Categories: What Goes to the DLQ and What Does Not

Connect failures occur broadly at three layers: converters (serialization/deserialization), SMTs (Single Message Transforms), and the connector itself (especially Sink writes to external systems). The DLQ primarily covers record-processing failures the Connect framework can observe.

Converter and SMT failures generally end up in the DLQ. Sink-side external-write failures do not — unless the connector implementation supports error reporting — and are instead handled as RetriableException retries or task failures.

  • Typical DLQ cases: deserialization failures from schema mismatches and conversion errors during SMT execution
  • Typical non-DLQ cases: transient external-DB issues, unreachability, timeouts (mostly retried via RetriableException)
  • Non-retriable, record-specific causes can be isolated via errors.tolerance=all combined with the DLQ
  • Metadata such as origin topic/partition/offset, error details, and failure stage can be added to headers (when enabled in config)

Key DLQ Settings and Topic Design

The minimum to enable a DLQ is setting errors.tolerance=all and errors.deadletterqueue.topic.name. Enable header context with errors.deadletterqueue.context.headers.enable. Relying on auto-create yields the broker's default partition count, so pre-create the topic when you need explicit partitions or retention.

Set retention to match your recovery SLA, sized together with capacity estimates. Because the DLQ is for investigation and reprocessing, log compaction is normally left disabled. From a security standpoint, tighten ACLs independently from regular business topics.

  • errors.tolerance=all (required): tolerates failures at the record level
  • errors.deadletterqueue.topic.name (required): destination DLQ topic
  • errors.deadletterqueue.context.headers.enable=true: writes error info into headers
  • errors.deadletterqueue.topic.replication.factor: RF used on auto-create (pre-creation recommended)
  • errors.log.enable / errors.log.include.messages: pair with log-based observation
  • errors.retry.timeout / errors.retry.delay.max.ms: retry time cap and max delay (for Retriable cases)

Kafka Connect and DLQ flow (conceptual diagram)

non-retriableProducerKafka TopicordersSink TaskExternal SystemDLQ Topicdlq.ordersRemediator / Repair Appfixed.ordersRe-publish targetNon-retriable record errors go to the DLQ. Headers: error, origin t/p/o. A Remediator repairs and re-publishes them.

Example: enabling the DLQ on a Sink connector (excerpt)

{
  "name": "jdbc-sink-orders",
  "config": {
    "connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
    "topics": "orders",
    "connection.url": "jdbc:postgresql://db:5432/app",
    "tasks.max": "3",

    "errors.tolerance": "all",
    "errors.deadletterqueue.topic.name": "dlq.orders",
    "errors.deadletterqueue.context.headers.enable": "true",
    "errors.deadletterqueue.topic.replication.factor": "3",
    "errors.log.enable": "true",
    "errors.log.include.messages": "true",

    "errors.retry.timeout": "600000",
    "errors.retry.delay.max.ms": "60000"
  }
}

Failed-Message Recovery Patterns (Comparison)

Do not let records pile up in the DLQ — design recovery as a deliberate pattern. Representative options include an automated isolate-investigate-repair-republish loop, a delayed re-publish (retry) pattern, and a pattern that inserts manual review.

Across every pattern, the practical key is to assume at-least-once duplicates and no ordering, and to combine that with idempotent sink-side design (primary keys, upserts, deduplication).

  • A Remediator app consumes the DLQ and, after repair, re-publishes to the source topic or a dedicated reprocessing topic
  • Route by cause (e.g., schema mismatches to A, missing foreign keys to B)
  • Use staged backoff on re-publish to avoid spikes
  • Keep the DLQ as an audit trail so it can be correlated with re-publish outcomes
StrategyBehavior on FailureReprocessing CostBest Suited For
Halt (errors.tolerance=none)Task stops on the first failureLow (but big downtime impact)When data integrity is paramount and you want immediate intervention
Log only (no DLQ)Processing continues; check via logsMedium (investigation is tedious)When failures are extremely rare and low-impact
DLQ (errors.tolerance=all)Only the failed record is isolatedMedium (downstream repair and re-publish)Close to the default choice for typical production use
DLQ + retry controlRetriable failures retry; non-retriable go to the DLQMedium to high (extra control logic)When transient and permanent errors coexist

Re-publishing from the DLQ (minimal investigate → repair → republish flow)

# 1) Inspect the DLQ with headers
kafka-console-consumer \
  --bootstrap-server broker:9092 \
  --topic dlq.orders \
  --from-beginning \
  --property print.headers=true \
  --max-messages 10

# 2) Use a repair script to reshape the JSON (e.g., fill missing fields)
#    Implement this to match each system's validation rules

# 3) Publish the repaired messages to the reprocessing topic
kafka-console-producer \
  --broker-list broker:9092 \
  --topic fixed.orders

Operations Monitoring and SLO Design

The DLQ is the last safety net. Ideally it stays near zero in steady state, and any rise should trigger an alert and investigation. Size retention and capacity from rate × SLA so the audit trail is not lost when retention expires.

At minimum, monitor per-task DLQ write success/failure counters, retry metrics, and DLQ-consumer latency/lag. Also consider rate limiting and batching so re-publish does not spike load on the source topic.

  • Establish a DLQ-rate baseline and detect deviations (thresholds and rate-of-change)
  • Respond immediately to DLQ write failures (verify config, permissions, and topic existence)
  • Dashboard retention and capacity (estimated days remaining, projected full date)
  • Visualize processed count, failure rate, and lead time of the repair/re-publish pipeline

Exam Checklist and Common Pitfalls

CCDAK targets the relationship between errors.tolerance and the DLQ, RetriableException retries, the investigative value of headers, ordering and duplicate handling, and the fact that external-system write failures do not always end up in the DLQ.

In production, enabling the DLQ without deciding who reprocesses it and how just creates technical debt. Design recovery ownership, SLO, topic/ACL/retention, monitoring, and capacity estimates as a single package.

  • Enable the DLQ with errors.tolerance=all + errors.deadletterqueue.topic.name
  • errors.retry.timeout / errors.retry.delay.max.ms control the retry window (Retriable cases only)
  • The DLQ is a separate topic — assume no ordering and duplicate reprocessing (keep the sink idempotent)
  • External-system write failures may not reach the DLQ depending on connector implementation
  • Pre-create the DLQ topic with explicit partition count, retention, and RF
  • Use headers to make investigation easier (origin topic/partition/offset and similar)

Check Your Understanding

CCDAK

問題 1

On a Kafka Connect Sink connector, you want to keep processing even when individual records fail conversion, and be able to investigate and reprocess the failures later. Which is the minimum required configuration?

  1. errors.tolerance=none and errors.log.enable=true
  2. errors.tolerance=all and setting errors.deadletterqueue.topic.name
  3. Only raise errors.retry.timeout to a large value
  4. Increase the connector's tasks.max

正解: B

To tolerate record-level failures and divert them to a DLQ, you need errors.tolerance=all together with errors.deadletterqueue.topic.name. Enabling logs, extending retries, or increasing parallelism alone cannot isolate failed records.

Frequently Asked Questions

Does the order of records sent to the DLQ match the source topic?

Order is not guaranteed. The DLQ is a separate topic written to in parallel by multiple tasks, so the original order and relative positions are not preserved. If order matters for reprocessing, you need extra mechanisms such as per-key serialization.

Are transient external-system failures (e.g., DB connection drops) sent to the DLQ?

Generally not. In most cases the connector throws a RetriableException and is retried according to errors.retry.*. Records may only reach the DLQ for non-retriable record-specific errors, or when the connector implementation supports error reporting.

How should a DLQ topic be designed?

Pre-create the topic with explicit partition count and replication factor, and set retention long enough to meet your investigation and reprocessing SLO. Disable log compaction in most cases, and tighten ACLs to separate read and re-publish permissions.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Kafka

Kafka Topics & Partitions: Distribution Fundamentals (2026)

How Kafka topics and partitions enable scale — ordering guar...

Kafka

CCDAK Exam Guide: Confluent Certified Developer (2026)

Complete prep for the CCDAK exam — Producer/Consumer API, St...

Kafka

CCAAK Exam Guide: Confluent Certified Administrator (2026)

Pass the CCAAK exam — cluster management, partitions, securi...

Kafka

Kafka Replicas & ISR: Fault Tolerance Explained (2026)

Replica placement, in-sync replicas (ISR), leader election. ...

Kafka

Kafka Offsets: Commit Modes & Consumer Position (2026)

Offset semantics — auto vs. manual commit, __consumer_offset...

Browse all Kafka articles (101)
© 2026 NicheeLab All rights reserved.