The DLQ is a mechanism that diverts records the connector, SMT, or converter could not process into a dedicated Kafka topic, allowing the rest of the records to keep flowing.
This article walks through which failures land in the DLQ, the configuration essentials, recovery patterns, operations and monitoring, and the points most likely to appear on the exam — end to end.
The Kafka Connect DLQ keeps the connector running even when individual record processing fails, sending only the failed records to a separate topic. Setting errors.tolerance=all and naming a DLQ topic are the baseline requirements.
On CCDAK, common topics include which failure types reach the DLQ, the relationship with errors.retry.*, the difference from log output, order guarantees, and topic design (partition count, retention, ACLs).
Connect failures occur broadly at three layers: converters (serialization/deserialization), SMTs (Single Message Transforms), and the connector itself (especially Sink writes to external systems). The DLQ primarily covers record-processing failures the Connect framework can observe.
Converter and SMT failures generally end up in the DLQ. Sink-side external-write failures do not — unless the connector implementation supports error reporting — and are instead handled as RetriableException retries or task failures.
The minimum to enable a DLQ is setting errors.tolerance=all and errors.deadletterqueue.topic.name. Enable header context with errors.deadletterqueue.context.headers.enable. Relying on auto-create yields the broker's default partition count, so pre-create the topic when you need explicit partitions or retention.
Set retention to match your recovery SLA, sized together with capacity estimates. Because the DLQ is for investigation and reprocessing, log compaction is normally left disabled. From a security standpoint, tighten ACLs independently from regular business topics.
Kafka Connect and DLQ flow (conceptual diagram)
Example: enabling the DLQ on a Sink connector (excerpt)
{
"name": "jdbc-sink-orders",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"topics": "orders",
"connection.url": "jdbc:postgresql://db:5432/app",
"tasks.max": "3",
"errors.tolerance": "all",
"errors.deadletterqueue.topic.name": "dlq.orders",
"errors.deadletterqueue.context.headers.enable": "true",
"errors.deadletterqueue.topic.replication.factor": "3",
"errors.log.enable": "true",
"errors.log.include.messages": "true",
"errors.retry.timeout": "600000",
"errors.retry.delay.max.ms": "60000"
}
}Do not let records pile up in the DLQ — design recovery as a deliberate pattern. Representative options include an automated isolate-investigate-repair-republish loop, a delayed re-publish (retry) pattern, and a pattern that inserts manual review.
Across every pattern, the practical key is to assume at-least-once duplicates and no ordering, and to combine that with idempotent sink-side design (primary keys, upserts, deduplication).
| Strategy | Behavior on Failure | Reprocessing Cost | Best Suited For |
|---|---|---|---|
| Halt (errors.tolerance=none) | Task stops on the first failure | Low (but big downtime impact) | When data integrity is paramount and you want immediate intervention |
| Log only (no DLQ) | Processing continues; check via logs | Medium (investigation is tedious) | When failures are extremely rare and low-impact |
| DLQ (errors.tolerance=all) | Only the failed record is isolated | Medium (downstream repair and re-publish) | Close to the default choice for typical production use |
| DLQ + retry control | Retriable failures retry; non-retriable go to the DLQ | Medium to high (extra control logic) | When transient and permanent errors coexist |
Re-publishing from the DLQ (minimal investigate → repair → republish flow)
# 1) Inspect the DLQ with headers
kafka-console-consumer \
--bootstrap-server broker:9092 \
--topic dlq.orders \
--from-beginning \
--property print.headers=true \
--max-messages 10
# 2) Use a repair script to reshape the JSON (e.g., fill missing fields)
# Implement this to match each system's validation rules
# 3) Publish the repaired messages to the reprocessing topic
kafka-console-producer \
--broker-list broker:9092 \
--topic fixed.ordersThe DLQ is the last safety net. Ideally it stays near zero in steady state, and any rise should trigger an alert and investigation. Size retention and capacity from rate × SLA so the audit trail is not lost when retention expires.
At minimum, monitor per-task DLQ write success/failure counters, retry metrics, and DLQ-consumer latency/lag. Also consider rate limiting and batching so re-publish does not spike load on the source topic.
CCDAK targets the relationship between errors.tolerance and the DLQ, RetriableException retries, the investigative value of headers, ordering and duplicate handling, and the fact that external-system write failures do not always end up in the DLQ.
In production, enabling the DLQ without deciding who reprocesses it and how just creates technical debt. Design recovery ownership, SLO, topic/ACL/retention, monitoring, and capacity estimates as a single package.
CCDAK
問題 1
On a Kafka Connect Sink connector, you want to keep processing even when individual records fail conversion, and be able to investigate and reprocess the failures later. Which is the minimum required configuration?
正解: B
To tolerate record-level failures and divert them to a DLQ, you need errors.tolerance=all together with errors.deadletterqueue.topic.name. Enabling logs, extending retries, or increasing parallelism alone cannot isolate failed records.
Does the order of records sent to the DLQ match the source topic?
Order is not guaranteed. The DLQ is a separate topic written to in parallel by multiple tasks, so the original order and relative positions are not preserved. If order matters for reprocessing, you need extra mechanisms such as per-key serialization.
Are transient external-system failures (e.g., DB connection drops) sent to the DLQ?
Generally not. In most cases the connector throws a RetriableException and is retried according to errors.retry.*. Records may only reach the DLQ for non-retriable record-specific errors, or when the connector implementation supports error reporting.
How should a DLQ topic be designed?
Pre-create the topic with explicit partition count and replication factor, and set retention long enough to meet your investigation and reprocessing SLO. Disable log compaction in most cases, and tighten ACLs to separate read and re-publish permissions.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Kafka Topics & Partitions: Distribution Fundamentals (2026)
How Kafka topics and partitions enable scale — ordering guar...
CCDAK Exam Guide: Confluent Certified Developer (2026)
Complete prep for the CCDAK exam — Producer/Consumer API, St...
CCAAK Exam Guide: Confluent Certified Administrator (2026)
Pass the CCAAK exam — cluster management, partitions, securi...
Kafka Replicas & ISR: Fault Tolerance Explained (2026)
Replica placement, in-sync replicas (ISR), leader election. ...
Kafka Offsets: Commit Modes & Consumer Position (2026)
Offset semantics — auto vs. manual commit, __consumer_offset...