Failures in Kafka are not unusual. Network jitter, schema mismatches, and brief downstream outages all happen. This article builds on stable concepts from the official documentation to organize how you design and operate retries, DLQs, and compensation logic.
As CCDAK preparation, we also cover the key points around delivery guarantees, transactions (EOS), offset commits, and per-partition ordering.
Start by classifying failures. Transient errors that clear quickly (network blips, throttling) call for different handling than permanent errors caused by the data itself (schema mismatches, validation failures, unexpected formats).
Next, separate business errors from technical errors. Technical errors are often recoverable via retries and backoff; business errors should be handled explicitly via DLQ or compensation.
Kafka provides message ordering on a per-partition basis. Always evaluate the impact of error handling on ordering — route records sharing a key to the same partition, restrict in-flight requests when needed, and make sure retry topics share the same partition count and partitioner.
Producer retries in Kafka are effective for transient broker or network failures. Combined with enable.idempotence=true, the client itself prevents duplicates from resends. When ordering matters, the standard practice is to cap max.in.flight.requests.per.connection at 1.
Consumer-side retry is the application's responsibility. Inline retries block the partition while running, hurting throughput. In practice, the most stable pattern is an N-stage chain of retry topics (retry-1, retry-2, ...) with progressively longer backoffs. Records that ultimately cannot be processed go to the DLQ. Kafka does not natively provide a "delay queue", so delays are expressed via dedicated retry topics and consumption interval control.
| Pattern | Applies to | Primary purpose | Strength |
|---|---|---|---|
| Producer retries | Send failures (technical) | Improved delivery success rate | Minimizes duplicates when combined with idempotence |
| Consumer inline retry | Processing failures (technical / some business) | Immediate retry | Easy to implement |
| Retry topics | Processing failures (mostly transient) | Staged backoff and isolation | Keeps main-stream throughput intact |
| DLQ | Unprocessable (permanent) | Quarantining damaged records | Prevents head-of-line blocking |
| Compensation | Business inconsistencies | After-the-fact correction | Restores consistency across services |
Retry topic and DLQ flow (preserving key order)
DLQ is not a built-in Kafka feature; it is implemented as a regular topic. Use a naming convention that is immediately recognizable, such as <original-topic>.DLQ or <domain>.dead-letter. Set retention longer than the main stream and prepare monitoring and reprocessing tooling (manual / batch / dedicated UI). The cleanup policy should usually be delete; avoid compaction unless key deletion has a meaningful semantic in your design.
To preserve ordering, match the partition count and partitioner of the DLQ and retry topics to the source topic. Carry the original metadata and error context in headers so reprocessing can reconstruct context. Avoid carelessly placing PII or secrets into headers or values, and design auditing and access controls deliberately.
Sending to DLQ / retry topics from a Java client (simplified)
import org.apache.kafka.clients.consumer.*;
import org.apache.kafka.clients.producer.*;
import org.apache.kafka.common.header.internals.RecordHeader;
import org.apache.kafka.common.serialization.ByteArrayDeserializer;
import org.apache.kafka.common.serialization.ByteArraySerializer;
import java.nio.charset.StandardCharsets;
import java.time.Duration;
import java.util.*;
public class RetryAndDlqExample {
static String MAIN = "orders";
static String RETRY_PREFIX = MAIN + ".retry."; // retry.1, retry.2 など
static String DLQ = MAIN + ".DLQ";
public static void main(String[] args) {
Properties cp = new Properties();
cp.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "broker:9092");
cp.put(ConsumerConfig.GROUP_ID_CONFIG, "orders-worker");
cp.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");
KafkaConsumer<byte[], byte[]> consumer = new KafkaConsumer<>(cp, new ByteArrayDeserializer(), new ByteArrayDeserializer());
consumer.subscribe(Collections.singletonList(MAIN));
Properties pp = new Properties();
pp.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "broker:9092");
pp.put(ProducerConfig.ACKS_CONFIG, "all");
pp.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, "true");
pp.put(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, "1");
pp.put(ProducerConfig.DELIVERY_TIMEOUT_MS_CONFIG, "120000");
KafkaProducer<byte[], byte[]> producer = new KafkaProducer<>(pp, new ByteArraySerializer(), new ByteArraySerializer());
while (true) {
ConsumerRecords<byte[], byte[]> records = consumer.poll(Duration.ofSeconds(1));
for (ConsumerRecord<byte[], byte[]> r : records) {
try {
// ここで業務処理(例:検証/下流呼び出し/DB upsert)
process(r);
// 正常時のみオフセットを明示コミット
consumer.commitSync(Collections.singletonMap(new TopicPartition(r.topic(), r.partition()), new OffsetAndMetadata(r.offset() + 1)));
} catch (Exception e) {
int attempt = headerInt(r, "x-attempt", 0) + 1;
String nextTopic = attempt <= 2 ? RETRY_PREFIX + attempt : DLQ;
ProducerRecord<byte[], byte[]> pr = new ProducerRecord<>(nextTopic, r.key(), r.value());
pr.headers().add(new RecordHeader("x-original-topic", r.topic().getBytes(StandardCharsets.UTF_8)));
pr.headers().add(new RecordHeader("x-original-partition", Integer.toString(r.partition()).getBytes(StandardCharsets.UTF_8)));
pr.headers().add(new RecordHeader("x-original-offset", Long.toString(r.offset()).getBytes(StandardCharsets.UTF_8)));
pr.headers().add(new RecordHeader("x-exception-class", e.getClass().getName().getBytes(StandardCharsets.UTF_8)));
pr.headers().add(new RecordHeader("x-ex-message", Optional.ofNullable(e.getMessage()).orElse("").getBytes(StandardCharsets.UTF_8)));
pr.headers().add(new RecordHeader("x-attempt", Integer.toString(attempt).getBytes(StandardCharsets.UTF_8)));
producer.send(pr).get(); // シンプルに同期送信
// 失敗レコードのオフセットは“処理済み”として進め、本流を詰まらせない
consumer.commitSync(Collections.singletonMap(new TopicPartition(r.topic(), r.partition()), new OffsetAndMetadata(r.offset() + 1)));
}
}
}
}
static void process(ConsumerRecord<byte[], byte[]> r) {
// ダミー:例外を投げるかもしれない業務処理
}
static int headerInt(ConsumerRecord<byte[], byte[]> r, String key, int dflt) {
var h = r.headers().lastHeader(key);
if (h == null) return dflt;
try { return Integer.parseInt(new String(h.value(), StandardCharsets.UTF_8)); } catch (Exception e) { return dflt; }
}
}
Compensation is about designing steps (cancellation / reversal) that retroactively undo a business inconsistency. The SAGA pattern keeps consistency without distributed transactions by having each step asynchronously publish success or compensation events.
The outbox pattern is effective for keeping DB updates and event publishing atomic. The application writes business data and an outbox row in the same DB transaction, and a CDC connector reliably forwards those rows to Kafka. When compensation is required, you publish a new compensation event and apply it downstream. Note that Kafka transactions (EOS) provide read-write atomicity within Kafka, not end-to-end atomicity with external systems.
Monitor retries and the DLQ as indicators of "flow health". Sudden DLQ spikes or retry-topic backlogs are early signs of mainline degradation. In addition to the usual producer / consumer client metrics, watching per-topic latency and backlog speeds up root-cause analysis.
Tilt alerts toward trends — "long-running backlog", "rising DLQ ratio", "growing retry stages" — and use tiered notifications to avoid jumping to immediate cutoffs. Reprocessing jobs should be idempotent and throttleable so that re-injection does not crush the mainline.
CCDAK frequently tests the differences between delivery guarantees (at-most-once / at-least-once / exactly-once), the idempotent producer, and the constraints of transactions (a single transaction can span writes to multiple partitions and offset commits, but does not guarantee atomicity with external systems).
Anti-patterns include unbounded retries, long blocking that exceeds max.poll.interval.ms, unmonitored DLQs, ordering breakage from mismatched partition counts on retry topics, and producer resends without idempotence enabled.
CCDAK
問題 1
A consumer reads from topic A and writes processed results to topic B, sending permanent errors to a DLQ. You want to minimize duplicates and preserve ordering per key. Which design is the most appropriate?
正解: A
Exactly-once within Kafka requires combining the idempotent producer with transactions and bundling the write and offset commit in the same transaction. When ordering matters, cap in-flight at 1. Permanent errors are quarantined to the DLQ after the transaction is aborted. B does not guarantee duplicate or ordering safety, C hurts throughput and stability, and D is inappropriate as a DLQ design (compaction-only is not the right cleanup policy here).
Is DLQ a built-in Kafka feature?
No. A DLQ is implemented as a regular topic. The full "DLQ design" comprises a naming convention, metadata (headers carrying the original topic/partition/offset), a reprocessing procedure, and monitoring.
Can ordering be preserved when using retry topics?
Within a single topic, ordering is preserved per partition. If a retry topic uses the same partition count and key-based partitioner as the original topic, the relative order of records sharing the same key is effectively maintained. However, absolute ordering across topics is outside Kafka's guarantees.
Is it always better to increase producer retries?
Unbounded retries inflate latency. Resends only occur within the delivery.timeout.ms window — beyond that, the send is treated as failed. Combine with enable.idempotence=true and tune retries/backoff to match your SLA. If ordering is critical, also restrict in-flight requests.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Kafka Topics & Partitions: Distribution Fundamentals (2026)
How Kafka topics and partitions enable scale — ordering guar...
CCDAK Exam Guide: Confluent Certified Developer (2026)
Complete prep for the CCDAK exam — Producer/Consumer API, St...
CCAAK Exam Guide: Confluent Certified Administrator (2026)
Pass the CCAAK exam — cluster management, partitions, securi...
Kafka Replicas & ISR: Fault Tolerance Explained (2026)
Replica placement, in-sync replicas (ISR), leader election. ...
Kafka Offsets: Commit Modes & Consumer Position (2026)
Offset semantics — auto vs. manual commit, __consumer_offset...