A poison pill is a message that a consumer fetches but cannot process - the application or Serde fails on it, and the same record keeps failing. Unlike a transient outage, leaving it alone stalls the processing pipeline and, depending on your offset commit strategy, can halt the whole system.
Based on the official documentation, this article concisely covers failure isolation, DLQs (Dead Letter Queues), retry design, schema compatibility configuration, and monitoring points for Kafka Connect, Kafka Streams, and the plain Consumer API. The closing chapter summarizes the essentials for CCDAK (Confluent Certified Developer for Apache Kafka) candidates.
A poison pill is a state where deserialization or application logic fails permanently on a specific record. Typical causes include schema mismatches, Serde mismatches, corrupted data, unexpected compression or encoding, and oversized payloads. Because the Kafka broker just stores messages as-is, the failure almost always surfaces on the consumer side.
The key is to isolate failures, keep healthy records flowing, and leave behind enough breadcrumbs (original headers and exception details) to reprocess later. Unbounded retries and naive commits invite infinite loops or data loss, so avoid them.
Deserialization failure caused by a value-type mismatch (example)
// Avro schema (age is int)
{
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "age", "type": "int"}
]
}
// Actual record (age is a string) -> deserialization fails
{"name":"taro","age":"twenty"}
// Mismatches like this can be prevented up front via Schema Registry compatibility settings and producer-side validation.When you detect a poison pill, the default response is to route that record around the healthy pipeline without stopping it. The three common approaches are: send it to a DLQ (Dead Letter Queue), use a dedicated retry topic with exponential backoff, or explicitly stop processing to prioritize consistency.
Kafka Connect officially supports error handling and DLQ configuration. Kafka Streams lets you build error paths via DeserializationExceptionHandler or topology branching. With the plain Consumer API, you produce to a DLQ yourself and control your own commit points.
| Strategy | Strengths | Caveats / when to use |
|---|---|---|
| DLQ (isolation) | Keeps healthy flow running. Manual or batch repair afterward | DLQ operations and storage costs; requires post-processing design |
| Retry topic (delayed reprocessing) | Self-recovers from transient issues; backoff smooths load | Permanent failures end up in the DLQ anyway; watch ordering requirements |
| Immediate stop (fail-fast) | Consistency-first; surfaces anomalies immediately | Reduced availability; a recovery playbook is mandatory |
Poison pill isolation flow (conceptual diagram)
Example Kafka Connect configuration to enable DLQ (works for both sinks and sources)
# connector.properties (excerpt)
errors.tolerance=all
errors.deadletterqueue.topic.name=dlq.user-v1
errors.deadletterqueue.context.headers.enable=true
errors.log.enable=true
errors.log.include.messages=true
# Tuning example for larger records (apply as needed)
producer.max.request.size=1048576
consumer.max.partition.fetch.bytes=1048576Most poison pills come from schema evolution mismatches. Use BACKWARD or BACKWARD_TRANSITIVE compatibility in Schema Registry as your default, and let producers preserve a coexistence window for old and new clients. Stage breaking changes - making a field required, changing a type - through default values or by adding a new field first and removing the old one later.
The subject naming strategy (TopicNameStrategy, RecordNameStrategy, etc.) determines the scope of compatibility checks. In both production and the exam, being able to clearly articulate the unit at which schemas evolve is crucial.
Set compatibility level in Schema Registry (example: backward)
# Global compatibility
curl -s -X PUT -H 'Content-Type: application/json' \
--data '{"compatibility":"BACKWARD"}' \
http://schema-registry:8081/config
# Per-subject (e.g., topic-value)
curl -s -X PUT -H 'Content-Type: application/json' \
--data '{"compatibility":"BACKWARD"}' \
http://schema-registry:8081/config/my-topic-value
# Compatibility check (dry-run before registering)
curl -s -X POST -H 'Content-Type: application/vnd.schemaregistry.v1+json' \
--data '{"schema":"<avro schema string>"}' \
http://schema-registry:8081/compatibility/subjects/my-topic-value/versions/latestWith the plain Consumer API, the DLQ is yours to build. The rule is: commit offsets only on success. On failure, produce the record to a DLQ and explicitly advance (seek) or fall through to skip handling so you do not loop on the same record forever.
Stamping the cause (exception class, schema ID, source partition/offset) into headers makes downstream repair and reprocessing much easier. Once you have fixed the issue, a batch job that replays records from the DLQ back into the source topic lightens the operational load.
Java (Kafka Consumer) example: DLQ forwarding and safe commits
var consumer = new KafkaConsumer<String, byte[]>(consumerProps);
var dlqProducer = new KafkaProducer<String, byte[]>(producerProps);
consumer.subscribe(List.of("input"));
while (true) {
ConsumerRecords<String, byte[]> batch = consumer.poll(Duration.ofSeconds(1));
for (ConsumerRecord<String, byte[]> r : batch) {
try {
// Custom decode (e.g., Avro/JSON validation)
var record = decode(r.value());
process(record);
// Commit only on success (manual)
consumer.commitSync(Collections.singletonMap(
new TopicPartition(r.topic(), r.partition()),
new OffsetAndMetadata(r.offset() + 1)));
} catch (Exception e) {
// Forward to DLQ (attach original info as headers)
Headers h = new RecordHeaders(r.headers());
h.add("error.class", e.getClass().getName().getBytes(StandardCharsets.UTF_8));
h.add("error.message", Optional.ofNullable(e.getMessage()).orElse("").getBytes(StandardCharsets.UTF_8));
h.add("source.topic", r.topic().getBytes(StandardCharsets.UTF_8));
h.add("source.partition", Integer.toString(r.partition()).getBytes(StandardCharsets.UTF_8));
h.add("source.offset", Long.toString(r.offset()).getBytes(StandardCharsets.UTF_8));
dlqProducer.send(new ProducerRecord<>("dlq.input", null, r.key(), r.value(), h));
// Advance to the next record (avoid infinite loop)
consumer.seek(new TopicPartition(r.topic(), r.partition()), r.offset() + 1);
}
}
}Visibility is non-negotiable for poison-pill defense. Continuously observe consumer lag, DLQ throughput and backlog, retry-topic reprocessing rate, and the distribution of exception types, then alert on thresholds. Connect and Streams expose rich metrics that you can visualize with JMX Exporter or Confluent Control Center.
The classic operational trap is a DLQ that quietly piles up. Dashboard the backlog (residency time) and record count, page the on-call when thresholds trip, and standardize periodic replay and repair procedures (reinjection, schema fixes, alternative transforms).
Quick command examples for lag/backlog checks
# Consumer lag
kafka-consumer-groups.sh \
--bootstrap-server broker:9092 \
--group my-group \
--describe
# Topic record count (approximate)
kafka-run-class.sh kafka.tools.GetOffsetShell \
--broker-list broker:9092 \
--topic dlq.input --time -1 | awk -F: '{sum+=$3} END {print sum}'CCDAK tests whether you can explain which strategy belongs at which layer. Frequent topics include Kafka Connect DLQ settings, Kafka Streams exception handlers, Schema Registry compatibility, and Consumer commit strategies. Approach them as production-design questions and be able to articulate the availability vs. consistency tradeoff.
Questions also extend beyond plain retries to message ordering, Exactly-Once Semantics (EOS), and transaction visibility (isolation.level=read_committed). Poison pills are outside EOS scope, but where you commit offsets directly affects consistency, so keep that in mind.
Kafka Streams: handling deserialization exceptions (handler configuration example)
Properties p = new Properties();
p.put(StreamsConfig.DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG,
org.apache.kafka.streams.errors.LogAndContinueExceptionHandler.class);
// Choosing LogAndFail stops the app by default (health-first)
StreamsBuilder b = new StreamsBuilder();
KStream<String, String> in = b.stream("input");
KStream<String, String>[] branches = in.branch(
(k, v) -> isValid(v), // healthy path
(k, v) -> true // error path
);
branches[0].to("output");
branches[1].to("dlq.input");CCDAK
問題 1
On a Kafka Connect sink connector, some records cannot be written due to schema mismatches. You want to keep healthy records flowing while making failed records reprocessable later. Which configuration combination is most appropriate?
正解: B
To keep healthy records flowing while isolating failures, use Connect's DLQ feature. The standard combination is errors.tolerance=all to tolerate errors, errors.deadletterqueue.topic.name to set the DLQ destination, and errors.deadletterqueue.context.headers.enable=true to include cause information in headers.
Does Kafka itself include a DLQ feature?
The broker itself has no dedicated DLQ feature. Kafka Connect provides DLQ configuration, and Kafka Streams provides exception handlers and branching to route errors. With the plain Consumer API, you build your own logic to produce failed records to a DLQ topic.
How should offsets be handled when a poison pill is encountered?
The rule is to commit only on success. Forward failed records to a DLQ and advance with consumer.seek so you do not loop forever. Relying on auto commit can commit offsets for unprocessed records, raising the risk that they can never be reprocessed.
How do you reduce schema mismatches?
Keep backward compatibility in Schema Registry and roll out breaking changes in staged migrations. Adding schema validation on the producer side and automating compatibility checks (the /compatibility API) in CI/CD is highly effective.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Kafka Topics & Partitions: Distribution Fundamentals (2026)
How Kafka topics and partitions enable scale — ordering guar...
CCDAK Exam Guide: Confluent Certified Developer (2026)
Complete prep for the CCDAK exam — Producer/Consumer API, St...
CCAAK Exam Guide: Confluent Certified Administrator (2026)
Pass the CCAAK exam — cluster management, partitions, securi...
Kafka Replicas & ISR: Fault Tolerance Explained (2026)
Replica placement, in-sync replicas (ISR), leader election. ...
Kafka Offsets: Commit Modes & Consumer Position (2026)
Offset semantics — auto vs. manual commit, __consumer_offset...