Handling Kafka Poison Pills (2026)

A poison pill is a message that a consumer fetches but cannot process - the application or Serde fails on it, and the same record keeps failing. Unlike a transient outage, leaving it alone stalls the processing pipeline and, depending on your offset commit strategy, can halt the whole system.

Based on the official documentation, this article concisely covers failure isolation, DLQs (Dead Letter Queues), retry design, schema compatibility configuration, and monitoring points for Kafka Connect, Kafka Streams, and the plain Consumer API. The closing chapter summarizes the essentials for CCDAK (Confluent Certified Developer for Apache Kafka) candidates.

What Is a Poison Pill: Symptoms and Root Causes

A poison pill is a state where deserialization or application logic fails permanently on a specific record. Typical causes include schema mismatches, Serde mismatches, corrupted data, unexpected compression or encoding, and oversized payloads. Because the Kafka broker just stores messages as-is, the failure almost always surfaces on the consumer side.

The key is to isolate failures, keep healthy records flowing, and leave behind enough breadcrumbs (original headers and exception details) to reprocess later. Unbounded retries and naive commits invite infinite loops or data loss, so avoid them.

Common causes: schema evolution mismatches (e.g., making a new field required), Serde misconfiguration, record corruption, character-encoding mismatches, exceeding max message size
Initial triage: consumer app logs, whether a DLQ exists, lag on the affected partition, reproducibility on the same record
Kafka does not have a DLQ as a built-in feature (Connect and Streams have dedicated strategies). With the plain Consumer API, you implement the pattern yourself

Deserialization failure caused by a value-type mismatch (example)

// Avro schema (age is int)
{
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "name", "type": "string"},
    {"name": "age",  "type": "int"}
  ]
}

// Actual record (age is a string) -> deserialization fails
{"name":"taro","age":"twenty"}

// Mismatches like this can be prevented up front via Schema Registry compatibility settings and producer-side validation.

Isolating and Routing Around Failures: DLQs, Retry Topics, and Stop Strategies

When you detect a poison pill, the default response is to route that record around the healthy pipeline without stopping it. The three common approaches are: send it to a DLQ (Dead Letter Queue), use a dedicated retry topic with exponential backoff, or explicitly stop processing to prioritize consistency.

Kafka Connect officially supports error handling and DLQ configuration. Kafka Streams lets you build error paths via DeserializationExceptionHandler or topology branching. With the plain Consumer API, you produce to a DLQ yourself and control your own commit points.

Include original headers, key, partition info, and exception message in the DLQ (so you have breadcrumbs for later)
Cap retries with both attempt count and wait time to avoid infinite retries (backoff plus a final DLQ landing spot)
A stop strategy is useful in strict SLAs like finance, but pair it with recovery runbooks and monitoring

Strategy	Strengths	Caveats / when to use
DLQ (isolation)	Keeps healthy flow running. Manual or batch repair afterward	DLQ operations and storage costs; requires post-processing design
Retry topic (delayed reprocessing)	Self-recovers from transient issues; backoff smooths load	Permanent failures end up in the DLQ anyway; watch ordering requirements
Immediate stop (fail-fast)	Consistency-first; surfaces anomalies immediately	Reduced availability; a recovery playbook is mandatory

Poison pill isolation flow (conceptual diagram)

Example Kafka Connect configuration to enable DLQ (works for both sinks and sources)

# connector.properties (excerpt)
errors.tolerance=all
errors.deadletterqueue.topic.name=dlq.user-v1
errors.deadletterqueue.context.headers.enable=true
errors.log.enable=true
errors.log.include.messages=true

# Tuning example for larger records (apply as needed)
producer.max.request.size=1048576
consumer.max.partition.fetch.bytes=1048576

Schema Compatibility Strategy: Prevention and Staged Migration

Most poison pills come from schema evolution mismatches. Use BACKWARD or BACKWARD_TRANSITIVE compatibility in Schema Registry as your default, and let producers preserve a coexistence window for old and new clients. Stage breaking changes - making a field required, changing a type - through default values or by adding a new field first and removing the old one later.

The subject naming strategy (TopicNameStrategy, RecordNameStrategy, etc.) determines the scope of compatibility checks. In both production and the exam, being able to clearly articulate the unit at which schemas evolve is crucial.

Default to backward compatibility (older consumers can read newer messages)
When breaking changes are necessary, run a coexistence window and roll out gradually
Distinguish global vs. per-subject compatibility settings in Schema Registry

Set compatibility level in Schema Registry (example: backward)

# Global compatibility
curl -s -X PUT -H 'Content-Type: application/json' \
  --data '{"compatibility":"BACKWARD"}' \
  http://schema-registry:8081/config

# Per-subject (e.g., topic-value)
curl -s -X PUT -H 'Content-Type: application/json' \
  --data '{"compatibility":"BACKWARD"}' \
  http://schema-registry:8081/config/my-topic-value

# Compatibility check (dry-run before registering)
curl -s -X POST -H 'Content-Type: application/vnd.schemaregistry.v1+json' \
  --data '{"schema":"<avro schema string>"}' \
  http://schema-registry:8081/compatibility/subjects/my-topic-value/versions/latest

Robust Design with the Consumer API: Commit Control and Retry Design

With the plain Consumer API, the DLQ is yours to build. The rule is: commit offsets only on success. On failure, produce the record to a DLQ and explicitly advance (seek) or fall through to skip handling so you do not loop on the same record forever.

Stamping the cause (exception class, schema ID, source partition/offset) into headers makes downstream repair and reprocessing much easier. Once you have fixed the issue, a batch job that replays records from the DLQ back into the source topic lightens the operational load.

Set enable.auto.commit=false and commitSync/Async only after successful processing
If you need raw bytes before deserialization, accept with ByteArrayDeserializer, validate, then convert
Balance max.poll.interval.ms against processing time, implement backoff, and account for re-entry on rebalance

Java (Kafka Consumer) example: DLQ forwarding and safe commits

var consumer = new KafkaConsumer<String, byte[]>(consumerProps);
var dlqProducer = new KafkaProducer<String, byte[]>(producerProps);
consumer.subscribe(List.of("input"));

while (true) {
  ConsumerRecords<String, byte[]> batch = consumer.poll(Duration.ofSeconds(1));
  for (ConsumerRecord<String, byte[]> r : batch) {
    try {
      // Custom decode (e.g., Avro/JSON validation)
      var record = decode(r.value());
      process(record);
      // Commit only on success (manual)
      consumer.commitSync(Collections.singletonMap(
        new TopicPartition(r.topic(), r.partition()),
        new OffsetAndMetadata(r.offset() + 1)));
    } catch (Exception e) {
      // Forward to DLQ (attach original info as headers)
      Headers h = new RecordHeaders(r.headers());
      h.add("error.class", e.getClass().getName().getBytes(StandardCharsets.UTF_8));
      h.add("error.message", Optional.ofNullable(e.getMessage()).orElse("").getBytes(StandardCharsets.UTF_8));
      h.add("source.topic", r.topic().getBytes(StandardCharsets.UTF_8));
      h.add("source.partition", Integer.toString(r.partition()).getBytes(StandardCharsets.UTF_8));
      h.add("source.offset", Long.toString(r.offset()).getBytes(StandardCharsets.UTF_8));
      dlqProducer.send(new ProducerRecord<>("dlq.input", null, r.key(), r.value(), h));

      // Advance to the next record (avoid infinite loop)
      consumer.seek(new TopicPartition(r.topic(), r.partition()), r.offset() + 1);
    }
  }
}

Operations and SLA: Lag, Error Rate, and DLQ Backlog

Visibility is non-negotiable for poison-pill defense. Continuously observe consumer lag, DLQ throughput and backlog, retry-topic reprocessing rate, and the distribution of exception types, then alert on thresholds. Connect and Streams expose rich metrics that you can visualize with JMX Exporter or Confluent Control Center.

The classic operational trap is a DLQ that quietly piles up. Dashboard the backlog (residency time) and record count, page the on-call when thresholds trip, and standardize periodic replay and repair procedures (reinjection, schema fixes, alternative transforms).

Lag check: use kafka-consumer-groups.sh --describe to inspect per-group lag instantly
Monitor record count, arrival rate, and residency time (processing latency) on DLQ/retry topics
Connect: leverage metrics like errors-total and deadletterqueue-produce-failures-total

Quick command examples for lag/backlog checks

# Consumer lag
kafka-consumer-groups.sh \
  --bootstrap-server broker:9092 \
  --group my-group \
  --describe

# Topic record count (approximate)
kafka-run-class.sh kafka.tools.GetOffsetShell \
  --broker-list broker:9092 \
  --topic dlq.input --time -1 | awk -F: '{sum+=$3} END {print sum}'

CCDAK Tips: Get Strong on Design-Choice Questions

CCDAK tests whether you can explain which strategy belongs at which layer. Frequent topics include Kafka Connect DLQ settings, Kafka Streams exception handlers, Schema Registry compatibility, and Consumer commit strategies. Approach them as production-design questions and be able to articulate the availability vs. consistency tradeoff.

Questions also extend beyond plain retries to message ordering, Exactly-Once Semantics (EOS), and transaction visibility (isolation.level=read_committed). Poison pills are outside EOS scope, but where you commit offsets directly affects consistency, so keep that in mind.

Connect: learn errors.tolerance, errors.deadletterqueue.topic.name, and headers.enable as a set
Streams: the difference between LogAndContinue and LogAndFail, and routing errors via branching
Consumer: commit only on success, attach metadata to the DLQ, and prevent infinite loops (seek/skip)

Kafka Streams: handling deserialization exceptions (handler configuration example)

Properties p = new Properties();
p.put(StreamsConfig.DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG,
      org.apache.kafka.streams.errors.LogAndContinueExceptionHandler.class);
// Choosing LogAndFail stops the app by default (health-first)

StreamsBuilder b = new StreamsBuilder();
KStream<String, String> in = b.stream("input");
KStream<String, String>[] branches = in.branch(
  (k, v) -> isValid(v),    // healthy path
  (k, v) -> true           // error path
);
branches[0].to("output");
branches[1].to("dlq.input");

Check Your Understanding

CCDAK

問題 1

On a Kafka Connect sink connector, some records cannot be written due to schema mismatches. You want to keep healthy records flowing while making failed records reprocessable later. Which configuration combination is most appropriate?

Set errors.tolerance=none and stop the connector on failure
Set errors.tolerance=all and errors.deadletterqueue.topic.name, and enable header output
Set consumer.auto.offset.reset=earliest to retry automatically
Set producer acks=0 to prioritize throughput

正解: B

To keep healthy records flowing while isolating failures, use Connect's DLQ feature. The standard combination is errors.tolerance=all to tolerate errors, errors.deadletterqueue.topic.name to set the DLQ destination, and errors.deadletterqueue.context.headers.enable=true to include cause information in headers.

Frequently Asked Questions

Does Kafka itself include a DLQ feature?

The broker itself has no dedicated DLQ feature. Kafka Connect provides DLQ configuration, and Kafka Streams provides exception handlers and branching to route errors. With the plain Consumer API, you build your own logic to produce failed records to a DLQ topic.

How should offsets be handled when a poison pill is encountered?

The rule is to commit only on success. Forward failed records to a DLQ and advance with consumer.seek so you do not loop forever. Relying on auto commit can commit offsets for unprocessed records, raising the risk that they can never be reprocessed.

How do you reduce schema mismatches?

Keep backward compatibility in Schema Registry and roll out breaking changes in staged migrations. Adding schema validation on the producer side and automating compatibility checks (the /compatibility API) in CI/CD is highly effective.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

Kafka Poison Pill Messages: A Practical Guide to Handling Unconsumable Data