Kafka

Debezium CDC: Real-Time Replication from RDB to Kafka

2026-04-19
NicheeLab Editorial Team

The classic approach to turning transactional DB updates into near-real-time events and pushing them into Kafka is CDC with Debezium. You can integrate data with low latency and strong consistency while keeping application changes to a minimum.

This article organizes the points that commonly trip people up in production, aligned with Kafka's official behavior. It also clarifies the angles CCDAK tends to ask about: Connect, serializers, partitioning, and delivery guarantees.

Why Debezium for CDC: Mechanism and Big Picture

Debezium subscribes to the database change log (e.g., MySQL binlog, PostgreSQL WAL) and emits change events to Kafka topics as a Kafka Connect Source connector. Its strength is making full use of the ordering and consistency guaranteed by the DB engine without adding triggers or application changes.

Events are typically encoded in an envelope format that contains the before/after records, the operation type (c: create, u: update, d: delete, r: snapshot), a timestamp, and source metadata. Use the primary key (or a logical key) as the message key to guarantee message consistency and per-partition ordering in Kafka.

  • Log-based CDC with low latency and low risk of data loss
  • Unified operations through Kafka Connect (scaling, restarts, offset management)
  • Key design and log compaction fit downstream upsert processing
ApproachTypical LatencyDuplicate / Loss ToleranceSchema Change Handling
Log-based CDC (Debezium)Hundreds of ms to a few secondsLoss is unlikely. Reprocessing can cause duplicatesRobust to column additions (assumes history management and schema evolution)
Polling (diff queries)Tens of seconds to minutesRisk of loss (update-timing races)Requires query and application-side changes
DB triggers / queuesApplication-dependentDepends on implementation. Watch out for overload and failure resilienceFragile under change (triggers must be updated)

RDB → Debezium → Kafka data flow

Relational DBGenerates binlog/WALDebezium SourceKafka ConnectKafkatopics: db.server.schema.tableStream/Sink AppsksqlDB/Sink CTRDB → Debezium → Kafka data flow

Kafka Connect and Debezium Connector Configuration (PostgreSQL example)

Debezium runs as a Kafka Connect Source connector. You POST a connector configuration to a Connect worker (distributed or standalone), and tasks read the DB log and emit to Kafka. For the serializer, choose between Avro/Protobuf/JSON Schema paired with Confluent Schema Registry, or plain JSON (schemaless). CCDAK frequently asks about key/value converter configuration and schema evolution handling.

PostgreSQL requires logical decoding (e.g., pgoutput). Publication and replication privileges, slot management, and WAL configuration (sufficient retention) are prerequisites. Topic names typically follow the serverName.schema.table format, and internal topics for transaction metadata are also created.

  • Connect worker: offsets are stored in an internal topic (_connect-offsets). Design for duplicate tolerance to handle reprocessing during restarts and rebalances.
  • Converter: keep key.converter and value.converter consistent. A schema-bearing format makes evolution management easier.
  • Use topic.creation to lock down partition count and compaction policy at initialization time.

Example Debezium PostgreSQL connector configuration POSTed to /connectors

{
  "name": "inventory-postgres-cdc",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "tasks.max": "1",
    "database.hostname": "postgres",
    "database.port": "5432",
    "database.user": "replicator",
    "database.password": "<redacted>",
    "database.dbname": "inventory",
    "plugin.name": "pgoutput",
    "slot.name": "debezium_slot",
    "publication.autocreate.mode": "filtered",
    "tombstones.on.delete": "true",
    "topic.prefix": "dbserver1",
    "snapshot.mode": "initial",
    "schema.include.list": "public",
    "table.include.list": "public.customers,public.orders",
    "max.batch.size": "2048",
    "max.queue.size": "8192",
    "heartbeat.interval.ms": "10000",
    "key.converter": "io.confluent.connect.avro.AvroConverter",
    "key.converter.schema.registry.url": "http://schema-registry:8081",
    "value.converter": "io.confluent.connect.avro.AvroConverter",
    "value.converter.schema.registry.url": "http://schema-registry:8081",
    "schema.history.internal.kafka.bootstrap.servers": "kafka:9092",
    "schema.history.internal.kafka.topic": "schema-changes.inventory",
    "topic.creation.default.replication.factor": "3",
    "topic.creation.default.partitions": "6",
    "topic.creation.default.cleanup.policy": "compact,delete"
  }
}

Schema and Message Design: Keys, Envelope, Compaction

Debezium messages use an envelope structure containing before/after, op, source, ts_ms, transaction, and more. When feeding upsert-style downstreams (e.g., ksqlDB tables, JDBC Sink upserts, DWH MERGE), the standard is to enable log compaction on the Kafka topic and put the primary-key equivalent in the message key. Deletes are represented as tombstone messages, and compaction preserves the final state.

Control schema evolution (column additions, NULL allowance changes, etc.) with Schema Registry compatibility settings. Prioritize changes that maintain backward compatibility, and configure things so deployments that violate compatibility are blocked. CCDAK frequently asks about compatibility modes (BACKWARD/FORWARD/FULL) and the fact that record keys and values have separate schemas.

  • Use the primary key or a logical key as the message key. Avoid keyless messages (you lose the basis for partition distribution and ordering).
  • One topic per table is the default. For high throughput, reconsider the sharding key (but watch ordering requirements).
  • Representation of delete: value only in before, after is null, op=d. The tombstone then finalizes compaction.

Practical Points on Offsets, Snapshots, and Reprocessing

Kafka Connect stores the source read position in an internal offsets topic. Debezium uses this to manage the binlog/WAL position and resume from the same place after restart. Small numbers of duplicates can occur during cluster rebalances or failures, so design the downstream to receive idempotently or via upsert.

A snapshot first reads every row in the table, then transitions to log following. For large tables, consider the incremental snapshot feature that stages the snapshot, run it outside business hours, and apply I/O throttling. Debezium reconciles the snapshot and log application while respecting transaction boundaries.

  • snapshot.mode: initial / initial_only / never (choose based on requirements and downtime tolerance).
  • Updates during a snapshot are filled in by log following. On the topic side, later events drive convergence to the final state.
  • If reprocessing is needed, reset offsets and re-subscribe. Just be sure to guarantee duplicate elimination downstream.

Operations, Monitoring, and Performance Optimization

Monitor across three layers. DB (WAL/binlog progress and retention), Connect/Debezium (task state, queue length, error rate, source position), and Kafka (publish latency, producer retries, broker-side throughput and replication). Connect and Debezium expose JMX metrics, so define clear alerting thresholds.

Balance performance across batch and queue sizes, poll intervals, topic partition counts, and replication factor. Too many partitions worsen ordering guarantees and key skew, so size them based on primary-key uniqueness and throughput.

  • Debezium: tune max.batch.size, max.queue.size, and poll.interval.ms against load and latency.
  • Kafka: lock down compaction and partition count up front via topic.creation.*.
  • DB: keep WAL/binlog retention long enough. Make sure slots/positions cannot expire during lag.

CCDAK Prep: Frequent Topics and Pitfalls

CCDAK frequently asks about Kafka Connect basics, serializers and Schema Registry, partitioning, log compaction, and delivery guarantees (the difference between at-least-once and exactly-once). Debezium-specific behavior is rarely asked directly, but how you design CDC topics and how downstream should consume them is fundamentally a Kafka design problem.

In particular, questions about key design, selection of topic cleanup policy, handling duplicates and consistency during reprocessing, and understanding Connect offset management are where candidates separate themselves.

  • Put the primary key in the message key, and route the same key to the same partition. Order is only guaranteed within a partition.
  • Enable log compaction and understand tombstones. Achieve both delete propagation and storage efficiency.
  • Connect delivers at-least-once. Receive idempotently or via upsert downstream.

Check with a Sample Question

CCDAK

問題 1

You are ingesting RDB row updates into Kafka with Debezium. Downstream will upsert-aggregate as a ksqlDB table. Which topic design is most appropriate?

  1. A. More partitions is better. Leave the key unset (null) for random distribution and set cleanup.policy=delete
  2. B. Use the table's primary key as the record key and configure cleanup.policy=compact or compact,delete
  3. C. Use the update timestamp as the key and set cleanup.policy=delete. Only the latest message needs to remain
  4. D. Issue a new UUID as the key every time and set cleanup.policy=compact to avoid duplicates

正解: B

For upsert-style downstreams (table semantics), the standard is to partition by primary key and let log compaction converge to the latest state per key. Use compact or compact,delete since deletes propagate via tombstones.

Frequently Asked Questions

Is Schema Registry required?

It is not required. You can ship plain JSON. However, in both real-world operations and CCDAK, schema compatibility management and evolution matter, so using Avro/Protobuf/JSON Schema with Schema Registry is the recommended approach.

Are updates that happen during a snapshot lost?

No. After the snapshot completes, changes that occurred during the snapshot window are caught up from the log and applied. The Kafka topic eventually converges with subsequent events.

Can duplicate events occur, and how should they be handled?

Kafka Connect delivers at-least-once, so duplicates can appear on restarts and rebalances. Maintain consistency through key-based upserts, idempotent sink processing, and use of transaction metadata.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Kafka

Kafka Topics & Partitions: Distribution Fundamentals (2026)

How Kafka topics and partitions enable scale — ordering guar...

Kafka

CCDAK Exam Guide: Confluent Certified Developer (2026)

Complete prep for the CCDAK exam — Producer/Consumer API, St...

Kafka

CCAAK Exam Guide: Confluent Certified Administrator (2026)

Pass the CCAAK exam — cluster management, partitions, securi...

Kafka

Kafka Replicas & ISR: Fault Tolerance Explained (2026)

Replica placement, in-sync replicas (ISR), leader election. ...

Kafka

Kafka Offsets: Commit Modes & Consumer Position (2026)

Offset semantics — auto vs. manual commit, __consumer_offset...

Browse all Kafka articles (101)
© 2026 NicheeLab All rights reserved.