The classic approach to turning transactional DB updates into near-real-time events and pushing them into Kafka is CDC with Debezium. You can integrate data with low latency and strong consistency while keeping application changes to a minimum.
This article organizes the points that commonly trip people up in production, aligned with Kafka's official behavior. It also clarifies the angles CCDAK tends to ask about: Connect, serializers, partitioning, and delivery guarantees.
Debezium subscribes to the database change log (e.g., MySQL binlog, PostgreSQL WAL) and emits change events to Kafka topics as a Kafka Connect Source connector. Its strength is making full use of the ordering and consistency guaranteed by the DB engine without adding triggers or application changes.
Events are typically encoded in an envelope format that contains the before/after records, the operation type (c: create, u: update, d: delete, r: snapshot), a timestamp, and source metadata. Use the primary key (or a logical key) as the message key to guarantee message consistency and per-partition ordering in Kafka.
| Approach | Typical Latency | Duplicate / Loss Tolerance | Schema Change Handling |
|---|---|---|---|
| Log-based CDC (Debezium) | Hundreds of ms to a few seconds | Loss is unlikely. Reprocessing can cause duplicates | Robust to column additions (assumes history management and schema evolution) |
| Polling (diff queries) | Tens of seconds to minutes | Risk of loss (update-timing races) | Requires query and application-side changes |
| DB triggers / queues | Application-dependent | Depends on implementation. Watch out for overload and failure resilience | Fragile under change (triggers must be updated) |
RDB → Debezium → Kafka data flow
Debezium runs as a Kafka Connect Source connector. You POST a connector configuration to a Connect worker (distributed or standalone), and tasks read the DB log and emit to Kafka. For the serializer, choose between Avro/Protobuf/JSON Schema paired with Confluent Schema Registry, or plain JSON (schemaless). CCDAK frequently asks about key/value converter configuration and schema evolution handling.
PostgreSQL requires logical decoding (e.g., pgoutput). Publication and replication privileges, slot management, and WAL configuration (sufficient retention) are prerequisites. Topic names typically follow the serverName.schema.table format, and internal topics for transaction metadata are also created.
Example Debezium PostgreSQL connector configuration POSTed to /connectors
{
"name": "inventory-postgres-cdc",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "postgres",
"database.port": "5432",
"database.user": "replicator",
"database.password": "<redacted>",
"database.dbname": "inventory",
"plugin.name": "pgoutput",
"slot.name": "debezium_slot",
"publication.autocreate.mode": "filtered",
"tombstones.on.delete": "true",
"topic.prefix": "dbserver1",
"snapshot.mode": "initial",
"schema.include.list": "public",
"table.include.list": "public.customers,public.orders",
"max.batch.size": "2048",
"max.queue.size": "8192",
"heartbeat.interval.ms": "10000",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "http://schema-registry:8081",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://schema-registry:8081",
"schema.history.internal.kafka.bootstrap.servers": "kafka:9092",
"schema.history.internal.kafka.topic": "schema-changes.inventory",
"topic.creation.default.replication.factor": "3",
"topic.creation.default.partitions": "6",
"topic.creation.default.cleanup.policy": "compact,delete"
}
}Debezium messages use an envelope structure containing before/after, op, source, ts_ms, transaction, and more. When feeding upsert-style downstreams (e.g., ksqlDB tables, JDBC Sink upserts, DWH MERGE), the standard is to enable log compaction on the Kafka topic and put the primary-key equivalent in the message key. Deletes are represented as tombstone messages, and compaction preserves the final state.
Control schema evolution (column additions, NULL allowance changes, etc.) with Schema Registry compatibility settings. Prioritize changes that maintain backward compatibility, and configure things so deployments that violate compatibility are blocked. CCDAK frequently asks about compatibility modes (BACKWARD/FORWARD/FULL) and the fact that record keys and values have separate schemas.
Kafka Connect stores the source read position in an internal offsets topic. Debezium uses this to manage the binlog/WAL position and resume from the same place after restart. Small numbers of duplicates can occur during cluster rebalances or failures, so design the downstream to receive idempotently or via upsert.
A snapshot first reads every row in the table, then transitions to log following. For large tables, consider the incremental snapshot feature that stages the snapshot, run it outside business hours, and apply I/O throttling. Debezium reconciles the snapshot and log application while respecting transaction boundaries.
Monitor across three layers. DB (WAL/binlog progress and retention), Connect/Debezium (task state, queue length, error rate, source position), and Kafka (publish latency, producer retries, broker-side throughput and replication). Connect and Debezium expose JMX metrics, so define clear alerting thresholds.
Balance performance across batch and queue sizes, poll intervals, topic partition counts, and replication factor. Too many partitions worsen ordering guarantees and key skew, so size them based on primary-key uniqueness and throughput.
CCDAK frequently asks about Kafka Connect basics, serializers and Schema Registry, partitioning, log compaction, and delivery guarantees (the difference between at-least-once and exactly-once). Debezium-specific behavior is rarely asked directly, but how you design CDC topics and how downstream should consume them is fundamentally a Kafka design problem.
In particular, questions about key design, selection of topic cleanup policy, handling duplicates and consistency during reprocessing, and understanding Connect offset management are where candidates separate themselves.
CCDAK
問題 1
You are ingesting RDB row updates into Kafka with Debezium. Downstream will upsert-aggregate as a ksqlDB table. Which topic design is most appropriate?
正解: B
For upsert-style downstreams (table semantics), the standard is to partition by primary key and let log compaction converge to the latest state per key. Use compact or compact,delete since deletes propagate via tombstones.
Is Schema Registry required?
It is not required. You can ship plain JSON. However, in both real-world operations and CCDAK, schema compatibility management and evolution matter, so using Avro/Protobuf/JSON Schema with Schema Registry is the recommended approach.
Are updates that happen during a snapshot lost?
No. After the snapshot completes, changes that occurred during the snapshot window are caught up from the log and applied. The Kafka topic eventually converges with subsequent events.
Can duplicate events occur, and how should they be handled?
Kafka Connect delivers at-least-once, so duplicates can appear on restarts and rebalances. Maintain consistency through key-based upserts, idempotent sink processing, and use of transaction metadata.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Kafka Topics & Partitions: Distribution Fundamentals (2026)
How Kafka topics and partitions enable scale — ordering guar...
CCDAK Exam Guide: Confluent Certified Developer (2026)
Complete prep for the CCDAK exam — Producer/Consumer API, St...
CCAAK Exam Guide: Confluent Certified Administrator (2026)
Pass the CCAAK exam — cluster management, partitions, securi...
Kafka Replicas & ISR: Fault Tolerance Explained (2026)
Replica placement, in-sync replicas (ISR), leader election. ...
Kafka Offsets: Commit Modes & Consumer Position (2026)
Offset semantics — auto vs. manual commit, __consumer_offset...