Kafka Connect Source Connectors are the official mechanism for continuously ingesting data from external systems — databases, files, SaaS — into Kafka topics.
This article organizes architecture, scaling, schema and ordering, reliability and operations, deployment options, and security and performance tuning into 5-7 sections, and calls out the points that matter for CCDAK.
Kafka Connect standardizes data integration between external systems and Kafka through pluggable connectors. Source Connectors handle one-way sync from outside into Kafka, structured as a job definition (Connector) and parallel execution units (Tasks). Configuration, execution, and monitoring all happen through the REST API.
In production, you consider Source Connectors before custom Producer apps because of operational stability, reusability, and maintainability. They shine for RDB ingestion (JDBC/CDC), object storage, log/file ingestion, and SaaS event subscription.
Kafka Connect Workers (processes) form a cluster, distributing Connector definitions and assigning Tasks across Workers. Scale out mainly by raising tasks.max and ensuring enough Kafka partitions. Rebalances occur when Workers come or go, or when configurations change.
Reliability rests on internal topic replication and quorum, acks=all, and a sufficient min.insync.replicas. Topic design and producer settings are delegated to or overridden by the Connector, so make them explicit per requirement.
Logical structure of a Source Connector
Key design is the single most important decision. If you need ordering at the entity level, use the same key for the same entity to funnel it into the same partition. Kafka only guarantees ordering within a partition. Plan key stability and cardinality with post-ingestion aggregation and downstream shuffling in mind.
Designing around Schema Registry makes change management easier. Default to Backward compatibility, prefer adding columns, and avoid deleting fields or changing types. For JDBC/CDC, also account for DDL diffs and nullability.
JDBC Source Connector example (incremental sync)
{
"name": "jdbc-users-source",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"connection.url": "jdbc:postgresql://db:5432/app",
"connection.user": "appuser",
"connection.password": "${file:/opt/connect/secrets.properties:db.password}",
"mode": "timestamp+incrementing",
"incrementing.column.name": "id",
"timestamp.column.name": "updated_at",
"table.whitelist": "public.users",
"topic.prefix": "src.postgres.",
"poll.interval.ms": "10000",
"batch.max.rows": "5000",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://schema-registry:8081",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"producer.override.acks": "all",
"producer.override.enable.idempotence": "true",
"errors.tolerance": "all",
"errors.deadletterqueue.topic.name": "_dlq.src.postgres.users",
"errors.deadletterqueue.context.headers.enable": "true"
}
}Kafka Connect Sources generally deliver at-least-once. Because offsets are committed after a successful send, duplicates can occur on crash recovery. Preserve consistency with idempotent key design, downstream deduplication, or a change-detection column on the source side.
Design error handling in stages. Send record-conversion errors and schema mismatches to a DLQ and fix them operationally. Retry transient errors with backoff, and route permanent errors to the DLQ with an alert.
Use standalone for development and testing, distributed in production. In distributed mode, Connector configurations are shared via internal topics and Tasks are reassigned automatically across Workers. Updates and scale-out happen online through the REST API.
Cases where you should build a custom Producer are limited. Consider it when the connector cannot cover your API spec or when you need strict custom transaction control. Try the existing connector plus SMT combination first.
| Mode | Scale / Availability | Rebalancing / Horizontal Scaling | Management Cost / Use Cases |
|---|---|---|---|
| Standalone | Small scale / single machine (no redundancy) | No rebalance; runs inside a single process | Lowest cost; for development, PoC, and one-off ingestion |
| Distributed | Medium to large scale / high availability (multiple Workers) | Automatic rebalance reassigns Tasks; supports rolling updates | Production standard; centralized management of multiple connectors |
| Custom Producer implementation | Requirement-dependent; availability is your responsibility | Scale managed at the application layer | Only when the connector does not cover the API or you need strict custom control |
Protect Kafka connections with SASL/SSL and apply least-privilege ACLs. Reference external source credentials indirectly through a Config Provider (file, environment variable, external secret store) and avoid plaintext. For auditing, classify data including the DLQ and internal topics.
Performance requires end-to-end optimization. Tune task parallelism, the external source's paging/scan strategy, Connector-specific poll/batch settings, Producer batching and latency, Kafka topic partitions and compression, and network MTU together.
CCDAK
問題 1
You want to continuously ingest an internal PostgreSQL users table into Kafka. After the initial full snapshot, only updates should be ingested, consistency must be preserved, and downstream can absorb duplicates. Which connector configuration is most appropriate?
正解: A
JDBC Source is the right fit for RDB-to-Kafka ingestion; mode=timestamp+incrementing safely handles the initial full load plus incremental updates. acks=all and idempotence minimize the impact when duplicates occur. The other options do not match the requirements.
Do Source Connectors provide exactly-once delivery guarantees?
Generally at-least-once. Duplicates can occur, so design around key strategies and downstream deduplication. Idempotent producers and transactions can minimize duplicates under specific conditions, but strict end-to-end exactly-once depends on the source characteristics and the connector implementation.
Throughput is plateauing. What should I check first?
Check in order: the balance between tasks.max and the Kafka partition count, the splittability of the external source (table sharding / parallel scans), poll.interval.ms and batch settings, Producer linger.ms / batch.size / compression, network and disk IO, and the broker health of the internal topics.
How are connector config updates applied? Do I need to stop the connector?
In distributed mode, PUT/POST the config via the REST API; the Connect cluster propagates the change to the internal topics and restarts or reassigns Tasks as needed. Stopping is usually unnecessary, but some settings trigger a task restart and a brief rebalance.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Kafka Topics & Partitions: Distribution Fundamentals (2026)
How Kafka topics and partitions enable scale — ordering guar...
CCDAK Exam Guide: Confluent Certified Developer (2026)
Complete prep for the CCDAK exam — Producer/Consumer API, St...
CCAAK Exam Guide: Confluent Certified Administrator (2026)
Pass the CCAAK exam — cluster management, partitions, securi...
Kafka Replicas & ISR: Fault Tolerance Explained (2026)
Replica placement, in-sync replicas (ISR), leader election. ...
Kafka Offsets: Commit Modes & Consumer Position (2026)
Offset semantics — auto vs. manual commit, __consumer_offset...