Kafka Connect is a pluggable framework that connects external systems with Kafka with minimal code. By simply deploying Source/Sink connectors and tasks onto workers and scaling them out, you can build a reliable data integration foundation.
This article covers both CCDAK exam topics (mode selection, internal topics, tasks/connectors, converters/SMTs, error handling) and real-world operational procedures.
Kafka Connect is the official framework for implementing and operating data integration with external systems (databases, object storage, SaaS, etc.) as connector plugins. Integration is configuration-based, eliminating the need to write dedicated Producers and Consumers in applications.
Workers (the Connect cluster) execute tasks in parallel, and in distributed mode they manage configuration, offsets, and status fault-tolerantly via internal Kafka topics. Operations are performed through the REST API, allowing rolling scale-out and upgrades.
Logical architecture of Kafka Connect
Create a minimal Source connector (REST)
curl -s -X POST http://localhost:8083/connectors \
-H 'Content-Type: application/json' \
-d '{
"name": "fs-source",
"config": {
"connector.class": "org.apache.kafka.connect.file.FileStreamSourceConnector",
"tasks.max": "1",
"file": "/tmp/input.txt",
"topic": "fs-input",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.storage.StringConverter"
}
}'Kafka Connect has two modes: Standalone and Distributed. The former runs as a single, simple process; the latter clusters workers to provide fault tolerance and scale-out. Managed offerings (e.g., Confluent Cloud) also exist, letting you outsource operations.
In distributed mode, state is held in Kafka via internal topics (config/offsets/status), and rebalancing happens automatically when workers are added, removed, or fail. Standalone is suited to development and single-purpose use cases.
| Aspect | Standalone | Distributed | Managed (example) |
|---|---|---|---|
| Operational unit | Single JVM process | Cluster of multiple workers | Managed by the provider |
| Fault tolerance | Halts on process failure | Automatic reassignment on worker failure | Redundancy handled by the service |
| Scaling | Mainly manual vertical scaling | Horizontal distribution of tasks across workers | Plan changes or auto-scaling |
| State management | Local files, etc. | Internal topics (config/offsets/status) | Service-internal metadata |
| Typical use case | Development / single-purpose / simple batch | Always-on production integration foundation | Fully managed requirements |
Comparison of startup commands (example)
# Standalone (single process)
connect-standalone worker.properties file-source.properties
# Distributed (worker cluster)
connect-distributed worker.properties # then create connectors via RESTConnectors own the responsibility of 'connecting to external systems and divide-and-conquer,' while tasks own the responsibility of being the 'unit of parallel execution.' Tasks are created according to tasks.max and, in distributed mode, are assigned across multiple workers.
Converters (key/value.converter) define the serialization format between Connect and Kafka. JSON/String/ByteArray are bundled; Avro/Protobuf and others generally require separate plugins and schema management (e.g., Schema Registry). SMTs are a lightweight, single-record transformation chain useful for shaping data before or after wiring.
SMT and converter configuration example (excerpt)
{
"name": "jdbc-sink",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"topics": "orders",
"tasks.max": "4",
"connection.url": "jdbc:postgresql://db:5432/app",
"insert.mode": "upsert",
"pk.mode": "record_key",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false",
"transforms": "route,mask",
"transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.route.regex": "(.*)",
"transforms.route.replacement": "prod_$1",
"transforms.mask.type": "org.apache.kafka.connect.transforms.MaskField$Value",
"transforms.mask.fields": "card_number"
}
}In distributed mode, the internal topics connect-configs, connect-offsets, and connect-status are provisioned. All are subject to log compaction, and the recommendation is to set a sufficient replication factor (commonly 3) to ensure fault tolerance.
Source offsets store the external system's position (source partition + position) as a map, and Sinks store Kafka Consumer offsets. This allows processing to resume from where it left off after a worker failure or restart.
Topic names are specified in the worker configuration. They may be auto-created when not pre-created, but in production it is safer to manually create them per policy with explicit compaction and replication settings.
Distributed worker configuration (excerpt)
# Kafka connection
bootstrap.servers=broker1:9092,broker2:9092
group.id=connect-cluster
# Internal topics
config.storage.topic=connect-configs
offset.storage.topic=connect-offsets
status.storage.topic=connect-status
config.storage.replication.factor=3
offset.storage.replication.factor=3
status.storage.replication.factor=3
# Converters (example: JSON without schemas)
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=false
# REST port
rest.port=8083
# (Optionally) security and Config Provider
# ssl.endpoint.identification.algorithm=https
# sasl.mechanism=PLAIN
# config.providers=env,file
# config.providers.env.class=org.apache.kafka.common.config.provider.EnvVarConfigProviderScale by adjusting tasks.max and the number of workers. Rebalancing automatically reassigns task allocations. Increasing Sink tasks beyond the input topic's partition count has little effect, so co-design with the partitioning strategy.
Combine REST and JMX for monitoring. You can check operational status via /connectors, /connectors/{name}/status, /connectors/{name}/tasks, etc. Failure behavior is controllable through errors.* settings, including offloading to a DLQ (dead-letter topic).
For rolling updates, restart workers one at a time and verify status after each restart. If prolonged rebalances or task start-up failures persist, prioritize checking the availability of internal topics and the health of external dependencies (DB connections, authentication).
DLQ and monitoring REST (example)
# Monitoring
curl -s http://localhost:8083/connectors | jq .
curl -s http://localhost:8083/connectors/my-conn/status | jq .
curl -s http://localhost:8083/connectors/my-conn/tasks | jq .
# Pause / Resume
curl -s -X PUT http://localhost:8083/connectors/my-conn/pause
curl -s -X PUT http://localhost:8083/connectors/my-conn/resume
# Error handling (part of connector configuration)
"errors.tolerance": "all", # all or none
"errors.retry.timeout": "60000", # ms
"errors.retry.delay.max.ms": "5000",
"errors.deadletterqueue.topic.name": "dlq.my-conn",
"errors.deadletterqueue.context.headers.enable": "true"Exam questions frequently cover mode selection, the role of internal topics, the relationship between tasks and partitions, the separation of responsibilities between converters and SMTs, and error-handling configuration. Design questions tend to ask whether you pick idempotent downstream updates (upserts / idempotent APIs) under at-least-once delivery.
A common pattern is CDC Source → Kafka → multiple Sinks (DWH, search, lake). Account for schema evolution and decide on schema management and a compatibility strategy (such as backward compatibility). For performance, fine-tune with task parallelism, appropriate partition counts, batch sizes, and flush intervals.
Scaling design notes (approximation)
# Estimate tasks.max from target throughput and partition count
# P = input partition count, Rt = target total rows/sec, Rc = rows/sec per task
# tasks.max ≈ min(P, ceil(Rt / Rc))
# Example: P=24, Rt=48k rps, Rc=3k rps → tasks.max=min(24, ceil(48000/3000)=16)=16CCDAK
問題 1
You want to choose a Kafka Connect configuration that meets production fault tolerance and scale requirements. The requirements are that tasks be automatically reassigned upon worker failure and that processing resume from where it left off after restart. Which is the most appropriate choice?
正解: A
The requirements are fault tolerance, automatic reassignment, and resumption from the previous position. These are satisfied by Distributed mode plus redundancy of the internal topics. Standalone runs as a single process and does not reassign, and an internal-topic replica count of 1 is a single point of failure. Simply increasing the task count does not trigger automatic reassignment.
What is the difference between Kafka Streams and Kafka Connect?
Kafka Connect is a framework specialized in I/O with external systems (the I/O part of ETL), centered on configuration-based, pluggable connectors. Kafka Streams is a library for writing stream processing (aggregations, joins, state management, etc.) in application code. They serve different purposes and complement each other.
Can Kafka Connect achieve exactly-once semantics?
Kafka Connect generally operates with at-least-once semantics. When duplicates are not acceptable, design for idempotency downstream (upserts by primary key, deduplication keys, transactional APIs, and so on).
Is Schema Registry required?
It is not required. You can operate with the JSON/String/ByteArray converters bundled with Kafka Connect. However, when schema evolution and type safety matter, combining Avro/Protobuf converters with schema management (such as Schema Registry) is the practical choice.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Kafka Topics & Partitions: Distribution Fundamentals (2026)
How Kafka topics and partitions enable scale — ordering guar...
CCDAK Exam Guide: Confluent Certified Developer (2026)
Complete prep for the CCDAK exam — Producer/Consumer API, St...
CCAAK Exam Guide: Confluent Certified Administrator (2026)
Pass the CCAAK exam — cluster management, partitions, securi...
Kafka Replicas & ISR: Fault Tolerance Explained (2026)
Replica placement, in-sync replicas (ISR), leader election. ...
Kafka Offsets: Commit Modes & Consumer Position (2026)
Offset semantics — auto vs. manual commit, __consumer_offset...