Kafka

Kafka Connect Overview: The Standard Foundation for Data Integration

2026-04-19
NicheeLab Editorial Team

Kafka Connect is a pluggable framework that connects external systems with Kafka with minimal code. By simply deploying Source/Sink connectors and tasks onto workers and scaling them out, you can build a reliable data integration foundation.

This article covers both CCDAK exam topics (mode selection, internal topics, tasks/connectors, converters/SMTs, error handling) and real-world operational procedures.

Kafka Connect: The Big Picture and Its Role

Kafka Connect is the official framework for implementing and operating data integration with external systems (databases, object storage, SaaS, etc.) as connector plugins. Integration is configuration-based, eliminating the need to write dedicated Producers and Consumers in applications.

Workers (the Connect cluster) execute tasks in parallel, and in distributed mode they manage configuration, offsets, and status fault-tolerantly via internal Kafka topics. Operations are performed through the REST API, allowing rolling scale-out and upgrades.

  • Source connectors: ingest from external systems into Kafka
  • Sink connectors: deliver from Kafka to external systems
  • Tasks: the unit of parallel execution for a connector (key to throughput and fault tolerance)
  • Converters and SMTs: separate the responsibilities of serialization and transformation to improve reusability

Logical architecture of Kafka Connect

pollconfigsoffsetsdeliverSource SystemAKafka Clusterconnect-configs / offsets / statusSink SystemBConnect Worker #1REST:8083 / Task S1,S2Connect Worker #2REST:8083 / Task K1,K2Kafka TopicsLogical architecture of Kafka Connect

Create a minimal Source connector (REST)

curl -s -X POST http://localhost:8083/connectors \
  -H 'Content-Type: application/json' \
  -d '{
    "name": "fs-source",
    "config": {
      "connector.class": "org.apache.kafka.connect.file.FileStreamSourceConnector",
      "tasks.max": "1",
      "file": "/tmp/input.txt",
      "topic": "fs-input",
      "key.converter": "org.apache.kafka.connect.storage.StringConverter",
      "value.converter": "org.apache.kafka.connect.storage.StringConverter"
    }
  }'

Comparing Execution Modes and Selection Criteria

Kafka Connect has two modes: Standalone and Distributed. The former runs as a single, simple process; the latter clusters workers to provide fault tolerance and scale-out. Managed offerings (e.g., Confluent Cloud) also exist, letting you outsource operations.

In distributed mode, state is held in Kafka via internal topics (config/offsets/status), and rebalancing happens automatically when workers are added, removed, or fail. Standalone is suited to development and single-purpose use cases.

  • Use Distributed for production as a rule. Start quickly with Standalone for local development or PoCs
  • Even with tight latency requirements, first consider solving them via task parallelism and sufficient partition design
  • Evaluate managed services holistically including SLA, operations, and connectivity requirements (VPC Peering, etc.)
AspectStandaloneDistributedManaged (example)
Operational unitSingle JVM processCluster of multiple workersManaged by the provider
Fault toleranceHalts on process failureAutomatic reassignment on worker failureRedundancy handled by the service
ScalingMainly manual vertical scalingHorizontal distribution of tasks across workersPlan changes or auto-scaling
State managementLocal files, etc.Internal topics (config/offsets/status)Service-internal metadata
Typical use caseDevelopment / single-purpose / simple batchAlways-on production integration foundationFully managed requirements

Comparison of startup commands (example)

# Standalone (single process)
connect-standalone worker.properties file-source.properties

# Distributed (worker cluster)
connect-distributed worker.properties  # then create connectors via REST

Components: Connectors / Tasks / Converters / SMTs

Connectors own the responsibility of 'connecting to external systems and divide-and-conquer,' while tasks own the responsibility of being the 'unit of parallel execution.' Tasks are created according to tasks.max and, in distributed mode, are assigned across multiple workers.

Converters (key/value.converter) define the serialization format between Connect and Kafka. JSON/String/ByteArray are bundled; Avro/Protobuf and others generally require separate plugins and schema management (e.g., Schema Registry). SMTs are a lightweight, single-record transformation chain useful for shaping data before or after wiring.

  • Tasks are the basic unit of throughput and availability. When scaling, be mindful of the relationship between tasks.max and the input partition count
  • SMTs are for lightweight transformations. Implement complex logic in Kafka Streams or downstream
  • Converters bridge Connect's internal representation and Kafka byte arrays. Pay attention to the schema-enable settings

SMT and converter configuration example (excerpt)

{
  "name": "jdbc-sink",
  "config": {
    "connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
    "topics": "orders",
    "tasks.max": "4",
    "connection.url": "jdbc:postgresql://db:5432/app",
    "insert.mode": "upsert",
    "pk.mode": "record_key",

    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
    "value.converter": "org.apache.kafka.connect.json.JsonConverter",
    "value.converter.schemas.enable": "false",

    "transforms": "route,mask",
    "transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
    "transforms.route.regex": "(.*)",
    "transforms.route.replacement": "prod_$1",
    "transforms.mask.type": "org.apache.kafka.connect.transforms.MaskField$Value",
    "transforms.mask.fields": "card_number"
  }
}

Internal Topics and the Reliability Mechanism

In distributed mode, the internal topics connect-configs, connect-offsets, and connect-status are provisioned. All are subject to log compaction, and the recommendation is to set a sufficient replication factor (commonly 3) to ensure fault tolerance.

Source offsets store the external system's position (source partition + position) as a map, and Sinks store Kafka Consumer offsets. This allows processing to resume from where it left off after a worker failure or restart.

Topic names are specified in the worker configuration. They may be auto-created when not pre-created, but in production it is safer to manually create them per policy with explicit compaction and replication settings.

  • Explicitly set config.storage.topic / offsets.storage.topic / status.storage.topic
  • Set an appropriate replication.factor for each internal topic
  • Prepare stable broker-side configuration and monitoring to handle rebalances during worker changes and rolling restarts

Distributed worker configuration (excerpt)

# Kafka connection
bootstrap.servers=broker1:9092,broker2:9092

group.id=connect-cluster

# Internal topics
config.storage.topic=connect-configs
offset.storage.topic=connect-offsets
status.storage.topic=connect-status
config.storage.replication.factor=3
offset.storage.replication.factor=3
status.storage.replication.factor=3

# Converters (example: JSON without schemas)
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=false

# REST port
rest.port=8083

# (Optionally) security and Config Provider
# ssl.endpoint.identification.algorithm=https
# sasl.mechanism=PLAIN
# config.providers=env,file
# config.providers.env.class=org.apache.kafka.common.config.provider.EnvVarConfigProvider

Operations: Scaling, Monitoring, and Error Handling

Scale by adjusting tasks.max and the number of workers. Rebalancing automatically reassigns task allocations. Increasing Sink tasks beyond the input topic's partition count has little effect, so co-design with the partitioning strategy.

Combine REST and JMX for monitoring. You can check operational status via /connectors, /connectors/{name}/status, /connectors/{name}/tasks, etc. Failure behavior is controllable through errors.* settings, including offloading to a DLQ (dead-letter topic).

For rolling updates, restart workers one at a time and verify status after each restart. If prolonged rebalances or task start-up failures persist, prioritize checking the availability of internal topics and the health of external dependencies (DB connections, authentication).

  • Sink: set tasks.max with the input partition count as the upper-bound guideline
  • Source: align tasks.max with the external system's sharding
  • Use a two-stage approach of retry + DLQ for error handling. Isolate permanent failures in the DLQ for post-processing
  • Incorporate REST-based Pause/Resume into operational runbooks

DLQ and monitoring REST (example)

# Monitoring
curl -s http://localhost:8083/connectors | jq .
curl -s http://localhost:8083/connectors/my-conn/status | jq .
curl -s http://localhost:8083/connectors/my-conn/tasks | jq .

# Pause / Resume
curl -s -X PUT http://localhost:8083/connectors/my-conn/pause
curl -s -X PUT http://localhost:8083/connectors/my-conn/resume

# Error handling (part of connector configuration)
"errors.tolerance": "all",                      # all or none
"errors.retry.timeout": "60000",                # ms
"errors.retry.delay.max.ms": "5000",
"errors.deadletterqueue.topic.name": "dlq.my-conn",
"errors.deadletterqueue.context.headers.enable": "true"

CCDAK Key Points and Design Patterns

Exam questions frequently cover mode selection, the role of internal topics, the relationship between tasks and partitions, the separation of responsibilities between converters and SMTs, and error-handling configuration. Design questions tend to ask whether you pick idempotent downstream updates (upserts / idempotent APIs) under at-least-once delivery.

A common pattern is CDC Source → Kafka → multiple Sinks (DWH, search, lake). Account for schema evolution and decide on schema management and a compatibility strategy (such as backward compatibility). For performance, fine-tune with task parallelism, appropriate partition counts, batch sizes, and flush intervals.

  • Tasks ≤ input partition count (the Sink basic). Match Source to the external system's sharding
  • Internal topics need compaction plus sufficient replication
  • Kafka Connect is at-least-once as a rule. Guarantee consistency via external-side idempotency and reprocessing design
  • SMTs are for lightweight transformations. Offload heavy processing to another layer (such as Kafka Streams)
  • Centralize security and secrets management in the worker configuration

Scaling design notes (approximation)

# Estimate tasks.max from target throughput and partition count
# P = input partition count, Rt = target total rows/sec, Rc = rows/sec per task
# tasks.max ≈ min(P, ceil(Rt / Rc))
# Example: P=24, Rt=48k rps, Rc=3k rps → tasks.max=min(24, ceil(48000/3000)=16)=16

Check Your Understanding

CCDAK

問題 1

You want to choose a Kafka Connect configuration that meets production fault tolerance and scale requirements. The requirements are that tasks be automatically reassigned upon worker failure and that processing resume from where it left off after restart. Which is the most appropriate choice?

  1. Run in Distributed mode with the internal topics (config/offsets/status) at replication factor 3
  2. Use Standalone mode, write offsets to a local file, and back them up periodically
  3. Use Distributed mode but leave the internal topics at the default replication factor of 1
  4. Increasing the task count in any mode causes automatic reassignment on failure

正解: A

The requirements are fault tolerance, automatic reassignment, and resumption from the previous position. These are satisfied by Distributed mode plus redundancy of the internal topics. Standalone runs as a single process and does not reassign, and an internal-topic replica count of 1 is a single point of failure. Simply increasing the task count does not trigger automatic reassignment.

Frequently Asked Questions

What is the difference between Kafka Streams and Kafka Connect?

Kafka Connect is a framework specialized in I/O with external systems (the I/O part of ETL), centered on configuration-based, pluggable connectors. Kafka Streams is a library for writing stream processing (aggregations, joins, state management, etc.) in application code. They serve different purposes and complement each other.

Can Kafka Connect achieve exactly-once semantics?

Kafka Connect generally operates with at-least-once semantics. When duplicates are not acceptable, design for idempotency downstream (upserts by primary key, deduplication keys, transactional APIs, and so on).

Is Schema Registry required?

It is not required. You can operate with the JSON/String/ByteArray converters bundled with Kafka Connect. However, when schema evolution and type safety matter, combining Avro/Protobuf converters with schema management (such as Schema Registry) is the practical choice.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Kafka

Kafka Topics & Partitions: Distribution Fundamentals (2026)

How Kafka topics and partitions enable scale — ordering guar...

Kafka

CCDAK Exam Guide: Confluent Certified Developer (2026)

Complete prep for the CCDAK exam — Producer/Consumer API, St...

Kafka

CCAAK Exam Guide: Confluent Certified Administrator (2026)

Pass the CCAAK exam — cluster management, partitions, securi...

Kafka

Kafka Replicas & ISR: Fault Tolerance Explained (2026)

Replica placement, in-sync replicas (ISR), leader election. ...

Kafka

Kafka Offsets: Commit Modes & Consumer Position (2026)

Offset semantics — auto vs. manual commit, __consumer_offset...

Browse all Kafka articles (101)
© 2026 NicheeLab All rights reserved.