Kafka

Kafka Source Connectors Practical Guide: Ingestion Design from External Systems and CCDAK Prep

2026-04-19
NicheeLab Editorial Team

Kafka Connect Source Connectors are the official mechanism for continuously ingesting data from external systems — databases, files, SaaS — into Kafka topics.

This article organizes architecture, scaling, schema and ordering, reliability and operations, deployment options, and security and performance tuning into 5-7 sections, and calls out the points that matter for CCDAK.

Source Connector Basics and When to Use Them

Kafka Connect standardizes data integration between external systems and Kafka through pluggable connectors. Source Connectors handle one-way sync from outside into Kafka, structured as a job definition (Connector) and parallel execution units (Tasks). Configuration, execution, and monitoring all happen through the REST API.

In production, you consider Source Connectors before custom Producer apps because of operational stability, reusability, and maintainability. They shine for RDB ingestion (JDBC/CDC), object storage, log/file ingestion, and SaaS event subscription.

  • Delivery guarantee: generally at-least-once. Design for duplicate tolerance. Some Connector and Kafka versions can approximate exactly-once, but combine deduplication or idempotent design as your business requirements demand.
  • Internal state: offsets are managed in the Kafka internal topic (__connect-offsets). Configurations (__connect-configs) and status (__connect-status) are also stored in internal topics.
  • Prefer SMT (Single Message Transform) for lightweight transformations. For heavy aggregation or joins, consider Kafka Streams or ksqlDB.

Architecture and Scaling: Workers, Connectors, Tasks

Kafka Connect Workers (processes) form a cluster, distributing Connector definitions and assigning Tasks across Workers. Scale out mainly by raising tasks.max and ensuring enough Kafka partitions. Rebalances occur when Workers come or go, or when configurations change.

Reliability rests on internal topic replication and quorum, acks=all, and a sufficient min.insync.replicas. Topic design and producer settings are delegated to or overridden by the Connector, so make them explicit per requirement.

  • Worker modes: standalone (single machine) or distributed (cluster). Distributed is the standard in production.
  • Task parallelism: capped by tasks.max. Effective parallelism also depends on the external source's splittability (e.g., table sharding) and the Kafka topic's partition count.
  • Internal topics: a sufficient replication.factor (e.g., 3) is recommended for __connect-configs / __connect-offsets / __connect-status.

Logical structure of a Source Connector

pollsendoffsetsExternal System(DB, Files, SaaS)Source Task(s)(N parallel)Kafka Topic(s)Partitions (M)Connect Worker Clusterinternal topics (__connect-*)Logical structure of a Source Connector

Ingestion Design: Schema, Key, Partitioning, and Ordering

Key design is the single most important decision. If you need ordering at the entity level, use the same key for the same entity to funnel it into the same partition. Kafka only guarantees ordering within a partition. Plan key stability and cardinality with post-ingestion aggregation and downstream shuffling in mind.

Designing around Schema Registry makes change management easier. Default to Backward compatibility, prefer adding columns, and avoid deleting fields or changing types. For JDBC/CDC, also account for DDL diffs and nullability.

  • Duplicate assumption: at-least-once means the same record can appear more than once. Aim for an idempotent reconciliation design using key + value version, or a source-side incremental column.
  • Partition count: balance consumption throughput against ordering requirements. The boundary is wherever you need per-entity serialization.
  • Use SMTs for key extraction, header injection, routing, field masking, and other lightweight transforms.

JDBC Source Connector example (incremental sync)

{
  "name": "jdbc-users-source",
  "config": {
    "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
    "connection.url": "jdbc:postgresql://db:5432/app",
    "connection.user": "appuser",
    "connection.password": "${file:/opt/connect/secrets.properties:db.password}",
    "mode": "timestamp+incrementing",
    "incrementing.column.name": "id",
    "timestamp.column.name": "updated_at",
    "table.whitelist": "public.users",
    "topic.prefix": "src.postgres.",
    "poll.interval.ms": "10000",
    "batch.max.rows": "5000",
    "value.converter": "io.confluent.connect.avro.AvroConverter",
    "value.converter.schema.registry.url": "http://schema-registry:8081",
    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
    "producer.override.acks": "all",
    "producer.override.enable.idempotence": "true",
    "errors.tolerance": "all",
    "errors.deadletterqueue.topic.name": "_dlq.src.postgres.users",
    "errors.deadletterqueue.context.headers.enable": "true"
  }
}

Reliability, Delivery Semantics, and Error Handling

Kafka Connect Sources generally deliver at-least-once. Because offsets are committed after a successful send, duplicates can occur on crash recovery. Preserve consistency with idempotent key design, downstream deduplication, or a change-detection column on the source side.

Design error handling in stages. Send record-conversion errors and schema mismatches to a DLQ and fix them operationally. Retry transient errors with backoff, and route permanent errors to the DLQ with an alert.

  • Common settings: errors.tolerance=all, errors.deadletterqueue.topic.name, errors.retry.timeout, errors.retry.delay.max.ms
  • Hardening the Producer: acks=all, enable.idempotence=true, revisit delivery.timeout.ms, and tune compression.type to optimize throughput and cost.
  • Explicitly set min.insync.replicas on internal topics to ensure resilience during broker failures.

Choosing a Deployment Mode and Operating It: Standalone vs Distributed

Use standalone for development and testing, distributed in production. In distributed mode, Connector configurations are shared via internal topics and Tasks are reassigned automatically across Workers. Updates and scale-out happen online through the REST API.

Cases where you should build a custom Producer are limited. Consider it when the connector cannot cover your API spec or when you need strict custom transaction control. Try the existing connector plus SMT combination first.

  • Rolling updates: restart Workers one at a time. Tune tasks.max and cooldowns to minimize rebalance impact.
  • Scaling: before raising the Worker count and tasks.max, revisit the topic partition count and the external source's splitting strategy.
  • Monitoring: connector/task state, send rate, error rate, DLQ backlog, rebalance frequency, and internal topic replication.
ModeScale / AvailabilityRebalancing / Horizontal ScalingManagement Cost / Use Cases
StandaloneSmall scale / single machine (no redundancy)No rebalance; runs inside a single processLowest cost; for development, PoC, and one-off ingestion
DistributedMedium to large scale / high availability (multiple Workers)Automatic rebalance reassigns Tasks; supports rolling updatesProduction standard; centralized management of multiple connectors
Custom Producer implementationRequirement-dependent; availability is your responsibilityScale managed at the application layerOnly when the connector does not cover the API or you need strict custom control

Security and Performance Tuning

Protect Kafka connections with SASL/SSL and apply least-privilege ACLs. Reference external source credentials indirectly through a Config Provider (file, environment variable, external secret store) and avoid plaintext. For auditing, classify data including the DLQ and internal topics.

Performance requires end-to-end optimization. Tune task parallelism, the external source's paging/scan strategy, Connector-specific poll/batch settings, Producer batching and latency, Kafka topic partitions and compression, and network MTU together.

  • Authorization: grant the minimum required ACLs (Write/Create/DescribeConfigs, etc.) on every topic the connector writes to.
  • Config Provider: reference secret values via ${file:...}, ${env:...}, etc. Manage reload policy and permissions.
  • Tuning examples: poll.interval.ms, batch.max.rows (connector-dependent), producer.override.linger.ms / batch.size, and the compression-vs-CPU trade-off.

Check Your Understanding

CCDAK

問題 1

You want to continuously ingest an internal PostgreSQL users table into Kafka. After the initial full snapshot, only updates should be ingested, consistency must be preserved, and downstream can absorb duplicates. Which connector configuration is most appropriate?

  1. Use JDBC Source Connector with mode=timestamp+incrementing, specify the incrementing/timestamp columns, and enable acks=all and idempotence
  2. Tail a CSV with File Source Connector and rotate the file periodically
  3. Use a Sink Connector to push from Kafka to the DB and feed changes back to Kafka via DB triggers
  4. Replicate directly from the DB to Kafka using MirrorMaker

正解: A

JDBC Source is the right fit for RDB-to-Kafka ingestion; mode=timestamp+incrementing safely handles the initial full load plus incremental updates. acks=all and idempotence minimize the impact when duplicates occur. The other options do not match the requirements.

Frequently Asked Questions

Do Source Connectors provide exactly-once delivery guarantees?

Generally at-least-once. Duplicates can occur, so design around key strategies and downstream deduplication. Idempotent producers and transactions can minimize duplicates under specific conditions, but strict end-to-end exactly-once depends on the source characteristics and the connector implementation.

Throughput is plateauing. What should I check first?

Check in order: the balance between tasks.max and the Kafka partition count, the splittability of the external source (table sharding / parallel scans), poll.interval.ms and batch settings, Producer linger.ms / batch.size / compression, network and disk IO, and the broker health of the internal topics.

How are connector config updates applied? Do I need to stop the connector?

In distributed mode, PUT/POST the config via the REST API; the Connect cluster propagates the change to the internal topics and restarts or reassigns Tasks as needed. Stopping is usually unnecessary, but some settings trigger a task restart and a brief rebalance.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Kafka

Kafka Topics & Partitions: Distribution Fundamentals (2026)

How Kafka topics and partitions enable scale — ordering guar...

Kafka

CCDAK Exam Guide: Confluent Certified Developer (2026)

Complete prep for the CCDAK exam — Producer/Consumer API, St...

Kafka

CCAAK Exam Guide: Confluent Certified Administrator (2026)

Pass the CCAAK exam — cluster management, partitions, securi...

Kafka

Kafka Replicas & ISR: Fault Tolerance Explained (2026)

Replica placement, in-sync replicas (ISR), leader election. ...

Kafka

Kafka Offsets: Commit Modes & Consumer Position (2026)

Offset semantics — auto vs. manual commit, __consumer_offset...

Browse all Kafka articles (101)
© 2026 NicheeLab All rights reserved.