Kafka Streams is a library that lets you run stream processing inside your application process, without adding a dedicated processing cluster alongside Kafka. Built on the same reliability foundation as Kafka's Consumer/Producer APIs, it provides a DSL and state management that make aggregations, joins, and windowed operations easy to express.
This article walks through the core concepts such as KStream/KTable, topology design, state management and fault recovery, time semantics, and operational best practices. We also call out the topics most commonly tested on the CCDAK (Confluent Certified Developer for Apache Kafka) exam.
Kafka Streams is the official Apache Kafka library — a lightweight runtime for stream processing inside your application. It requires no external distributed processing cluster; instead, each instance scales out as tasks. Kafka topics serve as both inputs and outputs, and offset management, retries, and failover are all handled in tight integration with Kafka itself.
For the CCDAK exam, you need to clearly understand the differences between Kafka Streams, hand-rolled Consumer/Producer code, and ksqlDB — and be able to explain which option fits which use case.
| Aspect | Kafka Streams | Custom Consumer/Producer | ksqlDB |
|---|---|---|---|
| Model | In-app library with DSL and state | Build on top of low-level APIs yourself | Server-side SQL-like queries |
| Execution model | Each app process runs tasks | Up to the app (depends on your design) | Runs on dedicated servers |
| State management | Local store plus changelog | Build it yourself, often with an external DB | Managed by the server |
| Failure recovery | Tasks move to another instance and state is restored | Entirely your responsibility | Server resumes processing |
| Learning & operational cost | Moderate (flexible in code) | Heavy implementation burden | Easy to learn, but operations means server management |
Kafka Streams execution model (in-app tasks and state flow)
A minimal Kafka Streams app (Java)
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.kstream.*;
import org.apache.kafka.streams.state.Stores;
import java.time.Duration;
import java.util.Properties;
public class WordCountApp {
public static void main(String[] args) {
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "wordcount-app");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
// Exactly-once 対応の設定(サポートされる方式を指定)
props.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE_V2);
StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> textLines = builder.stream("text-input");
KTable<Windowed<String>, Long> counts = textLines
.flatMapValues(value -> java.util.Arrays.asList(value.toLowerCase().split("\\W+")))
.selectKey((k, word) -> word)
.groupByKey()
.windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(1)))
.count(Materialized.as("word-counts-store"));
counts.toStream()
.map((windowedKey, count) -> new KeyValue<>(windowedKey.key(), count.toString()))
.to("wordcount-output", Produced.with(Serdes.String(), Serdes.String()));
KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
}
}
A KStream represents an unbounded sequence of records. Each record is independent, and you typically filter, map, or group before aggregating. A KTable is a table view of the latest value per key, reconstructable from a changelog. A GlobalKTable replicates every partition to every instance, making it a good fit for read-heavy reference joins.
Serializers/deserializers (Serdes) ensure type safety across topic I/O and state stores. A classic CCDAK question targets missing key/value Serde configuration or runtime errors caused by misconfigured Serdes during joins.
Minimal KStream-KTable join example (Java)
StreamsBuilder b = new StreamsBuilder();
KStream<String, String> orders = b.stream("orders");
KTable<String, String> users = b.table("users");
KStream<String, String> enriched = orders.join(
users,
(order, user) -> user + ":" + order
);
enriched.to("orders-enriched");A topology is a directed graph of sources, processors, and sinks. When you chain operations on the Kafka Streams DSL builder, the topology is built up internally. Grouping and joining require keys to line up, and repartition topics are auto-generated when necessary.
Common CCDAK questions cover whether repartitioning is triggered by keyBy/selectKey, whether join keys are aligned, and how null keys are handled within a topic.
Design example that includes repartitioning (Java)
KStream<String, Event> s = builder.stream("raw-events");
// userId をキーに再割当(selectKey によりリパーティションが必要になりうる)
KStream<String, Event> byUser = s.selectKey((k, v) -> v.userId());
KTable<String, Long> counts = byUser.groupByKey()
.count(Materialized.as("by-user-counts"));Kafka Streams state stores are held locally in RocksDB by default, and updates are written to the corresponding changelog topic. On failure, tasks move to another instance within the same application group, restoring state from the changelog. Standby tasks can optionally pre-warm replicas as needed.
Exactly-Once processing is achieved by combining the idempotent producer with transactions, so read, process, and write are committed as a single transaction. Atomically committing state updates and output records avoids both duplicates and lost data.
Declaring and reading a local state store (using the Processor API)
StreamsBuilder builder = new StreamsBuilder();
var storeBuilder = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("kv-store"),
Serdes.String(), Serdes.Long()
);
builder.addStateStore(storeBuilder);
KStream<String, Long> s = builder.stream("metrics");
s.process(() -> new org.apache.kafka.streams.processor.api.Processor<>() {
private org.apache.kafka.streams.state.KeyValueStore<String, Long> store;
@Override public void init(org.apache.kafka.streams.processor.api.ProcessorContext<String, Long> ctx){
store = ctx.getStateStore("kv-store");
}
@Override public void process(org.apache.kafka.streams.processor.api.Record<String, Long> r){
Long cur = store.get(r.key());
long next = (cur == null ? 0L : cur) + r.value();
store.put(r.key(), next);
}
}, "kv-store");Kafka Streams primarily uses event time (record time). By default it takes the record's timestamp, and you can override that with a custom TimestampExtractor. Supported window types include tumbling, hopping, and session windows.
Late events are governed by the grace period set on the window. To finalize aggregation results and control when they flow downstream, the Suppress operator is an effective way to temporarily hold back intermediate updates.
Windowed aggregation with Suppress (Java)
KStream<String, Long> metrics = builder.stream("latency");
KTable<Windowed<String>, Long> agg = metrics
.groupByKey()
.windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofSeconds(30)))
.reduce(Long::sum, Materialized.as("latency-sum"));
KStream<String, Long> finalized = agg
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()))
.toStream()
.map((wKey, v) -> new KeyValue<>(wKey.key(), v));
finalized.to("latency-30s");Scaling depends on topic partition count: the maximum number of tasks is bounded by the input topic's partition count. Keep an eye on version compatibility, Serde evolution (schema compatibility), and minimizing pause time during rebalance (e.g., static membership).
Frequently tested CCDAK topics include processing guarantees (at-least-once vs. exactly-once), the relationship between state stores and changelogs, conditions that trigger joins and repartitioning, the effects of timestamps and grace, and scaling strategies (partition design).
Key operational properties (Java configuration excerpt)
Properties p = new Properties();
p.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, 3); // 同一プロセス内の並列度
p.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, 3); // 内部トピックのRF
p.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 50 * 1024 * 1024L); // キャッシュ
p.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE_V2); // 処理保証
p.put(StreamsConfig.STATE_DIR_CONFIG, "/var/lib/streams"); // 状態ディレクトリ
// 必要に応じて consumer/producer のタイムアウトやリトライも調整CCDAK
問題 1
You are designing a Kafka Streams app that joins a KStream and a KTable by key and then performs windowed aggregation. Which statement is correct about repartitioning and the handling of late events?
正解: A
KStream-KTable joins assume matching keys; if the KStream side's key does not match, internal repartitioning may be triggered. Late events can still be incorporated into the aggregation if they fall within the window's grace period. Suppress controls when results are emitted — whether late events are accepted or rejected is determined by grace and window boundaries themselves.
When should I choose Kafka Streams vs. ksqlDB?
Kafka Streams fits when you need fine-grained control in code or integration with app-specific libraries. ksqlDB suits cases where you want to develop quickly in SQL and push operations to the server side. Choose based on your team's skill set, operational model, and latency/throughput requirements.
Does enabling Exactly-Once hurt performance?
Yes, transactional overhead can increase latency and reduce throughput compared to at-least-once. Mitigate by limiting EOS to flows that need it, tuning batch sizes and transaction timeouts, and optimizing intermediate topics.
Does the state store have to be RocksDB?
RocksDB is the default, but in-memory stores are also available. That said, the persistent store (RocksDB) combined with a changelog is the common choice for fault tolerance and memory efficiency.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Kafka Topics & Partitions: Distribution Fundamentals (2026)
How Kafka topics and partitions enable scale — ordering guar...
CCDAK Exam Guide: Confluent Certified Developer (2026)
Complete prep for the CCDAK exam — Producer/Consumer API, St...
CCAAK Exam Guide: Confluent Certified Administrator (2026)
Pass the CCAAK exam — cluster management, partitions, securi...
Kafka Replicas & ISR: Fault Tolerance Explained (2026)
Replica placement, in-sync replicas (ISR), leader election. ...
Kafka Offsets: Commit Modes & Consumer Position (2026)
Offset semantics — auto vs. manual commit, __consumer_offset...