Kafka

Kafka Streams Primer: A Practical Guide to In-App Stream Processing

2026-04-19
NicheeLab Editorial Team

Kafka Streams is a library that lets you run stream processing inside your application process, without adding a dedicated processing cluster alongside Kafka. Built on the same reliability foundation as Kafka's Consumer/Producer APIs, it provides a DSL and state management that make aggregations, joins, and windowed operations easy to express.

This article walks through the core concepts such as KStream/KTable, topology design, state management and fault recovery, time semantics, and operational best practices. We also call out the topics most commonly tested on the CCDAK (Confluent Certified Developer for Apache Kafka) exam.

What Kafka Streams Is: Its Place and Use Cases

Kafka Streams is the official Apache Kafka library — a lightweight runtime for stream processing inside your application. It requires no external distributed processing cluster; instead, each instance scales out as tasks. Kafka topics serve as both inputs and outputs, and offset management, retries, and failover are all handled in tight integration with Kafka itself.

For the CCDAK exam, you need to clearly understand the differences between Kafka Streams, hand-rolled Consumer/Producer code, and ksqlDB — and be able to explain which option fits which use case.

  • Library-based execution: add it as a dependency and embed it in your app
  • Scaling: tasks are distributed according to partition count
  • State management: local state plus a changelog provides fault tolerance
  • Processing guarantees: at-least-once or effectively exactly-once is configurable (Exactly-Once relies on transactions and the idempotent producer)
  • DSL and Processor API: build the graph declaratively, or take low-level control
AspectKafka StreamsCustom Consumer/ProducerksqlDB
ModelIn-app library with DSL and stateBuild on top of low-level APIs yourselfServer-side SQL-like queries
Execution modelEach app process runs tasksUp to the app (depends on your design)Runs on dedicated servers
State managementLocal store plus changelogBuild it yourself, often with an external DBManaged by the server
Failure recoveryTasks move to another instance and state is restoredEntirely your responsibilityServer resumes processing
Learning & operational costModerate (flexible in code)Heavy implementation burdenEasy to learn, but operations means server management

Kafka Streams execution model (in-app tasks and state flow)

assign by group.idassign by group.idKafka Clusterinput-topic [P0][P1][P2] / output-topicStreams App Instance ATask T0 (P0) / Task T1 (P1) / State(RDB)ChangelogRocksDB writesStreams App Instance BTask T2 (P2) / State(RDB)Kafka Streams execution model

A minimal Kafka Streams app (Java)

import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.kstream.*;
import org.apache.kafka.streams.state.Stores;
import java.time.Duration;
import java.util.Properties;

public class WordCountApp {
  public static void main(String[] args) {
    Properties props = new Properties();
    props.put(StreamsConfig.APPLICATION_ID_CONFIG, "wordcount-app");
    props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
    props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
    props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
    // Exactly-once 対応の設定(サポートされる方式を指定)
    props.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE_V2);

    StreamsBuilder builder = new StreamsBuilder();

    KStream<String, String> textLines = builder.stream("text-input");

    KTable<Windowed<String>, Long> counts = textLines
        .flatMapValues(value -> java.util.Arrays.asList(value.toLowerCase().split("\\W+")))
        .selectKey((k, word) -> word)
        .groupByKey()
        .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(1)))
        .count(Materialized.as("word-counts-store"));

    counts.toStream()
        .map((windowedKey, count) -> new KeyValue<>(windowedKey.key(), count.toString()))
        .to("wordcount-output", Produced.with(Serdes.String(), Serdes.String()));

    KafkaStreams streams = new KafkaStreams(builder.build(), props);
    streams.start();

    Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
  }
}

Core Concepts: KStream / KTable / GlobalKTable and Serdes

A KStream represents an unbounded sequence of records. Each record is independent, and you typically filter, map, or group before aggregating. A KTable is a table view of the latest value per key, reconstructable from a changelog. A GlobalKTable replicates every partition to every instance, making it a good fit for read-heavy reference joins.

Serializers/deserializers (Serdes) ensure type safety across topic I/O and state stores. A classic CCDAK question targets missing key/value Serde configuration or runtime errors caused by misconfigured Serdes during joins.

  • KStream: sequential event processing, grouping before aggregation, the stream side of a join
  • KTable: final-result view, upsert-style updates, counterpart in table joins
  • GlobalKTable: full replication of reference data, foreign-key joins with a stream
  • Serde: keeps types consistent across topics, state stores, and intermediate state

Minimal KStream-KTable join example (Java)

StreamsBuilder b = new StreamsBuilder();
KStream<String, String> orders = b.stream("orders");
KTable<String, String> users = b.table("users");
KStream<String, String> enriched = orders.join(
  users,
  (order, user) -> user + ":" + order
);
enriched.to("orders-enriched");

Topology Design and the Repartitioning Gotchas

A topology is a directed graph of sources, processors, and sinks. When you chain operations on the Kafka Streams DSL builder, the topology is built up internally. Grouping and joining require keys to line up, and repartition topics are auto-generated when necessary.

Common CCDAK questions cover whether repartitioning is triggered by keyBy/selectKey, whether join keys are aligned, and how null keys are handled within a topic.

  • Aggregations after groupByKey/selectKey are key-based, so partition boundaries matter
  • Joins with mismatched keys can trigger automatic repartitioning (watch for extra I/O and latency)
  • null keys are dropped by many operations — decide how to handle them before joining or aggregating
  • Intermediate topic names can be controlled explicitly via Named or Materialized#withName

Design example that includes repartitioning (Java)

KStream<String, Event> s = builder.stream("raw-events");
// userId をキーに再割当(selectKey によりリパーティションが必要になりうる)
KStream<String, Event> byUser = s.selectKey((k, v) -> v.userId());
KTable<String, Long> counts = byUser.groupByKey()
  .count(Materialized.as("by-user-counts"));

State Management and Fault Tolerance

Kafka Streams state stores are held locally in RocksDB by default, and updates are written to the corresponding changelog topic. On failure, tasks move to another instance within the same application group, restoring state from the changelog. Standby tasks can optionally pre-warm replicas as needed.

Exactly-Once processing is achieved by combining the idempotent producer with transactions, so read, process, and write are committed as a single transaction. Atomically committing state updates and output records avoids both duplicates and lost data.

  • Naming with Materialized.as determines both the state store and the changelog topic name
  • The number of standby replicas can be tuned per topology (trading recovery time against cost)
  • Recovery is a replay from Kafka — reducing external DB dependencies makes consistency easier to preserve
  • When EOS is enabled, mind producer transaction timeouts and buffer sizing

Declaring and reading a local state store (using the Processor API)

StreamsBuilder builder = new StreamsBuilder();
var storeBuilder = Stores.keyValueStoreBuilder(
  Stores.persistentKeyValueStore("kv-store"),
  Serdes.String(), Serdes.Long()
);
builder.addStateStore(storeBuilder);
KStream<String, Long> s = builder.stream("metrics");
s.process(() -> new org.apache.kafka.streams.processor.api.Processor<>() {
  private org.apache.kafka.streams.state.KeyValueStore<String, Long> store;
  @Override public void init(org.apache.kafka.streams.processor.api.ProcessorContext<String, Long> ctx){
    store = ctx.getStateStore("kv-store");
  }
  @Override public void process(org.apache.kafka.streams.processor.api.Record<String, Long> r){
    Long cur = store.get(r.key());
    long next = (cur == null ? 0L : cur) + r.value();
    store.put(r.key(), next);
  }
}, "kv-store");

Time Semantics and Windowed Processing

Kafka Streams primarily uses event time (record time). By default it takes the record's timestamp, and you can override that with a custom TimestampExtractor. Supported window types include tumbling, hopping, and session windows.

Late events are governed by the grace period set on the window. To finalize aggregation results and control when they flow downstream, the Suppress operator is an effective way to temporarily hold back intermediate updates.

  • Supports TimeWindows (fixed length) and SessionWindows (bucketed by idle gaps)
  • TimestampExtractor extracts event time — defaults to the record's timestamp if unset
  • grace tunes the window's lateness tolerance (larger values consume more memory and state)
  • Suppress can hold back intermediate updates until the window closes

Windowed aggregation with Suppress (Java)

KStream<String, Long> metrics = builder.stream("latency");
KTable<Windowed<String>, Long> agg = metrics
  .groupByKey()
  .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofSeconds(30)))
  .reduce(Long::sum, Materialized.as("latency-sum"));
KStream<String, Long> finalized = agg
  .suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()))
  .toStream()
  .map((wKey, v) -> new KeyValue<>(wKey.key(), v));
finalized.to("latency-30s");

Operational Tips and CCDAK Exam Highlights

Scaling depends on topic partition count: the maximum number of tasks is bounded by the input topic's partition count. Keep an eye on version compatibility, Serde evolution (schema compatibility), and minimizing pause time during rebalance (e.g., static membership).

Frequently tested CCDAK topics include processing guarantees (at-least-once vs. exactly-once), the relationship between state stores and changelogs, conditions that trigger joins and repartitioning, the effects of timestamps and grace, and scaling strategies (partition design).

  • The application ID drives state and internal topic names — change it with care
  • Tune internal topic retention and replication factor to match your operational SLOs
  • Standby tasks cut recovery time but add broker load and storage overhead
  • Monitor processing lag — track both end-to-end lag and the gap from stream-time

Key operational properties (Java configuration excerpt)

Properties p = new Properties();
p.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, 3); // 同一プロセス内の並列度
p.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, 3); // 内部トピックのRF
p.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 50 * 1024 * 1024L); // キャッシュ
p.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE_V2); // 処理保証
p.put(StreamsConfig.STATE_DIR_CONFIG, "/var/lib/streams"); // 状態ディレクトリ
// 必要に応じて consumer/producer のタイムアウトやリトライも調整

Check Your Understanding

CCDAK

問題 1

You are designing a Kafka Streams app that joins a KStream and a KTable by key and then performs windowed aggregation. Which statement is correct about repartitioning and the handling of late events?

  1. If you did not selectKey on the KStream side to match the KTable's key, repartitioning may be triggered automatically. Late events can still be included in the aggregation as long as they arrive within the window's configured grace period.
  2. KStream-KTable joins never trigger repartitioning. Late events are always discarded regardless of any Suppress configuration.
  3. The KTable's key is ignored, and the join always uses the KStream's original key. Late events are automatically promoted to the most recent window.
  4. Repartitioning never happens unless you explicitly call repartition(). Late event handling has nothing to do with timestamp extraction.

正解: A

KStream-KTable joins assume matching keys; if the KStream side's key does not match, internal repartitioning may be triggered. Late events can still be incorporated into the aggregation if they fall within the window's grace period. Suppress controls when results are emitted — whether late events are accepted or rejected is determined by grace and window boundaries themselves.

Frequently Asked Questions

When should I choose Kafka Streams vs. ksqlDB?

Kafka Streams fits when you need fine-grained control in code or integration with app-specific libraries. ksqlDB suits cases where you want to develop quickly in SQL and push operations to the server side. Choose based on your team's skill set, operational model, and latency/throughput requirements.

Does enabling Exactly-Once hurt performance?

Yes, transactional overhead can increase latency and reduce throughput compared to at-least-once. Mitigate by limiting EOS to flows that need it, tuning batch sizes and transaction timeouts, and optimizing intermediate topics.

Does the state store have to be RocksDB?

RocksDB is the default, but in-memory stores are also available. That said, the persistent store (RocksDB) combined with a changelog is the common choice for fault tolerance and memory efficiency.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Kafka

Kafka Topics & Partitions: Distribution Fundamentals (2026)

How Kafka topics and partitions enable scale — ordering guar...

Kafka

CCDAK Exam Guide: Confluent Certified Developer (2026)

Complete prep for the CCDAK exam — Producer/Consumer API, St...

Kafka

CCAAK Exam Guide: Confluent Certified Administrator (2026)

Pass the CCAAK exam — cluster management, partitions, securi...

Kafka

Kafka Replicas & ISR: Fault Tolerance Explained (2026)

Replica placement, in-sync replicas (ISR), leader election. ...

Kafka

Kafka Offsets: Commit Modes & Consumer Position (2026)

Offset semantics — auto vs. manual commit, __consumer_offset...

Browse all Kafka articles (101)
© 2026 NicheeLab All rights reserved.