Kafka Streams State Stores: RocksDB & Changelog (2026)

State management is the core of stateful operations in Kafka Streams: aggregations, joins, and windowing. By default, RocksDB is embedded locally, and failures are recovered from changelog topics.

This article walks through the fundamentals of RocksDB, the restart recovery flow, standby replicas, operational tuning, and the points most often tested on the CCDAK exam, all from a practical standpoint.

State Store Basics and Where RocksDB Fits In

Kafka Streams stores intermediate computation results in a State Store. A materialized State Store is backed by a changelog topic of the same name, which can be replayed to recover after failure.

The default persistent store is RocksDB, embedded in the process. An on-heap memory store is also available, but its contents disappear when the process exits, so a restart requires replaying the entire changelog. Because RocksDB persists to local disk, recovery is fast when the checkpoint matches — only the delta needs to be applied.

Every State Store is backed by a changelog (typically with log compaction enabled)
RocksDB is the default; memory stores are fast but expensive at restart
Enabling Exactly-once v2 guarantees consistency across store updates, changelog writes, and output topics

Store type	Durability	Recovery on restart	Performance / I-O
In-memory (on-heap)	None (lost when the process exits)	Full replay of changelog from the beginning	Low latency, but affected by GC
RocksDB (default)	Yes (local disk)	Can apply only the delta since the checkpoint	Disk I/O with caching delivers low latency
RocksDB + standby	Yes (warmed across multiple nodes)	On failover, the standby is promoted and the delta is minimal	Extra network and storage overhead

Restart Recovery Flow (RocksDB)

When the process restarts, each task inspects the RocksDB and checkpoint file (the changelog position already applied) under the local state.dir. If local state exists and the checkpoint matches the task assignment, only the changelog delta is read to restore consistency before processing resumes.

If local state is lost or the checkpoint is inconsistent, recovery falls back to a full replay of the store's changelog from the beginning. With Exactly-once v2 enabled, recovery stops at transaction boundaries, preserving consistency between stores and output topics. In clusters with cooperative rebalancing (sticky placement of stateful tasks), tasks tend to stay on the same host, so the reuse rate of local state goes up.

Conditions for incremental recovery: local RocksDB exists, checkpoint matches task assignment, no corruption
Triggers for full recovery: missing or corrupted local state, task moved so local cannot be reused, checkpoint mismatch
The recovery source is always the changelog — input topic offsets are not used for recovery

Recovery path for a RocksDB store

Changelog Topics, Log Compaction, and Tombstones

Each State Store writes key-value updates to its changelog topic. Log compaction keeps the latest update per key and discards older versions. Deletions are written as tombstones (key only, value null), so deletes are correctly replayed during recovery.

For accurate recovery, it is critical to use the same serializer/deserializer (Serde) for the store and its changelog, and to keep the replication factor of internal topics high enough. RocksDB-side flush timing and caching do not affect recovery logic — recovery is always applied in changelog order.

Tombstones are processed during recovery as well, so key deletions are reproduced
A replication factor of 3 is recommended for internal topics in production (availability)
Changing a Serde can break recovery, so migrate with care

Cutting Recovery Time with Standby Replicas and Rebalancing

Setting num.standby.replicas keeps each partition's State Store warm in the background on other nodes. On failover, the standby is promoted to active and recovery finishes by applying only the delta. The trade-off is extra network and storage cost.

Cooperative rebalancing uses sticky placement for stateful tasks, so tasks tend to stay on the same host even during scale-in/out events, and reusing the local state.dir keeps recovery cost down. In container deployments, putting state.dir on a persistent volume delivers the same benefit.

For tight SLOs, consider num.standby.replicas=1 or higher
Place state.dir on fast persistent storage (for example, NVMe with a persistent volume)
Use cooperative rebalancing for scaling operations to preserve task stickiness

Operational Tuning: state.dir and RocksDB Settings

Put state.dir on a local persistent disk with enough I/O and capacity, and in containers mount it from a host volume. The file descriptor limit and I/O scheduler settings matter as well.

RocksDB lets you tune write buffers and the block cache through RocksDBConfigSetter. The cache (cache.max.bytes.buffering) and commit interval (commit.interval.ms) trade off latency against changelog write volume. Recovery throughput is dominated by network and disk I/O while consuming the changelog, so revisit consumer settings such as max.partition.fetch.bytes too.

Exactly-once v2 (processing.guarantee=exactly_once_v2) is recommended. Set the replication factor of internal topics (changelog/repartition) to 3 to prevent data loss during failures.

Set state.dir explicitly and carry the volume across node reschedules
Set replication.factor and min.insync.replicas on internal topics to production levels
Size the RocksDB block cache to match the working set
Tune commit.interval.ms from milliseconds to a few seconds based on business requirements

Configuration example (Java Properties + Streams DSL)

Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "payments-app");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka:9092");
// 永続ディスク配下に配置
props.put(StreamsConfig.STATE_DIR_CONFIG, "/var/lib/kafka-streams/state");
// 復旧短縮（コストとトレードオフ）
props.put(StreamsConfig.NUM_STANDBY_REPLICAS_CONFIG, 1);
// Exactly-once v2
props.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE_V2);
// キャッシュとコミット間隔
props.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 256 * 1024 * 1024L); // 256MB
props.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 500);
// 内部トピックのレプリケーション（本番は 3 推奨）
props.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, 3);
// 復旧スループットに影響
props.put(ConsumerConfig.MAX_PARTITION_FETCH_BYTES_CONFIG, 4 * 1024 * 1024); // 4MB

StreamsBuilder builder = new StreamsBuilder();
KStream<String, Payment> payments = builder.stream("payments");
KTable<String, Long> counts = payments
    .groupByKey()
    .count(Materialized.as("payments-counts")); // デフォルトで RocksDB + changelog

KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();

Troubleshooting and Exam Tips

Accidentally deleting the local state.dir forces a full recovery (a complete changelog replay), which can take a long time. To shorten the recovery window, combine standby replicas, persistent volumes, and cooperative rebalancing.

When a checkpoint is broken, the logs record a recovery fallback and the system switches to replay from the beginning. Once you decide consistency cannot be restored, the safest move is to delete the affected task's local directory and let it regenerate.

Topology changes that rename a store or alter its Serde break compatibility with the existing changelog and cause recovery to fail. When a safe reset is required, use the application reset tool to clean up input offsets and internal topics before redeploying.

The changelog is the only source of truth for recovery — not input topic offsets
Local state reuse is conditional (matching checkpoint, no corruption, same host)
Serde compatibility and internal topic reliability settings are must-know exam points

Check Your Understanding

CCDAK

問題 1

A Kafka Streams app crashes and is restarted immediately on the same host. The local RocksDB state and checkpoint are healthy, processing.guarantee is exactly_once_v2, and num.standby.replicas=0. Which statement best describes the recovery at restart?

Trust local RocksDB up to the checkpoint, then apply only the delta from the changelog before resuming processing
Use input topic offsets to restore the store and ignore the changelog
Recovery is impossible without a standby, and the app stays down
Processing cannot resume until the entire changelog is replayed from start to end

正解: A

When local RocksDB state and the checkpoint are both healthy, Streams applies only the changelog delta for an incremental recovery and resumes processing. The changelog is the source of truth for recovery; input topic offsets are not used. Recovery is possible without standbys, but if local state is missing, a full replay from the beginning is required.

Frequently Asked Questions

How does switching to an in-memory store change recovery?

Because state is lost when the process exits, restart always means replaying the changelog from the beginning. Disk I/O drops, but recovery typically takes longer. For production workloads that prioritize resilience, RocksDB is recommended.

What happens if the checkpoint is corrupted or inconsistent?

Streams plays it safe and falls back to a full recovery from the beginning. If the issue keeps recurring, delete the affected task's state.dir contents to regenerate, then verify disk health and Serde compatibility.

How can you speed up recovery in container deployments?

Put state.dir on a persistent volume, use cooperative rebalancing to keep task stickiness, set num.standby.replicas to 1, set internal topic replication to 3, and ensure adequate I/O bandwidth at bootstrap time.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

Kafka Streams State Stores in Practice: RocksDB and Restart Recovery