ksqlDB is a stream processing engine that runs on Apache Kafka. Built on top of Kafka Streams, it lets you write continuous queries in a SQL-like language. You can filter, aggregate, join, and create derived topics without writing producer or consumer code.
This article aligns with the CCDAK (Confluent Certified Developer for Apache Kafka) exam scope and walks through the topics that trip people up in practice: key design, time semantics, internal topics, and schema management.
ksqlDB takes Kafka topics as input, runs continuous queries, and either writes the results to another topic or materializes them into an internal store you can query. Persistent queries (CSAS/CTAS) run continuously on the server and are rebuilt from internal topics and state stores after a failure.
Developers declare the logic in SQL while explicitly configuring Kafka-specific concerns like keys, partitions, serialization formats, and timestamps. These show up often on CCDAK—especially remembering to repartition (PARTITION BY) before joins and aggregations.
| Aspect | ksqlDB | Kafka Streams (Java API) | Kafka Connect |
|---|---|---|---|
| Primary use | Streaming transforms, aggregations, and joins in SQL | Fine-grained stream processing in code | Data integration with external systems |
| Style | Declarative (SQL) | Imperative (Java) | Declarative (connector configuration) |
| State management | Automatic (state store + internal topics) | Controlled in code (RocksDB + changelog) | None (mostly stateless) |
| Deployment | Submit queries to a ksqlDB server | Deploy as an application | Deploy connectors on a Connect cluster |
Event flow and query execution at a glance
A minimal stream definition and derived topic
CREATE STREAM clicks (
user_id STRING KEY,
page STRING,
ts TIMESTAMP
) WITH (
kafka_topic='raw_clicks',
value_format='JSON',
timestamp='ts'
);
CREATE STREAM clicks_en AS
SELECT user_id, page, ts
FROM clicks
WHERE page LIKE 'en/%'
EMIT CHANGES;A STREAM is interpreted as an append-only sequence of immutable events, ideal for filtering, mapping, and feeding into windowed aggregations. A TABLE represents a set of latest values (the folded result of a changelog), holding the current state per key.
Push queries are subscriptions that emit changes continuously—great for live dashboards. Pull queries are key lookups against a materialized TABLE, suited to request/response patterns. Pull queries are based on primary-key equality predicates.
Examples of STREAM, TABLE, and push/pull queries
-- 集計で TABLE を作成(CTAS)
CREATE TABLE pv_per_user AS
SELECT user_id, COUNT(*) AS pv
FROM clicks
WINDOW TUMBLING (SIZE 5 MINUTES)
GROUP BY user_id
EMIT CHANGES;
-- Push: 変化を逐次取得
SELECT user_id, pv FROM pv_per_user EMIT CHANGES;
-- Pull: 特定ユーザーの現在値を 1 回取得(TABLE が必要)
-- SELECT pv FROM pv_per_user WHERE user_id='u-123';Windowed aggregations slice time by event time. The three main shapes are TUMBLING (fixed width, non-overlapping), HOPPING (fixed width, overlapping), and SESSION (bursts of activity). The grace period for late-arriving events and the output timing are key design decisions.
Joins require matching keys and co-partitioned data. STREAM-STREAM joins need a WITHIN window, STREAM-TABLE joins are reference lookups with no window, and TABLE-TABLE joins combine latest values. If keys don't line up, repartition with PARTITION BY.
Windowed aggregation and join examples
-- 5 分の TUMBLING 集計
CREATE TABLE page_views_5m AS
SELECT user_id, COUNT(*) AS cnt
FROM clicks
WINDOW TUMBLING (SIZE 5 MINUTES)
GROUP BY user_id
EMIT CHANGES;
-- STREAM-STREAM ジョイン(到着の近いイベント同士を結合)
CREATE STREAM enriched AS
SELECT c.user_id, c.page, p.product_id
FROM clicks c
JOIN purchases p
WITHIN 10 MINUTES
ON c.user_id = p.user_id
EMIT CHANGES;ksqlDB handles Avro, JSON, Protobuf, and KAFKA (raw bytes) via VALUE_FORMAT and KEY_FORMAT. Pairing it with the Schema Registry makes compatibility checks and schema evolution (e.g., backward compatibility) easier to operate. Default subject names are <topic>-value and <topic>-key.
Aggregations and joins depend on keys being defined correctly. With the wrong key, partitions scatter and the aggregation results won't match expectations. The standard play is to PARTITION BY the correct key before joining or aggregating.
Schema, format, and repartitioning examples
-- Avro で値を管理し、イベント時刻を明示
CREATE STREAM orders (
order_id STRING KEY,
user_id STRING,
amount DECIMAL(10,2),
event_time TIMESTAMP
) WITH (
kafka_topic='orders_raw',
key_format='KAFKA',
value_format='AVRO',
timestamp='event_time'
);
-- 集計前にユーザー単位へ再パーティション
CREATE STREAM orders_by_user AS
SELECT * FROM orders PARTITION BY user_id EMIT CHANGES;Persistent queries (CSAS/CTAS) automatically maintain internal topics and state stores. To scale them, your input topics need enough partitions and an even key distribution. Review the replication factor and cleanup policy on internal topics against your operational policies.
On failure, state is rebuilt from internal changelogs. Plan for pending commits and replays by running at least 3 brokers with a sufficient replication factor and durable storage. Exactly-once processing guarantees only kick in when both Kafka and ksqlDB are configured correctly.
Common operational commands
-- クエリ一覧と説明
SHOW QUERIES;
EXPLAIN <QUERY_ID>;
-- 一時的な動作確認(早期テストに有効)
SET 'auto.offset.reset'='earliest';
PRINT 'raw_clicks' FROM BEGINNING LIMIT 5;The exam targets keys and co-partitioning, the preconditions for each join type, the difference between STREAM and TABLE, when to use push vs. pull queries, schema compatibility and evolution, the role of internal topics, and the differences between processing guarantees. Memorizing ksqlDB syntax matters less than being able to explain the why through Kafka's design principles.
Lock in the staples that people miss most often: PARTITION BY before joins and aggregations, declaring event time (the timestamp setting), and the rule that pull queries require a TABLE.
Minimal examples of the patterns most often confused on the exam
-- 典型的な CSAS と再パーティション
CREATE STREAM s AS SELECT * FROM src EMIT CHANGES;
CREATE STREAM s_by_key AS SELECT * FROM s PARTITION BY some_key EMIT CHANGES;
-- Pull クエリは TABLE が対象
-- SELECT v FROM some_table WHERE id='k';CCDAK
問題 1
You want to join a 10-minute windowed page-view aggregation per user (a TABLE) with a user-attribute TABLE and push the top pages to a dashboard. Which combination of preconditions and steps is correct?
正解: A
A windowed aggregation with GROUP BY produces a TABLE, and the attribute side is a TABLE reference, so the standard play is a STREAM-TABLE join (no window). Joins and aggregations require matching keys and co-partitioned data; when they don't align, PARTITION BY repartitions to fix it.
What is the difference between push and pull queries?
A push query (SELECT ... EMIT CHANGES) is a subscription that continuously emits changes. A pull query is a point lookup against a materialized TABLE by its primary key, returning the current value once. Pull queries require a TABLE and a primary-key predicate.
How do I choose between ksqlDB and Kafka Streams?
Pick ksqlDB when you want to build pipelines quickly in SQL and let the platform handle operations. Pick Kafka Streams (the Java API) when you need fine-grained control, custom logic, or complex state management.
Can I use exactly-once semantics?
Yes—exactly-once processing guarantees are available when Kafka and ksqlDB are configured correctly. The broker, producer, and consumer (Streams) configurations must be consistent, and internal topics need adequate replication and storage.
Practice with certification-focused question sets
無料で問題を解いてみるNicheeLab Editorial Team
NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.
Kafka Topics & Partitions: Distribution Fundamentals (2026)
How Kafka topics and partitions enable scale — ordering guar...
CCDAK Exam Guide: Confluent Certified Developer (2026)
Complete prep for the CCDAK exam — Producer/Consumer API, St...
CCAAK Exam Guide: Confluent Certified Administrator (2026)
Pass the CCAAK exam — cluster management, partitions, securi...
Kafka Replicas & ISR: Fault Tolerance Explained (2026)
Replica placement, in-sync replicas (ISR), leader election. ...
Kafka Offsets: Commit Modes & Consumer Position (2026)
Offset semantics — auto vs. manual commit, __consumer_offset...