Kafka

ksqlDB Primer: Streaming Processing in SQL (CCDAK-Aligned)

2026-04-19
NicheeLab Editorial Team

ksqlDB is a stream processing engine that runs on Apache Kafka. Built on top of Kafka Streams, it lets you write continuous queries in a SQL-like language. You can filter, aggregate, join, and create derived topics without writing producer or consumer code.

This article aligns with the CCDAK (Confluent Certified Developer for Apache Kafka) exam scope and walks through the topics that trip people up in practice: key design, time semantics, internal topics, and schema management.

The Big Picture: Where ksqlDB Fits

ksqlDB takes Kafka topics as input, runs continuous queries, and either writes the results to another topic or materializes them into an internal store you can query. Persistent queries (CSAS/CTAS) run continuously on the server and are rebuilt from internal topics and state stores after a failure.

Developers declare the logic in SQL while explicitly configuring Kafka-specific concerns like keys, partitions, serialization formats, and timestamps. These show up often on CCDAK—especially remembering to repartition (PARTITION BY) before joins and aggregations.

  • Query types: transient (SELECT ... EMIT CHANGES) and persistent (CREATE STREAM/TABLE AS SELECT)
  • Query modes: push (streams changes continuously) and pull (key lookup against a materialized table)
  • Time semantics: event time by default; can be set explicitly with WITH (TIMESTAMP=...)
  • Auto-generated internals: repartitioning and aggregations create internal topics and state stores automatically
  • Failure recovery: state stores are rebuilt from internal topics and changelogs
AspectksqlDBKafka Streams (Java API)Kafka Connect
Primary useStreaming transforms, aggregations, and joins in SQLFine-grained stream processing in codeData integration with external systems
StyleDeclarative (SQL)Imperative (Java)Declarative (connector configuration)
State managementAutomatic (state store + internal topics)Controlled in code (RocksDB + changelog)None (mostly stateless)
DeploymentSubmit queries to a ksqlDB serverDeploy as an applicationDeploy connectors on a Connect cluster

Event flow and query execution at a glance

Source Topic(s)ksqlDB QueryCSAS / CTASSink Topic / Materialized TableStateClient / AppPull Query (by key) / Push Query (changes)

A minimal stream definition and derived topic

CREATE STREAM clicks (
  user_id STRING KEY,
  page STRING,
  ts TIMESTAMP
) WITH (
  kafka_topic='raw_clicks',
  value_format='JSON',
  timestamp='ts'
);

CREATE STREAM clicks_en AS
  SELECT user_id, page, ts
  FROM clicks
  WHERE page LIKE 'en/%'
  EMIT CHANGES;

STREAM vs. TABLE and Push vs. Pull Queries

A STREAM is interpreted as an append-only sequence of immutable events, ideal for filtering, mapping, and feeding into windowed aggregations. A TABLE represents a set of latest values (the folded result of a changelog), holding the current state per key.

Push queries are subscriptions that emit changes continuously—great for live dashboards. Pull queries are key lookups against a materialized TABLE, suited to request/response patterns. Pull queries are based on primary-key equality predicates.

  • Move from STREAM to TABLE via aggregations like GROUP BY
  • TABLE-TABLE and STREAM-TABLE joins require key alignment (co-partitioning)
  • Pull queries require a materialized TABLE (via CTAS or aggregation)
  • Push: SELECT ... EMIT CHANGES. Pull: SELECT ... WHERE key=...

Examples of STREAM, TABLE, and push/pull queries

-- 集計で TABLE を作成(CTAS)
CREATE TABLE pv_per_user AS
  SELECT user_id, COUNT(*) AS pv
  FROM clicks
  WINDOW TUMBLING (SIZE 5 MINUTES)
  GROUP BY user_id
  EMIT CHANGES;

-- Push: 変化を逐次取得
SELECT user_id, pv FROM pv_per_user EMIT CHANGES;

-- Pull: 特定ユーザーの現在値を 1 回取得(TABLE が必要)
-- SELECT pv FROM pv_per_user WHERE user_id='u-123';

Stateful Processing: Windowed Aggregations and Join Basics

Windowed aggregations slice time by event time. The three main shapes are TUMBLING (fixed width, non-overlapping), HOPPING (fixed width, overlapping), and SESSION (bursts of activity). The grace period for late-arriving events and the output timing are key design decisions.

Joins require matching keys and co-partitioned data. STREAM-STREAM joins need a WITHIN window, STREAM-TABLE joins are reference lookups with no window, and TABLE-TABLE joins combine latest values. If keys don't line up, repartition with PARTITION BY.

  • TUMBLING: fixed width, no overlap
  • HOPPING: fixed width, slides at a set interval with overlap
  • SESSION: bounded by idle gaps
  • STREAM-STREAM joins need WITHIN; other join types do not
  • When keys don't align, PARTITION BY rebuilds an internal topic

Windowed aggregation and join examples

-- 5 分の TUMBLING 集計
CREATE TABLE page_views_5m AS
  SELECT user_id, COUNT(*) AS cnt
  FROM clicks
  WINDOW TUMBLING (SIZE 5 MINUTES)
  GROUP BY user_id
  EMIT CHANGES;

-- STREAM-STREAM ジョイン(到着の近いイベント同士を結合)
CREATE STREAM enriched AS
  SELECT c.user_id, c.page, p.product_id
  FROM clicks c
  JOIN purchases p
    WITHIN 10 MINUTES
    ON c.user_id = p.user_id
  EMIT CHANGES;

Schemas, Serialization, and the Schema Registry

ksqlDB handles Avro, JSON, Protobuf, and KAFKA (raw bytes) via VALUE_FORMAT and KEY_FORMAT. Pairing it with the Schema Registry makes compatibility checks and schema evolution (e.g., backward compatibility) easier to operate. Default subject names are <topic>-value and <topic>-key.

Aggregations and joins depend on keys being defined correctly. With the wrong key, partitions scatter and the aggregation results won't match expectations. The standard play is to PARTITION BY the correct key before joining or aggregating.

  • Use the Schema Registry to automatically check schema evolution and compatibility
  • Key and value can use different formats (e.g., KEY=KAFKA, VALUE=AVRO)
  • Use WITH (TIMESTAMP='column') to declare the event-time column explicitly
  • PARTITION BY repartitions and creates an internal topic

Schema, format, and repartitioning examples

-- Avro で値を管理し、イベント時刻を明示
CREATE STREAM orders (
  order_id STRING KEY,
  user_id STRING,
  amount DECIMAL(10,2),
  event_time TIMESTAMP
) WITH (
  kafka_topic='orders_raw',
  key_format='KAFKA',
  value_format='AVRO',
  timestamp='event_time'
);

-- 集計前にユーザー単位へ再パーティション
CREATE STREAM orders_by_user AS
  SELECT * FROM orders PARTITION BY user_id EMIT CHANGES;

Operational Tips: Topic Design, Internal Topics, and Availability

Persistent queries (CSAS/CTAS) automatically maintain internal topics and state stores. To scale them, your input topics need enough partitions and an even key distribution. Review the replication factor and cleanup policy on internal topics against your operational policies.

On failure, state is rebuilt from internal changelogs. Plan for pending commits and replays by running at least 3 brokers with a sufficient replication factor and durable storage. Exactly-once processing guarantees only kick in when both Kafka and ksqlDB are configured correctly.

  • Input-topic key design and even distribution are the most important factors
  • Monitor the replication factor and storage capacity of internal topics
  • Manage CSAS/CTAS by query ID and prepare procedures for stopping or deleting them
  • Choose your processing guarantee (at-least-once vs. exactly-once) based on requirements and cost

Common operational commands

-- クエリ一覧と説明
SHOW QUERIES;
EXPLAIN <QUERY_ID>;

-- 一時的な動作確認(早期テストに有効)
SET 'auto.offset.reset'='earliest';
PRINT 'raw_clicks' FROM BEGINNING LIMIT 5;

CCDAK Key Points and Common Pitfalls

The exam targets keys and co-partitioning, the preconditions for each join type, the difference between STREAM and TABLE, when to use push vs. pull queries, schema compatibility and evolution, the role of internal topics, and the differences between processing guarantees. Memorizing ksqlDB syntax matters less than being able to explain the why through Kafka's design principles.

Lock in the staples that people miss most often: PARTITION BY before joins and aggregations, declaring event time (the timestamp setting), and the rule that pull queries require a TABLE.

  • STREAM-STREAM needs WITHIN; STREAM-TABLE is a reference lookup and needs no window
  • Pull queries are fundamentally primary-key equality predicates
  • When repartitioning is required, an internal topic is created
  • Schema evolution: pay attention to the compatibility rules (backward, forward, full)

Minimal examples of the patterns most often confused on the exam

-- 典型的な CSAS と再パーティション
CREATE STREAM s AS SELECT * FROM src EMIT CHANGES;
CREATE STREAM s_by_key AS SELECT * FROM s PARTITION BY some_key EMIT CHANGES;

-- Pull クエリは TABLE が対象
-- SELECT v FROM some_table WHERE id='k';

Check Your Understanding

CCDAK

問題 1

You want to join a 10-minute windowed page-view aggregation per user (a TABLE) with a user-attribute TABLE and push the top pages to a dashboard. Which combination of preconditions and steps is correct?

  1. Build the PV aggregation as a TUMBLING CTAS TABLE and STREAM-TABLE join it with the user-attribute TABLE on matching keys. If keys don't align, PARTITION BY before aggregating.
  2. Build the PV aggregation as a HOPPING STREAM and STREAM-STREAM join it with the user-attribute TABLE using WITHIN. Keys aren't needed because we use a pull query.
  3. Build the PV aggregation as a SESSION TABLE and TABLE-STREAM join it with the user-attribute STREAM with no window. Either push or pull works.
  4. Build the PV aggregation as a TUMBLING TABLE; even with key mismatch against the user-attribute TABLE, ksqlDB repartitions automatically so no setup is needed.

正解: A

A windowed aggregation with GROUP BY produces a TABLE, and the attribute side is a TABLE reference, so the standard play is a STREAM-TABLE join (no window). Joins and aggregations require matching keys and co-partitioned data; when they don't align, PARTITION BY repartitions to fix it.

Frequently Asked Questions

What is the difference between push and pull queries?

A push query (SELECT ... EMIT CHANGES) is a subscription that continuously emits changes. A pull query is a point lookup against a materialized TABLE by its primary key, returning the current value once. Pull queries require a TABLE and a primary-key predicate.

How do I choose between ksqlDB and Kafka Streams?

Pick ksqlDB when you want to build pipelines quickly in SQL and let the platform handle operations. Pick Kafka Streams (the Java API) when you need fine-grained control, custom logic, or complex state management.

Can I use exactly-once semantics?

Yes—exactly-once processing guarantees are available when Kafka and ksqlDB are configured correctly. The broker, producer, and consumer (Streams) configurations must be consistent, and internal topics need adequate replication and storage.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Kafka

Kafka Topics & Partitions: Distribution Fundamentals (2026)

How Kafka topics and partitions enable scale — ordering guar...

Kafka

CCDAK Exam Guide: Confluent Certified Developer (2026)

Complete prep for the CCDAK exam — Producer/Consumer API, St...

Kafka

CCAAK Exam Guide: Confluent Certified Administrator (2026)

Pass the CCAAK exam — cluster management, partitions, securi...

Kafka

Kafka Replicas & ISR: Fault Tolerance Explained (2026)

Replica placement, in-sync replicas (ISR), leader election. ...

Kafka

Kafka Offsets: Commit Modes & Consumer Position (2026)

Offset semantics — auto vs. manual commit, __consumer_offset...

Browse all Kafka articles (101)
© 2026 NicheeLab All rights reserved.