Kafka

Kafka Capacity Planning Guide: Sizing Partitions, Brokers, and Storage

2026-04-19
NicheeLab Editorial Team

Kafka scales out well, but getting the initial partition count, broker count, and storage sizing wrong leads to painful after-the-fact reassignments and cost overruns. This guide distills the stable concepts from the official docs into the points CCAAK tests most and a sizing workflow that holds up in production.

Three keywords matter: parallelism (partitions), fault tolerance (replication/brokers), and retention (capacity). The fastest path is to back-solve all three from your workload.

1. Setting the Stage: Quantify Workload and SLA

Capacity planning starts by pinning down ingress rate, message size, retention window, and availability/latency SLAs as concrete numbers. Separate averages from peaks and call out the peak factor explicitly (e.g., p95/p99).

Kafka throughput scales with batching efficiency, so fixing target values for producer batch.size and linger.ms, plus consumer fetch settings, as part of your assumptions makes the partition-count math much more stable.

  • Average/peak ingress (MB/s) and average message size (bytes)
  • Per-topic retention policy: retention.ms / retention.bytes and cleanup.policy (delete/compact)
  • replication.factor and min.insync.replicas aligned with availability targets and RPO/RTO
  • Read patterns: number of consumer groups, concurrent subscribers, and write-to-read traffic ratio
  • Operational headroom: typically 20-30% as an initial guard rail

Workload-assumptions template (example)

# unit: MB/s unless noted
workload:
  topic: orders
  ingress_avg: 50
  ingress_peak: 120
  msg_size_bytes_avg: 1024
  consumers_groups: 3
  retention_ms: 604800000   # 7 days
  cleanup_policy: delete
  availability:
    replication_factor: 3
    min_insync_replicas: 2
  headroom: 0.25            # 25% extra

2. Partition Design: Match Parallelism to Throughput

Throughput scales roughly linearly with partition count. A single partition is a serial write to one leader (to preserve ordering), so hardware and batching settings impose a practical ceiling. Start with a small load test, measure effective throughput per partition (Tp), and back-solve the required partition count.

Skewed key distributions create hot partitions and prevent linear scaling. Where possible, add a sharding component to the key space, or tune StickyPartitioner behavior and batch size to even out the load.

  • Required partitions ≈ ceil(peak ingress / effective Tp per partition)
  • Consumer parallelism is capped at "partitions per group". Also derive a lower bound from your read-parallelism requirements
  • Over-provision modestly up front to leave room to grow; adding partitions later incurs rebalance cost
  • Segment size and indexes consume memory/FDs in proportion to partition count, so keep broker memory in mind when setting the ceiling

Rough partition-count formula and example

# 実測に基づく計算を推奨
# Tp_per_partition_MBps: 1パーティションの実効MB/s(負荷試験で取得)
# required_partitions = ceil(peak_ingress_MBps / Tp_per_partition_MBps)

peak_ingress_MBps=120
Tp_per_partition_MBps=8
required_partitions=$(( (peak_ingress_MBps + Tp_per_partition_MBps - 1) / Tp_per_partition_MBps ))
echo ${required_partitions}  # => 15 (例)

# 作成例(RF=3, min.insync.replicas=2)
# kafka-topics.sh --create --topic orders --partitions 16 --replication-factor 3 \
#   --config min.insync.replicas=2

3. Broker Count: Availability and Placement Principles

Replication factor (RF) is the foundation of availability. Each partition has RF replicas, and writes require the leader plus an ISR set of at least min.insync.replicas. For typical availability targets, RF=3 with min.insync.replicas=2 is the starting point.

Provision at least RF brokers and enough to spread replicas across failure domains (racks or AZs). Then add more so you do not exceed the per-broker partition ceiling, disk bandwidth, or CPU limits.

  • Minimum brokers ≥ RF. With rack-aware placement, the number of racks should also be ≥ RF
  • Target min.insync.replicas = RF - 1 and combine with acks=all to balance fault tolerance and consistency
  • Per-broker partition limit is hardware-dependent. GC/metadata load climbs into the thousands, so cap based on operational experience
  • Keep 20-30% headroom to bound controller load and reassignment duration
StrategyKey settingsPrimary use caseSizing note
Time-based deletionretention.ms, cleanup.policy=deleteRetain log for a fixed windowCapacity = avg ingress × retention × RF + headroom
Size-based deletionretention.bytes, cleanup.policy=deleteOperations driven by a capacity capCapacity = configured size × RF + headroom; oldest data is deleted first
Log compactioncleanup.policy=compact(,delete)Stream tables that keep the latest value per keyEstimate capacity as total unique-key size plus the delta between compactions

RF=3 rack-aware cluster placement

Producersacks=allRack A / Broker1P0[L] P1[F] P2[F]Rack B / Broker2P0[F] P1[L] P2[F]Rack C / Broker3P0[F] P1[F] P2[L]ConsumersL=Leader, F=Follower, ISR>=2 (min.insync.replicas=2)

Rack identification and topic creation example

# server.properties(各ブローカー)
# broker.rack=rack-a|rack-b|rack-c

# 新規トピック(RF=3, min.insync.replicas=2)
# kafka-topics.sh --create --topic orders --partitions 18 --replication-factor 3 \
#   --config min.insync.replicas=2

4. Storage Capacity: Back-Solving from Retention Policy

Size capacity by multiplying effective ingress (pre/post compression), retention, and RF, then add index/segment overhead and operational headroom. Official retention behavior is controlled via retention.ms, retention.bytes, and cleanup.policy (delete/compact).

When using log compaction, size primarily off total unique-key footprint and update frequency; if you also enable delete, manage the upper bound with both policies in tandem.

  • Base formula: single-replica capacity ≈ effective ingress (MB/s) × retention seconds × compression factor
  • Cluster capacity ≈ single-replica capacity × RF × (1 + headroom)
  • Overhead: index/time-index plus segment-rollover slack; budget +10-20% to stay on the safe side
  • Keep spare disk and evaluate IOPS/bandwidth so the log cleaner and compactor do not stall

Worked example (delete retention, RF=3, 7 days, 25% headroom)

# 前提
# 実効Ingress = 50 MB/s(圧縮後)
# 保持 = 7日 = 604800 秒
# RF = 3, ヘッドルーム = 25%
# オーバーヘッド係数 = 1.1 (インデックス等)

echo "scale=2; 50*604800*1.1*3*1.25/1024/1024" | bc
# => 約 120.4 TB (クラスター総容量の目安)
# 実機ではディスク単位(例 4TB x 台数)に割り付け、将来増設余地を残す

5. Network and Disk Bandwidth: Budgeting for Replication and Consumption

Cluster ingress is the producer → leader write path. With RF>1, followers also fetch replicas from the leader, so total receive bytes converge on roughly "client ingress × RF" (with duplicate ACKs and metadata traffic being relatively small). Egress scales with consumer count and read patterns.

HDDs can sustain high throughput on sequential writes, but for mixed workloads (read/write/compression/compaction), SSDs deliver lower jitter and make latency SLAs easier to hit.

  • NIC budget: receive ≈ ingress × RF, transmit ≈ ingress × (RF-1) + total egress (approximate)
  • Disk budget: evaluate writes against ingress, and reads at the peak where egress, replication, and compaction overlap
  • Batching and compression (LZ4/ZSTD) improve NIC/disk efficiency, but trade off against CPU usage
  • Producer acks=all combined with min.insync.replicas can add wait time, so validate latency SLA and bandwidth together

Network rule-of-thumb formula

# 例: Ingress=120 MB/s, RF=3, 総Egress=240 MB/s(複数グループ合算)
# 受信Rx ≈ 120 * 3 = 360 MB/s
# 送信Tx ≈ 120 * (3-1) + 240 = 480 MB/s
# 10GbE(実効~1,200 MB/s)なら1本で足りるが、冗長/将来増分を見て2ポート以上を推奨

6. Operational Headroom and Scaling Strategy: Safety Margin and Reassignment

Capacity is not static. Anticipate topic and traffic growth, hold 20-30% headroom at all times, and add brokers or disks as you approach the threshold. After scaling, use partition reassignment to rebalance.

Reassignment puts load on the network and disks. Limit impact by using maintenance windows, throttling, and preferred-leader election.

  • During reassignment, throttle bandwidth with quotas (replication.throttled.rate)
  • Use preferred-leader election to spread leaders deliberately
  • Monitor metrics: partition/broker count, disk usage, ISR stability, and request latency
  • Phased scaling: expand disks first, then add brokers and reassign

Reassignment and preferred-leader election example

# reassignment.json(例)
# {"version":1, "partitions":[{"topic":"orders","partition":0,"replicas":[1,2,3]}]}

# 実行
# kafka-reassign-partitions.sh --bootstrap-server <broker> --reassignment-json-file reassignment.json --execute
# 帯域スロットルを適用する場合は broker/topic 設定で replication.throttled.* を活用

# 優先リーダー選出
# kafka-preferred-replica-election.sh --bootstrap-server <broker>

Check Your Understanding

CCAAK

問題 1

From a CCAAK perspective, topic orders is configured with RF=3, min.insync.replicas=2, and producer acks=all. Which condition must hold for new writes to continue after two brokers fail?

  1. The failures occur outside the partition's ISR and the remaining ISR stays at 2 or higher
  2. With RF=3, writes always continue through any 2 broker failures
  3. Because acks=all is set, broker failures have no effect on writes
  4. Lowering min.insync.replicas to 1 lets writes continue under any 2-broker failure

正解: A

With acks=all, writes succeed only when ISR size is at least min.insync.replicas. With RF=3 and min.insync.replicas=2, two simultaneous failures typically drop ISR below 2 and writes are rejected. The exception is when the failures hit replicas outside the ISR (e.g., replicas already lagging out of ISR) so that ISR stays at 2 or more. B and C are incorrect, and D trades consistency for availability, which conflicts with the original design intent.

Frequently Asked Questions

How many partitions per broker is safe?

It depends on hardware, Kafka version, and GC strategy. Past several thousand partitions, metadata handling, GC, and cleanup tend to spike, so set a limit based on load testing and keep 20-30% headroom. Controller stability and reassignment duration should also be evaluated.

Should I use retention.ms or retention.bytes?

Use retention.ms as the primary lever when your SLA is time-based, and retention.bytes when cost ceilings are strict; let the other one act as a backstop. If you also use compaction, baseline on unique key size and combine with delete to cap old segments for stability.

What happens when I add partitions later?

Existing key-ordering guarantees do not carry over to new partitions (with unkeyed round-robin/sticky assignment). You can also see momentary latency spikes from data rebalancing and leader election. It is safer to over-provision a bit at the start to absorb future growth.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる
Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.


Related articles
Kafka

Kafka Topics & Partitions: Distribution Fundamentals (2026)

How Kafka topics and partitions enable scale — ordering guar...

Kafka

CCDAK Exam Guide: Confluent Certified Developer (2026)

Complete prep for the CCDAK exam — Producer/Consumer API, St...

Kafka

CCAAK Exam Guide: Confluent Certified Administrator (2026)

Pass the CCAAK exam — cluster management, partitions, securi...

Kafka

Kafka Replicas & ISR: Fault Tolerance Explained (2026)

Replica placement, in-sync replicas (ISR), leader election. ...

Kafka

Kafka Offsets: Commit Modes & Consumer Position (2026)

Offset semantics — auto vs. manual commit, __consumer_offset...

Browse all Kafka articles (101)
© 2026 NicheeLab All rights reserved.