Kafka Cost Optimization: Brokers, Storage, Network (2026)

Kafka cost optimization starts by aligning three levers: partition count, compression, and retention (retention/compaction). Too many partitions inflate memory, file descriptors, and replication traffic; a mismatched compression scheme drives CPU overload and throughput degradation; and the wrong retention policy causes storage to explode.

This article organizes the high-yield points commonly tested on the exam (CCAAK) and the practical answers from real-world operations, grounded in the official documentation. To avoid version-dependent behavior, the discussion is limited to stable features (topic settings, basic producer/consumer settings, and the principles of log compaction).

Partition Count Design Principles and Cost Impact

Partition count caps parallelism, but each partition carries fixed costs: metadata, page cache, log segments, replication threads, and so on. Over-provisioning rapidly increases broker memory consumption, open file count, and the load on the controller and metadata propagation.

As a rough estimate, total segment count ≈ partition count × (active + rotated segments). Larger retention.bytes/segment.bytes values inflate the segment count and increase file descriptor and disk seek costs. Replication network egress scales roughly with write throughput × (replication.factor - 1).

Throughput-first starting point: target around 1-3 MB/s per partition, then add headroom for future growth
Consumer parallelism is capped by partition count. Back-calculate from the target parallelism per group
Partition count can be increased but not decreased (re-partitioning is required). Do not get too far ahead
Once a single broker approaches thousands of partitions, GC and replication lag tend to rise. Monitor and scale gradually

Gradually increasing partition count (since shrinking is not possible, plan ahead)

# Increase partition count. Watch cluster balance as assignment proceeds
kafka-topics.sh --bootstrap-server <broker:9092> \
  --alter --topic orders --partitions 24

# Beware of over-provisioning. After assignment, plan for the network/disk I/O of rebalancing

Choosing and Tuning the Compression Algorithm

Kafka compression is fundamentally done on the producer side. With the topic setting compression.type=producer (default), the producer's setting is applied. Compression lowers disk and network usage but affects CPU cost and latency. Pick an algorithm by understanding its characteristics and how it interacts with message size and batching.

In general, zstd offers high compression at high CPU cost, lz4 delivers low latency and moderate compression, snappy is lightweight and stable, gzip is broadly compatible but CPU-heavy, and no compression minimizes CPU at the cost of bandwidth and disk. Compression has little effect on very small batches, so the standard practice is to combine linger.ms and batch.size to form meaningful units.

Small-to-medium JSON/Avro: consider lz4 or zstd
Ultra-low latency: lz4/snappy with a smaller batch
Bandwidth/storage pressure: zstd with moderate linger.ms and a larger batch.size
Compression is end-to-end. Avoid broker-side re-compression and decide on the producer

Algorithm	Typical compression ratio	CPU cost	Throughput / latency impact
zstd	High (2-5x compression is common)	High	Highly efficient with large batches; CPU-heavy
lz4	Medium (around 1.5-3x)	Medium-low	Stable low latency, good throughput
snappy	Medium (around 1.5-2x)	Low	Stable, light on CPU
gzip	Medium-high	Medium-high	Latency rises when CPU is the bottleneck
none	None	Minimal	Consumes bandwidth and disk

Example producer compression and batching settings

props.put("compression.type", "zstd");
props.put("linger.ms", "15");         // Adjust within latency tolerance
props.put("batch.size", "131072");    // ~128KB target (A/B test for your workload)
props.put("acks", "all");             // Strong durability with minimal wasted retries

Retention Period and Delete/Compaction Strategy

Retention design combines delete (drop entire segments by time/size) and compact (keep the latest record per key). For keyed data where the latest version must always remain (e.g. current entity state), compact is the choice; for event history, delete is the default. The combined compact,delete is useful when you want to retain history for a fixed window while still keeping the latest version, but if retention is too short, delete may race ahead of compaction and break your assumptions.

retention.bytes and retention.ms can each be set (or used together). When you must strictly stay within a storage budget, prioritize bytes; when compliance dictates a retention window, prioritize ms. segment.bytes affects compactor efficiency, file count, and page cache efficiency, so set it by balancing I/O against memory.

Latest-version guarantee is top priority: cleanup.policy=compact (do not add delete)
Want recent history too: cleanup.policy=compact,delete + a generously long retention.ms
Strict storage cap: set retention.bytes (estimate the total multiplied by replication.factor)
Pushing compaction harder: lowering min.cleanable.dirty.ratio too far increases I/O. Adjust in small steps from the default

delete vs compact (conceptual diagram)

Safe combinations of topic retention/compaction

# Latest-version guarantee (no history needed)
kafka-configs.sh --bootstrap-server <broker> --alter \
  --topic entity-state \
  --add-config cleanup.policy=compact,segment.bytes=134217728,min.cleanable.dirty.ratio=0.5

# Keep recent history too (compact + delete, with care)
kafka-configs.sh --bootstrap-server <broker> --alter \
  --topic entity-state-history \
  --add-config cleanup.policy=compact,delete,retention.ms=1209600000,segment.ms=604800000

# Strict storage cap (bytes-first)
kafka-configs.sh --bootstrap-server <broker> --alter \
  --topic metrics \
  --add-config retention.bytes=10737418240

Throughput and Storage Estimation Procedure

Storage estimate: with write rate B (bytes/sec), post-compression ratio r (0<r<=1), retention period T (seconds), partition count P, and replicas R, total required disk ≈ B × r × T × R. When using bytes limits, retention.bytes × P × R is close to the upper bound for cluster consumption (overhead aside).

Network replication traffic (broker ingress): B × r × (R-1). Client egress depends on consumer count and filtering. At the estimation stage, plan with both peak and average values so spikes can be absorbed.

Calibrate r from real measurements. Gather it from a 10-30 minute producer A/B test sample
Larger segment.bytes improves compression cohesion but coarsens deletion granularity
When retention.ms and retention.bytes are both set, deletion proceeds with whichever triggers first

Quick calculation note (shell)

# B=20MB/s, post-compression r=0.4, R=3, T=7 days
B=$((20*1024*1024))
r=0.4
R=3
T=$((7*24*3600))
echo "Disk ~= $(awk -v b=$B -v r=$r -v t=$T -v R=$R 'BEGIN{printf "%.1f GiB\n", b*r*t*R/1024/1024/1024}')"

Cutting Costs via Producer/Consumer Settings

On the producer side, batching via linger.ms and batch.size, plus compression and the right acks, suppress wasteful retries and small-grained sends. The smaller the records, the larger the gain from batching, directly cutting network/disk overhead.

On the consumer side, fetch.min.bytes and fetch.max.wait.ms aggregate fetches, and max.partition.fetch.bytes and session.timeout.ms are tuned to the workload. Values that are too small increase RPC count and context switches, driving up CPU and network cost.

Producer: optimize compression.type, linger.ms, and batch.size as a set
acks=all with appropriate retries/delivery.timeout.ms minimizes wasted retries
Consumer: start A/B testing fetch.min.bytes from somewhere between 64KB and 1MB
Tune max.in.flight.requests.per.connection based on ordering requirements and retry rate

Example consumer fetch optimization

props.put("fetch.min.bytes", "1048576");     // 1MB aggregation
props.put("fetch.max.wait.ms", "50");         // batching wait
props.put("max.partition.fetch.bytes", "5242880");
props.put("enable.auto.commit", "false");     // control to match idempotency

Operational Monitoring and Automation (Frequently Tested in CCAAK)

For monitoring, continuously visualize disk usage (trended per topic/partition), replication lag (ISR, UnderReplicatedPartitions, Follower lag), network egress, compaction metrics (log cleaner progress/JMX), and request latency. Some retention and compaction settings take effect immediately, so always track behavior after a dynamic change.

For automation, suppress noisy neighbors with quotas (per-producer/consumer/client-ID bandwidth limits), and pair storage thresholds with alerts and write controls. CCAAK targets your understanding of dynamic settings (kafka-configs.sh), safe rolling application, and the configuration precedence (broker < topic < client).

Key JMX example: kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec/BytesOutPerSec
Cleaner threads: tune log.cleaner.threads up or down in light of disk I/O
Quotas: rate-control write/produce/consume per client-id

Example dynamic settings for quotas/cleaner threads

# Throttle produce bandwidth per client-id (overload prevention)
kafka-configs.sh --bootstrap-server <broker> --alter \
  --add-config 'producer_byte_rate=1048576' --entity-type clients --entity-name appA

# Log cleaner threads (broker restart may be required; check whether dynamic change is supported in your environment)
# server.properties: log.cleaner.threads=2

Check Your Understanding

CCAAK

問題 1

For an audit-driven requirement, you must always retain the latest version per key while cutting older values as much as possible to reduce storage cost. Which topic configuration is most appropriate?

Set cleanup.policy=compact only, and tune segment.bytes and min.cleanable.dirty.ratio
Set only cleanup.policy=delete with retention.ms=7d
Set compression.type=gzip on the producer (leave topic settings at defaults)
Set cleanup.policy=compact,delete with retention.ms set to a few hours

正解: A

Forcing latest-version retention requires log compaction. With delete alone, the latest version can be lost once it exceeds retention. Combining compact and delete with a short retention risks losing the latest version because compaction cannot keep up. Changing only the compression scheme cannot satisfy the retention requirement.

Frequently Asked Questions

Can compression be set on the broker as well? Where is the right place to decide it?

Set compression.type to "producer" (the default) at the topic level. Producers do the actual compression, and keeping the same algorithm end-to-end is the most efficient path. Avoid re-compression at the broker since it adds CPU and latency cost.

If I combine compact and delete, will the latest version always be retained?

If you need a guaranteed latest-version retention, compact alone is safest. With compact,delete combined, a short retention can cause delete to remove old segments before compaction finishes, which may break the latest-version guarantee.

What are the minimum safety measures when increasing partition count?

Scale up gradually and monitor broker disk usage, file descriptors, replication lag, and GC. Execute during a maintenance window to absorb the network/disk load of rebalancing, and tighten throttling (quotas) temporarily if needed. Since you cannot shrink partition count, do not over-provision in advance.

Check what you learned with practice questions

Practice with certification-focused question sets

無料で問題を解いてみる

Author

NicheeLab Editorial Team

NicheeLab editorial team focused on data engineering and cloud certification learning. Content is structured around practical study needs and official exam domains.

Kafka Cost Optimization: Practical Standards for Partitions, Compression, and Retention