Rightsizing Kafka

2026-01-04 platform tuning kafka

The performance and reliability of a Kafka cluster depend largely on how brokers and partitions are sized and balanced. Brokers provide the underlying compute, storage, and network capacity, while partitions define parallelism and data distribution. Improving both together is essential to achieving scalability without unnecessary operational overhead.

The number of brokers primarily affects fault tolerance and overall capacity. Production clusters typically require at least three brokers to support replication and leader election. As data volume and throughput increase, adding brokers expands disk I/O, network bandwidth, and storage capacity. However, more brokers also increase operational complexity, so clusters are usually sized to keep resource utilization comfortably below peak limits, allowing room for failover and traffic spikes.

Partitions determine how work is parallelized across producers and consumers. Each partition can be processed by only one consumer in a consumer group at a time, which makes partition count a key factor in consumer scalability. More partitions increase throughput and parallelism but also introduce overhead regarding memory usage, file handles, and recovery time. Ordering guarantees are preserved only within a single partition, so partitioning strategies must also consider data ordering requirements.

Brokers and partitions must be balanced together. Too few partitions on many brokers can lead to underutilized resources, while too many partitions on too few brokers can cause instability and slow leader elections or rebalances. Partitions should be evenly distributed across brokers, and replication factors should align with the broker count to maintain resilience.

Kafka clusters are designed with moderate headroom and reviewed periodically as workloads evolve. It is generally easier to add brokers or partitions later than to correct an overly aggressive design. Starting with a balanced, conservative configuration and scaling based on observed usage is a best practice.

The Sizing Process

Verifying the proper size of a Kafka cluster starts with estimating data throughput and retention requirements. This includes calculating peak ingress and egress rates (MB/s), expected message sizes, and daily data volume. Total required disk capacity is derived from the daily data volume, the retention period, and the replication factor.

Storage Capacity

To calculate the total storage required, use the following formula:

Stotal = ( Vdaily * Rperiod * Frep) * (1 + Hdisk)

Stotal: Storage total volume (GB)
Vdaily: Daily data volume (GB)
Rperiod: Retention period in days
Frep: Replication factor
Hdisk: Headroom factor (typically 0.20 to 0.30)

The next step is validating broker-level resource capacity, particularly disk I/O, network bandwidth, and CPU. Network capacity calculations should include inbound producer traffic, outbound consumer traffic, and inter-broker replication, which can double write-related network usage.

Network Bandwidth

Total network load is calculated as follows:

Btotal = Ipeak + Epeak + ( Ipeak * ( Frep-1 ) )

Btotal: Bandwidth total rate (MB/s)
Ipeak: Peak ingress rate (MB/s)
Epeak: Peak egress rate (MB/s)
Frep: Replication factor

CPU sizing depends on message rates, compression type, and features such as encryption or schema validation. It is typically validated by ensuring average utilization remains well below saturation during peak loads.

Partition Sizing and Parallelism

The number of partitions is derived from required throughput per topic and the sustainable throughput per partition on the target hardware.

Ptopic = max ( (Ttotal/Tp), (Ttotal,Tc) )

Ptopic: Number of partitions for topic
Ttotal: Total required throughput for the topic.
Tp: Maximum throughput a single producer can achieve.
Tc: Maximum throughput a single consumer can process.

Consumer group sizing must be aligned so that the number of partitions supports the maximum desired consumer parallelism. Total partition count across the cluster must remain within safe operational limits per broker to avoid excessive metadata and long recovery times.

Finally, resilience and failure scenarios are factored into the calculations. The cluster must tolerate the loss of at least one broker while maintaining acceptable performance. This means verifying that remaining brokers can absorb the additional partition leaders, traffic, and storage load without exceeding safe utilization thresholds.

Broker Failure Impact

Use this calculation to estimate the load on remaining nodes during a failure:

Ufail = ( Ucurrent * N ) / (N - 1)

Ufail: Expected utilization after losing one broker
Ucurrent: Current resource utilization (%)
N: Total number of brokers in the cluster

Any utilization over 100% after a broker failure means that you are overcommitted, and the remaining brokers will not be able to absorb the load of the failed broker. This formula also assumes that all brokers are equally sized and carrying an evenly split workload.

Rightsizing Checklist

Scaling Kafka is a complex task. I find it easier to work from a checklist.

Data Volume & Retention

Estimate peak and average producer throughput (MB/s).
Calculate daily data volume (message size × messages per second × time).
Confirm retention period (time- or size-based).
Multiply total storage by the replication factor.
Add at least 20–30% disk headroom for rebalancing and uneven load.

Broker Capacity

Verify broker count supports the replication factor (minimum 3 for production).
Check disk I/O can sustain writes, reads, and replication traffic.
Ensure network bandwidth covers producer ingress, consumer egress, and inter-broker replication.
Validate CPU capacity based on message rate, compression, and security features.
Keep peak resource utilization below safe thresholds (≈60–70%).

Partitions & Parallelism

Calculate required partitions based on peak throughput per topic.
Ensure partitions support maximum consumer parallelism.
Verify partitions are evenly distributed across brokers.
Confirm total partitions per broker remain within operational limits.

Resilience & Failure Scenarios

Validate the cluster can handle at least one broker failure.
Ensure remaining brokers can absorb leader and traffic shifts.
Confirm disk and network capacity remain within limits during failover.
Check recovery and rebalance times are acceptable.

Validation & Operations

Load test with realistic peak traffic.
Monitor consumer lag, disk usage, and partition skew.
Revisit sizing assumptions as workloads and retention change.

Building and maintaining a Kafka cluster requires a rigorous approach to resource allocation and architectural design. By aligning broker capacity with partition density and throughput requirements, organizations can ensure high availability and consistent performance under peak loads. Continuous monitoring of metrics such as consumer lag and disk utilization allows teams to refine these configurations as data patterns shift. Adopting a conservative starting point and scaling incrementally remains the most effective strategy for managing production environments without compromising stability. What other factors do you take into consideration?

Previous Post Next Post