The performance and reliability of a Kafka cluster depend largely on how brokers and partitions are sized and balanced. Brokers provide the underlying compute, storage, and network capacity, while partitions define parallelism and data distribution. Improving both together is essential to achieving scalability without unnecessary operational overhead.
The number of brokers primarily affects fault tolerance and overall capacity. Production clusters typically require at least three brokers to support replication and leader election. As data volume and throughput increase, adding brokers expands disk I/O, network bandwidth, and storage capacity. However, more brokers also increase operational complexity, so clusters are usually sized to keep resource utilization comfortably below peak limits, allowing room for failover and traffic spikes.
Partitions determine how work is parallelized across producers and consumers. Each partition can be processed by only one consumer in a consumer group at a time, which makes partition count a key factor in consumer scalability. More partitions increase throughput and parallelism but also introduce overhead regarding memory usage, file handles, and recovery time. Ordering guarantees are preserved only within a single partition, so partitioning strategies must also consider data ordering requirements.
Brokers and partitions must be balanced together. Too few partitions on many brokers can lead to underutilized resources, while too many partitions on too few brokers can cause instability and slow leader elections or rebalances. Partitions should be evenly distributed across brokers, and replication factors should align with the broker count to maintain resilience.
Kafka clusters are designed with moderate headroom and reviewed periodically as workloads evolve. It is generally easier to add brokers or partitions later than to correct an overly aggressive design. Starting with a balanced, conservative configuration and scaling based on observed usage is a best practice.
Verifying the proper size of a Kafka cluster starts with estimating data throughput and retention requirements. This includes calculating peak ingress and egress rates (MB/s), expected message sizes, and daily data volume. Total required disk capacity is derived from the daily data volume, the retention period, and the replication factor.
To calculate the total storage required, use the following formula:
Stotal = ( Vdaily * Rperiod * Frep) * (1 + Hdisk)
The next step is validating broker-level resource capacity, particularly disk I/O, network bandwidth, and CPU. Network capacity calculations should include inbound producer traffic, outbound consumer traffic, and inter-broker replication, which can double write-related network usage.
Total network load is calculated as follows:
Btotal = Ipeak + Epeak + ( Ipeak * ( Frep-1 ) )
CPU sizing depends on message rates, compression type, and features such as encryption or schema validation. It is typically validated by ensuring average utilization remains well below saturation during peak loads.
The number of partitions is derived from required throughput per topic and the sustainable throughput per partition on the target hardware.
Ptopic = max ( (Ttotal/Tp), (Ttotal,Tc) )
Consumer group sizing must be aligned so that the number of partitions supports the maximum desired consumer parallelism. Total partition count across the cluster must remain within safe operational limits per broker to avoid excessive metadata and long recovery times.
Finally, resilience and failure scenarios are factored into the calculations. The cluster must tolerate the loss of at least one broker while maintaining acceptable performance. This means verifying that remaining brokers can absorb the additional partition leaders, traffic, and storage load without exceeding safe utilization thresholds.
Use this calculation to estimate the load on remaining nodes during a failure:
Ufail = ( Ucurrent * N ) / (N - 1)
Any utilization over 100% after a broker failure means that you are overcommitted, and the remaining brokers will not be able to absorb the load of the failed broker. This formula also assumes that all brokers are equally sized and carrying an evenly split workload.
Scaling Kafka is a complex task. I find it easier to work from a checklist.
Building and maintaining a Kafka cluster requires a rigorous approach to resource allocation and architectural design. By aligning broker capacity with partition density and throughput requirements, organizations can ensure high availability and consistent performance under peak loads. Continuous monitoring of metrics such as consumer lag and disk utilization allows teams to refine these configurations as data patterns shift. Adopting a conservative starting point and scaling incrementally remains the most effective strategy for managing production environments without compromising stability. What other factors do you take into consideration?