Tuning Linux for Kubernetes

kubernetes tuning linux
Kubernetes logo over blurry smoky background

Running Kubernetes is practically the norm. Every major cloud provider has built their own branded Kubernetes service. We have Google GKE, AWS EKS, Azure AKS, Oracle OKE, and many, many others. Sometimes you just want to run Kubernetes on-premises. The reason for that could be cost, secure, or just plain control over your environment. If you do run Kubernetes in on-prem, then you have the opportunity to eek as much performance as you can out of your hardware. Some Linux distributions provide tuned profiles for specific workloads, including Kubernetes. These profiles optimize various kernel parameters and settings based on the workload requirements. You can install and activate a tuned profile specific to Kubernetes to automatically apply the necessary tunings. This will give you a bit of a head start. Keep in mind this is still a generic configuration for a one-size-fits-all Kubernetes workload. Additional tuning may still be needed. These are some of the settings that I usually tweak.

CPU and Memory

Disable swap. Kubernetes requires a lot of memory to operate efficiently, and swapping to disk can significantly degrade performance. It's recommended to disable swap altogether. Remove or comment out the swap entry in the /etc/fstab file and run a swapoff -a command.

Enable the cgroup memory controller. Kubernetes uses cgroups to manage resource allocation, and enabling the cgroup memory controller can improve memory management. You can enable it by adding cgroup_enable=memory to the kernel command line in your bootloader configuration.

Networking

Increase maximum number of connections. Kubernetes involves a significant amount of network connections. You can increase the maximum number of network connections by adjusting the net.core.somaxconn, net.ipv4.tcp_max_syn_backlog, and net.core.netdev_max_backlog parameters. For example, you can set net.core.somaxconn to a higher value like 65535.

Enabling TCP Window Scaling allows for larger receive window sizes, which can improve throughput and reduce latency. You can enable it by setting net.ipv4.tcp_window_scaling to 1.

Kubernetes involves communication between various components, and adjusting TCP keepalive settings can help detect and clean up stale connections. You can adjust parameters like net.ipv4.tcp_keepalive_time, net.ipv4.tcp_keepalive_probes, and net.ipv4.tcp_keepalive_intvl to optimize keepalive behavior.

Enable TCP Fast Open. TCP Fast Open is a mechanism that allows for faster connection establishment by sending data in the initial SYN packet. Enabling TCP Fast Open can reduce latency for certain workloads. You can enable it by setting net.ipv4.tcp_fastopen to 3.

Several TCP/IP kernel parameters can be adjusted to optimize network performance. Some commonly tuned parameters include net.core.rmem_max, net.core.wmem_max, net.ipv4.tcp_rmem, net.ipv4.tcp_wmem, and net.ipv4.tcp_mtu_probing. These parameters control the receive and transmit buffer sizes, as well as the TCP maximum transmission unit (MTU) discovery behavior. The memory settings are measured in bytes. Be care not to set the values too high or they can consume valuable system memory.

These values can be changed live on a running node before making the settings permanent.

sysctl -w net.core.rmem_max=8388608
sysctl -w net.core.wmem_max=8388608
sysctl -w net.ipv4.tcp_rmem="4096 87380 8388608"
sysctl -w net.ipv4.tcp_wmem="4096 87380 8388608"

To make the changes persistent across reboots, add the corresponding lines to your /etc/sysctl.conf file. Remember to run sysctl -p to reload the configuration from the file.

net.core.rmem_max=8388608
net.core.wmem_max=8388608
net.ipv4.tcp_rmem=4096 87380 8388608
net.ipv4.tcp_wmem=4096 87380 8388608

The ring parameters on a network interface refer to the settings that control the size and behavior of the receive (RX) and transmit (TX) rings. These rings are data structures used by the network driver to buffer incoming and outgoing network packets. The ring parameters can affect network performance and behavior, especially in high-throughput or low-latency scenarios. Use the ethtool command to check the current settings for your network interfaces.

ethtool -g <interface>

For example:

$ ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX:             18139
RX Mini:        0
RX Jumbo:       0
TX:             2560
Current hardware settings:
RX:             9362
RX Mini:        0
RX Jumbo:       0
TX:             170

To modify the buffer sizes, use the following commands:

ethtool -G <interface> rx <buffersize>
ethtool -G <interface> tx <buffersize>

For example:

ethtool -G eth0 rx 18139 
ethtool -G eth0 tx 2560

Files and Processes

Increase file descriptor limits. Kubernetes uses many file descriptors, so it's beneficial to increase the file descriptor limits to avoid reaching the system limits. A common recommendation is to set the file descriptor limits to at least 102,400 (soft limit) and 102,400 (hard limit). These values should be sufficient for most Kubernetes clusters. However, for larger deployments or clusters with heavy workloads, you may need to further increase these limits.

*         soft    nofile      102400
*         hard    nofile      102400

These lines set the soft and hard limits for the maximum number of open file descriptors (nofile) to 102,400 for all users. After making the changes, you may need to log out and log back in for the changes to take effect. You can verify the new limits using the ulimit -n command.

Disk and File Systems

The I/O scheduler determines how the disk reads and writes are prioritized and scheduled. For Kubernetes nodes, it's often recommended to use the "deadline" or "noop" I/O scheduler. The "deadline" scheduler is a good choice for most workloads, while the "noop" scheduler is suitable for SSDs or when an external storage system handles scheduling. Change the I/O scheduler by modifying the /sys/block/<device>/queue/scheduler file. Substitute the appropriate device name in the file path.

Certain kernel parameters can be adjusted to optimize Kubernetes performance. Some commonly recommended parameters include increasing the maximum number of processes or namespaces (kernel.pid_max) and increasing the maximum number of inotify watchers (fs.inotify.max_user_watches).

When mounting disks or file systems, using appropriate mount options can improve performance and reliability. Some commonly recommended mount options for Kubernetes nodes include noatime (disable access time updates), nodiratime (disable directory access time updates), barrier=0 (disable disk write barriers), data=writeback (delayed allocation for better performance), and discard (enables TRIM/discard for SSDs). You can specify these options in the /etc/fstab file.

Conclusion

Remember that there is not a "correct" configuration for tuning a Kubernetes environment. The end configuration is dependent on the running workload and the available resources. In addition, the configuration may evolve over time as the workload changes. I recommend implementing consistent logging and monitoring practices across your environment. Well developed applications will log when they are approaching or exceeding configured resource limits.

Previous Post Next Post