Dealing with OOM Killed Pods

kubernetes tuning
Kubernetes logo over blurry smoky background

When memory utilization starts maxing out in Linux systems, performance degrades, and systems become unstable. This is especially true in Kubernetes environments. In order to free trapped memory and stabilize a node, the OOM Killer will begin to kill processes. These processes can be system level processes or running apps. When this happens, you will find OOMKilled error messages in your logs.

time="2023-07-16T10:06:16Z" level=fatal msg="OOMKilled" pod="web-api-v1" container="web-api" 
reason="OOMKilled" message="Out of memory: Kill process 1234 (web-api) score 983 or sacrifice child"

The OOM Killer in Linux determines which process to kill based on a scoring mechanism called the OOM score. The oom_score is a value that is assigned to each process by the kernel. The higher the oom_score, the more likely the process is to be killed by the OOM Killer. The oom_score is calculated based on a number of factors, including the amount of memory that the process is using, the number of child processes that the process has, and the importance of the process. Each process has an oom_score_adj value associated with it, which can be manually adjusted by the system administrator. Most admins do not do this. Higher values indicate a higher likelihood of being selected for termination. The Resident Set Size (RSS) refers to the amount of physical memory used by a process. The OOM Killer considers the RSS of each process to determine how much memory it is consuming. Older processes have a higher chance of being selected for termination as they are assumed to have run for a longer time and potentially have completed their tasks. The OOM Killer takes into account the recent memory allocation behavior of processes. If a process rapidly allocates memory, it might be considered more likely to be the cause of memory pressure. The OOM Killer will first try to kill processes that have a high oom_score. If no processes with a high oom_score are available, the OOM Killer will then try to kill processes that are using a lot of memory. If no processes that are using a lot of memory are available, the OOM Killer will then try to kill processes that are not important. The OOM Killer is a last resort measure that is used to prevent the system from crashing. If the OOM Killer is unable to find a process to kill, the system will eventually crash. While the OOM score provides a general guideline for process termination, it is not a foolproof mechanism. In certain scenarios, the OOM Killer may not select the process causing the memory pressure accurately, leading to potential disruptions or instability in the system.

In a Kubernetes cluster, each node runs multiple pods, and each pod contains one or more containers. When memory pressure increases on a node due to high resource utilization, the OOM Killer is triggered to reclaim memory by terminating one or more processes running on that node. Since the OOM Killer operates at the node level, it does not have direct visibility into the boundaries of individual pods. It makes its decision based on the memory usage of processes running on the node, without considering the specific pod or container associations. When the OOM Killer selects a process to terminate, it can affect all the containers running on that node, regardless of which pod they belong to.

The impact on the running pods depends on the specific configuration and behavior of the application and how it handles process termination. When the OOM Killer terminates a container process within a pod, the pod's application may be designed to handle such scenarios gracefully. In an ideal scenario, the application should be resilient to process failures and have mechanisms in place to recover or handle the situation appropriately; however, the outcome may not always be graceful. If the pod is configured with a restart policy (e.g., Always or OnFailure), the container that was terminated will be restarted automatically by the Kubernetes control plane. The pod's workload can continue running with the restarted container. The container termination may trigger the Kubernetes pod restart policy. If the pod's restart policy is set to "Always" or the maximum restart count has not been reached, the entire pod will be restarted, including all its containers. The termination of a container process can potentially affect the stability or functionality of the application running within the pod. If the terminated process is critical for the operation of the application, it may lead to errors, service disruptions, or degraded performance until the container or pod is restarted and the application can recover.

To minimize the impact of process termination by the OOM Killer, it's important to configure appropriate resource limits for containers and allocate sufficient memory resources to prevent excessive memory usage. Additionally, building application resilience, implementing health checks, and considering fault tolerance mechanisms can help mitigate the impact of process terminations and improve overall system stability. With some research and some minor configurations changes, we can attept to prevent the OOM Killer from being triggered. Ensure that you have set appropriate resource limits for your pods and containers. Specify resource requests and limits in the pod's resource manifest to allocate sufficient memory to handle the workload's requirements. Memory request and limit values in Kubernetes are set in the resources section of a Pod definition. The resources section has two key-value pairs: requests and limits. The requests key specifies the amount of memory that the Pod requests, while the limits key specifies the maximum amount of memory that the Pod is allowed to use. These are basically the lower and upper bounds for memory consumption.

apiVersion: v1
kind: Pod
metadata:
  name: web-api-v1
spec:
  containers:
  - name: memory-hungry-api
    image: nginx
    resources:
      requests:
        memory: 128Mi
      limits:
        memory: 256Mi

Adjust the resource requests and limits based on the actual requirements and behavior of your application. Analyze historical resource usage patterns to determine appropriate resource allocations. Consider the memory overhead required by the application, background processes, and system components. You probably already have Prometheus and Grafana installed in your environment. Use the tools to monitor resource usage, including memory, CPU, and other relevant metrics. Identify pods or containers that frequently exceed resource limits or experience memory pressure.

Kubernetes provides memory-based eviction policies like Memory-based Quality of Service (QoS) that allow you to evict pods based on memory consumption. By enabling these policies, you can proactively manage memory utilization and evict pods that exceed their memory limits. It automatically assigns a QoS class to each pod based on its resource requests and limits. The QoS class indicates the eviction priority for the pod. By specifying memory limits and requests, you influence the QoS class assigned to the pod.

  • If both the memory limit and request are set, the pod is assigned the Guaranteed QoS class. It means the pod is guaranteed to receive its requested memory and won't be evicted due to memory pressure unless it exceeds its memory limit.
  • If only the memory request is set, the pod is assigned the Burstable QoS class. It means the pod can be evicted if there is memory pressure and it exceeds its memory request.
  • If no memory limit or request is set, the pod is assigned the BestEffort QoS class. It means the pod has no memory guarantees and can be evicted if there is memory pressure.

Apply resource quotas at the namespace level to limit the amount of resources that can be consumed by pods within a namespace. This helps prevent resource hogging and ensures fair resource distribution among different applications or teams. ResourceQuota objects in Kubernetes can be used to limit the amount of resources that can be consumed by pods in a namespace. This includes memory, CPU, and other resources. To implement memory limits at the namespace level with a ResourceQuota object, you can specify the hard field for the memory resource. The hard field specifies the maximum amount of memory that pods in the namespace are allowed to use. For example, the following ResourceQuota object specifies a maximum memory limit of 512 MiB for all pods in the hungry-namespace namespace:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: hungry-resourcequota
spec:
  hard:
    memory: 512Mi
  scope: Namespace
  namespaceSelector:
    matchLabels:
      name: hungry-namespace

Configure Horizontal Pod Autoscaling to automatically scale the number of pod replicas based on resource utilization. This allows your application to dynamically adjust its capacity to handle increased workload and avoid resource exhaustion. You will need to install the HPA Autoscaler for this. I will save this topic for a future post.

The TLDR version of this is that the OOM Killer in Linux is a last resort option to free up memory and stabilize a system. It is best to avoid the OOM Killer by settings reservations and limits. If you can, adding additional compute resources to your K8s cluster is always an option. Of course, this will cost money.

Previous Post Next Post