Guide to OOMKill Alerting in Kubernetes Clusters

Dragan Milic Monday 23 Nov 2020

Intro

RAM is most likely the scarcest resource that is first exhausted on your servers. If you’re serious about running software under Linux/Unix, you’re certainly aware of what an OOMKill is.

Short refresher: when a program requests a new memory page from the kernel two things can happen.

There is a free memory page: The kernel page assigns the page to the process and everything is great.
The system is Out Of Memory (OOM): The kernel chooses a process based on its ‘badness’ (mainly by how much ram it uses). It sends a SIGKILL to the process. This forces the receiving process to exit with exit code 137. All the memory pages belonging to that process are free and now the kernel can fulfill the memory request.

Lately, I had a task to add alerting to a sizeable Kubernetes cluster. The cluster has ~100 active Deployments with autoscaling of nodes up to ~50 nodes at peak times. The cluster is well maintained and has a robust autoscaling strategy. All deployments have resource limits defined. Sometimes, some of the deployed pods would breach the memory limits. In those cases, it would be nice to find out when that happens and investigate the cause of it.

Prometheus and Alertmanager were already deployed. So I’ve thought that alerting on OOMKills will be as easy. I just had to find the right metric(s) indicating that OOMKill has happened and write an alerting rule for it. Given the length of this post, you could imagine how wrong I was!

First Attempt

A brief Google search has led me to the kube pod state metric. It turns out it has a metric called kube_pod_container_status_last_terminated_reason. The value of the metric is 1 when a container in a pod has terminated with an error. Based on the exit code, the reason label will be set to OOMKilled if the exit code was 137. That sounded promising! So I’ve created an alert for that.

As usual, things are rarely straightforward. As soon as the container restarts, the value of this metric will be 1. For alerting purposes, one has to combine it with another metric that will change when a pod restarts. kube_pod_container_status_restarts_total does that. Combine the two - and Bingo! It Worked!

“Invisible” OOMKills

For a brief moment, I’ve thought that I was done. I was about to declare victory over OOMKills in production! But then a puzzle came my way: One of our software developers has come forward. He claimed that one of his pods was running out of memory and he couldn’t see any alerts for it.

At first, I wasn’t inclined to believe that his diagnosis of running out of memory was correct. Mainly because his Pod didn’t even restart! But then I looked at the graph of the memory use of the Pod. It did show the usual pattern: Memory usage would grow, reach its peak at the memory limit, and then suddenly drop.

I’ve asked the developer for the gory details of the implementation. It turned out that the init process in the container would start a child process and wait for the result of it. If the child process would exit with an error, it would return an error to the requester and not terminate (because - why should it?).

That is when it dawned to me - my alerting is effective only if container exits. This is usually the case when the init process of the container is OOMKilled. But there is no guarantee this will happen if a child of the init is OOMKilled. In the case where the container’s init tries to handle OOMKill by itself, my alerting is not triggering!

Trying the Existing Solutions

Given that OOMKills are as old as Unix, I thought: surely someone will have a solution for this already.

I’ve ensued onto a frantic search for some kind of metric exporter for this. I just needed the number of OOMKill events in a pod, or at least in a Docker container. Here is what I’ve found:

cAdvisor

My first stop was cAdvisor itself. It turns out that cAdvisor is getting the OOMKill events, but not exporting them as a Prometheus metric and no one really seems to care. So that was a dead-end.

kubernetes-oomkill-exporter

My second stop was kubernetes-oomkill-exporter. A very promising-sounding project with two huge disadvantages:

There is really no documentation for it, literally anywhere.
It does not work.

I’ve tried the latest version of the Docker image, but once started it crashes and burns with:

standard_init_linux.go:211: exec user process caused "no such file or directory"

Going back one minor version one gets the following output:

F1120 22:04:21.571246       1 main.go:73] Could not create log watcher
I1120 22:04:21.572066       1 main.go:64] Starting prometheus metrics

As it seems no one has committed any code to in over a year. It has a low number of stars (14). All that meant that I was back to square one.

Rolling my Own: `missing-container-metrics`

Having a hard time finding an existing solution meant only one thing: I will have to write my own.

A cursory look at Docker’s events delivered everything I needed. There is an event called oom. Docker emits this event every time the OOMKiller process gets active in the container. Now I was only missing a piece of code that will listen to those events and export them as Prometheus metrics.

This is how missing-container-metrics was born. What it does is to connect to a local Docker instance (via /var/run/docker.sock). It lists all existing containers as a starting point. And then it listens to Docker events. Using those events, it keeps track of the currently running containers. It also gathers the basic stats of each container it knows about:

Number of restarts
Last exit code
Number of OOMKills

By design, it is not Kubernetes specific. This means it can be used with a plain Docker. But it also has a couple of very convenient Kubernetes specific features.

Whenever it finds a container label for the pod name or namespace, it adds them as a label to the exported metrics. Also, label naming is compatible with kube-state-metrics.

This keeps things simple for metric joins in PromQL.

Running it in the Cluster

In a Kubernetes cluster, missing-container-metrics needs to run on every node. The simplest way to achieve this is to use a daemon-set. The source code comes with an example daemon set deployment.

An Interesting Find Using `missing-container-metrics`

The most interesting issue I’ve found was where I’ve least expected it: Fluentd!

Fluentd log forwarder for node/pod/kubelet logs to the log aggregator. When the volume of logs was very high, Fluentd is OOMKilled.

Looking at the details of how Fluentd works, it becomes clear what is going on.

Fluentd has one main process (that ends up being init process in the container). This main process forks a worker process that forwards the logs. When the worker process dies for some reason (for example OOMKill), the main process starts a new one. This leads to an endless loop of spawn/OOMKill.

The fact that Fluentd is the log forwarder is very unfortunate. OOMKill loop would stop the log forwarding, so you could not ‘see’ what is going on by inspecting the logs.

Epilogue

If you want to make sure that your Kubernetes cluster is healthy, it is essential to alert on OOMKills. This enables you to know when processes hit their memory limits. Be it because of memory leaks or wrongly configured memory limits.

It turns out that monitoring for OOMKills in Kubernetes is not as an easy task as one might think. Using missing-container-metrics makes it much easier though.

So go ahead, deploy missing-container-metrics to your cluster. You might be surprised how many of OOMKills you have not been noticing.

I hope that it will be useful to you, and will save you the time that I’ve spent searching for the solution.