[What is it?]
When a Linux system runs out of memory (OOM) you have the choice between either killing a random task (bad), letting the system crash (worse) OR try to be smart about which process to kill. Note that we don't have to be perfect here, we just have to be good.
The one in charge of killing the process is the kernel, and is done by killing brutally (-9) the process with higher badness score.
[How it works]
The way the kernel computes the score is well defined but somehow confusing and with exceptions. You can further read about it in the Who is bad section. Basically the score measures the "badness score" of a process. The higher the more likely to be killed. The values are in the range -17 to 15. -17 has the special meaning of "never kill this process". The badness is computed used the original memory size, its CPU time (utime + stime), the run time (uptime - start time), the niceness, is it has called swapoff, etc...
When you run out of memory, the kernel searches the process with the higher badness score and kills. If this process has children the kernel kills them, otherwise kills the process itself.
[Can I deactivate it?]
As you can imagine, killing processes randomly can lead to system inestabilit. Is there anything else we can do like, you know, deactivate the OOM-killer?
In theory, yes. If you deactivate the OOM-killer (not recommended, btw), if a process request memory but there is no free memory available the allocation fails with a ENOMEM. This error is rarely handled in the code anyways, therefore highly unexpected behavior may be expected.
[How to avoid it]
Basically I can think in two different why the OOM-killer has to do its job:
- You are running a lot of processes/services in your server that requires a lot of memory for they normal use, and the memory on your system is not enough to handle all the processes up and running. You can fix temporarily this problem by adding swap to the server and add more RAM memory when possible. You maybe can try to move every service to one physical machine; this way the services will not have to compete for RAM.
- You have a bug on your software: a memory leak that consumes all your memory. This is the most difficult to solve but you want to track down what's the source of the problem and fix it, because it doesn't really matter how many memory you have on your system if you application is consuming always all the available memory (by not releasing it or whatever). Meanwhile you are trying to figure out whats the source of the problem, you can restart the service every a few hours preventing releasing memory and trying to avoid OOM-killer to kill you service instead.
[How I know OOM-killer has been awaken?]
First, you'll notice that you application or you service is down. Take a look to syslog, messages and kernel log, and search for something like "killed process", hehe.
- Code of the Linux kernel about OOM killer:
- Adjusting the badness score: Search oom_score_adj
- Funny analogy that explains why OOM-killer problem is not a trivialone.
OK guys, this is it for today. Take care!