How to hunt for kernel memory leaks


Problem

Processes on one LVS cluster node were dying by the hand of oom-killer while other nodes with same software setup (but slightly different hardware) were running just fine. I tried protecting vital processes of the system with OOM_DISABLE only to find whole system (unprotected processes) killed by oom-killer and the machine left in unusable state.

Solution

After brief consultation with Rik van Riel, author of oom-killer, I was recommended to watch slabinfo for changes because the machine had over 800MB worth of slab cache (non-swappable kernel memory [buffers]) allocated at the time oom-killer started his crusade.

The following script was used to detect changes in slab cache and to pinpoint the culprit (it was a scsi_cmd_cache leak).

p3 slabinfo # ./slabdiff.rb ac 20060328-101914 /proc/slabinfo  | tail -n 5
reiser_inode_cache: 907752
skbuff_fclone_cache: 1028736
size-512: 1368576
size-8192: 2105344
scsi_cmd_cache: 18525696

-> scsi_cmd_cache has grown by 18MB between snapshots -- might be a LEAK!

Usage

The script below shows you non-zero differences between two snapshots (dumps) of /proc/slabinfo, which is useful for discovery of kernel memory leaks.

To use it simply setup a cronjob that dumps /proc/slabinfo at regular intervals (once a few minutes) and then use the script to diff either two snapshots, or older snapshot and current /proc/slabinfo.

Script

Also available as plaintext file.

#!/usr/bin/ruby

=begin
Author: Wejn <wejn at box dot cz>
Thanks to: Rik van Riel <riel at redhat dot com>
License: GPLv2 (without the "latter" option)
Requires: Ruby
TS: 20060328175500

Background: https://wejn.org/stuff/slabdiff.rb.html
=end

if ARGV.size != 3
	$stderr.puts "Usage: #{File.basename($0)} <[ac]tive|[al]located> <file1> <file2>"
	$stderr.puts "\twhere <file[12]> is /proc/slabinfo dump"
	exit 1
end

active = true

case ARGV.shift
when "active", "ac"
	# no action
when "allocated", "al"
	active = false
else
	$stderr.puts "Error: you must select one of: active (ac), allocated (al)"
	exit 1
end

def load_slab(filename, active)
	slab = {}
	content = File.open(filename, 'r')
	raise "unsupported version" unless content.gets.strip == 'slabinfo - version: 2.1'
	content.each do |ln|
		next if ln =~ /^\s*#/
		name, active_objs, num_objs, objsize, rest = ln.strip.split(/\s+/, 5)
		slab[name] = (active ? active_objs.to_i : num_objs.to_i) * objsize.to_i
	end
	slab
end

old = load_slab(ARGV.first, active)
new = load_slab(ARGV.last, active)

def slabdiff(old, new)
	diff = {}
	(old.keys + new.keys).uniq.each do |k|
		diff[k] = (new[k] || 0) - (old[k] || 0)
	end
	diff
end

slabdiff(old, new).to_a.sort { |a,b| a[1] <=> b[1]}.each do |k, v|
	puts "#{k}: #{v}" unless v.zero?
end