How to hunt for kernel memory leaks
Problem
Processes on one LVS cluster node were dying by the hand of oom-killer while
other nodes with same software setup (but slightly different hardware) were
running just fine. I tried protecting vital processes of the system with
OOM_DISABLE
only to find whole system (unprotected processes) killed by
oom-killer and the machine left in unusable state.
Solution
After brief consultation with Rik van Riel, author of oom-killer, I was recommended to watch slabinfo for changes because the machine had over 800MB worth of slab cache (non-swappable kernel memory [buffers]) allocated at the time oom-killer started his crusade.
The following script was used to detect changes in slab cache and to pinpoint
the culprit (it was a scsi_cmd_cache
leak).
p3 slabinfo # ./slabdiff.rb ac 20060328-101914 /proc/slabinfo | tail -n 5
reiser_inode_cache: 907752
skbuff_fclone_cache: 1028736
size-512: 1368576
size-8192: 2105344
scsi_cmd_cache: 18525696
-> scsi_cmd_cache has grown by 18MB between snapshots -- might be a LEAK!
Usage
The script below shows you non-zero differences between two snapshots (dumps)
of /proc/slabinfo
, which is useful for discovery of kernel memory leaks.
To use it simply setup a cronjob that dumps /proc/slabinfo
at regular
intervals (once a few minutes) and then use the script to diff either two
snapshots, or older snapshot and current /proc/slabinfo
.
Script
Also available as plaintext file.
#!/usr/bin/ruby
=begin
Author: Wejn <wejn at box dot cz>
Thanks to: Rik van Riel <riel at redhat dot com>
License: GPLv2 (without the "latter" option)
Requires: Ruby
TS: 20060328175500
Background: https://wejn.org/stuff/slabdiff.rb.html
=end
if ARGV.size != 3
$stderr.puts "Usage: #{File.basename($0)} <[ac]tive|[al]located> <file1> <file2>"
$stderr.puts "\twhere <file[12]> is /proc/slabinfo dump"
exit 1
end
active = true
case ARGV.shift
when "active", "ac"
# no action
when "allocated", "al"
active = false
else
$stderr.puts "Error: you must select one of: active (ac), allocated (al)"
exit 1
end
def load_slab(filename, active)
slab = {}
content = File.open(filename, 'r')
raise "unsupported version" unless content.gets.strip == 'slabinfo - version: 2.1'
content.each do |ln|
next if ln =~ /^\s*#/
name, active_objs, num_objs, objsize, rest = ln.strip.split(/\s+/, 5)
slab[name] = (active ? active_objs.to_i : num_objs.to_i) * objsize.to_i
end
slab
end
old = load_slab(ARGV.first, active)
new = load_slab(ARGV.last, active)
def slabdiff(old, new)
diff = {}
(old.keys + new.keys).uniq.each do |k|
diff[k] = (new[k] || 0) - (old[k] || 0)
end
diff
end
slabdiff(old, new).to_a.sort { |a,b| a[1] <=> b[1]}.each do |k, v|
puts "#{k}: #{v}" unless v.zero?
end