Proper ZFS setup for peace of mind


Problem statement

I use ZFS on my storages. And while it is very powerful, I think that just plopping your data on ZFS and leaving it there without additional setup is a lot worse than it can be.

This article discusses a slightly more elaborate setup that provides me with additional peace of mind – roughly in the sense that silent corruption won’t fly by undetected, there’s some health monitoring, and fat-fingering rm -rf something has its blast radius attenuated1.

In other words:

  1. I’m quickly notified of failures
  2. ZFS is kept squeaky clean
  3. There’s reliable auto-snapshotting going on

Solution

The solution is going to be three-pronged, just like the problem statement.

I’m quickly notified of failures

The ZFS Event Daemon is a beauty.

It doesn’t come enabled by default, but it’s simply amazing: If there’s some noteworthy event (failure, scrubbing/resilvering finished, etc) going on in the life of your ZFS array, ZED will run bunch of preconfigured scripts ZEDLETs and dispatch notifications (if it makes sense).

And you can also write your own scripts2 if that floats your boat…

The best part is, it’s ridiculously easy to have it push to Pushover or ntfy, in addition to root@’s email3 which is default.

So, enable ZED, and maybe make it go verbose by default (at first):

sed -i 's,^#*ZED_NOTIFY_VERBOSE=.*,ZED_NOTIFY_VERBOSE=1,' \
  /etc/zfs/zed.d/zed.rc
/etc/init.d/zfs-zed start
rc-update add zfs-zed default

Also, configuring pushover is literally a two-liner:

sed -i 's,^#*ZED_PUSHOVER_TOKEN=.*,ZED_PUSHOVER_TOKEN=tokenhere,' \
  /etc/zfs/zed.d/zed.rc
sed -i 's,^#*ZED_PUSHOVER_USER=.*,ZED_PUSHOVER_USER=userhere,' \
  /etc/zfs/zed.d/zed.rc

ZFS is kept squeaky clean

One is supposed to run zpool scrub operation on the pools regularly, to make sure all data still checksums correctly.

I’ve seen weekly scrubbing recommended, which is super easy, barely an inconvenience:

cat - > /etc/periodic/weekly/zfs-scrub <<'EOF'
#!/bin/sh
for pool in $(zpool list -Ho name); do
  zpool scrub "$pool"
done
EOF
chmod a+x /etc/periodic/weekly/zfs-scrub

Of course, you could complicate this a lot more… but I don’t mind running scrubs in parallel, at the default4 schedule.

There’s reliable auto-snapshotting going on

I’m super spoiled when it comes to online snapshots with ZFS. The thought of not having to worry about screw-ups even with fileops outside of a git repository is downright pleasant.

For that, I’ve adopted Dave Eddy’s Automatic ZFS Snapshots and Backups with a slight twist on it:

Since I run podman containers with ZFS backing, I think having the tens of containers/.images/<hash> also snapshotted is a bit of an overkill. And I like them snapshots all at the ~same point-in-time…

So for snapshotting, I adopted a slightly different strategy. Here’s my zfs-snapshot-all:

#!/bin/sh
if [ $# -ne 1 ]; then
  echo "Usage: $0 <name>"
  exit 1
fi
name="$1"
now=$(date +%s)
code=0
for pool in $(zpool list -Ho name); do
  zfs snapshot -r ${pool}@${name}_${now} || code=1
  # special case -- murder image snapshots
  zfs list -t snapshot ${pool}/containers/.images@${name}_${now} >/dev/null 2>&1 &&
    zfs destroy -r ${pool}/containers/.images@${name}_${now}
done
exit $code

In other words, I run a recursive snapshot on each pool, and then selectively purge anything under ${pool}/containers/.images, if present.

For snapshot pruning I also took a much simpler approach, gutting Dave’s zfs-prune-snapshots. Here’s my zfs-snapshots-prune5:

#!/bin/bash

if [[ $# -lt 2 ]]; then
  echo "Usage: $0 <snapshot_prefix> <timespec>" 2>&1
  exit 1
fi

prefix="$1"
timespec="$2"
dryrun=""

time_re='^([0-9]+)([smhdwMy])$'
seconds=
if [[ $timespec =~ $time_re ]]; then
    # ex: "21d" becomes num=21 spec=d
    num=${BASH_REMATCH[1]}
    spec=${BASH_REMATCH[2]}

    case "$spec" in
        s) seconds=$((num));;
        m) seconds=$((num * 60));;
        h) seconds=$((num * 60 * 60));;
        d) seconds=$((num * 60 * 60 * 24));;
        w) seconds=$((num * 60 * 60 * 24 * 7));;
        M) seconds=$((num * 60 * 60 * 24 * 30));;
        y) seconds=$((num * 60 * 60 * 24 * 365));;
        *) echo "error: unknown spec '$spec'" >&2; exit 1;;
    esac
elif [[ -z $timespec ]]; then
    echo 'error: empty timespec' >&2
    exit 2
else
    echo "error: failed to parse timespec '$timespec'" >&2
    exit 2
fi

code=0

now=$(date +%s)
zfs list -Hpo name,creation -t snapshot $(zpool list -Ho name) | \
    while read -r snap creation; do
        # filter prefix
        [[ "$snap" =~ "@${prefix}_" ]] || continue

        delta=$((now - creation))
        if ((delta > seconds)); then
            echo "Removing $snap, creation: $creation, now: $now, ts: $timespec" 
            zfs destroy -r $snap || code=3
        else
            #echo "Not ripe yet: $snap, creation: $creation, now: $now, ts: $timespec"
            true
        fi
done

exit $code

Unlike Dave’s much more generic solution, I opted for a simple decision on the snapshot on pool level, followed by recursive zfs destroy -r. Because I don’t really care about du stats or other niceties.

To run this without silent failures, I’m pushing failures of hourly snapshotting to Healthchecks.io, with an assumption that if the hourly snapping ain’t failing, neither are the weekly/monthly/yearly ones6:

cat - > /etc/periodic/hourly/zfs-snapshot <<'EOF'
#!/bin/sh
exec >> /var/log/zfs-snapshots.log 2>&1

code=0

if zfs-snapshot-all hourly; then
  zfs-snapshots-prune hourly 25h || code=2
else
  code=1
fi

# FIXME: change the UUID
curl -fsS -m 10 --retry 5 -o /dev/null \
  https://hc-ping.com/FIXME-uuid-here/$code
exit $code
EOF

cat - > /etc/periodic/daily/zfs-snapshot <<'EOF'
#!/bin/sh
exec >> /var/log/zfs-snapshots.log 2>&1

set -e

zfs-snapshot-all daily
zfs-snapshots-prune daily 8d
EOF

cat - > /etc/periodic/weekly/zfs-snapshot <<'EOF'
#!/bin/sh
exec >> /var/log/zfs-snapshots.log 2>&1

set -e

zfs-snapshot-all weekly
zfs-snapshots-prune weekly 5w
EOF

cat - > /etc/periodic/monthly/zfs-snapshot <<'EOF'
#!/bin/sh
exec >> /var/log/zfs-snapshots.log 2>&1

set -e

zfs-snapshot-all monthly
zfs-snapshots-prune monthly 13M

# also handles yearly
if [ $(date +%m) -eq 1 ]; then
  zfs-snapshot-all yearly
  zfs-snapshots-prune yearly 10y
fi
EOF

chmod a+x /etc/periodic/*/zfs-snapshot

Closing words

There you have it, a somewhat automatic ZFS setup7 for increased peace of mind, in three easy pieces.

  1. Compared to conventional systems. Yes, you surely have proper backups.

    But those can be, depending on schedule, strictly worse. Definitely more painful to handle than just cd .zfs/snapshot/$name/, no?

  2. Say, to monitor scrubbing finishes regularly without silent fail…

  3. Linked is my take on painless send-only email setup.

  4. 3am on Saturday

  5. apk add bash if you wanna run it on Alpine

  6. Time will tell how smart of a shortcut this one was.

  7. And without any trace of backups. You shall have them; 3-2-1 or whatever. I’m a side note, not a cop.