Grub 2.12 broke my secureboot setup (again)


Problem statement

I’ve had my fair share of quarrels with grub. I taught it to secureboot Alpine with everything encrypted, then convinced 2.06 to give up on “verification requested but nobody cares”, which even survived an upgrade from Alpine 3.15 to 3.18.

However, upgrading to grub 2.12 broke my setup again. I guess I should be used to that1.

So what went wrong, and what needed fixing? Let’s dig in…

Discussion

Upgrading from grub 2.06 to 2.12 met me with a strange “alloc magic is broken” after I upgraded Alpine to 3.20.

What the hell is “alloc magic is broken”? The internet seems to think that it’s the result of bad memory chips. Well, it doesn’t have to be.

In my case, grub simply decided not to cooperate with secureboot. Because dropping secureboot made it boot back just fine.

I’ve spent several hours debugging the setup2, trying to minimize the config:

insmod luks
insmod cryptodisk
insmod part_gpt
insmod lvm
set default="0"
menuentry_id_option="--id"
export menuentry_id_option
insmod all_video
loadfont unicode
set gfxmode=auto
insmod gfxterm
terminal_output gfxterm
#set timeout_style=menu
#set timeout=2
set timeout=-1

insmod all_video
insmod gzio
insmod part_gpt
insmod diskfilter
insmod mdraid1x
insmod cryptodisk
insmod luks
insmod lvm
insmod gcry_rijndael
insmod gcry_sha256
insmod ext2
cryptomount -u df53a1b51a01450a9608ef27a805caa8
#set root='cryptouuid/df53a1b51a01450a9608ef27a805caa8'
#search --no-floppy --fs-uuid --set=root --hint='cryptouuid/df53a1b51a01450a9608ef27a805caa8'  343958a7-af92-4fb7-a1ea-b31b5410be7c
echo    'Loading Linux lts ...'
linux   (crypto0)/vmlinuz-lts root=ZFS=nvmetank/ROOT/alpine ro  modules=nvme,zfs quiet rootfstype=zfs  ro modules=nvme,zfs quiet rootfstype=zfs nomodeset nofb video=vesafb:off
#set pager=1
#set debug=all
set debug=linux
echo    'Loading initial ramdisk ...'
initrd  (crypto0)/initramfs-lts
boot

and even trying to convince grub-mkstandalone to get me some usable result:

grub-mkstandalone -O x86_64-efi -o g.efi \
  --modules="luks cryptodisk part_gpt lvm" \
  'boot/grub/grub.cfg=/boot/grub/grub.cfg' \
  --disable-shim-lock

All to no avail, because when the “alloc magic is broken” error disappeared, it was replaced by something like “unable to load image”, or “you must load kernel first”.

Well, you sad little thing, I am loading a kernel, you just ignore it.

And desperate, searching around, I found Create UEFI secureboot USB article on Alpine wiki that talks about the need to sign the kernel.

Which I thought rather strange – after all, why would I need to sign a kernel that resides on an encrypted /boot partition (that EFI bios can’t even see)?

Turns out, grub 2.12 – for a reason wholly unknown to me – will somehow pass this decrypted kernel image through something (EFI bios? IDK) that will choke when it’s not signed3.

So one little:

sbsign \
  --key /boot/secureboot/sb.key \
  --cert /boot/secureboot/sb.crt \
  /boot/vmlinuz-lts
mv /boot/vmlinuz-lts.signed /boot/vmlinuz-lts

and we’re back in business. Until the next kernel upgrade.

But there must be some way to make this change permanent?

Solution

The solution is straightforward (on Alpine):

apk add kernel-hooks
cat > /etc/kernel-hooks.d/sbsign-kernel <<'EOF'
#!/bin/sh
CERT=/boot/secureboot/sb.crt
KEY=/boot/secureboot/sb.key
KERNEL=/boot/vmlinuz-lts
sbverify --cert $CERT $KERNEL >/dev/null 2>&1
if [ $? -eq 0 ]; then
        echo "+ $KERNEL is already signed, skipping"
        exit 0
fi
echo "+ $KERNEL signing:"
set -e
sbsign --output $KERNEL.signed --key $KEY --cert $CERT $KERNEL
sbverify --cert $CERT $KERNEL.signed
mv -f $KERNEL.signed $KERNEL
exit 0
EOF
chmod a+x /etc/kernel-hooks.d/sbsign-kernel
apk fix kernel-hooks

Closing words

Well, in the end, the whole incident didn’t hurt that much.

The solution, I mean. The path to get to it was a different matter. Grub boot failure debugging is not fun, even with all the options.

But hey, at least I got some grub-fu – and a simple way to ditch grub – out of it. ;)

I call that a win.

  1. Although, I also found a way to secure boot without grub, while I was debugging this, so that’s my most likely future.

  2. Turns out, set debug=all is a really poor experience, even if you enable pager set pager=1. And setting just set debug=something to get sub-category isn’t much better either. Because built-in timeouts (menu and otherwise) might clean your screen exactly when you don’t want them to. Plus, the debug messages aren’t all that helpful debugging secureboot issues anyway.

  3. “Yes wejn, real technical explanation right there, you must be pro at this.”