Running latest kernels on ARM Scaleway servers

Adventures in kexec land

December 21, 2018

Scaleway, a french cloud provider, has been renting baremetal ARM servers for a few years now, and virtual ARM64 servers more recently. They ship with a scaleway-provided kernel and initird, which isn't updated as often as I'd like. The latest ARMv7 (32 bits) kernel, is 4.9.93, while the latest 4.9 LTS at the time of this writing is 4.9.146. 53 versions behind is a lot, so I've been looking at how to work around this.

A bad surprise

At the latest Golang Paris meetup, I did a livecoding introduction to autocert. Unfortunately,the demo at the end failed, despite the code being correct (it was still my fault, though.). After digging through, it all pointed out to something wrong on the server.

This server, was a C1 ARM server from Scaleway, was one of the first ever (baremetal) ARM servers available at cloud provider. Based on custom hardware with a Marvell ARMv7 SoC, it was also very cheap at launch, and still one of the cheapest baremetal server to rent out there. Since then, Packet has launched ARM64 servers based on the Cavium ThunderX (much more expensive, with 96 cores and 2 SoCs on board), and Scaleway followed suite with virtual servers based on the same platform (with 4 to 64 cores), and much more affordable.

The C1 server was updated regularly, in addition to unattended-upgrades being enabled. But what seemed odd was the old kernel version (4.5.7). Since I had provisioned it (more or less), it had been running the same kernel version, despite having been rebooted a few times. Which isn't really a good idea, at least for security reasons.

And it turned out, for at least one other reason as well: golang binaries starting with Go 1.9, failed to initialize the crypto-rng using the getrandom syscall, blocking forever. Updating to a more recent kernel (4.9.93) fixed the issue. But the update process required using the Scaleway web interface or the API, the cli tool does not (AFAIK) support this operation. Sidenote: I know that in a cloud world I should just spin up a new server and redeploy to it. I'll get there once I'm comfortable enough that it can work with my apps :-)

While this fixed this particular issue, it got me thinking about the general process for managing these servers. Should I setup a script or an ansible role to update the bootscript regularly ? Isn't there a better way, in order to use the distro kernels ? That led me to contemplate using kexec.

ARMv7 kexec attempts

Fortunately, I was not the first to have this idea, since Scaleway's initramfs scripts directly support using kexec to a new kernel ! You can find a tutorial here, but unfortunately, it only covers x86 servers.

I quickly learned that the serial console on the web interface is inferior to the one provided by the cli tool: ./scw attach <server-name>. All the boot logs from this post are captured with it.

My first attempts were therefore to use the KEXEC_KERNEL=/vmlinuz and KEXEC_INIRTD=/initrd.img server tags, but it failed to work. Here is the boot log output with INITRD_VERBOSE=1

** Message: /dev/nbd6 is not used[   30.528536] kexec_core: Starting new kernel

** Message: cm[   30.583224] Disabling non-boot CPUs ...
d check mode
** Message: /dev/n[   30.672735] CPU1: shutdown
bd7 is not used
** Message: cmd check mode
** Message: /dev/nbd8 is not used
[   30.791469] CPU2: shutdown
** Message: cmd check mode
** Message: /dev/nbd9 is not used
*[   30.891720] CPU3: shutdown
* Message: cmd check mode
** Message: /dev/nbd1[   30.960773] Bye!
0 is not used

The output is a bit mangled, and I lack visibility into how it's being done. So I wanted to add more kernel debug options: I tried the using KEXEC_APPEND="debug initcall_debug". But then I discovered that the server tags did not support having spaces inside, since the tokenisation was space-based.

I then decided to use INITRD_DROPBEAR=1 to start a shell in the initrd, giving me control into how the kexec is run. Initially, I was wondering if the fact that I didn't boot with a device tree was causing an issue. So I dumped the device-tree from the running process and re-built it with dtc. I made sure to re-use the command line from the current boot, and to properly detach the nbd block device. I attempted to use a more recent kexec userspace tool, and add a debugging option. After many attempts, I had a script to run inside the initramfs that looked like this:

#!/bin/sh

export PATH=/sbin/:/usr/sbin:$PATH

cp /newroot/initrd.img /
cp /newroot/vmlinuz /
cp /newroot/sbin/kexec /

/newroot/usr/bin/dtc -I fs -O dtb -o /generated-dtb /proc/device-tree/

umount /newroot
xnbd-client -c /dev/nbd0
xnbd-client -d /dev/nbd0

/kexec -d -l --append="verbose debug $(cat /proc/cmdline) is_in_kexec=yes root=/dev/nbd0 nbdroot=10.1.52.66,4448,nbd0" --dtb=/generated-dtb --ramdisk=/initrd.img  --type=zImage /vmlinuz
/kexec -d -e

Since the dropbear in initramfs lacks the scp server part, and kept generating new host keys on boot, I pushed it like this:

cat kexec-initramfs-script.sh | ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@myc1server "tee -a kexec.sh && chmod +x kexec.sh"

Then, it ran without any obvious error:

kernel: 0xb66ef008 kernel_size: 0x7d8200
MEMORY RANGES
0000000000000000-000000007fffefff (0)
zImage header: 0x016f2818 0x00000000 0x007d8200
zImage size 0x7d8200, file size 0x7d8200
zImage has tags
  offset 0x0000ae48 tag 0x5a534c4b size 8
kernel image size: 0x015c5d14
kexec_load: entry = 0x8000 flags = 0x280000
nr_segments = 3
segment[0].buf   = 0xb66ef008
segment[0].bufsz = 0x7d8200
segment[0].mem   = 0x8000
segment[0].memsz = 0x7d9000
segment[1].buf   = 0xb3603008
segment[1].bufsz = 0x30eba91
segment[1].mem   = 0x15ce000
segment[1].memsz = 0x30ec000
segment[2].buf   = 0x4f45a8
segment[2].bufsz = 0x45bc
segment[2].mem   = 0x46ba000
segment[2].memsz = 0x5000

But the serial console output was always the same:

[  129.248360] kexec_core: Starting new kernel
[  129.298586] Disabling non-boot CPUs ...
[  129.393572] CPU1: shutdown
[  129.532515] CPU2: shutdown
[  129.632399] CPU3: shutdown
[  129.700758] Bye!

And no new kernel seemed to boot… That's when I gave up, and decided to try something new. While writing this post, I also opened an issue to inform Scaleway of this status.

ARMv8 servers

I decided to check the ARMv8 virtual servers I had heard about. I already have arm64 experience, and I noticed that the pricing was similar (3€ per months for 4 cores + 2GB). So I instantiated one and tried to see if kexec could work on it. I first used the KEXEC_KERNEL and KEXEC_INITRD parameters, but it failed since there is no kexec in the arm64 initramfs:

>>> kexec: kernel=/vmlinuz initrd=/initrd.img append=''
/init: line 337: xnbd-client: not found
/init: line 337: xnbd-client: not found
/init: line 337: xnbd-client: not found
/init: line 337: xnbd-client: not found
/init: line 337: xnbd-client: not found
/init: line 337: xnbd-client: not found
/init: line 337: xnbd-client: not found
/init: line 337: xnbd-client: not found
/init: line 337: xnbd-client: not found
/init: line 337: xnbd-client: not found
/init: line 337: xnbd-client: not found
/init: line 337: xnbd-client: not found
/init: line 337: xnbd-client: not found
/init: line 337: xnbd-client: not found
/init: line 337: xnbd-client: not found
/init: line 337: xnbd-client: not found
/init: line 337: kexec: not found

It wasn't really an issue, since I had already resolved to use the rootfs' kexec tool previously (to have a more recent version), so I just enabled the INITRD_DROPBEAR ssh server, and ran a script on it. And it worked. Well, mostly: the kernel booted, but it couldn't mount the rootfs because it was looking for in /dev/vda ; which is the full block device, not the root partition: /dev/vda1. This is due to a bad parameter on the kernel command line; it doesn't affect Scaleway's initramfs because they do clever things.

After passing root=/dev/vda1, I finally had a working distro, with an up-to-date kernel.

Tutorial

After instaling kexec-tools, I added the following /boot.sh script:

#!/bin/sh
if grep -q is_in_kexec=yes /proc/cmdline; then 
        exit 0
fi
kexec -f --ramdisk=/initrd.img --append="$(cat /proc/cmdline) is_in_kexec=yes root=/dev/vda1" /vmlinuz

I don't use systemctl kexec, because it goes back to the initramfs, and kexec does not exist there…

And this systemd unit (to be improved, it starts very late, doesn't umount or stop services) kexec.service:

[Unit]
Description=Boot to kexec kernel if needed

[Service]
Type=oneshot
ExecStart=/boot.sh

[Install]
WantedBy=network.target basic.target

And then enabled it with systemctl enable kexec.service. That's all that's needed to always boot to the distribution's shipped kernel!

Bug notes

During my tests, I encountered many times IRQ exceptions on reboot; the VM is then broken, and needs api reboot; during the last tests to write this blog post, a reboot caused a permanent crash: even after using the API restart, the server was blocked in a transient state("rebooting server"), forbidding any other action. I hope a simple reboot in a VM can't crash the orchestrator or worse (hypervisor), affecting other clients. Update: after I contacted Scaleway support, they gave me be back access to the server: it was still rebooting endlessly and I was able to restart it with the API; I'm guessing the hypervisor didn't crash, and probably no other customers were affected.

Also during my explorations, I accidentally accessed the boot menu on the server (using a keyboard shortcut on the serial console). I don't think that's an issue since this is due to the fact that the full EFI stack is emulated as well. It might be possible to configure the bootloader to boot directly the kernel I want, but I haven't explored this possibility. It might require the EFI bootloader to understand virtio block devices, which might be possible.

Conclusion

The boot time is quite slow with this solution, since I have to boot the system twice (56 seconds before kexec, about 31 seconds after). Once the root= and kexec in initramfs bugs are fixed, I can use the server tags and have a faster boot; otherwise I might publish an ansible role to automate this process.

I also decided to migrate my services on the ARMv8 server, since it performs much better : +50% to +1300% on sysbench; only the threads and hackbench message passing tests were slower, I'm guessing due to virtualization. It also has IPv6 available, if enabled.

Be careful though: these servers are often out of stock, and I didn't notice, but I was lucky it was in stock when I provisioned it, since it isn't anymore in the Paris (par1) region, but is available in the Amsterdam ams1 datacenter (with low stock though). There might be a trick to bypass the "out of stock" status, but I doubt this works reliably.