ELCE, OS Summit and KVM Forum 2019 notes

Summary of some of the talks I attended

October 31, 2019

After a two years hiatus, I couldn't miss this year's ELCE happening in Lyon, the first in France since 10 years ago in Grenoble, which coincidentally was one of my first conferences. OS Summit and KVM forum events are co-located and with the same ticket, which is nice, and why I attempted a few talks there as well.

Making device identity trustworthy with TPMs — Matthew Garett and Brandon Weeks

For access to any internal Google Service (BeyondCorp), both the user and the device need to be authenticated. The devices that are allowed need to be well-known and inventoried.

A device identity needs to be unique (serial number), bound to hardware and stable for the lifetime of the device. It also shouldn't be unforgeable and resistant to tampering.

Existing solutions are inadequate: self-identification can be forged, keys on disk can be duplicated, and trust bootstrapping is hard, especially remotely.

TPMs are specific chips that provide a store and generation for keys that never leave the hardware: they can't be extracted and duplicated to other hardware. Modern TPMs allow tracking the hardware manufacturer, thanks to endorsement keys (EK): they provide proof that a TPM comes from a certain manufacturer. Attestation Keys (AK) are then used to prove whether a key has been generated on a specific TPM.

An issue is that TPMs aren't directly related to devices. A solution to that is runtime binding: run a script that extracts the TPM EK and place in an internal database. But an issue with this is that you don't know if the device state is trustable at this time, or if the TPM is indeed internal to the device.

Binding at provisioning helps with this by reducing the window where a system could be compromised by having an IT officer do this operation before it's given to a user. But it's still dependent on having a well-functioning inventory system.

Binding at manufacture goes even further: it maps a given device with a TPM with the help of Platform Certificates. Those are sent out-of-band at device ordering.

The lifecycle then looks like this: - first, the device is provisioned: at this point an Attestation Key is generated by the TPM, and both the EK and AK are registered through the Attestation CA - then, a Client certificate is issued: it is signed by the AK - the client certificate is then provided to the access gateway as part of mutual TLS auth to access services.

Another possibility once you know how to bootstrap device trust, is to use TPM-backed trust to authentify services for example.

The low level code, including Platform Certificates parsing has been released here.

What's new in Buildroot — Thomas Petazzoni

Two years ago, LTS support was added: the february release is maintained for a year for security updates and bug fixes.

Internal toochain support has been updated, with new gcc, binutils, and various libc versions. They are now tested automatically for architectures supported by qemu. External toolchains have been updated as well. It's also now possible to declare external toochains from BR2_EXTERNAL.

Two new common package infrastructures were added for go and meson packages.

Git caching has been improved, for git-fetched packages; as well as the whole package download infrastructure which has been rewritten.

Many packages were updated, and added; a few obsolete ones have been removed.

New global options have been added to force building with security-related options (relro, stack protection, etc.)

A new make show-info was added to dump the state of enabled packages as a json to be used by external tools.

Work has been done to improve the testability of reproducible builds, as part as a GSoC: if BR2_REPRODUCIBLE=y, the build is done twice, and the outputs of the two are compared with diffoscope.

Work has also continued to improve parallel builds; one of the last series on the subject is on the per-package directories for HOST.

The runtime test infrastructure has been improved to add more tests. The tooling around buildroot has been augmented with the support of release-monitoring.org to track packages that are outdated.

Boot time optimization with systemd — Chris Simmonds

systemd runs as init, so it's PID 1. It launched and monitors daemons, configures stuff, etc.

For embedded systems, systemd is a much bigger init system, with 50 binaries and a 34MB footprint. It supports many features: event logging with journald, user login with logind, device management with udevd, etc.

systemd has many features for resource control, free parallel boot, can have a system boot without a shell etc. It has unit (a generic type), services (a given job), and targets (a group of services, e.g a runlevel).

systemd searches for units first in /etc/systemd/system for local configuration, then in /run/systemd/system for runtime config, then in /usr/lib/systemd/system .

Units can depend on each other, with three types of deps: Requires: describes a hard dependency, Wants: is a weaker one meaning it won't be stopped if the dep fails, and Conflicts:.

systemd also provides an other concept: ordering. Before: and After: determine when a unit is started. It's used for example when starting a unit web server after network.target. Without ordering, units are started in no particular order.

At boot, systemd starts the default.target. On most systems, this is by default a symbolic link to the multi-user.target.

It's also possible to describe a reverse dependency with WantedBy:, which is used to add services to be started by a target for example: WantedBy: multi-user.target. This is implemented by creating a symbolic link in the multi-user.target.wants directory.

systemctl is the cli tool used to interface with systemd at runtime.

How to reduce boot time then ?

Boot time is defined by the time to power on to running the critical app.

When using a generic system image (yocto, debian), those are designed conservatively to cater to all common cases. So to reduce boot time, one should make it less generic, either by disabling services, or reducing their dependencies.

The main tool to optimize boot time is systemd-analyze, that can give you a summary of the boot time; systemd-analyze blame list units by order of start-up time. The most important is systemd-analyze critcal-chain that shows the time for the units in the critical path.

In an example, Chris showed that the critical-chain depends on a timeout because of a non-existant ttyGS0, removing the associated getty unit saved a lot of time. Changing the default target and disabling unused daemons also helped a lot.

Other useful features in embedded systems

The watchdog is a very useful feature of systemd: if a service does not reply to watchdog, it can be restarted automatically. It's even possible to force a reboot if the watchdog has been triggered a certain amount of time above a given threshold.

Resource limits like CPU and memory limiting can also be very useful; this is implemented through cgroups.

Crypto API: using hardware protected keys — Gilad Ben Yossef

In the Linux crypto API, there are transformation providers, that can either use dedicated hardware, specialized instructions or a software implementation. There are used by the crypto user API, dm-crypt or ipsec for example.

The crypto API is used in multiple steps - crypto_alloc_skcipher, for example to get xts(aes) transformation handle - set key to tfm - get a request handle - set request callback - set input, output, IVs, - etc.

Tranformation providers have a generic name (the algorithm), a driver name (the implementation), and a priority, to know which is most important. There are other properties describing the synchronicity, min/max keysize, etc.

The key is usally just stored in RAM, like everything else. It makes it vulnerable to various key-extraction attacks. It should be possible to have a transformation provider that support a hardware-backed key.

It was implemented a few years ago for IBM mainframes, which means that the infrastructure could be reused for embedded devices.

In the implementation, it means the user of the API would pass a tag instead of the key bytes. The tag describes a storage and key from inside a secure domain. The tag can be an index, or an encrypted key in case of key ladder.

In practice the security of this key depends on the security of the secure domain (hardware or software, e.g tee), its provisioning, etc.

The cipher's name is prefixed with 'p', for example "paes", for protected key. Because the tag value is specific to hardware implementation, when requesting a cipher, the specific name of the driver is used instead of just the algorithm name.

When instantiating it with dm-crypt, one should use the crypto-api algo driver name and instead of the key, a tag describing the key (e.g key slot).

A future challenge is that TPMs act very much in the same way, yet aren't using the same API.

iwd - state of the union — Marcel Holtmann

iwd 1.0 has been released on this day (October 30th 2019)

Marcel says Wi-Fi on Linux sucks. It's because the roles are split between many projects (kernel, wpa_supplicant, dhcpcd, network manager, etc.), and there's still a lot of code to write on top of this to ship a consumer product.

iwd's goal is consolidate the wifi information in one place, which is then used by network-manager. The goal is to only have one entity interacting with nl80211 for better performance.

For example, when you wakeup your laptop, you don't want to rescan the ever-growing list of channels before re-joining a network.

In addition to being the central known-network database, iwd has many features: - it has optimized scanning since it's the only daemon to do scanning in a system - it can do enterprise provisioning - supports fast roaming and transitions - it supports WPA3 and OWE (Opportunistic Wireless Encryption), and no UI change was needed to add this support - there's an integrated EAP engine that uses the kernel keyring system - it support the hotspot 2.0 spec - push-buttons method work (WPS, etc.) - address randomization is supported - AP mode is supported to do tethering

Enterprise provisioning can be very complex. Most OSes have a lot of settings that are hard to manage, etc. With Windows 10 and iOS there's now a downloadable configuration file, like for OpenVPN for example.

iwd has now support for configuration files with embedded certificates so that everything can be in a single file. An enterprise admin can now provide this configuration, the user installs it, and connects to the network. This format in documented in the manpage man iwd.network.5. Unfortunately, there's still no standard for Wi-Fi provisioning, and Marcel wants to address that.

Marcel says that in some cases, just the overhead of communicating with other daemons (systemd-networkd, connman or network-manager) in order to trigger dhcp, is too big. Some systems also don't necessarily have those daemons. That's why iwd added support for an experimental DHCPv4 daemon. This is documented in iwd.config.5.

The goal with iwd is to complete a connection in 100ms or less (with an IP address). Right now, it's not there yet. PAE in the kernel nl80211 interface helps reducing this. Address randomization adds 300ms on top of this. In Android, it can add up to a 3s penalty, because one needs to power down the phy and power-it up again with Linux. There's work in the kernel to reduce this time as well, but it's not there yet Marcel says.

iwd does not depend on wpa_supplicant, and has improved a lot.

Marcel says they have reached the limit of what is possible to improve inside iwd. There needs to be other features in nl80211 to continue doing optimizations.

iwd has 40k SLOC, which might be a lot, but only a tenth of wpa_supplicant.

There are other daemons in the work: ead for ethernet authentication; its code is in the repo, and still being worked on.

apd, the access point daemon is still private and being prototyped and should land next year in the repo.

rsd is a resolving service daemon is pretty much a replacement for systemd-resolved; the DNS part is quite tricky according to Marcel; the goal is to be able to chose the correct path (e.g through a proxy or not) for a given URL. It's not planned to be released anytime soon though.

VirtIO without the Virt: towards implementations in hardware — Michael Tsirkin

virtio enables re-utilisation of drivers that are already in the OS. There are already many types of devices that are supported.

Hardware helps implement userspace drivers, that can also be simpler. Another motivation for hardware virtio would be passthrough for performance, while retaining the advantages of software implementations.

If there is a precise virtio spec, when a bug happens it's possible to find if it's the fault of the driver or the card. If you have hardware, you can switch to a different card or software implementation to find out.

Virtio feature negociation allows implementing only certain features in the driver or hardware, and then use only the intersection.

For virtio-net, the virtualized hardware uses PCI, so it's possible to forward guest access to real hardware by giving it access directly to hardware memory range for example.

Virtio ring has a standard lockless access model that looks a lot like DMA systems that hardware vendors are used to implement.

Depending on the hardware, there might be cache coherency issues, which means that hardware has a different feature flags, in particular VIRTIO_F_ORDER_PLATFORM and VIRTIO_F_ACCESS_PLATFORM. Version 1 needs to be implemented as well, without the legacy interface.

The simplest way to implement hardware virtio is to just use passthrough; for example, if a network card implements the spec properly, just pass-through everything. Another possibility is to only have data path offloading: the control path is intercepted in an mdev driver.

It would also be possible to do partitioning in the last case, by tagging requests for a given virtqueue depending on each VM if we want to share a device between VMs. Another use of the mdev driver is for migration: force it to a matching subset of features between two machines, and then do the migration between the two transpararently.

If there are device quirks, the best way to address that is to use feature bits instead.

Virtio 1.2 spec plans to be frozen by end of November 2019. When a adding a device to the spec, and ID should be reserved, and a flag for new features.

Authenticated encryption storage — Jan Lübbe

It's possible to integrate authentication and encryption at various layer of the storage stack, from userspace, filesystems&VFS to device-mapper.

With dm-verity, a tree of hashes is built, and the root hash is provided out-of-band (kernel command line), or via signature in super block since Linux 5.4. It's the best choice for read-only data.

dm-integrity arrived in Linux 4.12, and provides integrity for writing as well. There's one metadata block for n data blocks, and they are interleaved. It needs additional space, and has a performance overhead, because a write happens twice because of journalling (for both data and metadata) to prevent power issues.

dm-crypt handles sector-based encryption with multiple algorithms. It's length preserving, which means that data cannot be authenticated. It's a good choice for RW block devices without authentication.

Recently, dm-crypt added support for authentication as well with AEAD cipher modes. But it authenticates individual sectors, so replay is possible (is it the last version ?). The recommended algorithm is AEGIS-128-random.

fsverity is now "dm-verity for files", and has been integrated into ext4. A single (large)file has root hash (provided out-of-band), and once written, is then immutable. Biggest user is likely Android for .apk files.

fscrypt has the same idea of encryption at the file level. It's interesting for a multi-user system where each user has its own keys. It's possible to mount a filesystem and remove files without having the keys. It also has no authentication.

Since Linux v4.20, UBIFS can provide authentication. The root hash is authenticated via HMAC or signature since Linux 5.3. It's the only FS that authenticates both data and metadata. It's the best choice for raw NAND/MTD devices.

ecryptfs is a stacked filesystem (mounts on top of another fs), and was used by Ubuntu at some point for per-user home directory encryption, but has now been superseded by fscrypt.

IMA/EVM was initially developed for remote attestation wit TPMs, and uses extended attributes. It protects from file data modification, but it is vulnerable to directory modifications (file move/cp).

Master key storage is also problematic, and platform dependent. Many SoCs provide key wrapping to encrypt secrets per-device, but it needs a secure boot chain. Other possibilities include using a TPM or (OP-)TEE.

Authenticated writable storage can only detect offline attacks, not runtime ones, so there's the need to have RO part of the system (recovery) in order to be able to restore the system in a good state.

How to analyze device problems from devices that are returned from the field ? There might be the need for a mode to erase the keys (protects the data) and disable authenticated boot (for HW analysis).

That's it ?

If you want more notes, I invite you to read Arnout Vandecappelle's at Mind Embedded Development's blog.

I'm also attending Linux Security Summit, so stay tuned!