Linux Security Summit Europe 2019

Notes on some of the talks

October 31, 2019

Following my ELCE notes, you'll find below my notes for some of the talks the Linux Security Summit Europe 2019.

Exploiting race conditions using the scheduler — Jann Horn

The first bug Jann talked about is a race between mremap and fallocate. mremap allows moving a memory mapping from an address to another. The associated page table entries are moved as well, and then the TLB is flushed.

fallocate allocates and de-allocates space for a file. It interactes with the page cache so it's possible to exploit a race between the page table modification and the TLB flushing.

To widen the race-exploitation window, the scheduler is used. The approach is to pin two own tasks to the same CPU, and set on of the task to idle (low) priority; it's then possible to interrupt the idle task, even when it's inside the kernel, simply by having the other task being active. This is used to have a syscall take much longer that needed, and make the race window also last longer.

The second example is a refcount decrement on struct file.

Both userfaultd and FUSE allow handling page faults in userspace.

The kcmp syscall, made for checkpoint/restore in userspace is useful for reliable Linux user-after-free exploitation. It compares two processes' kernel resources (file descriptors) to determine if they are sharing pointers.

Combined exploitation then creates a FUSE mapping, open a writable file, make a write that then blocks to FUSE write, then open another RO destination file and verify that we reuse the same file structure with kcmp() before then resolving the FUSE page fault and writing to the destination (forbidden) file.

The third bug uses getpidcon(); it exploits an issue in Android binder that used the caller PID to get the context, before then making a decision based on this.

Here the race window is widened by creating an artificial priority inversion inside the kernel, which mutexes are vulnerable to. This is done by interacting with the VFS's mutexes with userfaultd.

Kernel Runtime Security Instrumentation — KP Singh

Signals from Audit or perf can correlate with malicious activity, but not necessarily. LSM errors and denies are Mitigations that deny certain actions. You need both to have security.

To detect a new attack, you need both an Audit update (to log new events), and a mitigation update (e.g new LD_PRELOAD signature).

KP gave a few examples of Signals: for example a process that runs and deletes its binary. For Mitigations, the first example was to have a dynamic whitelist of known Kernel modules. The point is that both go hand in hand, and you need both.

KP says they want to implement KRSI with eBPF; LSM hooks should be implemented as eBPF programs. The LSMs are a good match because they map to security behaviours, are associated with kernel data structures with "blobs", and benefit from years of research and verification of the LSM system. It also benefits the LSMs because the framework can then get extended from security analysts feedback.

With this eBPF infrastructures, both actions are possible on an LSM action: returning an error and/or logging an audit event.

The new BTF format helps with eBPF programs because now the addresses of symbols are distilled in a compact format, allowing to read inside kernel data structures in a version and platform-independent way.

KP showed an example BPF program to do the denial of unlinking its own executable (or a parent, or another process's), or to logging a write /proc/self/mem.

Compared to the landlock patchset, KP says that KRSI has different roles: the goal is to do system-wide MAC, not unprivileged DAC.

Performance-wise, the latency impact of KRSI is three times lower than audit, and with less variation.

Address space for namespaces — Mike Rappoport

Address space isolation is on of the best protection methods since the invention of virtual memory, Mike says.

Page Table Isolation (PTI) is the first example of this type of isolation. There are other work in progress, like KVM address space isolation or process local memory.

Mike's group at IBM wants to improve container isolation; a way to do that can be to assign dedicated page tables to namespaces.

Previously, they tried something called System Call Isolation (SCI), to run system calls with very limited page tables, but it didn't provide enough guarantees, and had a heavy performance impact. Another thing they worked on was mmap(MAP_EXCLUSIVE) to create private memory mappings that can't be accessed by the rest of the system. A similar thing was done with a char device mapping : can transfer the rights with SCM_RIGHTS, and doesn't need a new page flag.

The Address spaces for namespaces approach would have processes within a namespace sharing a page table, but their mappings wouldn't be available outside of that namespace. A first version was built for netns in order to have an isolated network stack for network namespaces. Even the internal kernel pages and objects aren't visible outside of the namespace. It required extending alloc_page() and kmalloc().

The proof of concept implementation still does not work, but there is good hope that it can be improved.

Using a different LSM from the Host in a Container — John Johansen

The main use case to have a different LSM in a container is cross-distro container running: e.g, Fedora (selinux) inside a container on Ubuntu(apparmor), or an Android (selinux) inside something else.

LSM Stacking is very similar to what needs to be done here, so one can leverage it for this use case.

But an issue John found is that LSM order matters. His first tries with apparmor (main LSM), then selinux (stacked) failed to boot, because in Ubuntu dbus is built with both apparmor and selinux. It finds that selinux is enabled, and then when initializing apparmor, it thinks it already has an (empty) policy. That was the first gotcha.

Then, another gotcha was selinux was blocking apparmor code.

There also needs to be a way to namespace LSMs: inside a chroot (or container), you don't an applied policy to impact the whole system.

In order to do that, Apparmor implements virtualization of parameters, policy, etc. That's the way to be namespace aware, and it will need to be done in every LSM.

Another issue that was encountered was securityfs, which isn't multi-mount capable, so a non-container aware image can't boot because it would fail to mountit.

It's also not possible to use seccomp and no_new_privs. So to work around that, AppArmor support for no_new_privs with stacking was added: at lockdown, the apparmor confinement is saved, and then checked again whenever necessary to make sure it's enforced properly.

Container nesting is also another issue. There are specifically issues around user namespaces: for example there's no way to know when a user namespace is created; new LSM hooks are probably missing to fix these issues.

It's now possible to run an Ubuntu container on Fedora, with the help of LXD (handling mappings, etc.).

Keylime: Open Source software for remote trust — Luke Hinds

Traditionally, software trust used to reside in memory or disks. TPMs changed this by providing hardware-backed software trust sources. They are now ubiquituous, and used pervasively accross various industries.

The basic model relies on an RSA keypair, with the private key being inaccessible to software. A TPM can hash critical sections of firmware, a boot process, and then those hashes can be made public and verified with the public key.

Keylime is a remote trust framework using TPMs. TPM 2.0 and 1.2 are supported, but the latter will be deprecated. It provides Measured Boot (for firmware, shim, grub, kernel, etc.), it can measure secure boot (EFI DBX, MokListX, etc.), and an IMA Runtime for attestation. It' possible to have encrypted payload execution, depending on the previous measurements, which would then unwrap appropriate keys. Keylime also has a revocation framework.

In the target node (IoT / Edge platform / Provider Cloud), the Keylime Agent runs, with the TPM software stack. On the trusted infrastructure, the Keylime server runs, with these components: Verifier, Registrar, Revocation Service, and Keylime CA.

It's possible to have many nodes, and multiple verifiers combinations.

Enrollment relies on the TPM-backed hardware keys. Once it is done, the keylime infrastructure (Verifier and Registrar) will provide a way for a tenant to access an enrolled node, and send it a cryptographically secure payload.

With IMA, it's possible to have continous remote attestations (with polling). When an attacker triggers an IMA error, the Keylime Verifier will notice, and the revocation mechanisms can kick in. The Keylime Verifier will send a signed revocation event to all nodes, for example to revoke trust to the compromised node.

The Keylime project is active, has documentation and working CI. Luke would of course like to have even more contributors.

The agent is being ported from Python to Rust for performance and security reasons. Support for virtual TPM (vTPM) is being worked on as well, while still binding it to the hardware TPM's cryptographic trust.

Securing TPM Secrets with Intel TXT and Kernel Signatures — Paul Moore

Paul's goal is to store secrets in the TPM, and only allow access to these secrets to authorized kernels. He also wants this to work on both UEFI Secure boot systems and legacy BIOS systems.

Sealing TPM secrets against a set of PCR values. These values are measured against system state with secure boot, in the firmware, bootloader, kernel, etc.

For UEFI Secure boot, PCR 7 is a hash of the kernel signing authority; it's stable across kernel updates with the same signer.

With Intel TXT, the hardware and firmware create a dynamic root of trust; it uses the tboot bootloader, which makes a hash of the kernel image. But that hash isn't stable across updates, so the PCRs are unstable.

So Paul's solution to this is to extend tboot to support the UEFI PECOFF signature format. Its verification is rooted in the TPM; and it then updates the PCRs.

The prototype code has been released.

There's still need to work on command line and initrd verification, but the solutions that were applied with UEFI secure boot could work.

That's it for this edition of LSS EU!