Kernel Recipes 2017 day 2 notes

Day 2 of the best open kernel conference in the world

September 28, 2017

This is continuation of yesterday's live blog of Kernel Recipes 2017.

Linux Kernel Self Protection Project

by Kees Cook

Presentation slides

The aim of the project is more than protecting the kernel.


Kees' motivation for working on Linux, is the two billion Android devices running a Linux. The majority of those are running a 3.4 kernel.

CVE lifetimes — the time between bug introduction and fix — are pretty long, averaging many years.

Kees says the kernel team is fighting bugs, they are finding them, but just doing that isn't enough. The analogy Kees gave was that the Linux security is in the same place the car industry was in the 60s, where most work done was on making sure the car worked, but not necessarily that they were safe.

Killing bug classes is better than simply fixing bugs. There's some truth in the upstream philosophy that all bugs might be security bugs. Shutting down exploitation targets and methods is more valuable in the long term, even it has a development cost.

Modern exploit chains are built on a series of bugs, and just breaking the chain at one point is enough to stop or delay exploitation.

There are many out-of-tree defenses that have existed over the years: PaX/grsec, or many articles presenting novel methods that were never merged upstream. Being out-of-tree is not anything special, since the development mode in Linux is to fork. Distros integrate custom mitigations, like RedHat's ExecShield, Ubuntu's AppArmor, grsecurity or Samsung's Knox for Android.

But in the end, upstreaming is the way to go, Kees says. It protects more people, reduces maintenance cost, allowing to focus on new work instead of playing catch-up.

Many defenses are the powerful because it's they're not the default, and aren't widly examined. Kees gave an example of custom email server configuration that were very effective to fight spam because they're not the default, otherwise the spammers would adapt.

Kees then showed another example with grsecurity, where the stack clash protection was not upstreamed, not reviewed, and was in the end weaker than the solution finally merged upstream.

Kernel self protection project

In 2015, Kees announced this project because he realized he wouldn't be able to do all the upstreaming work by himself. It is now an industry-wide project, with many contributors.

There are various type of protections: probabilistic protections reduce the probability of success of an exploit. Deterministic protection completely block an exploitation mechanism.

Stack overflow and exhaustion is an example of bug stack that was closed down upstream with vmap stack. Kees is still porting a pax and grsecurity gcc plugin to work on that. The stack canary is essential as well, Kees said. For instance, it mitigates the latest BlueBorne vulnerability.

Integer over/underflow protection went inside the kernel with the new refcount patches. Buffer overflows are mitigated upstream through Hardened user copy or recent FORTIFY_SOURCE integration. Format string injection was mitigated in 3.13 when the %n format option was completely removed.

Kernel pointer leak isn't entirely plugged, despite various fixes. Uninitialized variable was mitigated through porting of the structleak PaX gcc plugin. Kees says it's more than an infoleak, and this might be exploited in some cases.

Use-after-free was mitigated with page zero poisoning in Linux 4.6, and freelist randomization in 4.7 and 4.8.


The basic is to find the kernel in memory (e.g through kernel pointer leaks). To mitigate this, there's various types of kASLR or the ported grsecurity randstruct plugin.

A very basic protection is to make sure executable memory cannot be writable, and this was merged for various architectures a long time ago.

Function pointer overwrite is a very standard exploitation method, and this was mitigated by the pax constify plugin, and then the ro_after_init annotation in the kernel.

Mitigating userspace execution is still a work in progress on x86, but arm64 already fixes for that.

The next stages are mitigating user data reuse, and reused code chunks (ROP), PaX has a RAP closed-source technology to do this.

Understanding the Linux Kernel via ftrace

by Steven Rostedt

Steven started by saying that this talk is really fast, and you should watch it three times to understand it.

Ftrace is an infrastructure with several features. Ftrace is the exact opposite of security hardening: it gives visibility in the kernel, provides instrumentation to do live-kernel patching, and of course rootkits.

Ftrace is already in the kernel. It was usually initially interacted with through debugfs, but it now has its own fs, tracefs, mountable in /sys/kernel/tracing. All files and even documentation are in there, so it's usable through echo and cat because Steve wanted that busybox be enough to control these features. This is were the described files are in the rest of the talk.

The basic file is trace, showing the raw data. Then there's available_tracers. The default tracer is the nop one, which does nothing. The most interesting one is the function tracer, that shows every called function in the kernel. The most beautiful one, according to Steve is the function_graph tracer that follows the call graph.

The tracing_on file controls the writes the ring buffer. Tracing infrastructure is still here, but the ring buffer isn't filled with data. It's there for temporary pauses of tracing.

There are few files that allow limiting ftrace to filter the output: set_ftrace_filter for example matches the function names, and supports glob matching, appending, or clearing.

The file available_filter_functions shows the available functions; it does not include all kernel functions, depending on gcc instrumentation(inline functions, and annotated non-traceable functions (timers, ftrace itself, boot time code).

When using the function tracer, it shows the function calls as well as the parent.

The filter file set_ftrace_pid limits function executed by a given task. If you have multiple threads, it's the thread id.

To trace syscalls, you need to know that the definition macros add a sys_ prefix to the syscall names. If you want to trace the read syscall, you should trace the SyS_read function, because the upper case function comes first. You can find it in the available_filter_functions file.

The set_graph_function filter helps when you want to trace starting from a given point, and follow the call graph, accross function pointer boundaries, giving you insight that's harder to get with just the code. Steven gave an example with the sys_read syscall, where you can know exactly which function is called, even when you have the file_operations structure making code reading harder, but the graph is very clear. You can combine this with set_ftrace_notrace to set a boundary of functions or set_graph_notrace for call graphs you're not interested in, to ease reading the call graph and reduce the ftrace performance impact.

There are many options in the options directory or the trace_options file. Steven likes the func_stack_trace option: it creates a stack trace of traced functions. Be careful, if you don't set a filter, it's going to bring your machine to a knee. Also remember to turn it off when done. sym_offset or sym_addr options show the function relative and absolute locations in memory.

When you set a filter starting with :mod:module_name, it will trace all the functions in a given module.

Function triggers are useful when you want a start a tracing, stop tracing, or even add a stacktrace when a function is it. For example you do set a filter with function_name:stacktrace, and it will give you stacktrace everytime this particular function is called.

When interrupted, you might not want to see the interrupt function graph: there's a default-on option funcgraph-irqs that does just that if you turn it off.

It's possible to limit the graph depth of the function_graph tracers with the max_graph_depth option.

You can also trace with events. The events are listed by subsystems in the events directory. The most commonly used ones are sched, irq or timer families of events. You enable events separately of the specific tracers. If you only want events, use the nop tracer, but this can be combined with the others.

There are two useful options to control event and function tracing: event-fork and function-fork allow to continue tracing children of a traced process.

Finally, Steve introduced the trace-cmd program, that wraps all the custom echos and cats in a single program. trace-cmd has nice tricks to make sure you only stack-trace a single function, and can do all you can do without it with a simpler interface.

Introduction to Generic PM domains

by Kevin Hilman

Two years ago, Kevin did an introduction on various power management subsystems at Kernel Recipes. This talk focuses on PM domains.

The driver model starts with the struct dev_pm_ops. You control the global system suspend through /sys/power/state, and this then calls the appropriate driver callbacks. It's very powerful, but also fragile since any driver failing will stop the whole chain. This is static power management or system-wide suspend.

The focus of this talk is the Dynamic power management, in particular for devices.

Dynamic power management

It starts with runtime PM, a per-device idle mode, one device at a time. It's handled by the driver, based on activity. In this mode, devices are independent, and one device cannot affect other drivers. When using powertop, the "device stats" tell you how long your device is idle.

The runtime PM core keeps a usage count for driver uses. When the count hits 0, the core calls runtime_suspend on a device. If you have a device on a bus_type, it sits between you and the runtime PM core. In driver callbacks, one can ensure context is saved, and the wakeups are enabled, restore context on resume, etc.

PM domains map the architecture of power domains inside modern SoCs, where various hardware blocks are grouped in domains that can be turned on and off independently, to the Linux kernel.

PM domains are similar to bus types in the kernel, but orthogonal since some devices might be in the same domain but different buses.


Generic PM domains (genpd) are the reference implementation of PM domains, to be able to do the grouping and actions when a device becomes idle or active.

In order to implement a genpd, you first implement the power_on/power_off function. It's typically messaging a power domain controller on a separate core, but might be related to clock management or voltage regulators. This is then described in a Device Tree node, allowing to reorder domains for different chip revisions.

Power domains have a notion of governors, allowing custom decision making before cutting power. It allows flexibility relative to the ramp up/down delays for example. It is usually implemented in the genpd, but there are two built-in governors like Always-on or Simple QoS governors. You can attach runtime system-wide or per-device QoS constraints to control the governors.

There has been a lot of work recently upstream, like IRQ-safe domains, or always-on domains. Statistics and debug instrumentations were also added recently.

Under discussion is a way to unify CPU and devices power domain management. Upstream is also interested in having a better interaction between static and runtime PM. Support for more complex domains, in order to have the same driver for an IP block whether it's used through ACPI or genpds, is still in the works.

Performance Analysis Superpowers with Linux BPF

by Brendan Gregg

Presentation slides

Boldly starting the presentation with a demo, Brendan showed how to analyze how top works, with funccount and funcslower, kprobe, funcgraph and other ftrace-based tools he wrote.

He then switched to an eBPF frontend called trace, that was used to dig into the arguments of a kernel function. You can leverage eBPF even more with other tools like execsnoop or ext4dist.

eBPF and bcc

BPF comes from network filtering, originally used with tcpdump. It's a virtual machine in the kernel.

BPF sources can be tracepoints, kprobes, or uprobes. It uses the perf event rig buffer for efficiency. You can use maps as an associative array inside the kernel. The general tracing philosophy is to have a very precise filter to only get the data you need, instead of dumping all the data in userspace, and filtering it later.

Many features were added recently to eBPF, and it keeps being improved.

BPF Compiler Collection (BCC) is the most used BPF frontend. It allows you to write BPF program in C instead of assembly, and load the programs. You can then combine this with a python userspace.

bpftrace is a new in-development frontend, with a simple-to-use philosophy.

Installing bcc on your distro is becoming easier as it gets packaged. There are many tools, each with a different use giving visibility into a different kernel part.

Heatmaps are very useful to visualize event distribution. Flamegraphs are also very powerful when combined with kernel stacktraces generation. It's now even possible to merge userspace and kernelspace stacktraces for analysis.

Future work

Support for higher level languages to write BPF programs like ply or bpftrace is in progress.

In conclusion, eBPF is very useful to understand Linux internals, and you should use it.

Kernel ABI Specification

by Sasha Levin

What's an ABI ? ioctls, syscalls, and the vDSO are examples of the Linux ABI.

Sasha repeated the ABI promise from Greg's talk yesterday. The issue, he says, is that kernel lacks tools to detect a broken ABI.

Sometimes basic syscall argument checks are forgotten, and discovered as a security vulnerability. Sometimes, some interfaces have undefined behaviour, making the ABI stability uncertain.

Breakage is sometimes difficult to fix when detected late, because new userspace might depend on the new behaviour.

In the end, some userspace programs like glibc, strace, or syzkaller might rewrite their understanding of the kernel ABI, and those might be out of sync. Man pages might not document everything either, and they're not a real documentation of the ABI Contract.

ABI Contract

Right now it's in the form of kernel code. Unfortunately, code evolves, so it's not an optimal format for this.

The goal is to fix many issues at the same time: ensure backwards compatibility, prevent kernel to userspace errors, document the contract, and encourage re-use. Sasha looked for a format that would only require writing this once, and be machine readable. syzkaller's description looked like a good starting point. He wanted this to be reusable by userspace tools that need this information. And finally, he wanted to use this as a tool to help ABI fixes and fast breakage detection.

It also helps re-assuring the distribution that the ABI promise is really kept. In Sasha's view, it would also greatly help the security aspect of things, since the ABI is the main interface by which the kernel is attacked.

The hard part is to determine the format of this contract, document all syscalls and ioctls and write the tools to test it out.

Sasha already started with a few system calls, and is currently looking for help to get the ball rolling.

Lightning Wireguard talk

by Jason A. Donenfeld

Jason's background is in breaking VPNs. He wanted to create one that was more secure. That's how Wireguard was born.

Wireguard is UDP based, and uses modern cryptographic principles. The goals is to make it simple and auditable. To prove his point, he showed that it clocks at 3900 lines of code, while OpenVPN , Strongswan or SoftEther have between 116730 and 405894 lines of code each.

It uses normal interfaces, added through the standard ip tool. Jason says it's blasphemous because it breaks through the layering assumptions barriers, as opposed to IPsec for example.

A given interface has a 1 to N mapping between Public keys and IP addresses representing the peers. To configure the cryptokey routing, you use the wg tool for now. Once merged, the intention to have this merged into the iproute project.

In Wireguard, the interface appears stateless, while under the hood, session state, connections are handled transparently.

The key distribution between peers is left to userspace.

Wireguard works well with network namespaces. You can for example limit a container to only communicate through a wireguard interface.

As a design principle, wireguard has no parsing. It also won't interact at all with unauthenticated packets, making it un-scannable unless you have the proper peer private key.

Under the hood, it uses the Noise Protocol Framework (used by Whatsapp) by Trevor Perrin, with modern algorithms like Chacha20, Blake2s, etc. It lacks crypto agility, but support a transition path.

To conclude, Jason says that Wireguard is the fastest, and lowest latency available VPN out there.

Modern Key Management with GPG

by Werner Koch

What's new

GnuPG 2.2 was released a few weeks ago, while 2.1 has been around for nearly 3 years. There's now easy key discovery going through key servers to search keys associated with an email address.

You can now use gpg-agent over the network, so that you don't have to upload your private keys to a server.

In the pipeline for version 2.3 is SHA2 fingerprinting, an AEAD mode, and new default algorithms. The goal is also to help upper applications to integrated GPG in there projects. Werner says he also wants to make the Gnuk hardware open usb token easier to buy in Europe. Improving documentation is also planned.

GPG will be moving to ECC. While this is a well researched-field, some curves (specific ECC implementation) have a pretty bad reputation according to Werner, and some of those are required by NIST, or European standards. The new de-facto standard curves are Curve25519 and Curve448-Goldilocks.

An advantage of ECC key signatures is that they are much shorter than RSA signature, and faster to compute for signing. Verification is slower though.

User experience

The command line interface is being improved with new --quick- options, that are simpler to use. There's now a quick command to generate a key, update the expiration time, add subkeys, update your email address (uid), revoke the old address, sign key, verify a key locally for key signing parties.

The main issue with key servers is that they can't map an address to a key. Anyone can publish a key with a given email. The proper way to handle this is through the email server, but this isn't solved yet. Werner's opinion is that the Web-of-Trust is a too complex tool, he believes that Trust On First Use (TOFU) is a better paradigm.

There are two GPG interfaces: one for humans, and one for scripting. You should always use the scripting ones with you programs, it's more stable.

There are now import/export filters in GPG to reduce the size impact of keys with lots of signatures.

You can now ssh-add keys into the gpg-agent. Only caveat, is that in this case, GnuPG is storing the key forever in its private key directory instead of just in memory.

In conclusion, GPG isn't set in stone, and it keeps improving and evolving. The algorithms, user interface, scriptability are getting better.

(That's it for today ! Continue reading on the last day !)