Kernel Recipes 2017 day 3 notes
Day 3 of the best open kernel conference in the world
September 29, 2017
perf at Netflix
by Brendan Gregg
Brendan started with a ZFS on Linux case study, where it was eating 30% of the CPU resources, which it should never be doing. He started by generating a flame graph with perf, through Netflix's Vector dashboard tool. It was confirmed instantly, despite the initial hunch. This was then quickly thought to be the container teardown cleanup using lots of resources. The only issue here, is that this particular project never used ZFS. It was in fact the free code path trying to get real entropy to free empty lists. It was later fixed in ZFS.
A particular point underlined is that when profiling, you want to see everything, from the kernel, to userspace C or Java code. perf allows doing that, because it has no blind spots, is accurate and low overhead.
This is useful at Netflix, because they scale the number of instances based on the percentage of CPU usage. At Netflix scale, a small performance improvement might lead to a scale-down saving the company a lot of money. While perf can do many things, Netflix uses it to profile CPU usage 95% of the time.
perf originated from implementation of CPU Performance Monitoring Counters (PMCs) in Linux, and supports many features.
The main workflow is to do a
perf list to look at the available tracepoint events, then
perf stat to count particular events.
perf record allows capturing and dumping the events to the file system,
perf report or
perf script is used to analyze a dumped perf data.
perf top can be used to look at events in real-time.
Brendan maintains a list of
perf one-liners, useful to explore and learn about perf capabilities.
Brendan came up with Flame Graphs when he was profiling a MySQL issue. It's a perl script that converts input data to svg. To use it with
perf script, and feed the output into
An important thing is to have working stack traces and symbol resolving working. To fix stack traces you should either use frame-pointer based stack walking, libunwind or DWARF. You probably want
-fno-omit-frame-pointer into your gcc option lists for C code. For Java, you might want to use
perf-map-agent to do symbol resolution and de-inlining.
When you go to instruction-level, the problem is that resolution isn't really precise, so you don't really know which one you're executing. This is because of modern out-of-order CPU architecture. Intel's PEBS helps with this issue.
When using VMs, you might want to have you hypervisor (Xen, etc.) enable PMCs for your OS and handle this properly. For containers,
perf might have issues finding the symbol files, since they are in a different namespace; this is fixed in 4.14.
In conclusion, there's a lot to say about perf, and this talk only scratched the surface of what's possible; Brendan pointed us to the many resources available about it online.
The Serial Device Bus
by Johan Hovold
While serial buses are ubiquitous, the TTY layer failed at modeling the associated resources with a serial line.
The TTY layer exposes a character device to userspace. It supports line discipline for switch modes, handling errors, etc.
It's possible to write drivers on top in userspace, and Johan used gpsd as example of this. But you need to know in advance the associated Port and resources aren't necessary accessible. And you lose the ability to interact with other subsystems in the kernel. Another example of this is bluetooth, where you register further devices (hci0) in order to be able to control the line-discipline and properly initialize ports.
To initialize the bluetooth, you use
hciattach to configure a tty as bluetooth device, then the hci device appears, and then you use
hciconfig to manage this device. The problem with this type of ldisc drivers is that you lose control over some information to userspace, and you don't have the full picture for GPIOs, and other resources for handling power management for example.
Serial Device Bus
serdev was originally written by Rob Herring; it was created as bus for UART-attached device. It was merged in 4.11, but enabled in 4.12 follwing some issues.
The new bus name is "serial"; it refers to
servdev controllers and clients (or slaves). The only controller available is the TTY-port controller. The hardware description happens in the Device Tree.
serdev allows a new architecture, with simpler interaction and layering, without the need to have userspace change the mode of a TTY first, since all the necessary data is in the Device Tree. For bluetooth, this would mean hci0 would appear at dt probe time, making it possible to use
There are currently three bluetooth drivers using this infrastructure in the kernel, as well as one ethernet driver (qca_uart).
The main limitation is that it's serial-core only. While it only supports Device Tree, this is being worked on to add ACPI. Hotplug support isn't solved either. Multiplexing for supporting multiple slaves patches have been posted.
eBPF and XDP seen from the eyes of a meerkat
by Éric Leblond
Suricata is an open-source Intrusion Detection System that relies on kernel features. It starts with dumping all packets at the IP level with linux raw sockets, then does stream reconstruction and application protocol analysis. It works at 10GB/s in normal use in enterprise networks. It analyses the data, and output JSON, or even a web dashboard.
Suricata uses linux raw sockets with
AF_PACKET in memory-mapped fan-out mode for multi-threaded processing.
One issue Suricata encountered was the asymmetrical hash being changed in Linux 4.2, breaking ordering so that Suricata couldn't properly analyze the streams. This was fixed later in 4.6.
eBPF came to the rescue by enabling Suricata to customize the hash function, and then properly tag packets so that they go to the proper thread (load-balanced), hence preserving ordering.
Another issue related to load-balancing, is the big flow handling, that is hard to handle without losing packets or ordering. One solution is to discard select packets, by bypassing certain packets as soon as possible in the kernel to reduce performance impact.
Suricata implemented a new "stream depth" bypass that allows to start discarding after the flow started, while still capturing the most interesting part at the beginning.
For the kernel part of this bypass implementation, nftables did not work because it was too late in the process, after
AF_PACKET handling. An eBPF filter using maps helped Suricata achieve this.
bcc didn't match Suricata requirements, so they used
libbpf which is hosted inside the kernel in
tools/lib/bpf. Eric says it's easy enough to use.
The eXtreme Data Path (XDP) project was started to give access to raw packet data from the network card, before it reaches the Linux network subsystem, creating an skb. You can even interact with it using an eBPF filter. This needs modified drivers, and many are already supported; in 4.12 there's even a generic driver usable for development, but less performant.
Eric started integrating XDP in Suricata, and found that it meant doing more parsing since it was raw packets.
libbpf support isn't done yet either. To hand over the capture to userspace, the strategy is to use the perf event system, with its memory mapped ring buffer.
This is still a bit fresh, Eric says, but promising and very efficient.
HDMI CEC Status Report
by Hans Verkuil
The Voyager space probe sent in 1977 communicates at 1477 bits per second, and CEC is a bus that communicates at 400 bits per second, making Hans the maintainer of the slowest bus in the Kernel.
CEC is an option part of HDMI that provides high level functions and communications for Audio and Video products. It's a 1-line protocol. It has physical addresses, the TV always being 0, and inputs have others. Then there are logical addresses from 0 to 15.
CEC allows waking up, shutting down a device (TV or else), switch sources, getting remote passthrough. You can tell also tell other devices the name of your device. You can also configure the Audio Return Channel (ARC) to send the audio from the sink (TV) to a device through the HDMI Ethernet pins.
Inside the kernel, the CEC framework implements most of the features. The drivers only need to implement the low-level CEC adapter operations. It handles core messages automatically, but you can also get them if you enable passthrough. If you need to assemble or decode CEC messages, there's a BSD and GPL-licensed header-only implementation in
cec-funcs.h that can be used by applications. The framework driver API is pretty compact and simple to implement.
The userspace API has various messages to set a physical or logical address, set the mode of the fd, etc.
The Hotplug Detect use case is complex, since it depends on the status of the HDMI Hotplug Detect Pin (HDP). If the pin is down, some devices won't be able to send CEC messages. Some TVs turn off HPD, but still receive CEC messages. Hans says that the most reliable way to wakeup a TV is to just send a message, regardless of the HPD status. It's out-of-spec, but is the only way to make it work.
cec-ctl is the tool that implements the userspace API and allows interacting with the framework from the command line.
In kernel 4.14, many devices are now supported, including the Raspberry Pi. It can now be emulated with the
vivid driver. It passed CEC 1.4 and 2.0 compliance tests. This makes Linux the only OS with built-in CEC support, Hans says.
In the pipeline, is support for many new devices, as well as a brand new cec-gpio driver allowing to do bit-banging of CEC over a GPIO. It also allows injecting errors, but this should come later.
20 years of Linux Virtual Memory
by Andrea Arcangeli
Virtual Memory(VM) is practically unlimited and costs virtually nothing, virtual pages point to physical pages, which is the real memory.
In x86, the pagetable format is a radix tree. With traditional 3 levels of pages tables you can have 256TiB of memory; with 5-level page tables, you can address 128PiB of memory, but it has a performance impact.
The VM algorithms in Linux use heuristics to solve a hard problem of using the memory as best as possible. One such choice is to have overcommit by default. Or to use all free memory as cache.
In the VM, the basic structure is struct page. It's currently 64 bytes, and is using 1.56% of all memory in a given system.
MM is the memory of a process, and is shared by threads.
virtual_memory_area VMA is inside the MM. The LRU cache is combination of two lists of recently used pages, and uses an active and inactive optimum balancing algorithm. The status of those lists is visible in
Reverse mapping of the objects (objrmap) is used as well to find reverse references of pages to processes.
There are other LRUs for anonymous and file-based mappings, or cgroups.
Automatic NUMA Balancing helps running various workloads, without having to adapt it to NUMA mode with hard bindings.
Transparent Hugepages are a way to automatically use huge pages if an application uses lots of memory, instead of manually with hugetlbfs.
The MMU notifier allows reducing page pinning, making it possible to swap-out DMAed memory with proper driver interactions.
HMM or Unified Virtual Memory allows going even furthers for GPU and seamless computing, without requiring cache-coherency.
Andrea showed auto-NUMA balancing benchmarks, and it improves transactions as much as 10%. A remark from the audience showed that in some pathological cases, the performance might actually be worse, but the feature can be disabled.
With hugepages, you can go from 4KiB pages to 2MiB pages. This allows completely removing a pagetable level, and thus improving performance in some cases. But it has a cost when clearing pages, making it less cache friendly. In the last case, a huge improvement in performance was seen when clearing the faulting sub-page last, so that it's still in the cache.
Transparent Hugepage (THP) works by simply sending 2M pages when the mmap region is 2M aligned, and the request is big enough. It is tunable in
/sys/kernel/mm/transparent_hugepage; it can be disabled, enabled only for madvise, or always. The THP defragmentation/compaction is also tunable.
Since Linux 4.8, it's possible to use THP with tmpfs and shmem. This is also tunable and disabled by default.
KSM and userfaultfd
Virtual memory deduplication (KSM) is practically unlimited, affecting migration during compaction for example; with KSMscale, a maximum limit is set on per-physical pages dedup, the default is 256, so that a given KSM would only be referenced by 256 virtual pages; this is tunable. Answering a question from the audience, Andrea said that if you care about cross-VM sidechannel attacks, you should probably disable KSM after disabling HyperThreading.
userfaultfd allows userspace more visibility and control over page-faulting. It enables postcopy live migration with VMs (efficient snapshotting). It can be used to drop write bits for with JITs, and has many other uses.
Andrea concluded that he is amazed with the room for innovation to continue further improvements, after 20 years of working with the Linux memory management.
An introduction to the Linux DRM subsystem
by Maxime Ripard
In the beginning, there was the framebuffer. That's how fbdev was born, to do very basic graphics handling. Then, GPUs came along, getting bigger and bigger. In parallel in the embedded space, piles of hack were accumulated in display engines to accelerate some operations.
At first, DRM was only for GPUs' needs, without any kind of modesetting. It required to map device registers to userspace so that it would do it. But since Kernel Mode-Setting (KMS), this has moved back into the kernel.
fbdev is now obsolete, and dozens of ARM drm drivers have been merged since 2011.
Traditionally in embedded devices, there were two completely different devices for the GPU and the display engine. In Linux, there's the divide between DRM and KMS.
KMS has planes, that can be used for double-buffering. It also has the CRTC, that does the composition. Encoders take the raw data from the CRTC, and convert it to a useful hardware bus format (HDMI, VGA). Connectors output the data, handle hotplug events and EDIDs.
In the DRM stack, GEM can be used to allocate and share buffers without copy with the kernel. PRIME can interact with GEM and dma-buf to also handle buffers shared with hardware.
Vendors also have their own solutions, like ARM's Mali proprietary driver. Blob access for userspace is tightly controlled.
Build farm again
by Willy Tarreau
This is a followup of last year's presentation. The old build farm had shortcomings: it wasn't reliable (HDMI sticks), had a bad power supply, and heating issues. Yet the RK3288 was quite powerful, so Willy wanted to try again with the same CPU.
He got 10 MiQi boards, which are even faster thanks to dual-channel DDR3, although still having shortcomings when combining them with foam. Willy fixed the heatsink, by using a 3M thermal tape. Instead of microUSB, Willy simply soldered thicker cables directly on the board. And to solve the switch attrition issue, he tried a Clearfog-A1 board.
distcc was updated to the latest version for more flexibility, and bumped settings in order to saturate all the cores on all CPUs. LZO compression helped reducing upload time. He also found that there was a hardcoded limit of 50 parallel jobs in distcc, and fixed it.
He improved the distcc distribution using haproxy in front with the leastconn algorithm, this helped a lot.
Using the cluster in addition to his local beefy machine, he went from 13 minutes for kernel builds to 4m45s.
To help with monitoring, Willy submitted a new
led-activity LED trigger for the kernel to change the blinking speed depending on CPU usage.
To build haproxy, he went from 11s to 3s with the added farm. With up to 200 builds a day, it saves less than half an hour per day.
Feedback was sent to MiQi's maker; patches to distcc. The quest for a good USB power supply continues. Willy is now exploring alternative boards for even faster builds.
(That's it for Kernel Recipes 2017! See you next year!)