Kernel Recipes 2017 day 1 live-blog
Day 1 of the best open kernel conference in the world
September 27, 2017
What's new in the world of storage for Linux
by Jens Axboe
Jens started with the status of blk-mq conversions: most drivers are now converted: stec, nbd, MMC, scsi-mq, ciss. There are about 15 drivers left, but Jens says it isn't over until floppy.c is converted, re-offering the prize he offered two years ago.
blk-mq scheduling was the only missing feature, in order to tag I/O request, have better flush handling, or help with scalability. To address this, blk-mq-sched was added in 4.11, with the "none" and "mq-deadline" algorithms. 4.12 saw the addition of BFQ and Kyber algorithms.
Writeback throttling is a feature to prevent overwhelming the device with request, to keep peak performance high. It was inspired by the networking Codel algorithm. It was tested with io.go, and proven to improve latency tremendously on both NVMe and hard-drives.
IO polling helps getting faster completion times, but it has a high CPU cost. A hybrid polling as added, adding predictive algorithms in the kernel to be able to wakeup the driver just before the IO completes. The kernel tracks IO completion time, and just sleeps for half the mean, allowing both fast completion time, and less CPU load leading to better power management. This is configurable through sysfs, with the proper fd configuration. Results show that adaptive polling is comparable in completion times with active polling, but with half the CPU cost.
Faster O_DIRECT and Faster IO accounting were also worked on. IO accounting used to be invisible in profiling, but with the huge scaling efforts of the IO stack, it started showing at 1-2% in testing. In synthetic tests, disabling iostat started improving performance greatly. It was rewritten and merged in 4.14.
A new mechanism called Write lifetime hints allows application to signal expected write lifetime with fcntl. It allows giving hint to flash based storage (supported in NVMe 1.3), of the total size of the write, making sure you won't get such a big write amplification associated with the internal Flash Translation Layer (FTL), when you do big writes. The device might make more intelligent decisions, better garbage collection internally. It showed improvements with RocksDB benchmarks.
IO throttling was initially tied to CFQ, which isn't ideal with the new blk-mq framework. It now scales better on SSDs, supports cgroup2, and was merged for 4.10.
Jens came back to a slide of 2015 Kernel Recipes were he predicted the future work, and all the feature previously discussed in this talk were completed in the two-year timespan.
In the future, IO determinism is going to be focus of work, as well as continuous performance improvements.
Testing on device with LAVA
by Olivier Crête
Continuous integration is as simple as "merge early, merge often" Olivier says. But the core of the value is more in Continuous Testing, and that's what most people think when they say CI.
Upstream kernel code is properly reviewed, so why should it be tested, Olivier asked. Unfortunately, arm boards aren't easy to test, so the kernel used to rely on users to do the testing.
That's until kernelci.org came along, doing thousands of compiles and boots every day, catching a lot of problems. kernelci.org is very good at breadth of testing, but not depth. If you have any serious project, you should do your own testing, with your own hardware and patches.
Unfortunately, automation isn't ubiquitous, because the perceived value is low compared to cost. To overcome this, the first thing to have is a standardized build, single click build system, with no manual operation. The build infrastructure should be the same for everyone, and Olivier recommends using docker images.
The second step is to close the CI loop, which is sending automated messages to the developer on failure as soon as possible. Public infrastructure in Gitlab, github or phabricator have support for CI, as well as blocking merging of anything that breaks the build.
Linaro Automation and Validation Architecture (LAVA) is not a CI system. It just focuses on board management, making testing them easier. It can install images, do power control, supports serial, ssh, etc. It's packaged for Debian and has docker images available. It should be combined with CI system like Jenkins.
The first thing to have is to have a way to Power on/off a board. You can find various power switch relay boards from APC, Energenie, devantech, or even other USB relays.
LAVA supports different bootloaders: u-boot, fastboot, and others. The best strategy is to configure the bootloader for network booting.
Lava is configured with a jinja2 template format, where you set various variables for the commands you need to connect to, reset, power on/off the board.
Tests are defined by YAML files, and can be submitted directly through the API or via command line tools like lava-tool, lqa, etc. You specify the name of the job, timeouts, visibility, priority, and a list of actions to do.
You should do CI, Olivier says. It requires a one-time investment, and saves a lot of time in the end. According to Olivier, from nothing, a LAVA+Jenkins setup is at most two days of work. Adding a new board to an infrastructure, is done in one or two hours.
Container FS interfaces
by James Bottomley
After an introduction on virtualization, hypervisor OSes. Within linux, there are two hypervisor OSes: Xen and kvm. Both use Qemu to emulate most devices, but they differ in approach. Xen introduced para-virtualization, modifying the OS to enhance emulation. But hardware advancements killed para-virt, except in a few devices. In James' opinion, the time lost in working with paravirt in Linux made it lose the enterprise virtualization market to VMWare.
Container "guests" just run on the same kernel: there is one kernel that sees everything. The disadvantage is that you can't really run Windows on Linux.
The container interface is mostly cgroups and namespaces. There are label-based namespaces, the first one being the network namespace. There are mapping namespace, mapping some resources to somewhere else, allowing those to be seen differently, like the PID namespace, which can map a given PID on the host to be PID 1 inside the container.
Containers are used in Mesos, LXC, docker, and they all use the same cgroups and namespaces standard kernel API. There many sorts of cgroups(block IO, CPU, devices, etc.), but aren't a focus of the talk. James intends to focus on namespaces instead.
James claims that you don't need any of the "user-friendly" systems, and you can just use the clone, unshare, and standard kernel syscall API to configure namespaces.
User namespaces are the tying it all together, allowing to run as root inside a contained environment. When you buy a machine in the cloud, you expect to run stuff on it as root. Since they give enhanced privileges to the user, the user namespaces were unfortunately the source of a lot of exploits, although there weren't any serious security breach recently since 3.14, James said.
User namespaces also maps uids; in Linux, the shadow-utils provides a newuidmap and newgidmap for this. The user namespace hides unmapped uids, so they are inaccessible, even to "root" in the namespace. This creates an issue since a container image will mostly have the files with uid 0, which then should be mapped to the real kuid, and the fsuid accross the userspace/kernel/storage boundary.
In kernel 4.8, the superblock namespace was added to allow plugging a usb key or running a FUSE driver in a container. But to be useful, you need a superblock, which isn't useful with bind maps, because you only have one superblock per underlying device.
The mount namespace works by cloning the tree of mounts when you do
unshare --mount; at first it's identical to the original one, but once you modify it it's different. But, all the modified mounts point to the same refcounted super_block structure. It might create issues when you add new mounts inside a sub-namespace, then this locks the other refcounted super_blocks from the host until you can umount the new mount, like the usb key you plugged in your container, that completely locks the mount namespace trees.
James then did a demo, showing with
unshare that if you first create a user namespace, you can then create mount namespaces, despite being unable to do it before entering the user namespace. It shows how you can elevate you privileges with user namespaces, despite not being root, from an outside view.
It was then showed how you can create a file that is really owned by root by manipulating the mount points inside the user/mount namespace by using marks with shiftfs.
shiftfs isn't yet upstream, and other alternatives are being explored to solve the issues brought by the container world.
Refactoring the Linux kernel
by Thomas Gleixner
The main motivation for Thomas' refactoring over the years was to get the RT patch in the kernel, and to get rid of the annoyances.
One of his pet peeves is the CPU hotplug infrastructure. At first, the notifier design was simple enough for the needs, but it had its quirks, like the uninstrumented locking evading lockdep, or the obscure ordering requirements.
While CPU hotplug was known to be fragile, people kept applying duct tape on top of it, which just broke down when the RT patch started adding hotplug support. After ten years, in 2012, Thomas attempted to rewrite it but ran out of spare time. He picked it up again in 2015 and it was finalized in 2017.
It started by analysing all notifiers, and adding instrumentation and documentation in order to explicit the order requirements. Then, one by one the notifiers were converted to states.
The biggest rework, was that of the locking. Adding lockdep coverage unearthed at least 25 deadlock bugs, and running Steven Rostedt's cpu-hotplug stress test tool could find one in less than 10 minutes. Answering a question from Ben Hutchings in the audience, Thomas said that these fixes are unfortunately very hard to backport, leaving old kernel with the races and locks.
The lessons learned are that if you find a bug, you expected to fix them. Don't rely on upstream to do that for you. There's a lot of bad code in the kernel, so don't assume you've seen the worse yet. You also shouldn't give up if you have to rewrite more things. Estimation in this context is very hard, and the original estimation of task was off by factor of three. In the end, the whole refactoring took 2 years, with about 500 patches in total.
Its base concept was implemented in 1997, and extended over time. The purpose initially the base for all sort of timers, mostly for timeouts after 2005.
Those timeouts aren't triggered most of the time, but re-cascading them caused a lot of performance issues for timers that would get canceled immediately after re-cascading. This is a process that holds a spin-lock with interrupts disabled, and therefore very costly.
It took a 3 month effort to analyze the problem, then 2 month for a design and POC phase, followed by 1 month for implementation, posting and review process. Some enhancements are still in-flight.
The conversion was mostly smooth, except for a userspace visible regression that was detected 1 year after the code was merged upstream.
The takeout of this refactoring is to be prepared to do palaeontological research; don't expect anyone to know anything, or even care. And finally, be prepared for late surprises.
Git is the absolute necessary tool for this work, with grep/log and blame. And if you need to dig through historical code, use the tglx/history merged repository.
Coccinelle is also very useful, but it's a bit hard to learn and remember the syntax.
Mail archives are very useful, but they need to be searchable, as well as quilt, ctags, and of course a good espresso machine.
In the end, this isn't for the faint of heart says Thomas. But it brings a lot of understanding on kernel history. It also gives you the skill to understand undocumented code. The hardest part is to fight the "it worked well until now" mentality. But, it is fun, for some definition of fun.
What's inside the input stack ?
by Benjamin Tissoires
Why talk about input, isn't it working already, Benjamin asked. But the hardware makers are creative, and keep creating new devices with questionable designs.
The usages keep evolving as well, with the ubiquitous move to touchscreen devices for example.
The kernel knows about hardware protocols(HID), talks over USB, and sends evdev events to userspace.
libinput was created on top of libevdev "because input is easy"; but it keeps being enhanced after three years, showing the simplicity of the task. It handles fancy things like gestures.
The toolkits use libevdev, but they also handle gestures because of different touchscreen use cases.
On top of that, the apps use toolkits.
The goood, bad and ugly
Keyboards are mostly working, so it's good. Except for that Caps Lock LED in a TTY being broken since UTF-8 support isn't in the kernel.
Mice are old too, so they are a solved problem. Except for those featureful gaming mice, for which the libratbag project was created to configure all the fancy features.
Most touchpads are still using PS/2, but extending the protocol to add support for more fingers. On Windows, the touchpads communicate over i2c (in addition to PS/2). Sometimes the i2c enumeration goes through PS/2, but other times through UEFI.
There were a few security issues, with an issue on Chromebook where they allowed the webapp to inject HID events through the uhid driver, and this enabled exploiting a buffer overflow in the kernel.
In 2016, the MouseJack vulnerability enabled remotely hacking wireless mouses. Which meant you could remotely send key events to a computer. You could also force a device to connect to your receiver. A receiver firmware update was pushed through gnome software for Logitech mouses.
Linux Kernel Release Model
by Greg Kroah-Hartman Slides
While the kernel has 24.7M lines of code in more than 60k files, you only run a small percentage of that at a given time. There's a lot of contributors, and a lot of changes per hour. The rate of change is in fact accelerating.
This is something downstream companies don't realize. They're getting behind faster than ever when not working with upstream.
The release model is now that there's a new release every 2 or 3 months. All releases are stable. This time-based release model works really well.
The "Cambridge Promise", is that the kernel will never break userspace. On purpose. This promise was formalised in 2007, and kept as best as possible.
Version numbers mean nothing. Greg predict that every 4 years, the first number will be incremented, so that's we might see Linux 5.0 in 2019.
The stable kernels are branched after each releases. They have publicly documented rules for what is merged, the most important one is that a patch has to be Linus' tree.
Longterm kernels are special stable versions, selected once a year, that are maintained for at least 2 years. This rule is now even applied by Google for every future Android device. This makes Greg thinks he might want to maintain some of those kernels for a longer time. Since people care, the longterm kernels also have a higher rate of bugfixes.
Greg says you should always have a mechanism to update your kernel (and OS). What if you can't ? Blame your SoC provider. He took for example a Pixel phone, where there's a 2.8M patch to mainline, for a total of 3.2M lines of running code. 88% of the running code isn't reviewed. It's very hard to maintain and update.
Greg's stance is that all bugs can eventually be a "security" issue. Even a benign fix might become a security fix years later once someone realizes the security implications. Which is why you should always update to your latest stable kernel, and apply fixes as soon as possible.
In conclusion, Greg says to take all stable kernel updates, and enable hardening features. If you don't use a stable/longterm kernel, your device is insecure.
Fixing Coverity Bugs in the Linux Kernel
by Gustavo A. R. Silva
Coverity is a static source code analyzer. There are currently around 6000 issues reported by the tool for the Linux kernel; those are sorted in different categories.
The first category is illegal memory access, followed by the medium category.
Gustavo first worked on a missing break in a switch in the usbtest driver. Gustavo sent first a patch to fix the issue, then a second one to refactor the code following advices from the maintainer.
Then he worked on arguments sent in the wrong order in scsi drivers. Following was an uninitialized scalar variable, and others. Gustavo showed many examples with obvious commenting or logic bugs.
Tracking exactly which bugs were fixed was really useful to take note of similar issues. He sent in total more than 200 patches in three months, in twenty-six different subsystems.
Software Heritage: Our Software Commons, Forever
by Nicolas Dandrimont
Open Source Software is important, Nicolas says. Its history is part of our heritage.
Code disappears all the time, whether maliciously, or when a service like Google Code is shut down.
Software Heritage is a project an open project to preserve all the open source code ever available. The main targets are VCS repositories, and source code releases. Everything is archived in the most (VCS)agnostic data model possible.
The project heritage fetches the source code from many sources, and then deduplicates it using a Merkle tree. There are currently 3.7B source files from 65M projects. It's already the richest source code archive available, and growing daily.
How to store all of this on a limited budget (100k€ hw budget). It all fits in a single (big) machine. The metadata is stored in PostGres, the files are in filesystems. XFS was selected, and they hit the bottlenecks pretty quickly.
They are thinking of moving to scale-out object storage system like Ceph. The project wants to lower the bar for anyone wanting to do the same thing. They also have plans to use more recent filesystem features.
Software Heritage is currently looking for contributors, sponsors, for this project.