Kernel Recipes 2016 notes
A "live" blogging attempt
September 28, 2016
Update 2016-10-21: I've added links to the videos and LWN articles, which are of much higher quality than these live notes.
This year I'm trying a live blog of Kernel Recipes 2016, live from Paris, at Mozilla's headquarters. You can watch the live stream here.
The kernel report
by Jonathan Corbet; video
We've had 5 kernel releases since last November, with 4.8 coming out hopefully on Oct 2nd. There were between 12 and 13k changesets for each releases. About 1.5k devs contributed to each release.
The number of developers contributing to each release is stable, growing slowly. For each new releases, there are about 200 first-time contributors.
The development process continues to run smoothly, and not much is changing.
Security
Security is a hot topic right now. Jon showed an impressive list of CVE numbers, estimating that the actual number of flaws is about double that.
The current process for fixing security flaws is like a game of whack-a-mole: there are more and more new flaws, and in the end it's not sure you can keep up.
The distributors also aren't playing their part pushing updates to users.
So vulnerabilites will always be with us, but what is possible is eliminating whole classes of exploits. Examples of this include: - Post-init read-only memory in 4.6 - Use of GCC plugins in 4.8 - Kernel stack hardening in 4.9 - Hardened usercopy in 4.8 - Reference-count hardening is being worked on.
A lot of this originates in grsecurity.net, some of it is being funded by the Core Infrastructure Initiative.
The catch is that there are performance impacts, so it's a tradeoff. So can we convince kernel developers it's worth the cost ? Jonathan is optimistic that the mindsets are changing towards a yes.
Kernel bypass
A new trend is to bypass the kernel stack, for instance in the network stack for people doing High-Frequency-Trading.
Transport over UDP (TOU) is an example of this, enabling applications to make transport protocols in userspace. The QUIC protocol in Chrome is an example of this.
The goal here is to be able to make faster changes in the protocol. For instance, TCP Fast Open has been available for a long time in the kernel, but most devices out there (Android, etc.) have such an old kernel, that nobody is using this.
Another goal is to avoid middlebox interference (for example, they mess with TCP congestion, etc.). So here, the payload is "encrypted" and not understood by those middleboxes, so they can't interfere with it.
The issue with TOU is that we risk having every app (Facebook, Google, etc.) speaking its own protocol, killing interoperability. So the question is will the kernel still be a strong unifying force for the net ?
BPF
The Berkeley Packet Filter is a simple in-kernel virtual machine. Users can load code in the kernel with the bpf() syscall.
It's safe because, there are a lot of rules and limitations to make sure BPF programs do not pose a problem: they can't loop, access arbitrary memory, access uninitialized memory, or leak kernel pointers to user space for example.
The original use car of BPF was of course to filter packets. Nowadays it allows system call restriction with seccomm(), perf events filtering, or tracepoint data filtering and analysis. This is finally the Linux "dtrace".
Process
A lot has changed since 2.4. At the time distributors backported lots of code and out-of-tree features.
Since then, the "upstream first" rule, or the new regular release (every 10 weeks or so) helped solve a lot of problems.
Yet, there are still issues. For instance, a phone running the absolute latest release of Android (7.0), is still running kernel 3.10, which was released in June 2013 and is 221k+ patches behind mainline.
So why is this ? Jonathan says that Fear of mainline kernel is a reason. With the rate of change there's the possibility of new bugs and regressions.
Jon then showed a table compiled by Tim Bird showing that most phones have a vast amount of out-of-tree code to forward port: between 1.1M and 3.1M lines of inserted codes!
Out-of-tree code might be because upstreaming can take a long time. For example, wakelocks or USB changing aren't upstream. Other changes like scheduler rewrites are simply not upstreamable. The kernel moves to slowly for people shipping phones every 6 months.
This is a collision of two points of views: manufacturers say that "nobody will remember our product next year", while kernel developers say they've been here for 25 years and intend to continue be here. This is quite a challenge that the community will have to overcome.
GPL enforcement
To sue or not to sue ?
Some say that companies will not comply without the threat of compliance. Other say that lawsuits would just shut down any discussions with companies that might become contributors in the future.
Contributions stats show that the absolute maximum of independent contributors is about 15%, and that the rest of contributions are coming from people being paid by companies to do so. Therefore alienating those companies might not be the best idea.
Corbet put it this way: do we want support for this current device eventually, or do we want support from companies indefinitely ?
entry_*.S : A carefree stroll through kernel entry code
by Borislav Petrov; video
There are a few reasons for entry into the kernel: system calls, interrupts(software/hardware), and architectural exceptions (faults, traps and aborts).
Interrupts or exceptions entry need and IDT (Interrupt Descriptor table). The interrupt numbers indexes to it for example.
Legacy syscalls had quite an overhead due to segment-based protections. This evolved with the long mode, which requires a flat memory model with paging. Borislav then explains how the setup the MSRs to go into the syscall.
The ABI described is x86-specific (which Borislav is a maintainer of), with which registers to setup (rcx, rip, r11) in order to do a long mode syscall. Borislav explains what the kernel does on x86. Which flags should be set/reset ? Read his slides (or the kernel code) for a nice description.
entry_SYSCALL_64 …
… is the name of the function that takes 6 arguments in registers that is run once we're in the kernel.
SWAPGS is then called, GS and FS being one of the only segments still used. Then the userspace stack pointer is saved.
Then the kenel stack is setup (with a per-cpu-varible) appropriately reading cpu_tss struct.
Once the stack is setup, user pt_regs is constructed and handed to helper functions. A full IRET frame is setup in case of preemption.
After that the thread info flags are looked at in case there's a special situation that needs handling, like ptraced' syscalls.
Then the syscall table is looked at, using the syscall number in RAX. Depending on the syscall needs, it's called more or less differently.
Once the syscall has been called, there is some exit work, like saving the regs, moving pt_regs on stack, etc.
A new thing on the return path is SYSRET, being faster than IRET which is implemented in microcode (saving ~80ns in syscall overhead). SYSRET does less checks. It depends on the syscall, whether it's on slowpath or fastpath.
If the opportunistic SYSRET fails, the IRET is done, after restoring registers and swapping GS again.
On the legacy path, for 32-bit compat syscalls, there might be a leak of 16bits of ESP, which is fixed with per-CPU ministacks of 64B, which is the cacheline size. Those ministacks are RO-mapped so that IRET faults are promoted and get their own stack[…].
cgroup v2 status update
by Tejun Heo; video
The new cgroup rework started in Sep 2012 with gradual cleanups.
The experimental v2 unified hierarchy support was implemented in Apr 2014.
Finally, the cgroup v2 interface was exposed in 4.5.
Differences in v2
The resource model is now consistent for memory, io, etc. Accounting and control is the same.
Some resources spent can't be charged immediately. For instance, an incoming packet might consume a lot of CPU in the kernel before we know to which cgroup to charge these resources.
There's also a difference in granularity, or delegation. For example, what to do when a cgroup is empty is well defined, with proper notification of the root controllers.
The interface conventions have been unified, for example for weight-base resource sharing, the interfaces are consistent accross controllers.
Cpu controller controversy
The CPU controller is still out of tree. There are disagreements around core v2 design features, see this LWN article for details.
A disagreement comes from page-writeback granularity, i.e how to tie a specific writeback operation to a specific thread as opposed to a resource domain.
Another main reason is process granularity. The scheduler only deals with threads, while cgroups don't have thread-granularity, only process-level granularity. This is one of the major disagreements.
The scheduler priority control (nice syscall) is a very different type of interface to the cgroup control interface (echo in a file).
Discussion on this subject is still ongoing.
The rest
A new pids controller was added in 4.3. It allows controlling the small resource that is the PID space (15 bits) and prevent depletion.
Namespace support was added in 4.6, hiding the full cgroup path when you're in a namespace for example. There are still other bugs.
An rdma controller is incoming as well.
Userland support
systemd 232 will start using cgroup v2, including the out-of-tree cpu controller. It can use both cgroup v1 and v2 interfaces at the same time.
libvirt support is being worked on by Tejun Heo as well, which is currently deploying it with systemd at Facebook.
We've had some interesting questions from the audience with regards to some old quirks and security issues in cgroups, but Tejun is quite optimistic that v2 will fix many of those issues and bugs.
Old userland tools will probably be broken once cgroup v2 is the norm, but they are fixable.
from git tag to dnf update
by Konstantin Ryabitsev; video
How is the kernel released ? (presentation)
Step 1: the git tag
It all starts with a signed git tag pushed by Linus. The transport is git+ssh for the push.
It connects to git master, a server in Portland Oregon maintained by the Linux Foundation.
The ssh transport passes the established connection to a gitolite shell. gitolite uses the public key of the connection (through an env variable) to identify the user. Then the user talks to the gitolite daemon.
Before the push is accepted, a two-factor authentication is done via 2fa-val. This daemon allows the user to validate an IP address for a push. It uses the TOTP protocol. The 2fa token is sent through ssh by the user. It allows the user to authorize an IP address for a certain period of time (usually 24h).
Once the push is accepted, gitolite passes control to git for the git protocol transfer.
As a post-commit hook, the "grokmirror" software is used to propagate changes to the frontend servers.
grokmirror updates a manifest that is served through httpd (a gzipped json file), on a non-publicly accessible server.
On a mirror server connected through a VPN, the manifest is checked for changes every 15 seconds, and if there's a change, the git repo is pulled.
On the frontend, the git daemon is running, serving updates the repo.
Step 2: the tarball
To generate the tar, the git archive command is used. The file is then signed with gpg.
kup (kernel uploader) is then used to upload the tarball. Or it can ask the remote to generate the tarball itself from a given tag, saving up lots of bandwidth. Only the signature is then uploaded. Then the archive is compressed and put in the appropriate public directory.
kup uses ssh transport as well to authentify users. The kup server store the tarball in a temporary storage.
The tarball is then downloaded by the marshall server, and copied over nfs to the pub master server.
The pub master server is mounted over nfs on rasperry pi that watches directory changes and updates the sha256sums file signatures. On marshall, builder server checks if the git tag and tarball are available and then runs pelican to update the kernel.org frontpage.
Finally, to publicly get the tarballs, you shouldn't use ftp. It is recommended to use https or rsync, or even https://cdn.kernel.org which uses Fastly.
Maintainer's Don't Scale
by Daniel Vetter; video, LWN article
I took break here so you'll only find a summary of the talk. Talk description here
Daniel exposes the new model adopted by the open source intel graphics team to include every regular contributor as Maintainer. His trick ? Give them all commit access.
The foreseen problems failed to materialize. Everything now works smoothly. Can this process be applied elsewhere ?
Patches carved into stone tablets
by Greg Kroah-Hartman; video, LWN article
Why do we use mail to develop the kernel? presentation
Because it is faster than anything else. There are 7 to 8 changes per hour. 75 maintainer took on average 364 patches.
There are a lot of reviewers.
A good person knows how to choose good tools. So Greg reviews a few tools.
Github is really nice: free hosting, drive-by contributors, etc. It's great for small projects. The downside is that it doesn't scale for large projects. Greg gives kubernetes as an example: there are 4000+ issues, 500+ outstanding pull requests. Github is getting better at handling some issues, but still requires constant Internet access, while the kernel has contributors that don't have constant Internet access.
gerrit's advantage is that project managers love it, because it gives them a sense of understanding what's going on. Unfortunately, it makes patches submissions hard, it's difficult to handle patch series, and doesn't allow viewing a whole patch at once if it touches multiple files. It's slow to use, but it makes local testing hard, people have to work around it with scripts. Finally, it's hard to maintain as a sysadmin.
Plain text email has been around since forever. It's what the kernel uses. Everybody has access to email. It works with many types of clients. It's the same tool you use for other types of work. A disadvantage is gmail, exchange, outlook: many clients suck. Gmail as a webserver is good.
Read Documentation/email_clients.txt in order to learn how to configure yours.
Another advantage of email, is that you don't need to impose any tool. Some kernel developers don't even use git ! Although git works really well with email: it understands patches in mailbox format (git am), and you can pipe emails to it.
Project managers don't like it though because they don't see the status.
But there's a simple solution: you can simply install Patchwork, which you plug into your mailing list, and it gives you a nice overview of the current status. There's even a command line client.
Why does it matter ? Greg says it's simple, has a wide audience, it's scalable, and grows the community by allowing everybody to read and understand how the development process works. And there are no project managers.
Kubernetes and docker (github-native projects) are realizing this.
Greg's conclusion is that email is currently the best (or less worse?) tool for the job.
Why you need a test strategy for your kernel development
by Laurent Pinchard; video
Laurent showed us an example of how a very small, seemingly inconsequential change might introduce quite a bug. There's a need to test everything before submitting.
The toolbox used when he started to test his v4l capture driver is quite simple and composed of a few tools ran in the console, in two different telnet connections.
He quickly realized that running the commands every time wouldn't scale. After writing a script simplifying the commands, he realized running the script in each of the 5 different terminal connection wouldn't scale either.
After this, he automated even further by putting images to be compared in a directory and comparing them with the output. But the test set quickly grew to over a gigabyte of test files.
Instead of using static files, the strategy was then to generate the test files on the fly with an appropriate program.
He then ran into an issue where the hardware wasn't sending data according to the datasheet. While looking at the data, he discovered he had to reverse engineer how he hardware worked for a specific image conversion algorithm (RGB to HSV).
The rule of thumb Laurent advises is to have one test per feature. And to add one test for each bug. Finally, to add a test for each non-feature. For example, when you pass two opposite flags, you should get an error.
The test suite Laurent developed is called vsp-tests and is used to test the specific vsp driver he has been working on. There are many other kind of tests in the kernel(selftests, virtual drivers...), or outside of it (intel-gpu-tools, v4l2-compliance, linaro lava tests...).
While there are many test suites in the kernel development, there's no central place to run all of these.
Regarding CI, the 0-Day project now monitors git trees and kernel mailing lists, performs kernel builds for many architectures, in a patch-by-patch way. On failure it sends you an email. It also runs coccinelle, providing you a patch to fix issues detected by coccinelle. Finally, it does all that in less than one hour.
kernelci.org is another tool doing CI for kernel developers. There will be a talk about it on the next day.
There's also Mark Brown's build bot and Olof Johansson's autobuilder/autobooter.
That's it for day one of Kernel Recipes 2016 !
Man-pages: discovery, feedback loops and the perfect kernel commit message
by Michael Kerrisk; video
Michael has been contributing man pages since around 2000. There around ~1400 pages.
When providing an overview, there a few challenges : providing a history of the API, the reason for the design, etc.
One of Michael's recent goals has been preventing adding new buggy Linux API. There are a few examples of this. One of the reasons is lack of prerelease testing.
There are design inconsistencies, like the different clone() versions. Behavioral inconsistencies might also creep up, like the mlock() vs remap_file_pages() differences in handling pages boundaries.
Many process change APIs have different combination of rules for matching credentials of the process that can do the changes.
Another issue is long-term maintainability, in which an API must make sure it's extensible, and work on making sure the flags are properly handled, and bad combinations are rejected.
We don't do proper API design in Michael's opinion. And when it fails, userspace can't be changed, and the kernel has to live with the problems forever.
Mitigations
In order to fix this, Unit tests are a good first step. The goal is to prevent regressions. But where should they be put ? One of the historical home of testing was the Linux Test Project. But those are out of trees, with only a partial coverage.
In 2014, the kselftest project was created, lives in-tree, and is still maintained.
A test needs a specification. It turns out specifications help telling the difference between the implementation and intent of the programmer. It's recommended to put the specification at the minimum in the commit message, and at best send a man-page patch.
Another great mitigation is to write a real application. inotify is good of example of that: it took Michael 1500 lines of code to fully understand the limitations and tricks of inotify. For example, you can't know which user/program made a given file modification. The non-recursive monitoring nature of inotify also turned out to be quite expensive for a large directory. A few other limitations were find while writing an example program.
The main point is that you need to write a real-world application if you're writing any non-trivial API in order to find its issues.
Last but not least, writing a good Documentation is a great idea: it widens the audience, allows easier understanding, etc.
Issues
A problem though is discoverability of new APIs. A good idea is to Cc the linux-api mailing list. Michael runs a script to watch for changes for example. It's an issue, because sometimes ABI changes might happen unvoluntarily, while there are a complete no-no in kernel development.
Sometimes, we get silent API changes. One example was an adjustment of the posix mq implementation that was discovered years after. By then it was too late to reverse. Of course, this API had no unit tests.
The goal to fix this is to get as much feedback as possible before the api is released to the public. You should shorten the feedback loop.
Examples
The example of recent cgroup change was given, where improvement of the commit message over the versions gave people a better understanding of the problem that was corrected. It make life easier of the reviewer, for userspace developer, etc.
The advice to the developer for a commit message is to assume the less knowledge as possible for the audience. This needs to be done at the beginning of the patch series so many people can give feedback.
The last example is from Jeff Layton's OFD locks who did a near perfect API change proposal: well explained, example programs, publicity, man-page patch, glibc patch and even going as far as proposing a POSIX standard change.
In response to a question in the audience about the state of process to introduce Linux kernel changes, Michael went as far as to propose that there be a full-time Linux Userspace API maintainer, considering the huge amount of work that needs to be done.
Real Time Linux: who needs it ? (Not you!)
by Steven Rostedt; video
What is Real Time ?
It's not about being fast. It's about determinism. It gives us repeatability, reliability, known worse case scenario and knows reaction time.
Hard Real Time is mathematically provable, and has bounded latency. The more code you have, the harder it is to prove.
With soft Real Time you can deal with outliers, but have unbounded latency.
Examples of hard real time include airplane engine controls, nuclear power plants, etc. Soft real time include a video systems, video games, and some communication systems. Linux is today a Soft Real Time system.
PREEMPT_RT in Linux
It's not a Soft Real Time system because it doesn't allow for outliers or unbounded latency. But it's not Hard Real Time either because it can't be mathematically proven. Steven says it's Hard Real Time "Designed".
If it had no bug Linux would be a Hard Real Time system. It's used by financial industries, audio recordings (jack), navigational systems.
Lots of feature from PREEMPT_RT has been integrated in the kernel. Examples include highres timers, deadline scheduler, lockdep, ftrace, mostly tickless kernel etc. It allowed people to test SMP-related bugs with only one CPU, since it changed the way spinlocks worked, giving Linux years of advance in SMP performance.
But every year PREEMPT_RT also keeps evolving and getting bigger. Missing features still in PREEMPT_RT include Spin locks to sleeping mutexes.
Latency always happens. When you have an interrupt, it might run and steal processor time to high priority thread. But with threaded interrupts, you can make sure the "top half" runs for as little time as possible, just to wake up the appropriate thread that will handle the interrupt.
Hardware matters
The hardware needs to be realtime(cache/TLB misses, etc.) as well, but this is topic of Steven's next talk. Come back tomorrow !
kernelci.org : 2 million kernel boots and counting
by Kevin Hilman; video
Kevin showed his growing collection of boards sitting in his home office, that is in fact part of kernelci.org.
Over the last years, the number of different boards supported by device trees has exploded, while board files have been slowly removed. The initial goal was therefore to test as many boards as possible, while trying to keep up with the growing number of new machines.
It started with automation of a small board farm, and then grew into kernelci.org, that builds, boots and reports on the status through web, mail or RSS.
Many trees are being tested, with many maintainers requesting that their tree being part of the project.
The goal of kernelci.org is to boot those kernel. Building is just a required step. There are currently 31 unique SoCs, across four different architectures, with 200+ unique boards.
A goal is to quickly find regressions on a wide range of hardware. Another goal is to be distributed. Anyone having a board board farm can be contributing. There are currently labs at Collabora, Pengutronix, BayLibre, etc. And all of this done in the Open, by small team, none of its member working full-time on it.
Once booted a few test suites are run, but no reporting or regression testing is done, and this is only done a small subset of platforms. The project is currently looking for help in visualization and regression detection, since the logs of these tests aren't automatically analyzed. They also would like to have more hardware dedicated to long-running tests.
They have a lot of ideas for new features that might be needed, like comparing size of kernel images, boot times, etc.
The project is also currently working on energy regressions. The project uses the ARM energy probe and BayLibre's ACME to measure power during boot, tests, etc. The goal is to detect major changes, but this is still under development. Data is being logged, but not reported or analyzed either.
How to help ? A good way to start might be just try it, and watch the platforms/boards you care about. The project is looking for contributors in tools, but also for people to automate their lab and submit the results. For the lazy, Kevin says you can just send him the hardware, as long as it's not noisy.
Kevin finally showed his schematics to plug many boards, using an ATX power supply, with usb-controled relays and huge USB hubs. The USB port usage explodes since in the ARM space, many boards need USB power supply, and then another USB port for the serial converter.
Debian's support for Secure Boot in x86 and arm
by Ben Hutchings; LWN article
Secure Boot is an optional feature in UEFI that protects against persistent malware if implemented correctly. The only common trusted certificate on PCs are for Microsoft signing keys. They will sign bootloaders on PCs for small fee, but for Windows ARM the systems are completely locked down.
For GNU/Linux, the first stage needs an MS signature. Most distribution ship "shim" as a first stage bootloader that won't need updating often.
For the kernel, you can use Matthew Garrett's patchset to add a 'securlevel' feature, activated when booted with Secure Boot, that makes module signatures mandatory, and disables kexec, hibernation and other peek/poke kernel APIs. Unfortunately this patch is not upstream.
The issue with signatures is that you don't want to expose signing keys to build daemons. You need to have reproducible builds that can't depend on anything secret, therefore you can't auto-build the signed binary in a single step. Debian's solution is to have an extra source package. The first one from which you build the unsigned binary, and a second one in which you put signatures you generated offline.
This new package is called linux-signed. It contains detached signatures for a given version, and a script to update them for a new kernel version. This is currently done on Ben's machine, and the keys aren't trusted by grub or shim.
Signing was added to the Debian archive software dak. This allows converting unsigned binaries to signed ones.
While this was already done in Ubuntu, the process is different for Debian (doesn't use Launchpad). Debian signs kernel modules, has detached signatures (as opposed to Ubuntu's signed binaries), and supports more architectures than amd64. Finally, the kernel packages from Ubuntu and Debian are very different.
Julien Cristau then came on stage to explain his work on signing with a PKCS#11 hardware security module (Yubikey for now). Signing with an HSM is slow though, so this is only done for the kernel image, not modules.
You can find more information the current status of Secure Boot on the Debian wiki. The goal is to have all of this ready for the stretch release, which freezes in January 2017.
The current state of kernel documentation
by Jonathan Corbet; video
Documentation is unsearchable, and not really organized. There is no global vision, and everything is a silo.
Formatted documentation (in source-code) is interesting because it's next to the code. It's generated with "make htmldocs", and is complex multi-step system developed by kernel developers. It parses the source files numerous times for various purposes, and is really slow. The output is ugly, and doesn't integrate with he rest of Documentation/ directory.
How to improve this ? Jon says it needs to be cleaned up, while preserving text access.
Recently, asciidoc support was added in kernel comments. It has some advantages but adds a dependency on yet-another tool.
Jon suggests that it would have been better to get rid of DocBook entirely, and rework the whole documentation build toolchain instead of adding new tools on top.
To do this, Jon had a look at Sphinx, a documentation system in Python using reStructuredText. It is designed for documenting code, generating large documents, is widely supported.
After posting a proof of concept, Jani Nikula took responsibility and pushed it into a working system. It now supports all the old comments, but also supports RST formatting. To include kerneldoc comments, Jani Nikula wrote an extension module to Sphinx.
All this work has been merged for 4.8, and there are now Sphinx documents for the kernel doc HOWTO, GPU and media subsystems.
Developers seem to be happy for now, and a new manual is coming in 4.9: Documentation/driver-api is conversion of the device drivers book. Of course, this is just the beginning, as there are lots of files to convert to the new format, and Jon estimates this might take years until it's done.
For 4.10, a goal would be to consolidate the development process docs (HOWTO, SubmittingPatches, etc.) into a document. The issue here is that some of this files are really well-known, and often pointed-to, and this would break a lot of "links" in a way.
Landlock LSM: Unprivileged sandboxing
by Mickaël Salaün; video, LWN article
The goal of landlock is to restrict processes without needing root privileges.
The use case is to be used by sandboxing managers (flatpak for example).
Landlock rules are attached to the current process via seccomp(). They can also be attached to a cgroup via bpf()
Mickaël then showed a demo of the sandboxing with a simple tool limiting the directories a given process can access.
The approach is similar to Seatbelt or OpenBSD Pledge. It's here to minimize the risk of sandbox escape and prevent privilege escalation.
Why do existing features do no fit with this model ? The four other LSMs didn't fit the needs because they are designed to be controlled by the root/admin user, while Landlock is accessible to regular users.
seccomp-BPF can't be used because it can't filter arguments that are pointers, because you can't dereference userland memory to have deep filtering of syscalls.
The goal of Landlock is to have a flexible and dynamic rule system. It of course has hard security constraints: it aims to minimize the attack surface, prevent DoS, and be able to work for multiple users by supporting independent and stackable rules.
The initial thought was to extend the seccomp() syscall, but then it was switch to eBPF. The access rules are therefore sent to the kernel with bpf().
Landlock uses LSM hooks for atomic security checks. Each rule is tied to one LSM hook. It uses map of handles, a native eBPF structure to give rules access to kernel objects. It also exposes to eBPF rules filesystem helpers that are used to handle tree access, or fs properties (mount point, inode, etc.).
Finally, bpf rules can be attached to a cgroup thanks to a patch by Daniel Mack, and Landlock uses this feature.
Rules are either enforced with the process hierarchy, with the seccomp() interface to which Landlock adds a new command; or via cgroups for container sandboxing.
The third RFC patch series for Landlock is available here.
Lightning talks
the Free Software Bastard Guide
by Clement Oudot; video
This is a nice compilation of things not to do as user, developer or enterprise. While the talk was very funny, I won't do you the offense of making a summary since I'm sure all my readers are very disciplined open source contributors.
Mini smart router
by Willy Tarreau; video
This is about a small device made by Gl-inet. It has an Atheros SoC (AR9331) with a MIPS processor, 2 ethernet ports, wireless, 64MB of RAM and 16MB of flash.
The documentation and sources for the Aloha Pocket, a small distro running on the hardware.
Corefreq
by Cyril
Corefreq measures Intel CPUs frequencies and states. It gives you a few hardware metrics. You can lean more on Corefreq github page.
That's it for day two of Kernel Recipes 2016 !
Speeding up development by setting up a kernel build farm
by Willy Tarreau; video, LWN article
Some people might spend a lot of time building the Linux kernel, and this hurt the development cycle/feedback loop. Willy says during backporting sessions, the build time might dominate the development time. The goal here is to reduce the wait time.
In addition, build times are often impossible to predict when you might have an error in the middle breaking the build.
Potential solutions include, buying a bigger machine or using a compiler cache, but this does not fit Willy's use case.
Distributed building is the solution chosen here. But as a first step, this required a distributed workload, which isn't trivial at all for most project. Fortunately, the Linux kernel fits this model.
You need to have multiple machines, with the exact same compiler everywhere. Willy's proposed solution is to build the toolchain yourself, with crosstool-ng. You then combine this with distcc, which is a distributed build controller, with low overhead.
Distcc still does the preprocessing and linking steps locally, which will consume approx 20% to 30% of the build time. And you need to disable gcov profiling.
In order to measure efficiency of a build farm, you need to compare performance. This requires a few steps to make sure the metric is consistent, as it might depend on number of files, CPU count, etc. Counting lines of code after preprocessing might be a good idea to have a per-line metric.
Hardware
In order to select suitable machines, you first need to consider what you want to optimize for. Is it build performance at given budget, number of nodes, or power consumption ?
Then, you need to wonder what impacts performance. CPU architecture, DRAM latency, cache sizes and storage access time are all important to consider.
For the purpose of measuring performance, Willy invented a metric he calls "BogoLocs". He found that dram latency and L3 cache are more important for performance than CPU frequency.
To optimize for performance, you must make sure your controller isn't the bottleneck: its CPU or network access shouldn't be saturated for instance.
PC-type machines are by far the fastest, with their huge cache and multiple memory channels. However, they can be expensive. A good idea might be to look at gamer-designed hardware, that provides the best cost-to-performance ratio.
If you're optimizing for a low number of nodes, buy a single dual-socket high-frequency, x86 machine with all RAM slots populated.
If you're optimizing for hardware costs, a single 4-core computers can cost $8 (NanoPi). But there are a few issues: there are hidden costs (accessories, network, etc.), it might be throttled when heating, some machines are unstable because of overclocking, while only achieving up to 1/16th performance of a $800 PC.
You can also look at mid-range hardware (NanoPI-M3, Odroid C2), up to quad-core Cortex A9 at 2GHz. But then they run their own kernel. "High-range" low cost hardware are often sold as "set-top-boxes" (MiQi, RKM-v5, etc.) Some of these can even achieve 1/4th performance of a $800 PC. But there are gotchas as well, with varying build quality, high power draw, thermal throttling.
The MiQi board at $35 is Willy's main choice according to his performance measurements (or its CS-008 clone). It's an HDMI dongle that can be opened and used barebones. You don't need to use a big linux distribution, a simple chroot is enough for gcc and distcc.
All the data from this presentation is on a wiki.
Understanding a real-time system: more than just a kernel
by Steven Rostedt; video
Real-time is hard. Having preempt-rt patched kernel, is far from enough. You need to look at the hardware under, and the software on top of it, and in general have holistic view of your system.
A balance between a Real-Time system versus a "Real-Fast" system needs to be found.
You have to go with a Real-Time hardware if you want a real-time system. It's the foundation, and if you don't have it, you can forget about your goal.
Non-real-time hardware features
Memory cache impacts determinism. One should find the worst-case scenario, by trying to run without the cache.
Branch prediction misses can severely impact determinism as well.
NUMA, used on multi-CPUs hardware, can cause issues whenever a task tries to access memory from a remote node. So the goal is to make sure a real-time task always uses local memory.
Hyper-Threading on Intel processors (or AMD's similar tech) is recommended to be disabled for Real-Time.
Translation Lookaside Buffer is a cache for page tables. But this means that any miss would kill determinism. Looking for the worst-case scenario during testing by constantly flushing the TLB is needed for a real-time system.
Transactional Memory allows for parallel action in the same critical section, so it makes things pretty fast, but makes the worst case scenario hard to find when a transaction fails.
System Management Interrupt (SMI), puts the processor in System Management Mode. On a customer box, Steven was able to find that every 14minutes, an interrupt was eating CPU time, that was in fact an SMI for ECC memory.
CPU Frequency scaling needs to be disabled (idle polling), while not environmental friendly, it's a necessity for determinism.
Non-real-time software features
When you're using threaded interrupts, you need to be careful about priority, especially if you're waiting for important interrupts, like network if you're waiting for data.
Softirqs need to be looked at carefully. They are treated differently in PREEMPT_RT kernels, since they are run in the context of who raises them. Except when they are raised by real Hard interrupts like RCU or timers.
System Management Threads like RCU, watchdog or kworker also need to be taken into account, since they might be called as side-effect of a syscall required by the real-time application.
Timers are non-evident as well and might be triggered with signals, that have weird posix requirements, making the system complex, also impacting determinism.
CPU Isolation, whether used with the isolcpus kernel command line parameter, or with cgroup cpusets can help determinism if configured properly.
NO_HZ is good for power management thanks to longer sleeps, but might kill latency since coming out of sleep can take a long time, leading to missed deadlines.
NO_HZ_FULL might be able to help with real-time once ready, since it can keep the kernel from bothering real-time tasks by removing the last 1-second tick.
When writing an RT Task, memory locking with mlockall() is necessary to prevent page fault from interrupting your threads. Enabling priority inheritance is a good idea to prevent some types of locking situations leading to unbounded latency.
Linux Driver Model
by Greg Kroah-Hartman; video
Greg says nobody needs to know about the driver model.
If you're doing reference counting, use struct kref, it handles lots of really tricky edge cases. You need to use your own locking though.
The base object type is struct kobject, it handles the sysfs representation. You should probably never use it, it's not meant for drivers.
On top of that struct attribute provides sysfs files for kobjects, also to never be managed individually. The goal here is to have only one text or binary value per file. It prevents a problem seen in /proc where multiple values per file broke lots of applications when values were added, or unavailable.
kobj_type handles sysfs functions, namespaces, and release().
struct device is the universal structure, that everyone sees. It either belongs to a bus or a "class".
struct device_driver handles a driver that controls a device. It does the usual probe/remove, etc.
struct bus_type binds devices and drivers, matching, handles uevents and shutdown. Writing a bus is a complex task, it requires at least 300 lines of code, and has lots of responsibilities, with little helper functions.
Creating a device is not easy either, as you should set its position in the hierarchy (bus type, parent), the attributes and initialize it in two-step way to prevent race conditions.
Registering a driver is a bit simpler (probe/release, ownership), but still complex. struct class are userspace-visible devices, very simple to create (30-40 lines of code). A class has a lot of responsibilities, but most of those are handled by default, so not every driver has to implement them.
Greg says usb is not a good example to understand the driver model, since it's complex and stretches it to its limits. The usb2serial bus is good example.
The implementation relies on multiple levels of hierarchy, and has lots of pointer indirections throughout the tree in order to find the appropriate function for an operation (like shutdown())
Driver writers should only use attribute groups, and (almost) never called sysfs_*() functions. Greg says you should never use platform_device. This interface is abused of using a real bus, or the virtual bus.
Greg repeated that raw sysfs/kobjects should never be used.
Analyzing Linux Kernel interface changes by looking at binaries
by Dodji Seketeli; LWN article
What if we could see changes in interfaces between the kernel and its modules just by looking at the ELF binaries ? It would be a kind of diff for binaries, and show changes in meaningful way.
abidiff already does almost all of this userspace binaries. It builds an internal representation of an API corpus, and can build differences. Dodji shows us here how does abidiff works.
Unfortunately, there's nothing yet for the Linux Kernel. Dodji entertains the idea of a "kabidiff" tool that would work like abidiff, but for the Linux kernel.
For this to work, it would need to handle special Linux ELF symbol sections. For instance, it would restrain itself to "__export_symbol" and "__export_symbol_gpl" sections. It would also need to support augmenting an ABI corpus with artifacts from modules.
In fact, work on this has just started in the dodji/kabidiff branch of libabigail.git.
Video color spaces
by Hans Verkuil; LWN article
struct v4l2_pix_format introduced in kernel 3.18 is the subject of the talk.
Hans started by saying that Color is an illusion, interpreted by the brain.
A colorspace is actually the definition of the type of light source, where the white point is, and how to reproduce it.
Colorspaces might be linear, but neither human vision or early cRTs were. So to convert from a linear to non-linear colorspace, you define a transfer function.
In video, we otfen use the Y'CbCr (YUV) colorspace. To convert to and from RGB is possible. You can represent all colors in all colorspaces, as long as you don't do quantization (cut of values <0 and >1), which is why you should always do it last.
There are a few standards to describe colorspaces: Rec 709, sRGB, SMPTE 170M, and lately BT 2020 used for HDTVs.
Typically, colorspace names might be confusing, the conversion matrices might be buggy, and applications would just ignore colorspace information. Sometimes, hardware uses a bad transfer function.
In the end Hans found that only half of the vl2_pix_format structure fields were useful.
Hans showed examples of the difference of transfer functions between SMPTE 170M and Rec.709. The difference between Rec. 709 and sRGB, or betweer Rec.709 and BT.601 Y'CbCr is more visible. Those example would be impossible to see on a projector, but luckily the room at Mozilla's has huge LCD screens. But even there, it's not enough, since with Full/Limited Range Quantization, a light grey color visible on Hans' screen, was simply white while displayed on the big screen and recording stream. Some piece of the video chain was just doing quantization "bad".
State of DRM graphics driver subsystem
by Daniel Vetter; LWN article
The Direct Rendering Management (drm) subsystem is slowly taking over the world.
Daniel started by saying that the new kerneldoc toolchain (see above talk by Jonathan Corbet) is really nice. Everything with regards to the new atomic modesetting is documented. Lots of docs have been added.
Some issues in the old userspace-facing API are still there. Those old DRI1 drivers can't be removed, but have been prefixed with drm_legacy_ and isolated.
The intel-gpu-tools tests have been ported to be generic, and are starting to get used by on many drivers. Some CI systems have been deployed, and documentation added.
The open userspace requirement has been documented: userspace-facing api in DRM kernel code requires an open source userspace program.
Atomic display drivers have allowed flicker-free modesetting, with check/commit semantics. It has been implemented because of hardware restrictions. It also allows userspace to know in advance if a given modification would be possible. You can then write userspace that can try approaches, without becoming too complex.
20 drivers and counting have been merged with an atomic interface, which 2 or 3 per release, as opposed to one per year (1 per 4 or 5 releases) in the 7 years before atomic modesetting. There's a huge acceleration in development, driving lots of small polish, boiler-plate removals, documentation and new helpers.
There's a bright future, with the drm api being used in android, wayland, chromeos, etc. Possible improvements include a benchmark mode, or more buffer management like android's ion.o
A generic fdbev implementation has been written on top of KMS.
Fences are like struct completion, but for DMA. Implicit fences are taken care of by the kernel. Explicit fences can be passed around by userspace. Fences allows synchronisation between components of video pipeline, like a decoder and an upscaler for example.
With upcoming explicit fencing support in kernel and mesa, you can now run Android on upstream code, with graphics support.
The downside right now is the bleak support of rendering in open drivers. There are 3 vendor-supported, 3 reverse-engineered drivers, and the rest is nowhere to be seen.
The new hwmon device registration API
by Jean Delvare
The hwmon subsystem is used for hardware sensors available in every machine, like temperature sensors for example.
hwmon has come a long way. 10 years go, it became unmaintanable, with lots of device-specific code in userspace libraries.
The lm-sensors v2 in 2004 was based on procfs for kernel 2.4, and sysfs for kernel 2.6.x.
In 2006, there was no standard procfs interface. Therefore, for lm-sensors v3, a documentation was written, standards were enforced, and the one-value per sysfs file rule was adopted. No more device-specific code in libsensors and applications was allowed. Support for new devices could finally be added without touching user-space.
kernel-space
Once the userspace interface was fixed, it did not mean the end of the road.
It turned out that every driver implemented its own UAPI. So in 2005, a new hwmon sysfs class was submitted. It was quite simple, and all drivers were converted to the new subsystem at once.
It worked for a while, but wasn't sufficient. In 2013, a new hwmon device registration API was introduced: hwmon_register_with_groups. It gives the core flexibility, and allows it to validate the device name. Later this year a new API was added to help unregister and cleanup.
Finally, in July 2016 a new registration API proposal was proposed, moving hwmon attributes in core, and doing the heavy lifting of setting up sysfs properly. This patchset is still under review and discussion. Driver conversion won't be straightforward at all, but still deletes more code.
In conclusion, a good subsystem should help drivers, integrate well into the kernel, and offer a standard interface. It should provide a smaller binary size and have fewer bugs. But there are still concerns with regards to performance issues, and added complexity because of too many registration functions.
That's it for Kernel Recipes 2016 ! Congratulations if you managed to read everything !