Viewing adjacent French towns on Wikipedia

Exploring the messy world of Wikidata property P47

May 05, 2025

It all started with this really simple nerdsnipe by Joachim (french):

How many clicks would it take to go from the Wikipedia pages of (southmost Metropolitan France town) Cerbères to (northmost France town) Bray-les-Dunes, using only Wikipedia's "adjacent communes links" ?

Quite a simple question. Many surprises.

Querying Wikidata

So, we want to know how to go from one page to another, clicking links. This is a graph problem, where the nodes are towns, and edges are the links between adjacent towns. So we first need to get this graph of towns and their connections (edges).

My first thought was simple, since this is on Wikipedia, why not query the structured data (more on that later) on Wikidata and see the result? Surely, a quick shortest-path on the resulting graph and it would quickly be done. Luckily there's a whole page on a Wikidata's wiki on how to query the french towns' data. The query would look like this, even displayed directly as a graph! Wow, so simple; does it mean we're done? Nope, there's a catch: you can't use the Wikidata query service to get all the data, because there is just too much, and it does not support pagination either. So, one has to find another solution.

A kind soul is running the WDumper service, so I tried it. Unfortunately after 20+ days my query still hasn't completed. Luckily, after a few hours I had already lost patience and sought to address this another way…

The Wikidata dumps are big (for my poor RPi4)

So, let's get the full dumps instead. Those dumps are well documented and at the time of writing only take ~89GiB of disk space… compressed. They use lbzip2, which compresses well and can use multiple CPUs, nice! In addition, each entity is on a single line, so they can be processed independently. Time to whip out the cli text processing-fu!

First, one needs to download the dumps. Unfortunately, the most recent dump available on torrent is already a year old. And one needs to store all of this. In addition, when I wrote this project I was abroad, with a 35GB quota on a 4G connection. Luckily I always have at least one computer available remotely, and I settled on an RPi4 at home where I have a big disk attached and fast internet connection.

So I went with curl initially, downloaded the latest dump over 12h or so and continued processing it. I would later find that the RPi4's anemic CPU would soon be an issue, so I re-downloaded it from multiple mirrors with aria2c (~25minutes) on a cloud instance to saturate the bandwidth.

Before going into more details of processing this data, let's look into what we want from it…

What's a commune anyway?

It's an administrative status in France. In Wikidata parlance, it would be an item which nature (property P31) is Q484170: a commune of France.

Communes are created and terminated all the time in France

At least, a few every year. Most of the time, those are merger because it uses less resources and we are stronger together. So any data from the previous year is most probably already obsolete. In Wikidata, this is represented with the property "end date", a qualifier marking a statement as "ended at a given date"; it is used to filter out old towns.

Also, since 2019, Paris is no longer a "commune", but has a specific administrative status. Other people have queried Wikidata using "sub-class of" as a qualifier to get all type of items that would match. There are 7 such categories, which I did not use in my data extract for simplicity; in addition Paris is currently not an instance of any of such categories; it uses yet another type. So I simply added an exception for Paris instead of filtering on its status itself.

Processing the data, line by line

Even using all available CPUs, extracting the file takes about 10 hours on the RPi4; and I didn't want to store the (presumably) huge json file, so everything is processed in a pipe. So to extract, I used lbzip2's lbzcat tool, which would allow starting the pipeline to do further processing.

Since this is json, I went with jq to extract the data. It started with a very simple (but wrong) query:

lbzcat latest-all.json.bz2 |tail -n+2 |sed 's/,$//' \
  | jq --indent 0  \
      'select(.claims.P31[]?.mainsnak.datavalue.value.id=="Q484170") \
       | { id : .id, name: .labels.fr.value , conns: [ .claims.P47[]?.mainsnak.datavalue.value.id] }' \
    > communes.json

To process the json line-by line we use tail to go directly to the second line and not be in an array; sed cleans-up the trailing commas from the array. jq's indent 0 option makes sure we also have one line per item. We only select items which claims include the property P31 (type) which has value Q484170 (commune of France). The data is then re-formatted to only have the elements we care about: the unique id, the french name of the town, and its connections: the list of ids of items in that are in its P47 (shares-border-with) claims.

Note that in this version, there is no array in output: just one json object per line, and no trailing commas either.

This version is nice, but unfortunately quite slow; jq is single-threaded and will only use a single core.

`lbzcat` is slow, `grep` is faster than `jq`

I iterated quite a while over the query, and each time I realized my mistake, I had to wait 12+ hours (lbzcat + jq bottleneck) for the pipeline to complete. After a while I lost patience and rented temporary beefy instances to do the download and processing. And after a while, it hit me: I was duplicating too much work.

So, the trick I used to reduce the processing time was to store only the items that contain "Q484170":

lbzcat latest-all.json.bz2 | grep \"Q484170\" |lbzip2 --fast > contains-Q484170.json.bz2

GNU grep is also very optimized (and we could even go faster with hyperscan/ripgrep), so even if we get more data than we need, this saves us a lot of time.

The temporary file only weighs 223MiB, which is much more manageable, and can be downloaded to be processed on my faster laptop, which can uncompress it in less than 30 seconds.

The final iteration of the processing looks like this:

echo [ > communes-new-only.json; \
    lbzcat contains-Q484170.json.bz2 | sed 's/,$//' | jq --indent 0 \
        'select(.claims.P47 != null and .claims.P31[]?.mainsnak.datavalue.value.id=="Q484170") \
        | . as $r | .claims.P31[]? \
        | select ($r.id == "Q90" or (.mainsnak.datavalue.value.id == "Q484170" and .qualifiers.P582 == null)) \
        | { id : $r.id, name: $r.labels.fr.value , conns: [ $r.claims.P47[]?.mainsnak.datavalue.value.id], \
            coord: { latitude: $r.claims.P625[]?.mainsnak.datavalue.value.latitude, \
                     longitude: $r.claims.P625[]?.mainsnak.datavalue.value.longitude}}' \
    | sed 's/$/,/' >> communes-new-only.json ; \
    truncate -s-2 communes-new-only.json; \
    echo ] >> communes-new-only.json

Wow, that's a lot of text and commands. Let's unfold what's new:

we now want to create a directly usable, valid json. For this, we first open and close an array in the first and last echo commands;
the second sed command adds commas at the end of each generated line;
in json, trailing commas aren't valid, so we trim the last one with the truncate -s-2 command;
And the jq query is now a bit more complex:
- we still only want items which one of the types (P31) is a commune (Q484170);
- we filter for communes that do have other neighbouring communes, i.e, their P47 is not null;
- to ease manipulation, the root item is marked as $r, and we go matching into the .claims.P31[] array;
- then we start the selection; we can see here the special exception for Paris (Q90);
- otherwise, we look into the communes that don't have an end date (P582 is null), i.e, it's a proxy for inferring they still exist (we ignore any that would be ending in the future);
- we also added the target coordinates in the generated json.

Wikidata is not Wikipedia: the data is not that clean

Did I say that Wikidata is just the structured version of the data in Wikipedia? Well, I was wrong. Wikidata is an entirely separate database and project. There might have been imports of the data in the past, but obviously it's not kept in sync automatically. I know because I have found a few inconsistencies.

A few examples of issues I've seen:

Links between communes across France; For example Saint-Jacques-de-Néhou - Sault, (which I fixed):

Most of the time those are because of homonyms, but sometimes, there is no apparent good reason. There's even a query about this on the Wikidata wiki, but not all the results have been fixed yet.
Missing links; For example, Escorpain (which I fixed):

Which is weird because on the corresponding Wikipedia pages the adjacent towns are well documented. That's how I found that the Wikidata and Wikipedia data aren't really in sync; I've always thought that Wikidata was just an export the infoboxes found on the Wikipedia ages. It is not. Of course, it's possible to write scripts and bots to do the sync, but by default it isn't done.
Incorrect coordinates; here Trois-Rivières:

This one is a recent commune (created in 2019), and its coordinates have already been fixed by someone else. And you can see in its article (french) that it does not have the French-Wikipedia-specific "shares border-with" section (it's a paragraph instead) that is (or was) the subject of this article.

Viewing the data

Apache ECharts is nice…

So, to view all of this, I started with Apache ECharts, a generic and full-featured js visualization library. I used the graph chart, with fixed coordinates to prevent the simulation of the force layout. In addition to this, I added a map layer to show the shape of France, generated by the France GeoJSON project.

It soon became apparent that I could not show all the data at once, so I added code to arbitrarily limit the number of nodes of the graphs to show, and their edges. One needs to to zoom enough to be able to see all the nodes and their links. You can look at what I came up with here.

but not the right tool for the job

Unfortunately there are bugs that I wasn't able to shakeout: for example, the graph and map get de-synchronized when zooming. And the map wasn't very usable, including on mobile; I even reported a such an issue.

With all this, I realized I just wasn't using the right tool for the job. I needed to start looking at libraries dedicated to viewing geographical data: maps.

Maplibre GL js

After a quick look at leaflet, I settled on Maplibre GL JS for performance reasons. Maplibre is a fork of all Mapbox libraries from the last open source version, and it has great engineering and documentation.

So I rewrote it all and tranformed the graph in a GeoJSON structure, with nodes as circles. All the the data was sent to the library without any filtering. And it was buttery-smooth, despite the 34830 nodes and 102094 edges, in addition to the GeoJSON background; I was able to throw away a lot of code and let Maplibre handle the performance details. Hopefully it's not too slow on your browser.

The maps previously included in this article are in fact based on the viewing code integrating Maplibre I came up with. I copy/pasted examples from the doc and added a filter by name, center and zoom coordinates parameters. Here is another example viewing the smallest commune of France by area (Castelmoron-d’Albret):

And the largest:

Or the commune with a population of 1, Rochefourchat:

… what's the answer by the way?

Oh yes, that question, of the number of clicks... Well, I did not get to that part, but someone else did.

But they did it using another data source from the French government, itself extracted from OpenStreetMap. The dataset was extracted in 2022, so it's probably obsolete; the towns are referenced by their unique INSEE ID, which is also in Wikidata (property P374), so it could be compared to the dataset I extracted pretty easily. This is left as an exercise to the reader.