Making a Twitter bot that looks for hashes

September 09, 2016

This is a followup to What do you find when you search Twitter for hashes ?

Why ?

I'm not sure I remember how it started.

It all started four years ago. Jon Oberheide was still an independent security researcher and not yet CTO of a successful product company. He posted some hashes on twitter. I was perplex at first but then I quickly understood that it was to serve as a proof in case someone disputed his research's finding later (and the timing at which he found his results). He was posting hash proofs. And then Matthew Garrett did it too.

At the time Twitter wasn't very reliable for accessing old tweets (they vastly improved). I thought maybe by finding these hash proofs and indexing them, we could serve as an independent verifier. Nowadays all the kids put their hashes in the Bitcoin blockchain, and there are even services to do it from your browser.

So how to do it ?

This ought to be easy, right? The initial idea was to just do a simple search of random characters in the hexadecimal space, and hope that they are in hashes ? Well, not really. At first I thought it could be done, but it can't, because twitter search only works on full words, since it's tokenizing for indexing purposes. Which means you can't search for part of words hoping to stumble upon hashes. So much for using n-grams.

Therefore, I had to use the public sample stream, and filter every tweet in order to find relevant ones.

Firehose ? Not likely.

Twitter has a special stream that contains all the tweets being posted, called "Firehose". Few people get access to it. There are two other streams: Gardenhose, containing 10% on the tweets, and Spritzer, the sample stream containing 1% of the tweets. The bot currently runs on Spritzer, and Gardenhose was requested, but I never got an answer. It's part of the monetization strategy. No place here for hacker/hobbyists.

So only 1% of tweets(I have tried to verify that with other public data, it seems about right despite my initial thoughts) that's why the bots haven't been talking much together yet. It also means there's a 99% chance of missing your tweet. And that development iteration speed is a hundred time slower.

How does it work ?

The initial version used a naive regex, but had too many false positives, from repeated characters, to magnet links of P2P files. Now it's much harder to match.

The regex is currently matching MD5, SHA1, SHA256 and SHA512 sizes. Most uses are covered.

I added a naive exclusion filter (all letters or all numbers), which might not detect extremely well crafted hashes a researcher might be working on. This is out of scope for hashproofs, the anti-spam measures are already pretty strong and might miss interesting content.

Current approach

The first stage is a simple regex [a-f0-9]{32,128} . I wanted it as simple as possible because it is run on every tweet, and should be as fast as possible.

The second stage is a much more complex regex (harder to match), with specific sizes of various hashes.

Then there are lots of manually crafted filters to fight off spam. Blocked keywords. Users banned automatically. Embedded images and most links are blocked.

Finally there is entropy measurement, making sure we have a hash and not a mindless series of characters.

Performance research

To improve performance, I built-in quite a few tools. For example, there's a command allowing to dump the sample stream in temporary file (that you're not allowed to keep). This file is then used to measure performance in a repeatable fashion (there's no contradiction here, right ?), and isolated from the network.

I implemented different version of the core line processing, some of which are still in the tests. I was trying to see how to speed up the code. But after some profiling, I realized that most of the time was spent in json processing. Moving to ultrajson(ujson) cut the processing time by 5, compared to python2's cjson module.

Bot detection and spam fighting algorithm

What I did was initially mostly manual: keyword based, username and client based. I kept adding new keywords and banning new clients, but it didn't scale.

I then implemented an analysis of a match users's timeline. Within the last 200 tweets, if it had more than 5% of hashes, it was probably a bot. It greatly cut the spam at first, and since it's implementation in 2013 has detected 14k+ accounts posting more that 5% of hashes, and 2.7k+ accounts posting more 50%.

There was still a LOT of things passing through (including porn). But the strategy is to use automatic (algorithmic) filtering, not manual. I had to resolve to blocking most outgoing URLs, meaning ther's nothing to spam for. I had to filter tweets containing images.

Earlier this year, I discovered a spam network selling followers used the new Twitter Cards to embed links & images without having an URL in the tweet, so I added a filter for that too. For some reason, they were posting lots of hashes. Maybe adding entropy helps circumvent Twitter's detection systems.

Challenges

The code is not py3k compatible for historical reasons (used to need requests-oauth, but moved since to requests-oauthlib (which at some point was inside requests)), although I love py3k. I also had to use ur"" strings, which were ported in python 3.3, which wasn't available at the time. The porting shouldn't be very hard.

It was very hard to deal with twitter intermittent service. I developed a watchdog specifically to detect hangs, and then auto-restart. It's the easy way out, but has allowed the bot to work quite well, with months-long uptimes between the updates.

As I said earlier, it's hard to debug with a very slow stream that make errors appear a hundred times more slowly.

Finally, this "light" stream means there's a 99% chance of missing your tweet. Unless you have lot of followers that RT you, but then you don't need hashproofs, do you ?

Potential improvements

Follow user stream and watch for hashes. The bot already auto follows people below a certain rate already for good potential feed.
use a hashtag (e.g #hashproof) that security researcher can use so that their important tweets are seen.

Gimme the code, gimme the data

Today I'm publishing the source code for hashbot on Github. The data is available there as well and analyzed in the earlier article.

Who noticed ?

I actually implemented Georg's suggestion and all hashes were entropy checked after this.

Yeah, spam was this bad (and still is to an extent).

It was also noticed by @adulau

He asked about the code. Which is why you're seeing this here today.

A few successful findings

There out to be some after all ? Here are a few:

Lessons from the project

Always test, makes for robust code.

Always benchmark, you might have surprises, cf ultrajson that gave 5x performance speed up.

A watchdog is essential when interacting with an external, long-lived service. Twitter has been stopping the stream while keeping the TCP socket open many times, which would mean a hang of the bot.