What do you find when you search Twitter for hashes ?

Results of a four year research

September 09, 2016

This image:

jpg

This is what I found with hashbot, a twitter bot that looks for hashes.

What is this image ?

Posted with the hash "2f404a288d1b564fadee944827a39a14" by japanese accounts (of which @furueru_zekkei used to be the top poster, now suspended).

After a bit of research on google images and more, I found that this image is a photo of the White Desert in New Mexico, by Greg Riegler. This might or might not be the same Greg Riegler as here.

Why is this ?

Bots. There a lots of them. The Internet is made of bots.

This is what you were most likely to find until 2015 (with a 10% chance).

How do I know that ? Well I searched. But this is a story for another post.

What else do you find ? Bots bots bots.

Along this, I found many japanese bots mentionning @null

Porn posting bots. The internet is made of them. For some reasons they post hashes... maybe to make sure their tweets are unique and not detected as a spam network ?

Occasionnal git and mercurial commit IDs.

Security researcher posting proof-of-work. This was the initial motivation behind hashproofs.

iPhone UDIDs. Apparently there's a 'market' on Twitter between devs and users to enable iPhones with beta builds:

Giveaway of various activation codes for games, digital products.

People crowd-sourcing password hashs, and bots running rainbow table queries.

Bitcoin transaction IDs:

Torrent hashes:

Some things just impossible to understand:

LOTS of bots posting more than 5% of tweets containing hashes (found a lot) These won't appear in the results, but here is the list.

I realize how ironic it is to criticize Twitter for having a lot of bots, because the same conditions that allowed all these bots (the API), also permitted this research (as compared to a scraping bot that would have to be updated more often). Of course, hashproofs isn't really spamming, and just acts as a "curator", and does a job that would be impossible to do for a human (i.e analyzing lots of tweets/s).

The full list of results can be found on hashproofs' Twitter feed.

Give me the data

I published the code on Github and the full results of the four-year research. (WARNING: contains spam and porn links)

This should give you the full data you need to re-analyze the results or run you own hashbot instance (with a better algorithm? or access to a better stream ?)

Unveiling a few bot networks

As I explained earlier, hashproofs analyzes the timeline of users for every matching tweets. If the percentage of matching tweets they have is above a certain arbitrary level (5%), the username is banned locally. If it's over 50%, the account is blocked. That's why you'll find two different lists in the results. One is from Twitter, listing the ids of blocked account. The other is the content of the "banlist" state file of the bot.

By analyzing the list of blocked users, I found a few legitimate bots (e.g posting commits on twitter, running rainbow tables, see earlier). I also found a lot of spam bots, some of which were taken car of by twitter. I also discovered that spammers tend to rename their accounts, and my younger self only thought of tracking the usernames, not the account ids, so that's why you'll see discrepancies if you try to have the two lists match.

You'll also see that even regular users rename their account if you look at historical data from 2014.

Here are a few excerpt from the banlist that show twitter handles that I doubt have been created by legitimate users:

  3924fe95e2cd5f8
  68c59dbbb15c5a4
  6298c2a08ef9b3b
  a33262acc8e5c77
  b2dc44d67994d44
  21332a575639f58
  […]
  Cloud404aa
  cloud405aa
  cloud406aa
  cloud407aa
  […]
  000xxx_6wy
  000xxx_897
  000xxx_dr3
  […]
  Death_ldo
  Death_y7s
  Death_mew
  Death_ojy

All of those are in sequence, which means they were detected by hashproofs one after the other. There are many other examples like this if you want to look at all the 14k+ automatically banned handles.

If you're interested in the historical and technical details, read on to the following article.