What do you find when you search Twitter for hashes ?
Results of a four year research
September 09, 2016
This is what I found with hashbot, a twitter bot that looks for hashes.
What is this image ?
Posted with the hash "2f404a288d1b564fadee944827a39a14" by japanese accounts (of which @furueru_zekkei used to be the top poster, now suspended).
Why is this ?
Bots. There a lots of them. The Internet is made of bots.
This is what you were most likely to find until 2015 (with a 10% chance).
How do I know that ? Well I searched. But this is a story for another post.
What else do you find ? Bots bots bots.
Along this, I found many japanese bots mentionning @null
Porn posting bots. The internet is made of them. For some reasons they post hashes... maybe to make sure their tweets are unique and not detected as a spam network ?
Occasionnal git and mercurial commit IDs.
Security researcher posting proof-of-work. This was the initial motivation behind hashproofs.
iPhone UDIDs. Apparently there's a 'market' on Twitter between devs and users to enable iPhones with beta builds:
Giveaway of various activation codes for games, digital products.
People crowd-sourcing password hashs, and bots running rainbow table queries.
Bitcoin transaction IDs:
Some things just impossible to understand:
LOTS of bots posting more than 5% of tweets containing hashes (found a lot) These won't appear in the results, but here is the list.
I realize how ironic it is to criticize Twitter for having a lot of bots, because the same conditions that allowed all these bots (the API), also permitted this research (as compared to a scraping bot that would have to be updated more often). Of course, hashproofs isn't really spamming, and just acts as a "curator", and does a job that would be impossible to do for a human (i.e analyzing lots of tweets/s).
The full list of results can be found on hashproofs' Twitter feed.
Give me the data
This should give you the full data you need to re-analyze the results or run you own hashbot instance (with a better algorithm? or access to a better stream ?)
Unveiling a few bot networks
As I explained earlier, hashproofs analyzes the timeline of users for every matching tweets. If the percentage of matching tweets they have is above a certain arbitrary level (5%), the username is banned locally. If it's over 50%, the account is blocked. That's why you'll find two different lists in the results. One is from Twitter, listing the ids of blocked account. The other is the content of the "banlist" state file of the bot.
By analyzing the list of blocked users, I found a few legitimate bots (e.g posting commits on twitter, running rainbow tables, see earlier). I also found a lot of spam bots, some of which were taken car of by twitter. I also discovered that spammers tend to rename their accounts, and my younger self only thought of tracking the usernames, not the account ids, so that's why you'll see discrepancies if you try to have the two lists match.
You'll also see that even regular users rename their account if you look at historical data from 2014.
Here are a few excerpt from the banlist that show twitter handles that I doubt have been created by legitimate users:
3924fe95e2cd5f8 68c59dbbb15c5a4 6298c2a08ef9b3b a33262acc8e5c77 b2dc44d67994d44 21332a575639f58 […] Cloud404aa cloud405aa cloud406aa cloud407aa […] 000xxx_6wy 000xxx_897 000xxx_dr3 […] Death_ldo Death_y7s Death_mew Death_ojy
All of those are in sequence, which means they were detected by hashproofs one after the other. There are many other examples like this if you want to look at all the 14k+ automatically banned handles.
If you're interested in the historical and technical details, read on to the following article.