A while ago I was browsing the public pastes on PasteBin and I came across a few e-mail/password dumps from either malware or some hacker trying to make a name for himself.
As I perused the information, I was shocked to find usernames, emails, passwords, social security numbers, credit card numbers, and more in these dumps. I reported the posts as credit card info and SSNs are nothing to trifle with, but the thought lingered as to why they were public in the first place. There must be a way to automate the process of reporting these posts, I thought, usernames and especially passwords hold a very unique signature: at least one upper-case letter, at least one lower-case letter, at least one digit, and at least 8 characters long.
How many words in the english language have that particular combination?
This question inevitably led to an experiment.
The parameters were quite simple: How accurate can I identify a password that is surrounded by junk text in a post?
This is actually harder than is seems as we can’t simply assume that the posts will be in English, or that they will be a human language at all (code). This presented an interesting problem to work with and I started development of a framework to solve it.
The solution
The system to test my question is quite simple. It includes a web page scraper and an analysis engine.
The scraper is simple enough and goes to pastebin’s public post archive and pulls all of the links to “pastes” contained therein. It then grabs only the paste text from each page and adds them to a list. This list is sent to the analysis engine.
The analysis engine uses a spam filter-like merit score to help identify interesting pastes and discard pastes that do not have anything interesting in them.
It uses a series of filters to affect the merit score:
The first one is a simple password identification. It uses a master list of popular passwords and searches each paste for them. If a keyword is found, the post’s merit score is increased.
The second filter is keyword identification. This is similar to the password identification but it includes words and phrases that are not passwords but might signal a paste that is more likely to have passwords in it. These keywords are held in a dictionary that also stores the associated merit value (positive or negative).
The third filters are the basic password rules:
- Must have at least one capital letter
- Must have at least one lower-case letter
- Must have at least one digit
- Must be at least 8 characters long
The analysis engine then returns a list of all of the links sorted by “most-likely to have a password” — Highest probability at the top.
Results and Conclusion
I initially found that the basic filters I had created were getting less fast positives than the basic password filter (#3) but still wouldn’t get promising results. The accuracy of the identification would have to be improved before I attempted any sort of automation for reporting. So I have open-sourced the software and made it available on pip (as it is written in python):
The project is called “Pastebin Password Scraper” or PBPWScraper:
Here is the PBPWScraper Github
You can use pip to install the latest release version of the library by entering:
pip install PBPWScraper
It was an interesting experiment and it is fun to tweak the filters to improve certain aspects of the analysis. I will continue to work on the system and see if I am able to decrease its false-positive count enough to warrant an automated reporting module.