Programming, Security, Privacy, Technical

ShillBot: A Study in Identifying Reddit Trolls Through Machine Learning

For those who would rather just read the code:



We’ve all been there. You’re browsing Reddit and see a post that you’re passionate about. You click the comment box and reach for the keyboard — but hesitate. Reddit’s reputation precedes it. You type anyways and punch out your thoughts. Submit.


A comment already? You click the icon and read the most disproportionately-voracious response to a comment about cats you have ever seen. What a jerk! But you’re not going to play that game, and view the author’s previous posts and comments. Through your review a trend of tactless comments and inflationary responses bubbles to the surface. They’re a troll. You promptly ignore the comment.



The process of inspecting a user’s previous posts and determining how to respond based on that information is a repeatable process — and if it is repeatable, it is surely automatable. This was my thought as I wrapped up my own analysis — and spawned a project to figure out the ‘how’. ShillBot is the fruit of my efforts.

I approached the problem by breaking it down into two problems: The first problem was the extraction of a target user’s comment history from Reddit, the second problem was training the appropriate algorithm with a corpus of data that is representative of the group I wanted to identify.

Extracting the post history from Reddit was more complicated than initially expected. When a user views another user’s page one of three separate page versions may be returned: The ‘new’ style, ‘old-new’ style, and the ‘old’ style of Reddit. I had to create a separate parser for each version. Once this was done, I was able to extract the post information and construct a corpus for that specific target user.

Training an algorithm on a representative set of Reddit trolls required manual identification. This exercise was both entertaining and depressing as the posts included some of the most vile aspects of Reddit. I was able to find more than enough of a representation simply by finding ‘hot-topic’ and controversial posts and then sorting by controversial (or simply finding the most down-voted post). Then I would inspect that user’s history and determine if they were truly a troll. I also created a list of ‘normal’ Reddit users. This would be used to counteract the troll set. In essence, I would need to give the algorithm an accurate representation of both troll and ‘not troll’ to accurately classify each set.

If all the algorithm knows how to classify is a hammer everything starts to look like a hammer.

The algorithm was trained by combining the post text, post title, post author, and subreddit for all posts in a target user’s history. This provided more context than simply recording the post’s text. For example, including the subreddit and post author (OP) allows the algorithm to identify common trends such as cross posting from one subreddit to another and trolls commenting on other troll’s posts to boost controversy, etc.

For this application I used a basic stochastic gradient descent (SGD) classifier as it has traditionally had some success in the text classification space. In the future I may play around with other classifiers to see what results they produce.



Promising. I was able to successfully differentiate Reddit trolls from ‘not trolls’ to a reasonable extent. The main limitation is my manual verification process for data points — I still have to check the post history of each suspected troll manually before I can add them to my dataset.

Bias. My search method may be prone to bias as ‘searching for controversial topics’ is dependent on the topic du jour. For example, political topics are highly represented in Reddit posts as of late which leads to an over-representation in the classifier’s training set. Even when I am conscious of such an effect and over-representation the dataset will inevitably reflect it.

Scalable. Partially. Obviously limited to a reasonable number of requests to Reddit. With that said the system itself is capable of handling a relatively large number of requests through a standard consumer/producer multi-threaded model. Workers complete the scraping and parsing actions then send the data to the server for analysis.



Always with the Neural Networks! I just like throwing an ANN at the problem to see what it finds — they are great for teasing out relationships that you may not have found otherwise.

Better extraction of data. My data points right now include a combination of text post, text author, post title, and subreddit. I suspect that I can tease out more relationships between these aspects if I represent them in a better format. Will consider mapping relationships between subreddits and posters, etc.

Trying to address the bias problem — although I am not entirely sure how.



All in all, this was a fun project with some interesting challenges. This project confirms, if there was any doubt, that it is possible to take a corpus of text posts from a selected group and apply a few basic algorithms to answer the question ‘does this text belong in that group’ — to a certain extent.

Programming, Technical, Uncategorized

Installing scikit-learn; Python Data Mining Library

Update: The instructions of this post are for Python 2.7. If you are using Python 3, the process is simplified. The instructions are here:

Starting with a Python 3.6 environment.

Assumptions (What I expect to already be installed):

  1. Install numpy: pip install numpy
  2. Install scipy: pip install scipy
  3. Install sklearn: pip install sklearn

Test installation by opening a python interpreter and importing sklearn:
import sklearn

If it successfully imports (no errors), then sklearn is installed correctly.


Scikit-learn is a great data mining library for Python. It provides a powerful array of tools to classify, cluster, reduce, select, and so much more. I first encountered scikit-learn when I was developing prototypes for my first business venture. I wanted to use something that was easy and powerful. Scikit-learn was just that tool.

The only problem with scikit-learn is that it builds off of some powerful-yet-finicky libraries, and you will need to install those libraries, NumPy and SciPy, before you can proceed with installing scikit-learn.

To a novice, this can be a frustrating task since the order of installation matters and many Google searches will only produce unhelpful and long-winded responses. Thus, my motivation to set the record straight and provide a quick tutorial on how to install scikit-learn — mostly on Windows, but I have provided links and notes on both Linux and Mac installations as well.

In the process of this tutorial, you will install (or already have) the following — in this order:

NOTE: I have provided the links unlabeled above because, like all tech/installation tutorials, over time they become obsolete. By providing the links as they are, it is my hope that even if new versions come out, you will be able to use this tutorial to find the resources you need.

Step 1: Install Python

If you do not already have Python, install it now at the address provied above ( I will be using Python 2.7 for this tutorial.

The installer for python is quick and good. Once installed, we will need to check to see if Python is available on the command line. Open a terminal by searching for ‘cmd’ or running C:\Windows\System32\cmd.exe. I would recommend creating a shortcut if you are doing this a lot.

in the command line, enter:

python –version

something similar to “Python 2.7.6” should display. That shows that python is working and accessible from the cmd line.

Step 2: Install NumPy

NumPy is a powerful library for Python that contains advanced numerical capabilities.

Install NumPy by downloading the correct installer using the link provided above ( then run the installer.

NOTE: There are a few installers based on your OS version AND the version of Python you have. It is important that you find the right installer for your OS and Python version!

Step 3: Install SciPy

Download the SciPy installer using the link provided above ( and run it.

NOTE: There are a few installers based on your OS version AND the version of Python you have. It is important that you find the right installer for your OS and Python version!

Step 4: Install Pip

Pip is a package manager specifically for Python. It comes in handy so much that I highly recommend that you install it to help manage python packages.

Go to the link provided above (

The easiest way to install pip on Windows is by using the ‘’ script and then running it in your command line:


If you are on Linux you can use apt-get (or whatever package manager you have):

sudo apt-get install python-pip

Step 5: Install scikit-learn

NOTE: More information on installing scikit-learn at the link provided above (

On Windows: use pip to install scikit-learn:

pip install scikit-learn

On Linux: Use the package manager or follow the build instructions at

Step 6: Test Installation

Now we must see if everything installed correctly. Open up a command line terminal and type:


This will open a python interpreter. You will know this because there will be some text and three chevrons, “>>>”, prompting input. Type:

import sklearn

If nothing happens and another prompt appears scikit-learn has been installed correctly.

If an error occurs, there might have been a mis-step in the process. Go back through the tutorial to see if any steps were missed or follow the error message that was given.

Programming, Technical

An Experiment on PasteBin

A while ago I was browsing the public pastes on PasteBin and I came across a few e-mail/password dumps from either malware or some hacker trying to make a name for himself.

As I perused the information, I was shocked to find usernames, emails, passwords, social security numbers, credit card numbers, and more in these dumps. I reported the posts as credit card info and SSNs are nothing to trifle with, but the thought lingered as to why they were public in the first place. There must be a way to automate the process of reporting these posts, I thought, usernames and especially passwords hold a very unique signature: at least one upper-case letter, at least one lower-case letter, at least one digit, and at least 8 characters long.

How many words in the english language have that particular combination?

This question inevitably led to an experiment.

The parameters were quite simple: How accurate can I identify a password that is surrounded by junk text in a post?
This is actually harder than is seems as we can’t simply assume that the posts will be in English, or that they will be a human language at all (code). This presented an interesting problem to work with and I started development of a framework to solve it.

The solution

The system to test my question is quite simple. It includes a web page scraper and an analysis engine.

The scraper is simple enough and goes to pastebin’s public post archive and pulls all of the links to “pastes” contained therein. It then grabs only the paste text from each page and adds them to a list. This list is sent to the analysis engine.

The analysis engine uses a spam filter-like merit score to help identify interesting pastes and discard pastes that do not have anything interesting in them.

It uses a series of filters to affect the merit score:

The first one is a simple password identification. It uses a master list of popular passwords and searches each paste for them. If a keyword is found, the post’s merit score is increased.

The second filter is keyword identification. This is similar to the password identification but it includes words and phrases that are not passwords but might signal a paste that is more likely to have passwords in it. These keywords are held in a dictionary that also stores the associated merit value (positive or negative).

The third filters are the basic password rules:

  • Must have at least one capital letter
  • Must have at least one lower-case letter
  • Must have at least one digit
  • Must be at least 8 characters long

The analysis engine then returns a list of all of the links sorted by “most-likely to have a password” — Highest probability at the top.

Results and Conclusion

I initially found that the basic filters I had created were getting less fast positives than the basic password filter (#3) but still wouldn’t get promising results. The accuracy of the identification would have to be improved before I attempted any sort of automation for reporting. So I have open-sourced the software and made it available on pip (as it is written in python):

The project is called “Pastebin Password Scraper” or PBPWScraper:

Here is the PBPWScraper Github

You can use pip to install the latest release version of the library by entering:

pip install PBPWScraper

It was an interesting experiment and it is fun to tweak the filters to improve certain aspects of the analysis. I will continue to work on the system and see if I am able to decrease its false-positive count enough to warrant an automated reporting module.

Programming, Technical

Auto-generate HTML using a tree in Python

Here’s something I’ve been working on and thought was interesting.

I needed to dynamically generate HTML with varying degrees of nesting and attributes in Python. All I found was a few Stack Overflow questions on generating HTML – and some blogs talking about hard-coding the tags.

This is far from what I needed, so I started to look into making my own – besides, it looked fun!

I started by looking at what I actually wanted:

  • Needs to be easily extensible
  • Needs to be structured in a way that makes sense

With these requirements in mind, I elected to go the tree approach. HTML works much like a hierarchy: since the <html> tag is the root node and the <head> and <body> nodes are children of <html> and so on. Obviously it would be an n-ary tree as a node can have lots of children – or none.

Also, I would have to set up a specific tree traversal that would execute commands pre and post traversal – Namely printing out the beginning and ending tags.

This is what I came up with for the tree structure:

class DefTreeNode(object):
        A Protocol Definition Tree Node.
        It will contain a single payload (ex: a tag in HTML),
        a list of child nodes, a label, and what to do prefix
        and postfix during traversal.

        The payload is of type 'Generic_HTML_Tag'

        The label is a unique string that identifies the node.
        Both the label and the payload are required to initialize
        a node.

    def __init__(self, label, payload, contents=""):
        self.children = []
        self.payload = payload
        self.label = label
        self.contents = contents

    def addChild(self, child):
        if child:
            return True
        return False

    def setPayload(self, payload):
        if payload:
            self.payload = payload
            return True
        return False

    def setLabel(self, label):
        self.label = label

    def setContents(self, contents):
        self.contents = contents

    def getChildren(self):
        return self.children

    def getPayload(self):
        return self.payload

    def getLabel(self):
        return self.label

    def getPrefix(self):
        if self.payload:
            return self.payload.getPrefix()
            return None

    def getPostfix(self):
        if self.payload:
            return self.payload.getPostfix()
            return None

    def getContents(self):
        return self.contents

Note that the payload is not a string but (as the class descriptor comment says) is of type ‘Generic_HTML_Tag’. We will get to that in a second. Let’s finish the tree structure first.

Now that I have a node to work with, let’s make the tree. The tree class will contain the traversal code and a “find node” function, and it will hold the root node for the structure:

class DefinitionTree(object):
    def __init__(self, node):
        self.root = node

    def getRoot(self):
        return self.root

    def findNode(self, label):
        return self.recursive_findNode(label, self.root)

    def traverse(self):
        if self.root:
            return self.recursive_traverse(self.root)
            return ""

    def recursive_traverse(self, node, construction=""):
            Traverse the tree.
            This algorithm will run a pre-order traversal. When
            the algorithm encounters a new node it immediately
            calls its 'prefix' function, it then appends the
            node's payload, and traverses the node's children.
            After visiting the node's children, its 'postfix'
            function is called and the traversal for this node
            it complete.

            Returns a complete construction of the nodes' prefix,
            content, and postfix in a pre-order traversal.

            construction = construction + node.getPrefix() + node.getContents()

            for child in node.getChildren():
                construction = self.recursive_traverse(child, construction)

            return construction + node.getPostfix()

    def recursive_findNode(self, label, node):
        Executes a search of the tree to find the node
        with the specified label. This algorithm finds the first
        label that matches the search and is case insensitive.

        Returns the node with the specified label or None.

        if(node is None or node.getLabel().lower() == label.lower()):
            return node

        for child in node.getChildren():
            node = self.recursive_findNode(label, child)
            if node is not None:
                return node

Now, let's create the payload for the nodes. This will be done by creating a file to store our 'html tag classes':


class Generic_HTML_Tag(object):
        A Generic HTML tag class
        This class should not be called directly, but contains 
        the information needed to create HTML tag subclasses

    def __init__(self):
        self.prefix = ""
        self.postfix = ""
        self.indent = 0
        self.TAB = " "

    def getPrefix(self):
        return self.prefix

    def getPostfix(self):
        return self.postfix

    def setPrefix(self, prefix):
        self.prefix = prefix

    def setPostFix(self, postfix):
        self.postfix = postfix

class HTML_tag(Generic_HTML_Tag):
    def __init__(self, indent_level=0):
        self.indent = indent_level

    def generateHTMLPrefix(self):
        self.prefix = self.indent*self.TAB + "<html>"

    def generateHTMLPostfix(self):
        self.postfix = self.indent*self.TAB + "</html>"

class HEAD_tag(Generic_HTML_Tag):
    def __init__(self, indent_level=0):
        self.indent = indent_level

    def generateHTMLPrefix(self):
        self.prefix = self.indent*self.TAB + "<head>"

    def generateHTMLPostfix(self):
        self.postfix = self.indent*self.TAB + "</head>"

class TITLE_tag(Generic_HTML_Tag):
    def __init__(self, indent_level=0):
        self.indent = indent_level

    def generateHTMLPrefix(self):
        self.prefix = self.indent*self.TAB + "<title>"

    def generateHTMLPostfix(self):
        self.postfix = self.indent*self.TAB + "</title>"

class BODY_tag(Generic_HTML_Tag):
    def __init__(self, indent_level=0):
        self.indent = indent_level

    def generateHTMLPrefix(self):
        self.prefix = self.indent*self.TAB + "<body>"

    def generateHTMLPostfix(self):
        self.postfix = self.indent*self.TAB + "</body>"

class P_tag(Generic_HTML_Tag):
    def __init__(self, indent_level=0):
        self.indent = indent_level

    def generateHTMLPrefix(self):
        self.prefix = self.indent*self.TAB + "<p>"

    def generateHTMLPostfix(self):
        self.postfix = self.indent*self.TAB + "</p>"


Great! Obviously this is quite basic and incomplete, but the idea is there. The ‘generateHTMLPrefix() / Postfix() functions are the modifiers. You can add extra parameters and such here without crashing your code elsewhere. Also, you can add additional logic to, say, only add a specific attribute if the tag has it.

Ex: only a few tags have an onload attribute


Now that you have, let’s say, a file with your tree and node called and a file with your html markup classes called, let’s bring it all together:

import HTML_Tags
import HTMLTree

if __name__ == "__main__":

    html = DefTreeNode("html_tag", HTML_Tags.HTML_tag())
    head = DefTreeNode("head_tag", HTML_Tags.HEAD_tag())

    head.addChild(DefTreeNode("title_tag", HTML_Tags.TITLE_tag(), "basic title here!"))

    body = DefTreeNode("body_tag", HTML_Tags.BODY_tag())
    body.addChild(DefTreeNode("p_tag", HTML_Tags.P_tag(), "paragraph here"))

    htmltree = DefinitionTree(html)

    searchlabel = "p_tag"
    print 'Searching for a node labeled: ' + searchlabel

    node = htmltree.findNode(searchlabel)
        print "\n\n" + node.getPrefix() + node.getContents() + node.getPostfix()
        print '\n\nNode with label \"' + searchlabel + '\" not found!'

    print '\n\nTree Traversal:\n'

    print htmltree.traverse()

The output should look similar to this:

Searching for a node labeled: p_tag

<p>paragraph here</p>

Tree Traversal:

<html><head><title>basic title here!s_static</title></head><body><p>paragraph here</p></body></html>

Great! You dynamically generated HTML in an extensible way!