Programming, Technical, Uncategorized

Installing scikit-learn; Python Data Mining Library

Update: The instructions of this post are for Python 2.7. If you are using Python 3, the process is simplified. The instructions are here:

Starting with a Python 3.6 environment.

Assumptions (What I expect to already be installed):

  1. Install numpy: pip install numpy
  2. Install scipy: pip install scipy
  3. Install sklearn: pip install sklearn

Test installation by opening a python interpreter and importing sklearn:
python
import sklearn

If it successfully imports (no errors), then sklearn is installed correctly.

Introduction

Scikit-learn is a great data mining library for Python. It provides a powerful array of tools to classify, cluster, reduce, select, and so much more. I first encountered scikit-learn when I was developing prototypes for my first business venture. I wanted to use something that was easy and powerful. Scikit-learn was just that tool.

The only problem with scikit-learn is that it builds off of some powerful-yet-finicky libraries, and you will need to install those libraries, NumPy and SciPy, before you can proceed with installing scikit-learn.

To a novice, this can be a frustrating task since the order of installation matters and many Google searches will only produce unhelpful and long-winded responses. Thus, my motivation to set the record straight and provide a quick tutorial on how to install scikit-learn — mostly on Windows, but I have provided links and notes on both Linux and Mac installations as well.

In the process of this tutorial, you will install (or already have) the following — in this order:

NOTE: I have provided the links unlabeled above because, like all tech/installation tutorials, over time they become obsolete. By providing the links as they are, it is my hope that even if new versions come out, you will be able to use this tutorial to find the resources you need.

Step 1: Install Python

If you do not already have Python, install it now at the address provied above (https://www.python.org/downloads/). I will be using Python 2.7 for this tutorial.

The installer for python is quick and good. Once installed, we will need to check to see if Python is available on the command line. Open a terminal by searching for ‘cmd’ or running C:\Windows\System32\cmd.exe. I would recommend creating a shortcut if you are doing this a lot.

in the command line, enter:

python –version

something similar to “Python 2.7.6” should display. That shows that python is working and accessible from the cmd line.

Step 2: Install NumPy

NumPy is a powerful library for Python that contains advanced numerical capabilities.

Install NumPy by downloading the correct installer using the link provided above (http://sourceforge.net/projects/numpy/files/NumPy/1.10.2/) then run the installer.

NOTE: There are a few installers based on your OS version AND the version of Python you have. It is important that you find the right installer for your OS and Python version!

Step 3: Install SciPy

Download the SciPy installer using the link provided above (http://sourceforge.net/projects/scipy/files/scipy/0.16.1/) and run it.

NOTE: There are a few installers based on your OS version AND the version of Python you have. It is important that you find the right installer for your OS and Python version!

Step 4: Install Pip

Pip is a package manager specifically for Python. It comes in handy so much that I highly recommend that you install it to help manage python packages.

Go to the link provided above (https://pip.pypa.io/en/stable/installing/).

The easiest way to install pip on Windows is by using the ‘get_pip.py’ script and then running it in your command line:

python get_pip.py

If you are on Linux you can use apt-get (or whatever package manager you have):

sudo apt-get install python-pip

Step 5: Install scikit-learn

NOTE: More information on installing scikit-learn at the link provided above (http://scikit-learn.org/stable/install.html)

On Windows: use pip to install scikit-learn:

pip install scikit-learn

On Linux: Use the package manager or follow the build instructions at http://www.bogotobogo.com/python/scikit-learn/scikit-learn_install.php

Step 6: Test Installation

Now we must see if everything installed correctly. Open up a command line terminal and type:

python

This will open a python interpreter. You will know this because there will be some text and three chevrons, “>>>”, prompting input. Type:

import sklearn

If nothing happens and another prompt appears scikit-learn has been installed correctly.

If an error occurs, there might have been a mis-step in the process. Go back through the tutorial to see if any steps were missed or follow the error message that was given.

Standard
Programming, Technical

An Experiment on PasteBin

A while ago I was browsing the public pastes on PasteBin and I came across a few e-mail/password dumps from either malware or some hacker trying to make a name for himself.

As I perused the information, I was shocked to find usernames, emails, passwords, social security numbers, credit card numbers, and more in these dumps. I reported the posts as credit card info and SSNs are nothing to trifle with, but the thought lingered as to why they were public in the first place. There must be a way to automate the process of reporting these posts, I thought, usernames and especially passwords hold a very unique signature: at least one upper-case letter, at least one lower-case letter, at least one digit, and at least 8 characters long.

How many words in the english language have that particular combination?

This question inevitably led to an experiment.

The parameters were quite simple: How accurate can I identify a password that is surrounded by junk text in a post?
This is actually harder than is seems as we can’t simply assume that the posts will be in English, or that they will be a human language at all (code). This presented an interesting problem to work with and I started development of a framework to solve it.

The solution

The system to test my question is quite simple. It includes a web page scraper and an analysis engine.

The scraper is simple enough and goes to pastebin’s public post archive and pulls all of the links to “pastes” contained therein. It then grabs only the paste text from each page and adds them to a list. This list is sent to the analysis engine.

The analysis engine uses a spam filter-like merit score to help identify interesting pastes and discard pastes that do not have anything interesting in them.

It uses a series of filters to affect the merit score:

The first one is a simple password identification. It uses a master list of popular passwords and searches each paste for them. If a keyword is found, the post’s merit score is increased.

The second filter is keyword identification. This is similar to the password identification but it includes words and phrases that are not passwords but might signal a paste that is more likely to have passwords in it. These keywords are held in a dictionary that also stores the associated merit value (positive or negative).

The third filters are the basic password rules:

  • Must have at least one capital letter
  • Must have at least one lower-case letter
  • Must have at least one digit
  • Must be at least 8 characters long

The analysis engine then returns a list of all of the links sorted by “most-likely to have a password” — Highest probability at the top.

Results and Conclusion

I initially found that the basic filters I had created were getting less fast positives than the basic password filter (#3) but still wouldn’t get promising results. The accuracy of the identification would have to be improved before I attempted any sort of automation for reporting. So I have open-sourced the software and made it available on pip (as it is written in python):

The project is called “Pastebin Password Scraper” or PBPWScraper:

Here is the PBPWScraper Github

You can use pip to install the latest release version of the library by entering:

pip install PBPWScraper

It was an interesting experiment and it is fun to tweak the filters to improve certain aspects of the analysis. I will continue to work on the system and see if I am able to decrease its false-positive count enough to warrant an automated reporting module.

Standard
Programming, Technical

Auto-generate HTML using a tree in Python

Here’s something I’ve been working on and thought was interesting.

I needed to dynamically generate HTML with varying degrees of nesting and attributes in Python. All I found was a few Stack Overflow questions on generating HTML – and some blogs talking about hard-coding the tags.

This is far from what I needed, so I started to look into making my own – besides, it looked fun!

I started by looking at what I actually wanted:

  • Needs to be easily extensible
  • Needs to be structured in a way that makes sense

With these requirements in mind, I elected to go the tree approach. HTML works much like a hierarchy: since the <html> tag is the root node and the <head> and <body> nodes are children of <html> and so on. Obviously it would be an n-ary tree as a node can have lots of children – or none.

Also, I would have to set up a specific tree traversal that would execute commands pre and post traversal – Namely printing out the beginning and ending tags.

This is what I came up with for the tree structure:


class DefTreeNode(object):
    '''
        A Protocol Definition Tree Node.
        It will contain a single payload (ex: a tag in HTML),
        a list of child nodes, a label, and what to do prefix
        and postfix during traversal.

        The payload is of type 'Generic_HTML_Tag'

        The label is a unique string that identifies the node.
        Both the label and the payload are required to initialize
        a node.
    '''

    def __init__(self, label, payload, contents=""):
        self.children = []
        self.payload = payload
        self.label = label
        self.contents = contents

    def addChild(self, child):
        if child:
            self.children.append(child)
            return True
        return False

    def setPayload(self, payload):
        if payload:
            self.payload = payload
            return True
        return False

    def setLabel(self, label):
        self.label = label

    def setContents(self, contents):
        self.contents = contents

    def getChildren(self):
        return self.children

    def getPayload(self):
        return self.payload

    def getLabel(self):
        return self.label

    def getPrefix(self):
        if self.payload:
            return self.payload.getPrefix()
        else:
            return None

    def getPostfix(self):
        if self.payload:
            return self.payload.getPostfix()
        else:
            return None

    def getContents(self):
        return self.contents

Note that the payload is not a string but (as the class descriptor comment says) is of type ‘Generic_HTML_Tag’. We will get to that in a second. Let’s finish the tree structure first.

Now that I have a node to work with, let’s make the tree. The tree class will contain the traversal code and a “find node” function, and it will hold the root node for the structure:


class DefinitionTree(object):
    def __init__(self, node):
        self.root = node

    def getRoot(self):
        return self.root

    def findNode(self, label):
        return self.recursive_findNode(label, self.root)

    def traverse(self):
        if self.root:
            return self.recursive_traverse(self.root)
        else:
            return ""

    def recursive_traverse(self, node, construction=""):
        '''
            Traverse the tree.
            This algorithm will run a pre-order traversal. When
            the algorithm encounters a new node it immediately
            calls its 'prefix' function, it then appends the
            node's payload, and traverses the node's children.
            After visiting the node's children, its 'postfix'
            function is called and the traversal for this node
            it complete.

            Returns a complete construction of the nodes' prefix,
            content, and postfix in a pre-order traversal.
        '''

        if(node):
            construction = construction + node.getPrefix() + node.getContents()

            for child in node.getChildren():
                construction = self.recursive_traverse(child, construction)

            return construction + node.getPostfix()

    def recursive_findNode(self, label, node):
    '''
        Executes a search of the tree to find the node
        with the specified label. This algorithm finds the first
        label that matches the search and is case insensitive.

        Returns the node with the specified label or None.
    '''

        if(node is None or node.getLabel().lower() == label.lower()):
            return node

        for child in node.getChildren():
            node = self.recursive_findNode(label, child)
            if node is not None:
                return node


Now, let's create the payload for the nodes. This will be done by creating a file to store our 'html tag classes':

 


class Generic_HTML_Tag(object):
    '''
        A Generic HTML tag class
        This class should not be called directly, but contains 
        the information needed to create HTML tag subclasses
    '''

    def __init__(self):
        self.prefix = ""
        self.postfix = ""
        self.indent = 0
        self.TAB = " "

    def getPrefix(self):
        return self.prefix

    def getPostfix(self):
        return self.postfix

    def setPrefix(self, prefix):
        self.prefix = prefix

    def setPostFix(self, postfix):
        self.postfix = postfix


class HTML_tag(Generic_HTML_Tag):
    def __init__(self, indent_level=0):
        Generic_HTML_Tag.__init__(self)
        self.indent = indent_level
        self.generateHTMLPrefix()
        self.generateHTMLPostfix()

    def generateHTMLPrefix(self):
        self.prefix = self.indent*self.TAB + "<html>"

    def generateHTMLPostfix(self):
        self.postfix = self.indent*self.TAB + "</html>"

class HEAD_tag(Generic_HTML_Tag):
    def __init__(self, indent_level=0):
        Generic_HTML_Tag.__init__(self)
        self.indent = indent_level
        self.generateHTMLPrefix()
        self.generateHTMLPostfix()

    def generateHTMLPrefix(self):
        self.prefix = self.indent*self.TAB + "<head>"

    def generateHTMLPostfix(self):
        self.postfix = self.indent*self.TAB + "</head>"

class TITLE_tag(Generic_HTML_Tag):
    def __init__(self, indent_level=0):
        Generic_HTML_Tag.__init__(self)
        self.indent = indent_level
        self.generateHTMLPrefix()
        self.generateHTMLPostfix()

    def generateHTMLPrefix(self):
        self.prefix = self.indent*self.TAB + "<title>"

    def generateHTMLPostfix(self):
        self.postfix = self.indent*self.TAB + "</title>"

class BODY_tag(Generic_HTML_Tag):
    def __init__(self, indent_level=0):
        Generic_HTML_Tag.__init__(self)
        self.indent = indent_level
        self.generateHTMLPrefix()
        self.generateHTMLPostfix()

    def generateHTMLPrefix(self):
        self.prefix = self.indent*self.TAB + "<body>"

    def generateHTMLPostfix(self):
        self.postfix = self.indent*self.TAB + "</body>"

class P_tag(Generic_HTML_Tag):
    def __init__(self, indent_level=0):
        Generic_HTML_Tag.__init__(self)
        self.indent = indent_level
        self.generateHTMLPrefix()
        self.generateHTMLPostfix()

    def generateHTMLPrefix(self):
        self.prefix = self.indent*self.TAB + "<p>"

    def generateHTMLPostfix(self):
        self.postfix = self.indent*self.TAB + "</p>"

 

Great! Obviously this is quite basic and incomplete, but the idea is there. The ‘generateHTMLPrefix() / Postfix() functions are the modifiers. You can add extra parameters and such here without crashing your code elsewhere. Also, you can add additional logic to, say, only add a specific attribute if the tag has it.

Ex: only a few tags have an onload attribute

 

Now that you have, let’s say, a file with your tree and node called HTMLTree.py and a file with your html markup classes called HTML_Tags.py, let’s bring it all together:

import HTML_Tags
import HTMLTree


if __name__ == "__main__":

    html = DefTreeNode("html_tag", HTML_Tags.HTML_tag())
    head = DefTreeNode("head_tag", HTML_Tags.HEAD_tag())

    head.addChild(DefTreeNode("title_tag", HTML_Tags.TITLE_tag(), "basic title here!"))
    html.addChild(head)

    body = DefTreeNode("body_tag", HTML_Tags.BODY_tag())
    body.addChild(DefTreeNode("p_tag", HTML_Tags.P_tag(), "paragraph here"))
    html.addChild(body)

    htmltree = DefinitionTree(html)

    searchlabel = "p_tag"
    print 'Searching for a node labeled: ' + searchlabel

    node = htmltree.findNode(searchlabel)
    if(node):
        print "\n\n" + node.getPrefix() + node.getContents() + node.getPostfix()
    else:
        print '\n\nNode with label \"' + searchlabel + '\" not found!'

    print '\n\nTree Traversal:\n'

    print htmltree.traverse()

The output should look similar to this:

Searching for a node labeled: p_tag


<p>paragraph here</p>


Tree Traversal:

<html><head><title>basic title here!s_static</title></head><body><p>paragraph here</p></body></html>

Great! You dynamically generated HTML in an extensible way!

Standard