Programming, Security, Privacy, Technical

A Study in Password Generation and Validation

For those who would rather just read the code: https://github.com/calebshortt/pwanalysis

[Updated  April 16, 2019] I have included further results in an update at the bottom of this post.

 

Introduction

A study.

On many a dreary evening, once the sun sets beyond the Pacific and leaves only the gold and pink flakes of cloud and sky, I take to my studies — in the literal sense. On my desktop I maintain a growing number of “Studies” that pique my interest and encourage further research. When time permits I dive into one and if such a study bears fruit it will find its way upon these pages.

This study continues my obsession with passwords.

Firstly, I must ask you a question: Have you ever read through a password dump?

I do not mean to ask if you perused the listing then used it in some attack, but did you truly “digest” the list of passwords one by one? Did you consider their meaning or guess on the context in which they were conjured within the mind of one unsuspecting human and in which immensely intimate feelings and thoughts often manifest themselves? For who chooses a password with the thought that this very string of characters, words, symbols, and what-have-you would one day be revealed in its entirety to the world?

“*Gasp!* Plaintext?”, you balk, “But this is 2019!”. Perhaps you should ask Facebook what year it is.

I have found that passwords often reveal much more about an individual than the crude challenge-answer mechanisms utilized for authentication. If a user’s digital identity (i.e. username, etc.) is their public-facing persona then their password is often their private form. It was through my inspection of these password lists that I encountered three thoughts:

  • “Password language” is a language all it’s own and it DOES have loosely established, unspoken, syntax rules.
  • The frequency distribution of characters in passwords could be quite different to a spoken language frequency distribution.
  • The relationship between the personality of a user and their password is non-random (sounds obvious)

 

My Claim:

It is not enough to simply have unique passwords. There are ways to generate passwords that are not found in a word list AND can have better results than a full brute-force attack.

And thus my journey began.

Let’s hash out these thoughts a bit more…

 

Thoughts

Thought 1: Password Rules as a Language

Even without the mountain of research and rule-sets available for Hashcat we need only look at the body of information security standards to see this. Both NIST SP 800-53 and ISO 27002 have general rules around passwords that create an environment which can be interpreted as a language. Let us look at some common rules:

Must have X of the following (if not all)

  • Upper-case letter
  • Lower-case letter
  • A number/digit
  • A special character/symbol
  • Minimum length of 8

Looks like a language definition to me. The “language” creates a single valid password with respect to whatever policy/standard/etc. that is used.

Interesting … If I apply some predictive language models to this I might get some interesting results… Hmmm.

 

Thought 2: Frequency Distribution of Characters in Passwords

This seems obvious: Given the language rules outlined in “Thought 1”, a character-frequency distribution that differs from other languages would emerge. I suspect that similarities would exist to whatever language the “password maker” speaks, but they would not be the same.

For example:

If I require all words in a language to have at least one number in them I will significantly skew the frequency distribution of characters:

  • English Letters: a-z: 26 possible options
  • Numbers: 0-9: 10 possible options

A “password maker” (human) will have 10 possible numbers to choose from for their required number but will have 26 possible letters to choose from for the rest of the password. How many English words have a letter AND a number in them? This not only changes frequency distributions, it also makes passwords a bit easier to identify in a large corpus of English text… Interesting. I used this technique in my pastebin scraper: https://github.com/calebshortt/pastebin-scraper

Spoiler: You get into trouble with encrypted text and when people post code, but still possible. I had to train an ANN to differentiate.

All of this to simply say that if you require non-standard characters in a language it will uniquely skew the language’s character-frequency distribution.

Important foreshadowing: This means that, if we were to generate passwords on a predictive model, we would have to recalculate the frequency distributions of the “password” language… but HOW???

 

Thought 3: The Non-Randomness of Personality and Passwords

Now let’s get philosophical.

*pulls up a backwards chair and sits while resting arms on the back #CdrRiker*

The first time I encountered this was when I was looking at a pastebin paste of passwords from a fake porn site. I kept noticing certain passwords popping up. The same ones. Things like batman, and semperfi. Now I am not so naive to think that someone would use their banking password for a porn site — I sure hope not. There were plenty of “f-you scammer” style passwords out there too. This is why actually looking at what is in the password dumps are important. Filtering is an important phase.

Now ‘semperfi’ caught my attention (‘batman’ did too, but for different reasons. He’s obviously better than Superman). SemperFi for the uninitiated is the motto of the US Marines. It is short for “Semper fidelis” which is Latin for “always loyal”. Very nice motto and all, but why was it popping up en-masse in password dumps. Turns out the associated emails definitely gave off a … military vibe.

I suppose this is a good time to say: Worry not! all of the pastes I find are reported immediately and I do not save any of them. I look but will not touch.

Back to the thought: Non-randomness. Personality and passwords.

Turns out that correlations exist. I started looking for similar correlations in classic password dumps like “RockYou”, “MySpace”, etc. For example, some form of “batman” pops up 906 times, and “semper” 195 times, in RockYou (total of 14,344,392 passwords). Batman wins.

What is even more interesting is that you get even more information from these passwords. It isn’t just the “semperfi” aspect. Remember that a number is usually required — this usually manifests in the form of a year… The year they joined the military. Some have their state of origin, their gender, birthdays, etc. Lots going on here.

The point? Non-randomness. Again, this seems obvious, but it is an important point when thinking of creating predictive models based on such data.

 

Now What?

With these thoughts in mind, I started contemplating a mechanism to take advantage of these flaws in the password armour (It could be argued that ‘password armour’ is nothing more than a tattered cloth that barely covers the intimate portions of the wearer. Be careful of strong winds).

 

The Algorithm

My thoughts immediately went to frequency analysis. I needed to confirm that my suspicion (Thought 2) was correct. For this, I resorted to the RockYou password dump. Remember the RockYou password has around 14 million passwords. The interesting thing about RockYou is that these are non-throw-away plaintext passwords. This is a rare occurrence and getting your hands on such a password dump for analysis is VERY rare.

You can listen to the history and context around RockYou and why it is such an important password dump at a recent Darknet Diaries podcast episode. Very informative and interesting!

 

My second goal would be to generate a predictive model based on an existing corpus of the “password language”. This, again, would require RockYou. I played around with a few methods but settled on generating a Markov Model based on the password list. I looked around and Hashcat has a bruteforce++ method that uses a Markov chain based on the RockYou password list as well.

I was also interested in all of the possible ngrams generated for a password dump. If I could extend the single character Markov models to use any-sized ngrams for password generation my results may see improvement.

I would use the generated Markov model to create net new passwords that have not been seen in the word list but follow the same “password language” rules.

 

My Third goal was to add a final filter layer that acts as a verification check against all of the generated passwords. A Markov model (even with tweaking) can generate a wide range of “potential passwords” that will have a high variance from the parent dataset. This is required for a good generator algorithm, but if the “reins” are too loose we approach general bruteforce run-times.

 

Frequency Analysis

This was the easy part: Write a script to go through every password in the list and count the number of unique characters.

I expanded this script to generate and count all possible ngrams in each password for the entire list. A single character is just an ngram of size 1, so I just looped for all possible-sized ngrams. This ended up helping with the Markov model generation later.

For reference:

  • The RockYou password list is roughly 133MB in size
  • The list of generated ngrams was ~3.7GB (Contains all duplicates)
  • The counted list of ngrams was ~1.39GB (No duplicates — just ngram and count per line)

The generation of such data required a streaming style programming paradigm as it is not a good idea to assume that you can load the entire file into memory. A significant amount of the programming effort was spent here as MemoryErrors can pop up in lots of places where you don’t expect them to.

To answer the character frequency question, I generated a bar chart of the top 50 characters from the password lists.

NOTE: The trailing off after ‘C’ continued into obscurity. This is a full count of characters derived from all NGRAMS.

The above bar chart shows that my suspicion was correct. The frequency distribution of characters in the “password language” is slightly different than the usual (primarily English) language.

 

Generating a Model and then Passwords

This process required ingestion of the ngram count file (~1.39GB), generation of a Markov model based on the ngram count file, then generation of new passwords based on the Markov Model.

Again, a streaming programming paradigm was taken to not load the entire dataset into memory. The generation of the Markov model was relatively simple.

First another counting and structuring exercise was complete: Each ngram was broken down into characters and iterated through. The current character was treated as a node while the next character would then be a child of that node with the ngram’s count associated to it. This allowed for the cumulative counts to be applied to each relationship within the ngram.

For example, if the ngram was “root” with a count of 100:

  • Current character: ‘r’
    • ‘o’ follows ‘r’: Add a child called ‘o’ with count 100 to top-level ‘r’ node.
  • Current character: ‘o’
    • ‘o’ follows ‘o’: Add a child called ‘o’ with count 100 to top-level ‘o’ node.
  • Current character: ‘o’
    • ‘t’ follows ‘o’: Add a child called ‘t’ with count 100 to top-level ‘o’ node.

Then, once the data structure is built we can normalize it using the total count to generate probabilities for each node. This will make our Markov model satisfy the property that each path must add to 1 (i.e. 100%).

Now the model is complete and we can use it to generate a new password … but how do we choose the very first character of our password? This is where our character frequency distribution map chimes in. I use this map to select a random character that is weighted based on the frequency distribution we generated (Remember we can’t use an English one — has to be based on the ‘password language’).

Now we have a basic password generator that is using weighted random selection of the first character, then generating the remaining password from a Markov model built from the RockYou password list. All we have to do is tell it when to stop — that is, how long we want the password to be (8 characters? 10? etc.).

An improvement

I ran the generator and analyzed the results. It sure worked, but there were a few low-frequency characters that would never show up in the generated passwords. These weren’t a huge deal as they were low probability, but I took a page from the genetic algorithm playbook and included a “mutation” value that I could configure. This would add a randomization element to the generation of a password and potentially avoid any “local minima passwords” (passwords that are so common that they override the probability of similar passwords — which would then never be produced). The results seemed to produce a better “population” of passwords.

 

Filtering Generated Passwords

Now that I have a set of generated passwords I could start using them and compare the results against other brute-force algorithms such as the Hashcat one. But first I wanted to build a filtering mechanism to help hone the generated passwords.

The problem with generated passwords is that they vary. A lot. This is both good and bad.

The lines in the above diagram needs some explaining. If we think about generated password variance, we are thinking about how our generated password varies from what we would consider an actual password. The above diagram attempts to use the ‘B’ line as a representation of a ‘perfectly’ generated password (the generated password fully reflects what the model thinks is a password and IS a password). As we horizontally move away from the ‘B’ line in either direction we start to diverge from what we think is a password and what IS a password.

  • If we move to the left from B, our model ‘thinks’ the string looks like a password but it starts to be LESS probable that it actually is a password.
  • If we move to the right, our model starts to ‘think’ the string doesn’t look like a password but it starts to be MORE probable that it actually is a password.
  • The lines A and C represent some bound that I was hoping to keep my model within as to not get too far away from the center of this diagram. I suspect that there is a correlation between the distance from the center B line and the entropy in the password — I would consider a high entropy password to be at the far right of the “strings that are passwords” category. This is a string that my algorithm would never generate.

The problem is that the only way to check if a string not only looks like a password but actually is a password is to verify it with known passwords. This isn’t always possible (Obvious reasons). Some sort of filter or measure of “password likeness” would help here. It would at best be a heuristic but it may still provide the model with some guidance.

Enter the machine learning aspect of the system.

This is a classic 2-label classification problem: “Password” and “Not password”. The problem is that “Not password” is not a helpful term. At any point a “Not password” string could be used as a password and change categories. Not fun. Also, training an algorithm requires a LABELED dataset (unless I start to look at unsupervised clustering algorithms). I had a list of passwords — a labeled set, but no “no password” list.

I went for a unary classifier. To be specific, I used scikit-learn’s single class SVM with a non-linear kernel. This I trained with my known ‘password’ list. The goal here was not to be 100% accurate — in fact an algorithm that was 100% accurate against my training set (no matter how large) would signal a serious issue (over-fitting). In reality, I was hoping for a range of inaccuracies because by definition my generated password would not be in the training dataset. I wanted an algorithm that measured more of a “looks close enough to be a password based on what I’ve seen” vibe.

I trained the classifier on the same dataset that the Markov models were generated from — the RockYou dataset. This could be considered adequate or poor form based on who you talk to — I will leave this as a future exercise to retrain the classifier on a different dataset and see how the results differ. The ultimate goal of classifier was to filter out generated passwords that differed too greatly from the ‘established norm’ … in essence, that B line in the Venn Diagram above.

This “verifier” was put in front of the password generator to filter its results. Depending on its configuration it could block any generated password that differed too greatly from its training set. This ultimately resulted in only passwords that looked like a RockYou password (but weren’t actually in the RockYou dataset) passing through the filter.

 

Results, Conclusions, and Future Work

Passwords were generated! These passwords looked close enough to real passwords from visual inspection that I am satisfied that this method could be useful. With that said, tweaking is always required. I think I can do better. My system would also benefit from a numerical comparison of passwords such as Hamming distance or other string similarity algorithms. This could better inform my classifier.

As far as lessons learned, I had quite a time coding up a way to handle counting a dataset where I could not even store my counts in memory, so had to get creative. Learned a lot about the limits of Python and memory allocation, and a few tricks to keep your application from tipping things over.

I am interested in testing this framework further to see how much better the generated passwords are against a standard Markov model-generated password set. I would have to evaluate it based on number of passwords compared and not timing as optimizations in that space can always be made.

What is interesting about this method is that the resulting passwords could be generated for any length, then hashed and placed in a rainbow table — this would create a rainbow table of arbitrary-length passwords that have some similarity to actual passwords. I suspect a better hit rate when all wordlists fail and you have time to precompute a huge list of passwords.

Other future work: There are plenty of places for improvement. My code is littered with “TODOs” where I have identified issues or potential areas for improvement. I would like to improve the Markov model usage to include ngrams of length >1 — this project was only on single characters.

 

If you are interested in the code and don’t want to scroll back up to the top, here it is:

https://github.com/calebshortt/pwanalysis

 

Update: Analysis of the Generated Passwords

April 16, 2019. My continued work and analysis has generated some results on this topic and so I will post them here.

This update will outline in more detail the generated passwords and my resulting analysis on these passwords to ensure that they are generated and validated correctly.

Generated Passwords and their Source Material

Let us look at a list of generated passwords (from my tool) and the list of passwords from the source material. In this case I used the rockyou password dump as source material.

Generated Passwords (Of length 8, with validation) Sample:

enleiffa
ngegiena
leetlolo
ramtiohe
uisshler
whoenioa
iacilele
setlmsii
kioaeil3

The tool currently requires the user to specify a password length. For example the user will ask the tool to generate 1000 passwords of length 8 based on the given source dump. Remember, validation uses a trained model filter which only allows passwords that have been classified as ‘passwords’. This second level of validation prevents the raw generation of passwords from the Markov Model from diverging too much from the source content (Remember that I included a ‘mutation’ aspect to the model in an attempt to inject variance in the password ‘population’, this additional filter keeps that mutation aspect in check).

Here are a few generated passwords without validation enabled:

2&0๑1003
02957878
weeaihie
teehieőc
onvlk0nl
oturnnor
laeillòe
6M574872
hoooaol9

The set of passwords generated in this way can be thought of as a superset of the validated passwords. All passwords generated in the raw, nonvalidated form will include the validated passwords — just with more noise. Also notice the increase use of characters that are not in the ASCII set but are in the UTF-8 character set. This is because the model generated from rockyou included these lower probability paths.

Methods: Comparison and Similarities

My method for comparing the generated passwords against the source material is relatively straight-forward.

When comparing strings one generally tends to think of either Hamming or Levenshtein Distances. There are others I know but I will limit my assessments to these two as I am not trying to spark a debate on the merits of various string comparison algorithms.

Hamming Distances are great comparison calculations if the two strings are equal lengths. This works by counting the number of character positions in a string that differ from another string. E.g. hamming_distance(abc, abb) = 1 as only one character position differs. I could not, and did not, want to guarantee this property when comparing my generated passwords to their source. Length is an important value in the comparison, and in the future I want my tool to make the decision to terminate the string (password) by itself thus determining length automatically.

Levenshtein Distances can compare two strings of different lengths. This is done by counting the number of insertions, deletions, and substitutions required to match the strings. This method compensates for length variation with the insertion and deletion methods (whereas Hamming Distances can sort-of be thought as a L. Distance with only the ‘substitution’ property). I chose Levenshtein Distance as my string metric as I wanted to use variable-length strings.

In practicality, I chose the Levenshtein Ratio (just the L. Distance over alignment length). This gives me a normalized value from 0.0 to 1.0 that I can work with.

I used the python-Levenshtein library to execute the actual calculation.

Process

My analysis included the following steps:

  1. Run validated password generation using rockyou as the source (Generate 1000 sample passwords)
  2. Run nonvalidated password generation using rockyou as the source (Generate 1000 sample passwords)
  3. Calculate the Levenshtein Ratio for each generated dataset (compare with rockyou)

 

Analysis of Results

I generated four sets or passwords for my initial assessment:

  1. Validated passwords of length 8
  2. Validated passwords of length 9
  3. Nonvalidated passwords of length 8
  4. Nonvalidated passwords of length 9

For each password set I compared them against the rockyou source using the Levenshtein ratio.

Here are the results (NOTE: The ratio is from 0.0 to 1.0. The higher the ratio the higher the similarity between the sets):

  1. Validated passwords of length 8
    • L. Ratio Compared to Whole RockYou DataSet: 0.99561
    • L. Ratio Compared to Top 1000 RockYou Passwords: 0.99923
  2. Validated passwords of length 9
    • L. Ratio Compared to Whole RockYou DataSet: 0.99563
    • L. Ratio Compared to Top 1000 RockYou Passwords: 0.99924
  3. Nonvalidated passwords of length 8
    • L. Ratio Compared to Whole RockYou DataSet: 0.99560
    • L. Ratio Compared to Top 1000 RockYou Passwords: 0.99923
  4. Nonvalidated passwords of length 9
    • L. Ratio Compared to Whole RockYou DataSet: 0.99560
    • L. Ratio Compared to Top 1000 RockYou Passwords: 0.99922

After I ran the initial evaluations against the entire rockyou dataset I wanted to see how my values compared to just the top 1000 rockyou passwords. This would compare my 1000 generated passwords to the top 1000 passwords in rockyou. As you can see above I was quite happy with the results.

The comparison between the entire rockyou dataset showed roughly a ratio of 0.995 whereas the ratio of the top 1000 comparisons was 0.999. This makes sense as the rockyou dataset is roughly 14 million entries. This would include significantly more variance than the top 1000. With that said, the top 1000 passwords of rockyou are often enough (with perhaps some HashCat rule trickery) to crack plenty of passwords.

As I ran the generator in validation mode for passwords of length 10 or greater I noticed a significant drop-off in production. This was not due to the tool’s inability to produce passwords of length 10+ but in the validation model’s ability to tell if the generated strings of length 10+ were sufficiently similar to the password dataset to allow through. The validation step was filtering out most of the passwords that were generated. Unfortunately the solution would be to relax some of the constraints on the model and allow for more variance — which would have invalidated the consistency of my results. I chose to only do validated passwords of length 8 and 9 for now and to look into more agile models than the Single Class SVM I was using.

With that said, I ran my comparisons on all nonvalidated results with interesting outcomes:

  • Nonvalidated passwords of length 10 against the top 1000: 0.99922
  • Nonvalidated passwords of length 11 against the top 1000: 0.99922
  • Nonvalidated passwords of length 12 against the top 1000: 0.99921
  • Nonvalidated passwords of length 13 against the top 1000: 0.99921
  • Nonvalidated passwords of length 14 against the top 1000: 0.99921
  • Nonvalidated passwords of length 15 against the top 1000: 0.99920
  • Nonvalidated passwords of length 16 against the top 1000: 0.99920
  • Nonvalidated passwords of length 17 against the top 1000: 0.99920

Notice that as the password length increases there isn’t much of a change in the Levenshtein Ratio. There is a downward trend, but only very slightly.

 

Final Remarks on Results

Considering the similarity ratios calculated, if I were to use this tool in a practical sense, I would simply forego the validation step and generate the nonvalidated passwords as their similarity calculations are very close to those of the validated sets. The difference is the significant time required to generate passwords of length 10 (or more) that can pass the validation step.

It is important to note that the passwords generated were not already in the rockyou dataset (no overlap). This means that my tool can be used as an intermediate step — between “naive” brute forcing and using a dictionary. This method can be thought of as a type of “password squeeze problem” as it is close enough to the source dataset to possibly provide useful alternatives that may not be found within the dataset itself.

 

Standard
Programming, Security, Privacy, Technical

ShillBot: A Study in Identifying Reddit Trolls Through Machine Learning

For those who would rather just read the code: https://github.com/calebshortt/shillbot

 

Introduction

We’ve all been there. You’re browsing Reddit and see a post that you’re passionate about. You click the comment box and reach for the keyboard — but hesitate. Reddit’s reputation precedes it. You type anyways and punch out your thoughts. Submit.

*Bliip*

A comment already? You click the icon and read the most disproportionately-voracious response to a comment about cats you have ever seen. What a jerk! But you’re not going to play that game, and view the author’s previous posts and comments. Through your review a trend of tactless comments and inflationary responses bubbles to the surface. They’re a troll. You promptly ignore the comment.

 

Method

The process of inspecting a user’s previous posts and determining how to respond based on that information is a repeatable process — and if it is repeatable, it is surely automatable. This was my thought as I wrapped up my own analysis — and spawned a project to figure out the ‘how’. ShillBot is the fruit of my efforts.

I approached the problem by breaking it down into two problems: The first problem was the extraction of a target user’s comment history from Reddit, the second problem was training the appropriate algorithm with a corpus of data that is representative of the group I wanted to identify.

Extracting the post history from Reddit was more complicated than initially expected. When a user views another user’s page one of three separate page versions may be returned: The ‘new’ style, ‘old-new’ style, and the ‘old’ style of Reddit. I had to create a separate parser for each version. Once this was done, I was able to extract the post information and construct a corpus for that specific target user.

Training an algorithm on a representative set of Reddit trolls required manual identification. This exercise was both entertaining and depressing as the posts included some of the most vile aspects of Reddit. I was able to find more than enough of a representation simply by finding ‘hot-topic’ and controversial posts and then sorting by controversial (or simply finding the most down-voted post). Then I would inspect that user’s history and determine if they were truly a troll. I also created a list of ‘normal’ Reddit users. This would be used to counteract the troll set. In essence, I would need to give the algorithm an accurate representation of both troll and ‘not troll’ to accurately classify each set.

If all the algorithm knows how to classify is a hammer everything starts to look like a hammer.

The algorithm was trained by combining the post text, post title, post author, and subreddit for all posts in a target user’s history. This provided more context than simply recording the post’s text. For example, including the subreddit and post author (OP) allows the algorithm to identify common trends such as cross posting from one subreddit to another and trolls commenting on other troll’s posts to boost controversy, etc.

For this application I used a basic stochastic gradient descent (SGD) classifier as it has traditionally had some success in the text classification space. In the future I may play around with other classifiers to see what results they produce.

 

Results

Promising. I was able to successfully differentiate Reddit trolls from ‘not trolls’ to a reasonable extent. The main limitation is my manual verification process for data points — I still have to check the post history of each suspected troll manually before I can add them to my dataset.

Bias. My search method may be prone to bias as ‘searching for controversial topics’ is dependent on the topic du jour. For example, political topics are highly represented in Reddit posts as of late which leads to an over-representation in the classifier’s training set. Even when I am conscious of such an effect and over-representation the dataset will inevitably reflect it.

Scalable. Partially. Obviously limited to a reasonable number of requests to Reddit. With that said the system itself is capable of handling a relatively large number of requests through a standard consumer/producer multi-threaded model. Workers complete the scraping and parsing actions then send the data to the server for analysis.

 

Future

Always with the Neural Networks! I just like throwing an ANN at the problem to see what it finds — they are great for teasing out relationships that you may not have found otherwise.

Better extraction of data. My data points right now include a combination of text post, text author, post title, and subreddit. I suspect that I can tease out more relationships between these aspects if I represent them in a better format. Will consider mapping relationships between subreddits and posters, etc.

Trying to address the bias problem — although I am not entirely sure how.

 

Conclusion

All in all, this was a fun project with some interesting challenges. This project confirms, if there was any doubt, that it is possible to take a corpus of text posts from a selected group and apply a few basic algorithms to answer the question ‘does this text belong in that group’ — to a certain extent.

Standard
General, Security, Privacy, Technical

A Hacker’s View of Passwords

Passwords You Say?

Passwords. The bastion of authentication. Defenders of data. Bane of those shadowy figures wearing hoods and ski masks in darkened basements whilst attacking your servers. Passwords protect your secrets, but how effective are they really?

Plenty of articles have been written on the short-comings of passwords — mainly around complexity, reuse, expiry, and how these additional “controls” may not truly solve the problems inherent to passwords. I will touch on these, but in the spirit of education I felt a duty to provide context and to answer the inevitable question one hears when they enact some new policy or control in the security world: “Why?”

I will start by saying that, in my humble opinion, passwords are here to stay — in one form or another. “What about biometrics?” you may ask — to which I will reply with another question: “What happens when your fingerprint is stolen?”. You can easily change a password. You can’t (easily) change your fingerprints. What about the tokens used in two-factor authentication? Couldn’t we simply just use those instead? Yes we could, but they can be lost or stolen, and can be expensive relative to a password. Economically speaking, we would have to see executives, as a whole, start taking security a lot more seriously if that is to happen.

So, for now, let us say that passwords will be with us for the foreseeable future. Maybe I’m wrong and some new technology will supplant passwords as the de facto standard — but for now they are here and we have to deal with them.

Now, Let us take a look at the current “state of the art” of passwords.

 

Passwords Currently

Password policy varies from organization to organization, but in general they seem to follow the lines of NIST SP-800 53’s example. Don’t know what NIST SP-800-53 is? Not to worry, it’s the US Federal Government’s catalogue of controls for “information systems” (aka: software systems, etc.).

In other words, it’s a list of things you need to do security-wise if you want to play ball with the US Feds. It’s good practice to do much of this if you are in the private industry as it’s a crazy world out there with crazy-hooded-masked-basement-dwellers around every corner.

So what does NIST SP… bla bla bla … say? It’s quite simple actually. It wants your passwords to:

  • Have a mix of upper-case letters, lower-case letters, numbers, and/or special characters (symbols and such) (Usually 3 of the 4)
  • Have a minimum password length (Usually 8+)
  • Not be the same as your last password (Some say “can’t be one of your last 10 passwords” or so)
  • Expire after some period of time (Usually 60 or 90 days)

Sound familiar?

NOTE: There are a few other requirements, but they don’t directly relate to password complexity so I’m leaving them out. If you really want to check it out, here’s the NIST SP 800-53 standard (pdf). It’s on page 253.

So what passwords fit this criteria?

  • Password123
  • Test1234
  • 12345678ABC!

Why is this important? Because adding complexity makes it hard to guess — in a way. What it actually does is make it hard to do what is called “brute-force”. That simply means to check every possible combination of values.

Time for some math. I promise it won’t be too much.
Let’s look at our criteria again:

  • Upper, lower, number, symbol
    • There are 26 upper-case letters (English alphabet)
    • There are 26 lower-case letters (English)
    • 10 numbers (0, 1, 2, 3 ,4, 5, 6, 7, 8, 9)
    • 16 symbols (~!@#$%^&*()_+,.?) (Yes I know there can be more but I had to choose)
    • TOTAL = 26 + 26 + 10 + 16 = 78 possible “characters” for each position in the password
  • Minimum password length
    • Assume this is 8 (can be longer, set by policy)
  • Not the same as your last password(s)
    • Doesn’t affect complexity
  • Expires after 60 or 90 days
    • Will affect how long we have to crack the code!

With all this, we can calculate how many possibilities there are in this 8-character password:

78 x 78 x 78 x 78 x 78 x 78 x 78 x 78 = 78^8 = 1,370,114,370,683,136 possible passwords

That’s a lot! Even with some decent-powered computers this would take a long time to go through. But a hacker is smarter than that. They know that it’s not worth trying every combination — especially on a live system where an increase in traffic might get noticed.

So they think about how people choose their passwords instead.

 

A Hacker’s Approach to Passwords

Hackers aren’t going to brute-force the passwords. There are just too many possibilities. They are going to use their brains. They’re going to think about how you choose your passwords. They see the standards too! They know you have to change them. They know the complexity requirements. But they also know that it is hard to remember lots of passwords. They take advantage of this by:

  • Looking at “Common Password Lists” that are occasionally published
  • Looking at common themes:
    • 60-90 day password changes are around the time of Fall, Winter, Summer, and Spring
    • Fun Fact: ‘Fall’, ‘Winter’, ‘Spring’, and ‘Summer’ come up in passwords
  • Use of a year range — usually at the end of a word. For example: Winter2018
    • Or, if they know it, your birth year or date in different formats
  • Common substitutions of numbers for letters. For example: W1nter2018
  • General words with numbers at the end. For example: Test1234
  • Noticing that most passwords start with a capital letter and end with one or more numbers
  • Noticing that people use the same or similar passwords for other accounts
    • If another account get hacked, they will look for published password dumps and try those

These are a lot of rules, but let’s take a look at just the “common password list” one.

If the list has 100,000 of the most used passwords, the hacker is expecting to have decent luck and will only have to try 100,000 times per user. That’s far better than the 1,370,114,370,683,136 possible passwords per user we calculated earlier.

Sadly, in many cases this is as far as the hacker has to go to get into a system. Common passwords are published as “common” for a reason. They are out there in numbers.

Let’s take a look at another few rules.

“Starts with a capital, ends with a number” (Ex: Testing1)

The hacker can brute-force using this rule to reduce the possibilities, but brute-forcing will always be their slowest option. They are more likely to use a pre-existing dictionary of words and apply this rule to those words. For example, let’s say that the hacker obtained a dictionary of the top 100,000 words in the English language that are 7 characters or more in length. Then they apply their rule to it:

“abashed” becomes Abashed1, Abashed2, Abashed3, Abashed4, Abashed5, Abashed6, Abashed7, Abashed8, Abashed9, Abashed0 and so on.

We add 10 words for every one word in the dictionary since we are adding a number to the end. If there are 100,000 words in the dictionary, this rule will have 1,000,000 possibilities.

This is a very effective method to guess passwords when a hacker has no other information to go by.

Seasons or Trends

Similar to the “dictionary” method above, except our list of words consist of only Winter, Fall, Summer, and Spring (Or some other noticed trend). The hacker may try a variety of date ranges on the end to try and find old passwords too. For example:

Winter2015, Winter2016, Winter2017, Winter2018, Winter2019

This is quick and dirty, but sometimes works. This is an attack that may even be done manually (no scripts or automation). If a hacker sees a username and password portal, they may try these out (along with username:password combinations such as admin:admin, admin:<blank>, root:admin, etc.)

Fun fact: After the Equifax breach it was found that one of their servers had the admin:admin username and password combo. It works because it is still used. Hackers know.

NOTE: Hackers will also think about who they think you are. If your email is usmarine@marinemail.com, they might just passively try semperfi as the password (US Marine’s motto). Or batman. For some reason the password batman is always in password dumps. Odd. Or awesome. Not awesome. Choose better passwords.

Ultimately, the hacker is going to try and think of ways that people try and remember passwords — because those methods also make it easier to guess. So if the only good password is one that is hard to guess and it seems that our techniques to remember them are making things worse, what do we do?

I’m glad you asked.

 

How Can We Make Better Passwords

Password Managers

The examples above are guessable because they aren’t random. If you had a truly random password there would be no patterns and it would be hard to guess. The problem with this is that it would also be hard to remember. Enter the password manager.

The password manager … wait for it … manages passwords! Now you only have to remember one strong password and it will deal with the rest — better yet, add two-factor authentication for that added kick. +1 to awesome!

In addition to storing and managing your passwords securely, some password managers also give you the ability to auto-generate strong random passwords for you, enter them when needed, and update them when your password expires. Very nice indeed!

Don’t Use Predictable Patterns

If you can’t (or don’t want to) go down the password manager route, you can always just up your password-making game, but know this:

  • Common substitutions are almost as easy to guess as if they were just the normal character/letter
  • Don’t follow a predictable pattern such as a single word that starts with a capital letter, then followed by some numbers. You’re better than that.
  • Never use the seasons as part of your password — They are one of the first few passwords guessed
  • Know what the top passwords used are (they get published from password breaches) and make sure your passwords don’t follow any of those patterns
  • If your password expires, don’t make the new password the same as your old password just with a capital or number or other slight change
  • Where possible, use two-factor authentication (AKA: 2FA)

NOTE: Adding 2-factor authentication is good, but the solutions aren’t all equal: Getting a text is the weakest 2-factor authentication you can have. It is better than nothing, but not by much. You can buy a physical token but this has an added cost. It’s ultimately about economics… and how tinfoil-hatty you are.

Passphrases

There has been some discussion that points towards passphrases as a way forward. A passphrase is an entire phrase (as in multiple words) as opposed to a simple password (which is a single word). A passphrase will be longer and include multiple words that make simple dictionary methods fail. They will also be easier to remember.

 

Conclusion

It is my hope that in this journey through the darkened minds of hooded deviants you are able to think a little like them when you choose your next password. Don the tinfoil hat with pride. And know that your data is just a little bit safer because of it.

 

Standard
Programming, Security, Privacy, Technical

A Study in IDN Homograph Attack Detection

A Brief Introduction…

How well do you scrutinize the URLs that you click in a browser? Are you the wild type who click links before you read them? Or perhaps you are the cautious type — One of the careful number who hover over the link (without clicking), check the address that appears in the browser bar, and that it has a valid certificate (which might not mean so much as people think).

Let’s say that you are trying to get to ‘facebook.com’. You see a link and hover over it. The URL that appears in the browser bar says ‘facebook.com’. You click the link, but it takes you to a phishing site.

What happened? You checked the link. It said it was ‘facebook.com’!

You may have fallen victim to an IDN homograph attack. A homograph is a word that looks the same but has two meanings. In the example above, the user clicked a link that looked like ‘facebook.com’ but had a different meaning. How can that happen? The ‘how’ comes from the ‘IDN’ portion of the name: ‘internationalized domain name’. The URL in the address bar is actually pointing to a domain name. A domain name is the human-readable name of a website. The real address of a website (server) is an IP address, but people aren’t that great at remembering those so domain names make them memorable. For example, you can remember ‘facebook.com’ (the domain name) or you can remember ‘31.13.76.68’ (the ip address). The problem with domain names is that there are many languages. How do you support Mandarin characters in a domain name? Or Russian? You have to internationalize your ‘charset’ (The allowed characters, or individual letters). By internationalizing your charset, you now accept a far greater amount of characters than just simple English (latin-based) characters. This now lets people have domain names in their own language. But there is a down-side. In the charset, the same character may appear multiple times under different languages. For example, Cyrillic and Greek charsets (or alphabets) share similar characters (letters) to Latin. These characters will have separate identifiers, but will look very close, if not identical, to each other.

In the facebook example, the lower-case letter ‘o’ is identified by a character code ‘006F’ for the Latin alphabet (“Small letter o”), but could also be ’03BF’ in the Greek alphabet (small letter Omicron). If I were to register a domain name of facebook.com but with the Greek version of those letters, it may pass an internationalized domain name registration. I could then use this URL to trick people into clicking the link because it will visually look like the real thing. An even simpler example is the substitution of a capital ‘O’ with the number zero (0) in the domain name.

I wanted to see if I could detect these types of attacks and flag them using basic statistics.

The Idea…

My idea was quite simple. Each character in a charset can be broken down into their unique identifier (number) and alphabets (or charsets) are grouped together and thus their identifiers will be close together. The plan was to run basic statistical analysis (mean, median, max, min, and standard deviation) on the individual characters of a URL (or domain name). If there are characters that are outside the usual range, then throw a ‘red flag’. I also had to account for the fact that punctuation may throw off any analysis so there will have to be at least a few methods to identify anomalies — hence the calculation of mean, median, max, min, and standard deviations. I also avoided using neural networks for this particular problem although, as you will see in my summary/future works section, I can see merit in taking it that way.

I wanted a basic proof-of-concept system that would take a corpus of known valid URLs and derive a statistical model from them. Then I would use this model to compare new URLs and determine if they are within the same alphabet or if there was a possibility that they were homograph attacks.

The Project…

You can see/fork/download the project code here: https://github.com/calebshortt/homograph_detector

The project is quite simple and has two files in the main ‘base’ package. The ‘meat’ of the system is in the analysis.py file. The ‘results.py’ file was used to manage and compare results.

The flow is as such:

  1. Train the system using either a given string or a file with many strings
  2. Pass the system either a given string of a file with many strings to test
  3. Generate results and output them

The comparison function simply takes the statistics for the given test string and compares it (based on a defined number of standard deviations — default 2) to the model statistics — these are based on mean, median, and the standard deviations themselves (I actually created a metric that is the standard deviation of standard deviations). It also compares the max and min ranges of the character identifiers — this is a crude way to check the number ranges.

Here is the comparison function code:

def compare(self, str_stats, stdev_threshold=2):
        """
        :param str_stats: ResultSet object
        :param stdev_threshold: (int) number of standard deviations allowed
        :return:
        """

        print('Analysing: Threshold: 2 standard deviations...')

        str_max = str_stats.result_max
        str_min = str_stats.result_min
        str_mean = str_stats.mean_means
        str_median = str_stats.mean_medians
        str_stdev = str_stats.mean_stdevs

        if not (self.result_min <= str_min <= str_max <= self.result_max):
            return False, str_stats.all_stats, 'max/min range'

        r_mean_low = self.mean_means - stdev_threshold*self.stdev_means
        r_mean_high = self.mean_means + stdev_threshold*self.stdev_means
        if not (r_mean_low <= str_mean <= r_mean_high):
            return False, str_stats.all_stats, 'mean'

        r_median_low = self.mean_medians - stdev_threshold*self.stdev_medians
        r_median_high = self.mean_medians + stdev_threshold*self.stdev_medians
        if not (r_median_low <= str_median <= r_median_high):
            return False, str_stats.all_stats, 'median'

        r_std_low = self.mean_stdevs - stdev_threshold*self.stdev_stdevs
        r_std_high = self.mean_stdevs + stdev_threshold*self.stdev_stdevs
        if not (r_std_low <= str_stdev <= r_std_high):
            return False, str_stats.all_stats, 'stdev'

return True, str_stats.all_stats, None

NOTE: The above code compares the given string to the model ResultSet data in ‘self’.

Preliminary Results, Conclusion, and Future Work

The system was able to digest a Latin alphabet-based corpus (included in the source) and correctly identify URLs that had characters from other alphabets in them. The interesting thing about this method is that once the model is generated based on the corpus, it can be recorded and reused — no need to regenerate the model (which took the most time). The system was fast and quite accurate at first glance, although more analysis would be needed to make any real claims on its accuracy.

My file load function breaks on some charsets as they are not included in the default encoding for the python file-loading function (that or I missed something). I am working to fix this without clobbering the original encoding of the URL (which would defeat the purpose of this experiment).

This is already starting to look like a neural network would work nicely for this particular problem. I would like to feed the model stats into an ANN with perhaps some other inputs and see how it does. The basic stats I used were able to get decent results, but there are obvious edge cases that standard statistics wouldn’t identify (such as the old ‘substitute the 0 (zero) in for capital O’ trick). A neural network may help catch those.

I would say that it is quite possible to identify IDN homograph attacks using basic statistics and there are a few paths forward to improve accuracy with the results already demonstrated. With that said, nothing will compare to the standard ‘look before you click’ mentality. Users who identify the ‘0 instead of O’ substitution will not leave as much to chance. Systems like these aren’t catch-alls.

Standard
Programming, Security, Privacy, Technical

Growing Threats, Growing Attack Surface

It has been an interesting year of breaches, vulnerabilities, and scares. With the more recent ROCA vulnerability in Infineon’s TPM, a widely used module in the Smart Card industry, to the less-exploitable-but-still-serious KRACK vulnerability that makes hacking the WPA2 Wi-Fi security protocol possible, to the supply chain attack of CCleaner, a popular utility program for cleaning up a user’s computer.

What is clearly emerging from such events is that there is still much work to be done in the security space. The Ixia Security Report for 2017 describes an increase in the amount of malware and an increase in the size of company’s attack surface. Attack surface is the exposed, or public-facing, “surface area” of a company. Some have attributed the increase of attack surface to an increased usage and misconfiguration of cloud infrastructure. They argue that misconfiguration of servers looks to be replacing some of the more traditional OWASP top 10 vulnerabilities — such as SQL injection.

 

But the vulnerabilities listed above (ROCA, KRACK, and the CCleaner supply chain attack) aren’t necessarily cloud-related.

ROCA relates to an incorrectly-implemented software library — specifically the key pair generation. It allows an attacker to factor the private (secret) key just from using the public key. Modern encryption relies on a public key (that is sent to whoever wants it) and a secret (private) key that only the owner has and is used to decrypt messages. Only the private key can decrypt messages to the owner. This vulnerability allows anyone with the public key and some decent computational power (say an AWS cluster used to number-crunch) to get the original private key and decrypt the messages sent to the original owner. The computational power required is in the realm of “expensive but possible”. A targeted attack is a very real possibility, but widespread breaching would be infeasible.

KRACK involves a replay attack. “By repeatedly resetting the nonce transmitted in the third step of the WPA2 handshake, an attacker can gradually match encrypted packets seen before and learn the full keychain used to encrypt the traffic”. This vulnerability requires a physical component as the attacker will have to be on the Wi-Fi network. The cause is inherent to the standard — which means all correctly-implemented versions of the standard are vulnerable (i.e. Libraries that implemented it to spec). Many security practitioners have taken this particular moment in time to explain that the usage of a Virtual Private Network (VPN) would mitigate such attacks, and that Wi-Fi should be an un-trusted source to begin with.

The CCleaner supply chain attack included an injection of malware in a library that is used in the implementation of CCleaner. When the CCleaner program is packaged and deployed it will include this third-party library in its package. This type of attack takes advantage of consumer’s trust in CCleaner, and it is becoming a more popular attack for hackers. For the record, CCleaner has been sanitized and is no longer a threat from this malware. I imagine Avast, the company that offers CCleaner, also took a look at their supply chain trust and revamped some policies around it.

 

Each of these attacks are in addition to the general increase of attack surface and the misconfiguration of servers (that are becoming more common). There is an increase of supply chain attacks because they clearly work. There are plenty of incorrect implementations of standards or protocols that hackers can take advantage of. There are, far less often, errors in the actual standard or protocol themselves.

It may seem like the odds are piled against an organization’s security team. They are. That is why security is not only the responsibility of the security team, but the entire organization from the executive branch to the developers (who choose to implement specific libraries in their software) to the tech support teams that are often on the “front-line” with customers.

Education is always a good first step.

Standard
General, Programming

Software Development, Morality, ‘The Secret Life of Walter Mitty’, and Victor Frankenstein

For those who haven’t watched ‘The Secret Life of Walter Mitty’, I highly recommend a showing. It follows Walter Mitty, a daydreaming “negative asset manager” at LIFE magazine during its conversion to a fully-online offering. It truly is a visually stunning work.

The opening premise, LIFE magazine moving online and the inevitable downsizing and layoffs, struck a chord that has been, and is still, resonating: Is there a place for morality in software developer’s drive toward automation and efficiency?

One would be quite right in saying that the issue of ‘worker layoffs due to automation’ is not a new problem. History is full of examples. What piques my interest, however, is the generality of software automation. The immense reach of software naturally leads to an immense number of avenues for automation.

For example: I found myself talking with a colleague about the problems that they were having with some of their staff. When we finally distilled the problem down to its essence, we discovered that a great portion of his department was dedicated to the handling and sorting of files (originally electronic, then printed, then sorted and filed). I found myself flippantly stating that I could replace most of his department with a script.

My watching of ‘Walter Mitty’ sparked a wave of introspection, and a single question welled within me: If I could write a script that replaces an entire department, should I?

The script would increase the company’s efficiency through a significant reduction in cost. But why is efficiency so important that one would look for avenues to terminate the employment of others? Who benefits from it? Recently, it seems, the cost savings would not make its to the remaining employees but would manifest as bonuses for an executive, or manager, or perhaps dividends for shareholders.

Is inefficiency really that bad? In this case a department is being employed to do work. They are doing the work satisfactorily. Their wages pay for local food, rent, and expenses. This provides a boon to the local economy. If the populace is scraping by financially they surly will not be purchasing cars, houses, or other ‘big ticket items’. Would this not stagnate the greater economy?

Would a 100%-efficient company have anyone working there?

My authorship of this script directly instigates the termination of those employees. The causative relationship is undeniable.

Such scenarios are drenched with hubris as such mechanisms are en-route to also replace developers. In this we are the architects of our own obsolescence and ultimate demise: Dr. Frankenstein would surely have words with us. It is pure arrogance to assume such devices would not also be applied towards our craft.

Some may argue that apparatuses are in place to mitigate such effects, or that the evolution of the market warrants the employee’s termination: ‘They have become obsolete and must retool to stay competitive’, or ‘that is what welfare is for’, or ‘universal basic income is the future for this very reason’. Such comments do not address my question, ‘If one could write a script to replace a large group of people’s jobs, should they?’, rather they address the symptom, or after-effects, of such a decision — The employees are terminated, now what?

Perhaps this is the issue?

At the risk of sounding defensive I must note that I am not one to resist change. Resistance to change in our particular field is a doomed prospect to say the least. But one must address the social and economic implications of their decisions. One must have a conscience.

I do not have an answer. The creation of software is a technical achievement, a work of art, a labor of love, and wildly creative. It behooves those who embark on such journeys to consider their implications. Perhaps it is our hubristic tendencies as developers, or our arrogance, that drives us to construct our own monsters. Dr. Frankenstein would surely have words with us.

 

Standard