A Brief Introduction…
How well do you scrutinize the URLs that you click in a browser? Are you the wild type who click links before you read them? Or perhaps you are the cautious type — One of the careful number who hover over the link (without clicking), check the address that appears in the browser bar, and that it has a valid certificate (which might not mean so much as people think).
Let’s say that you are trying to get to ‘facebook.com’. You see a link and hover over it. The URL that appears in the browser bar says ‘facebook.com’. You click the link, but it takes you to a phishing site.
What happened? You checked the link. It said it was ‘facebook.com’!
You may have fallen victim to an IDN homograph attack. A homograph is a word that looks the same but has two meanings. In the example above, the user clicked a link that looked like ‘facebook.com’ but had a different meaning. How can that happen? The ‘how’ comes from the ‘IDN’ portion of the name: ‘internationalized domain name’. The URL in the address bar is actually pointing to a domain name. A domain name is the human-readable name of a website. The real address of a website (server) is an IP address, but people aren’t that great at remembering those so domain names make them memorable. For example, you can remember ‘facebook.com’ (the domain name) or you can remember ‘18.104.22.168’ (the ip address). The problem with domain names is that there are many languages. How do you support Mandarin characters in a domain name? Or Russian? You have to internationalize your ‘charset’ (The allowed characters, or individual letters). By internationalizing your charset, you now accept a far greater amount of characters than just simple English (latin-based) characters. This now lets people have domain names in their own language. But there is a down-side. In the charset, the same character may appear multiple times under different languages. For example, Cyrillic and Greek charsets (or alphabets) share similar characters (letters) to Latin. These characters will have separate identifiers, but will look very close, if not identical, to each other.
In the facebook example, the lower-case letter ‘o’ is identified by a character code ‘006F’ for the Latin alphabet (“Small letter o”), but could also be ’03BF’ in the Greek alphabet (small letter Omicron). If I were to register a domain name of facebook.com but with the Greek version of those letters, it may pass an internationalized domain name registration. I could then use this URL to trick people into clicking the link because it will visually look like the real thing. An even simpler example is the substitution of a capital ‘O’ with the number zero (0) in the domain name.
I wanted to see if I could detect these types of attacks and flag them using basic statistics.
My idea was quite simple. Each character in a charset can be broken down into their unique identifier (number) and alphabets (or charsets) are grouped together and thus their identifiers will be close together. The plan was to run basic statistical analysis (mean, median, max, min, and standard deviation) on the individual characters of a URL (or domain name). If there are characters that are outside the usual range, then throw a ‘red flag’. I also had to account for the fact that punctuation may throw off any analysis so there will have to be at least a few methods to identify anomalies — hence the calculation of mean, median, max, min, and standard deviations. I also avoided using neural networks for this particular problem although, as you will see in my summary/future works section, I can see merit in taking it that way.
I wanted a basic proof-of-concept system that would take a corpus of known valid URLs and derive a statistical model from them. Then I would use this model to compare new URLs and determine if they are within the same alphabet or if there was a possibility that they were homograph attacks.
You can see/fork/download the project code here: https://github.com/calebshortt/homograph_detector
The project is quite simple and has two files in the main ‘base’ package. The ‘meat’ of the system is in the analysis.py file. The ‘results.py’ file was used to manage and compare results.
The flow is as such:
- Train the system using either a given string or a file with many strings
- Pass the system either a given string of a file with many strings to test
- Generate results and output them
The comparison function simply takes the statistics for the given test string and compares it (based on a defined number of standard deviations — default 2) to the model statistics — these are based on mean, median, and the standard deviations themselves (I actually created a metric that is the standard deviation of standard deviations). It also compares the max and min ranges of the character identifiers — this is a crude way to check the number ranges.
Here is the comparison function code:
def compare(self, str_stats, stdev_threshold=2): """ :param str_stats: ResultSet object :param stdev_threshold: (int) number of standard deviations allowed :return: """ print('Analysing: Threshold: 2 standard deviations...') str_max = str_stats.result_max str_min = str_stats.result_min str_mean = str_stats.mean_means str_median = str_stats.mean_medians str_stdev = str_stats.mean_stdevs if not (self.result_min <= str_min <= str_max <= self.result_max): return False, str_stats.all_stats, 'max/min range' r_mean_low = self.mean_means - stdev_threshold*self.stdev_means r_mean_high = self.mean_means + stdev_threshold*self.stdev_means if not (r_mean_low <= str_mean <= r_mean_high): return False, str_stats.all_stats, 'mean' r_median_low = self.mean_medians - stdev_threshold*self.stdev_medians r_median_high = self.mean_medians + stdev_threshold*self.stdev_medians if not (r_median_low <= str_median <= r_median_high): return False, str_stats.all_stats, 'median' r_std_low = self.mean_stdevs - stdev_threshold*self.stdev_stdevs r_std_high = self.mean_stdevs + stdev_threshold*self.stdev_stdevs if not (r_std_low <= str_stdev <= r_std_high): return False, str_stats.all_stats, 'stdev' return True, str_stats.all_stats, None
NOTE: The above code compares the given string to the model ResultSet data in ‘self’.
Preliminary Results, Conclusion, and Future Work
The system was able to digest a Latin alphabet-based corpus (included in the source) and correctly identify URLs that had characters from other alphabets in them. The interesting thing about this method is that once the model is generated based on the corpus, it can be recorded and reused — no need to regenerate the model (which took the most time). The system was fast and quite accurate at first glance, although more analysis would be needed to make any real claims on its accuracy.
My file load function breaks on some charsets as they are not included in the default encoding for the python file-loading function (that or I missed something). I am working to fix this without clobbering the original encoding of the URL (which would defeat the purpose of this experiment).
This is already starting to look like a neural network would work nicely for this particular problem. I would like to feed the model stats into an ANN with perhaps some other inputs and see how it does. The basic stats I used were able to get decent results, but there are obvious edge cases that standard statistics wouldn’t identify (such as the old ‘substitute the 0 (zero) in for capital O’ trick). A neural network may help catch those.
I would say that it is quite possible to identify IDN homograph attacks using basic statistics and there are a few paths forward to improve accuracy with the results already demonstrated. With that said, nothing will compare to the standard ‘look before you click’ mentality. Users who identify the ‘0 instead of O’ substitution will not leave as much to chance. Systems like these aren’t catch-alls.