Cyber Security: Sparse coding and anomaly detection

I’ve recently published the thesis I wrote in fulfillment of my Masters in Computer Security, entitled

BioRFID: A Patient Identification System using Biometrics and RFID

Anyone interested can download and read the whole thesis here:

In this article I’ll give an extremely compressed version of the thesis and how the work therein can be translated to the cybersecurity domain – along with some practical code to illustrate my points.

In the physical world, we often translate visual data from one “dimension” to another. For example, looking at the picture below, on the left hand side we see a view using night vision – and we’re still unable to pick out any “anomalies”. The anomaly (standing person) becomes pretty clear when we translate the night google data to use infrared instead, and as can be seen on the right hand side, though we lose some image detail we are now easily able to pick out our “anomaly”


In machine learning, we spend a lot of time trying to find “dimensions” to represent our data in such a way as to make the anomalies we’re looking for stand out far more than if we leave the data in it’s original form. There are a multitude of dimensions we can use, the one presented in the thesis is called “Sparse Coding“. The essence of sparse coding can be explained by examining the figure below:


Imagine we have a set of data (images of a forest in the figure above). We can pass this data through a “dictionary learner“. The job of the dictionary learner is to decompose our data into a set of unique “bases” or “atoms“. Just like in the real world, a language dictionary can be used to construct sentences. Any sentence I write can be decomposed into individual words that can subsequently be looked up in a dictionary.


Similarly in our previous example above, any picture can be decomposed into bases or atoms which can be found in the dictionary we just built from our training data. In the specific example in the figure, the bottom “test example” is expressed in terms of  three basis, each in different proportions (0.8 for the first one, 0.3 for the second one, and 0.5 for the last one)

Applying this to Cyber Security

Intuitively, such a system will struggle to express data it has never seen before – because it lacks the words or basis to decompose this data. Similarly, unusual or uncommon data will be expressed using a different set of words than those used to express common or normal data. Let’s test this theory.

Take the following practical scenario:

You collect data logs from your firewall, every 5 minutes. Being a good DevOps engineer, you write a quick script to summarize this data, converting all the data in a 5 minute time windows to:

  • The destination BGP AS number (because tracking each individual destination IP provides too many entries…)
  • The bytes transferred between your network and the destination AS number during those 5 minutes
  • The number of clients in your network that communicated with the destination AS number

You would end up with a dataset that looks something like the below. I built the below data set by using LibreOffice calc, randomly generating numbers for each entry. The only difference being the last entry, where I purposely entered an anomalous entry for demo purposes

Now, you are required to find from within these entries any anomalous or weird data. ideally, you should be able to use your work to calculate if future data points are anomalous or not.

We can apply the sparse coding principles I introduced in this article, as follows – using python, pandas and scipy:

The above code is basically using sparse coding to translate our data from one dimension to another (keep in mind that when doing so we usually can pick out details that are usually hidden, as in our night vision vs infrared example). The resulting data is shown at the end of the article, but it’s easier to visualise the data as a plot, shown below:


We immediately note three anomalies. One translates to the purposely anomalous data point I inserted into the end of our toy data set (as expected), while the other two are anomalies introduced by the random numbers generated. If we examine these further, it turns out that both these anomalies come from AS number “200”, which typically has “number of bytes transferred” being over 100. However for these two cases the number of bytes transferred turned out to be lower than expected – at about 80.

And there you have it – a quick and easy way of detecting anomalous data from firewall logs. Not only that, but you can use the dictionary generated by your code to see if new data points are anomalous or not. Of course this method doesn’t cover all cases and probably has its own set of problems but it’s a very good start considering the minimal amount of work we just put in.

At CyberSift we develop more advanced techniques which leverage machine learning and artificial intelligence to perform anomaly detection as we presented above – but on a much more advanced scale and in a more user friendly manner. Check us out!

Resulting data after sparse coding:



Anomaly detection vs Ransomware

A big part of what we do at CyberSift is anomaly detection. The recent WannaCry attack highlighted the growing threat of ransomware in the security landscape. The WannaCry authors may have made amateur mistakes, and there may be more stealthy and profitable attacks than WannaCry, but the negative impact it has had on Windows users (as it turns out… especially Windows 7 users) is undeniable — even bringing UK’s NHS to a halt. Microsoft promptly issued a patch, and vendors started releasing signatures to detect WannaCry — mostly lists of file hashes or domains:

Sample WannaCry filehashes
Sample WannaCry domains

It’s still a largely a game of cat and mouse. A simple update and the above lists become invalid. CyberSift already does a pretty good job of detecting ransomware activity via it’s DNS module (one notices the domains shown above look nothing like the usual english domains the majority of users visit — a dead giveaway for CyberSift)… but we wanted to take the concept of anomaly detection further and try help block ransomware as it happens — not after the fact.

We don’t usually venture into the realm of endpoint protection, but in this particular case we did — and we’re releasing a Windows anti-ransomware tool called “RansomSift”. Ransomsift doesn’t contain any signatures — no file hashes or domains, it’s a pure anomaly-based system. In this two part blog post we’ll explore the methods we used.

First — let’s see RansomSift in action against WannaCry on a Windows XP machine:

Anomaly Indicators

How do we achieve the above? Ransomsift uses two classes of anomaly detectors:

  1. File-based statistical indicators. In this blog post we’ll highlight what exactly we use to detect when files are being encrypted — and why it doesn’t always work
  2. OS based statistical indicators. In the next part of this two part series, we’ll explore which Operating System features RansomSift monitors for anomalies to further reduce false positives and block ransomware quicker.

Note: for these tests all files were encrypted using AES 256 using the command below (this becomes important later on so keep this in mind…)

openssl enc -aes-256-cbc -a -pass pass:word

File-based statistical indicators

A couple of academic research papers deal with detecting ransomware as it encrypts files [1][2]. The basic idea is that if one where to plot the histogram of data in a “normal” file and compare it to that of an encrypted file, there are differences that can be detected. Let’s have a quick rundown of the methods used:

Shannon Entropy

A simplistic explanation of entropy is “randomness” in a file. If we compare the probability distribution of an unencrypted file (MS WORD DOCX in this particular case) and an the same file encrypted we see the following:

A Microsoft Word file (unencrypted vs encrypted)

Clearly, the unencrypted file is a lot more “random” than the encrypted file. This feature holds true across multiple file types:


Another statistical feature we can measure is “skewness”. The below diagram sums up the concept of skewness:


Since the histograms are different, the unencrypted and encrypted versions of a file have different skewness. Plotting this for different file types we get another marked difference (though it looks like HTML would give us some difficulty here):


The last measure we looked at was “kurtosis”. Again, a simple diagram explains the concept succinctly:


Again, plotting kurtosis for different file types we get quite a difference (though again had we to rely on this statistic only, we’d have problems with TXT, HTML and DOC) :

Victories and Defeats

The above results alone are quite convincing. By combining the weak models and having each of them “vote” if a file is encrypted or not, we end up with a strong model that can tell with a good deal of reliability if a file is encrypted or not. RansomSift leverages this concept by monitoring files that have been changed in the “My Documents” directory, and determines if the file has been encrypted or not.

However, depending on this file-based statistical approach alone is not enough. During testing we ran into a couple of false positives (files being marked as encrypted when they are not) and false negatives (files being marked as not encrypted when in fact they are):

  • Compressed files are extremely similar to some forms of encryption. Depending on how files are compressed, and how they are subsequently encrypted, they both look like very random byte streams so their histograms would look very similar. This becomes quite an issue when you consider that nowadays programs like MS OFFICE compresses it’s files (DOCX, XLSX, etc…). Depending on the encryption scheme used, it’s hard to tell them apart using just statistics

In this series of tests we used openssl to encrypt our files — just as a malware author might do. However, there is more than one way to encrypt a file. You could:

  • Use a popular tool like AxCrypt. The encryption and compression used by this program makes it harder to tell files apart statistically speaking. During testing we found similar behavior with some other compression / encryption programs

Since we try to make our anomaly detection systems as robust as possible, we added another layer of anomaly detection that doesn’t depend on file statistics. In the next blog post we’ll explore the operating system features that we monitor in order to detect (and block) suspect activity such as WannaCry or other ransomware.

Interested in trying out RansomSift or CyberSift for enterprise? Contact Us!


[1] Scaife, N., Carter, H., Traynor, P. and Butler, K.R., 2016, June. Cryptolock (and drop it): stopping ransomware attacks on user data. In Distributed Computing Systems (ICDCS), 2016 IEEE 36th International Conference on (pp. 303–312). IEEE.

[2] Mbol, F., Robert, J.M. and Sadighian, A., 2016, November. An efficient approach to detect torrentlocker ransomware in computer systems. In International Conference on Cryptology and Network Security (pp. 532–541). Springer International Publishing.


The code used to generate the above statistics can be found below (written in GOLANG v1.8). Please note this is not the actual RansomSift code.