What do Smartphone Predictive Text and Cybersecurity have in common?

Maybe the link between your smartphone keyboard and current machine learning research in cybersecurity is not apparent at first glance, but the technology behind both is extremely similar: both leverage deep learning architectures called Recurrent Neural Networks [RNNs], specifically a type of RNN called Long Short Term Memory [LSTM].

One of the main advantages of LSTMs is their ability to deal with sequences very well. Due to the composition of the building blocks of LSTMs, these RNNs are able to predict the next step in a sequence given previous steps by taking into account not only the statistical properties of a sequence in question (e.g. frequency) but also the temporal properties of a sequence. To give a practical example of “temporal properties”, let’s imagine a simplistic example. Say an LSTM has been trained with sequences similar to the following:

previous steps -> next step

“1 1 1” -> 2

“4 4 4” -> 5

Given the never-before-seen sequence of “8 8 8” the LSTM is very well able to predict “9” correctly. This may seem simplistic but a neural network typically deals with thousands or millions of different sequences, but the LSTM is anyway capable of learning the intuitive rule in our example that if you see three repeated numbers, the next number is simply +1. This is different from spatial or frequency based machine learning techniques (such as One Class SVMs) where a never-before-seen sequence gets classified as an anomaly — precisely because it’s never been seen before.

Your smartphone keyboard is actually powered by deep learning

You probably use LSTMs every day without realizing it — in the form of the predictive text suggestions that appear whenever you are typing something in your smartphone. As we just explained, LSTMs are very good with sequences. Sequences can just as well be letters rather than numbers. So given enough training, given a previous sequence of letters, an LSTM gets very good at suggesting the next letter, couple of letters, or the whole word.

The screenshot above is familiar to all of you… start typing and given a sequence of characters, the LSTM will predict the most probable next few characters. These “predictions” are what we call suggestions.



Where things get interesting for cybersecurity analysts is what happens when we feed an LSTM a sequence of characters which are abnormal.



An example of doing this on your smartphone is shown above. When we feed the LSTM an abnormal sequence of characters, it cannot predict with any certainty what the next character is. This manifests itself in very limited suggestions. In the screenshot, note how the keyboard suggestions are limited to the sequence itself (LSTM could not predict the next character, or it simply prepends common characters).


The cybersecurity tie-in

One man’s trash is another man’s gold. While the above might not seem very useful to the smartphone user — it is to a cybersecurity analyst who is looking for anomalies within the millions of logs that are generated by security devices.

For example, let’s consider CyberSift’s Docker anomaly detection engine. The concept is pretty simple: detect anomalous sequences of system calls. Any operating system’s activity can be characterized as a stream of system calls like so:


We can imagine each system call as being a character or number in a longer sequence — exactly what LSTM is designed to handle. To give a practical example, let’s imagine we are using an LSTM that has been trained on common sequences of system calls. Next, we see how the LSTM reacts when we ask it to predict the next system call, given a sequence of syscalls which is relatively common. The LSTM output could look similar to this:



The above graph shows that the LSTM is 90% certain that the next syscall is going to be “open”. Similar to what we saw before with the smartphone keyboard, the LSTM network has a good chance of being correct.

Contrast this to what happens when we feed the LSTM network an unusual syscall sequence. Just like before, the LSTM network will get confused and give very uncertain predictions:


The above graph still shows “open” as being the next most probable system call, but the network is a lot less certain about it (16% vs the 90% we had previously)



This is exactly how CyberSift leverages deep learning to help detect anomalies in your docker environment — or to detect anomalies within your logs, highlighting those sequences that are different or unusual and therefore are more worthy of your limited time.

These types of protections are becoming increasingly important as novel attacks are discovered against docker and other systems which do not necessarily trigger signatures, but definitely generate anomalous behavior.


Source: https://threatpost.com/attack-uses-docker-containers-to-hide-persist-plant-malware/126992/


Consider the attack presented in Black Hat just last month — where hackers were able to spin up a docker container just by having a target visiting a specially crafted webpage. Their attack consists of leveraging the docker API to start a container and then use that to laterally attack the network. In a busy docker environment, where containers are being started and stopped multiple times within a short period of time, keeping your eye on all the containers being started may be a bit too much to handle, but as we can see from CyberSift’s anomaly detection engine output below — starting a container that performs unusual actions shows up as a highly anomalous period:


Note the significantly higher anomaly score for the time period where a docker container was spun up and performed a range of lateral attacks and data exfiltration. For further information about the test environment used to capture the above results, please have a quick read here


For more posts like this, written in my capacity as CTO of CyberSift, please follow us on Medium! We include more technical, marketing, and management articles all relating to InfoSec



Anomaly detection vs Ransomware

A big part of what we do at CyberSift is anomaly detection. The recent WannaCry attack highlighted the growing threat of ransomware in the security landscape. The WannaCry authors may have made amateur mistakes, and there may be more stealthy and profitable attacks than WannaCry, but the negative impact it has had on Windows users (as it turns out… especially Windows 7 users) is undeniable — even bringing UK’s NHS to a halt. Microsoft promptly issued a patch, and vendors started releasing signatures to detect WannaCry — mostly lists of file hashes or domains:

Sample WannaCry filehashes
Sample WannaCry domains

It’s still a largely a game of cat and mouse. A simple update and the above lists become invalid. CyberSift already does a pretty good job of detecting ransomware activity via it’s DNS module (one notices the domains shown above look nothing like the usual english domains the majority of users visit — a dead giveaway for CyberSift)… but we wanted to take the concept of anomaly detection further and try help block ransomware as it happens — not after the fact.

We don’t usually venture into the realm of endpoint protection, but in this particular case we did — and we’re releasing a Windows anti-ransomware tool called “RansomSift”. Ransomsift doesn’t contain any signatures — no file hashes or domains, it’s a pure anomaly-based system. In this two part blog post we’ll explore the methods we used.

First — let’s see RansomSift in action against WannaCry on a Windows XP machine:

Anomaly Indicators

How do we achieve the above? Ransomsift uses two classes of anomaly detectors:

  1. File-based statistical indicators. In this blog post we’ll highlight what exactly we use to detect when files are being encrypted — and why it doesn’t always work
  2. OS based statistical indicators. In the next part of this two part series, we’ll explore which Operating System features RansomSift monitors for anomalies to further reduce false positives and block ransomware quicker.

Note: for these tests all files were encrypted using AES 256 using the command below (this becomes important later on so keep this in mind…)

openssl enc -aes-256-cbc -a -pass pass:word

File-based statistical indicators

A couple of academic research papers deal with detecting ransomware as it encrypts files [1][2]. The basic idea is that if one where to plot the histogram of data in a “normal” file and compare it to that of an encrypted file, there are differences that can be detected. Let’s have a quick rundown of the methods used:

Shannon Entropy

A simplistic explanation of entropy is “randomness” in a file. If we compare the probability distribution of an unencrypted file (MS WORD DOCX in this particular case) and an the same file encrypted we see the following:

A Microsoft Word file (unencrypted vs encrypted)

Clearly, the unencrypted file is a lot more “random” than the encrypted file. This feature holds true across multiple file types:


Another statistical feature we can measure is “skewness”. The below diagram sums up the concept of skewness:

source: https://www.kullabs.com/uploads/skewness1.jpg

Since the histograms are different, the unencrypted and encrypted versions of a file have different skewness. Plotting this for different file types we get another marked difference (though it looks like HTML would give us some difficulty here):


The last measure we looked at was “kurtosis”. Again, a simple diagram explains the concept succinctly:

Source: https://stats.stackexchange.com/questions/84158/how-is-the-kurtosis-of-a-distribution-related-to-the-geometry-of-the-density-fun

Again, plotting kurtosis for different file types we get quite a difference (though again had we to rely on this statistic only, we’d have problems with TXT, HTML and DOC) :

Victories and Defeats

The above results alone are quite convincing. By combining the weak models and having each of them “vote” if a file is encrypted or not, we end up with a strong model that can tell with a good deal of reliability if a file is encrypted or not. RansomSift leverages this concept by monitoring files that have been changed in the “My Documents” directory, and determines if the file has been encrypted or not.

However, depending on this file-based statistical approach alone is not enough. During testing we ran into a couple of false positives (files being marked as encrypted when they are not) and false negatives (files being marked as not encrypted when in fact they are):

  • Compressed files are extremely similar to some forms of encryption. Depending on how files are compressed, and how they are subsequently encrypted, they both look like very random byte streams so their histograms would look very similar. This becomes quite an issue when you consider that nowadays programs like MS OFFICE compresses it’s files (DOCX, XLSX, etc…). Depending on the encryption scheme used, it’s hard to tell them apart using just statistics

In this series of tests we used openssl to encrypt our files — just as a malware author might do. However, there is more than one way to encrypt a file. You could:

  • Use a popular tool like AxCrypt. The encryption and compression used by this program makes it harder to tell files apart statistically speaking. During testing we found similar behavior with some other compression / encryption programs

Since we try to make our anomaly detection systems as robust as possible, we added another layer of anomaly detection that doesn’t depend on file statistics. In the next blog post we’ll explore the operating system features that we monitor in order to detect (and block) suspect activity such as WannaCry or other ransomware.

Interested in trying out RansomSift or CyberSift for enterprise? Contact Us!




[1] Scaife, N., Carter, H., Traynor, P. and Butler, K.R., 2016, June. Cryptolock (and drop it): stopping ransomware attacks on user data. In Distributed Computing Systems (ICDCS), 2016 IEEE 36th International Conference on (pp. 303–312). IEEE.

[2] Mbol, F., Robert, J.M. and Sadighian, A., 2016, November. An efficient approach to detect torrentlocker ransomware in computer systems. In International Conference on Cryptology and Network Security (pp. 532–541). Springer International Publishing.


The code used to generate the above statistics can be found below (written in GOLANG v1.8). Please note this is not the actual RansomSift code.