What do Smartphone Predictive Text and Cybersecurity have in common?

Maybe the link between your smartphone keyboard and current machine learning research in cybersecurity is not apparent at first glance, but the technology behind both is extremely similar: both leverage deep learning architectures called Recurrent Neural Networks [RNNs], specifically a type of RNN called Long Short Term Memory [LSTM].

One of the main advantages of LSTMs is their ability to deal with sequences very well. Due to the composition of the building blocks of LSTMs, these RNNs are able to predict the next step in a sequence given previous steps by taking into account not only the statistical properties of a sequence in question (e.g. frequency) but also the temporal properties of a sequence. To give a practical example of “temporal properties”, let’s imagine a simplistic example. Say an LSTM has been trained with sequences similar to the following:

previous steps -> next step

“1 1 1” -> 2

“4 4 4” -> 5

Given the never-before-seen sequence of “8 8 8” the LSTM is very well able to predict “9” correctly. This may seem simplistic but a neural network typically deals with thousands or millions of different sequences, but the LSTM is anyway capable of learning the intuitive rule in our example that if you see three repeated numbers, the next number is simply +1. This is different from spatial or frequency based machine learning techniques (such as One Class SVMs) where a never-before-seen sequence gets classified as an anomaly — precisely because it’s never been seen before.

Your smartphone keyboard is actually powered by deep learning

You probably use LSTMs every day without realizing it — in the form of the predictive text suggestions that appear whenever you are typing something in your smartphone. As we just explained, LSTMs are very good with sequences. Sequences can just as well be letters rather than numbers. So given enough training, given a previous sequence of letters, an LSTM gets very good at suggesting the next letter, couple of letters, or the whole word.

The screenshot above is familiar to all of you… start typing and given a sequence of characters, the LSTM will predict the most probable next few characters. These “predictions” are what we call suggestions.



Where things get interesting for cybersecurity analysts is what happens when we feed an LSTM a sequence of characters which are abnormal.



An example of doing this on your smartphone is shown above. When we feed the LSTM an abnormal sequence of characters, it cannot predict with any certainty what the next character is. This manifests itself in very limited suggestions. In the screenshot, note how the keyboard suggestions are limited to the sequence itself (LSTM could not predict the next character, or it simply prepends common characters).


The cybersecurity tie-in

One man’s trash is another man’s gold. While the above might not seem very useful to the smartphone user — it is to a cybersecurity analyst who is looking for anomalies within the millions of logs that are generated by security devices.

For example, let’s consider CyberSift’s Docker anomaly detection engine. The concept is pretty simple: detect anomalous sequences of system calls. Any operating system’s activity can be characterized as a stream of system calls like so:


We can imagine each system call as being a character or number in a longer sequence — exactly what LSTM is designed to handle. To give a practical example, let’s imagine we are using an LSTM that has been trained on common sequences of system calls. Next, we see how the LSTM reacts when we ask it to predict the next system call, given a sequence of syscalls which is relatively common. The LSTM output could look similar to this:



The above graph shows that the LSTM is 90% certain that the next syscall is going to be “open”. Similar to what we saw before with the smartphone keyboard, the LSTM network has a good chance of being correct.

Contrast this to what happens when we feed the LSTM network an unusual syscall sequence. Just like before, the LSTM network will get confused and give very uncertain predictions:


The above graph still shows “open” as being the next most probable system call, but the network is a lot less certain about it (16% vs the 90% we had previously)



This is exactly how CyberSift leverages deep learning to help detect anomalies in your docker environment — or to detect anomalies within your logs, highlighting those sequences that are different or unusual and therefore are more worthy of your limited time.

These types of protections are becoming increasingly important as novel attacks are discovered against docker and other systems which do not necessarily trigger signatures, but definitely generate anomalous behavior.


Source: https://threatpost.com/attack-uses-docker-containers-to-hide-persist-plant-malware/126992/


Consider the attack presented in Black Hat just last month — where hackers were able to spin up a docker container just by having a target visiting a specially crafted webpage. Their attack consists of leveraging the docker API to start a container and then use that to laterally attack the network. In a busy docker environment, where containers are being started and stopped multiple times within a short period of time, keeping your eye on all the containers being started may be a bit too much to handle, but as we can see from CyberSift’s anomaly detection engine output below — starting a container that performs unusual actions shows up as a highly anomalous period:


Note the significantly higher anomaly score for the time period where a docker container was spun up and performed a range of lateral attacks and data exfiltration. For further information about the test environment used to capture the above results, please have a quick read here


For more posts like this, written in my capacity as CTO of CyberSift, please follow us on Medium! We include more technical, marketing, and management articles all relating to InfoSec



Cyber Security: Sparse coding and anomaly detection

I’ve recently published the thesis I wrote in fulfillment of my Masters in Computer Security, entitled

BioRFID: A Patient Identification System using Biometrics and RFID

Anyone interested can download and read the whole thesis here:


In this article I’ll give an extremely compressed version of the thesis and how the work therein can be translated to the cybersecurity domain – along with some practical code to illustrate my points.

In the physical world, we often translate visual data from one “dimension” to another. For example, looking at the picture below, on the left hand side we see a view using night vision – and we’re still unable to pick out any “anomalies”. The anomaly (standing person) becomes pretty clear when we translate the night google data to use infrared instead, and as can be seen on the right hand side, though we lose some image detail we are now easily able to pick out our “anomaly”


In machine learning, we spend a lot of time trying to find “dimensions” to represent our data in such a way as to make the anomalies we’re looking for stand out far more than if we leave the data in it’s original form. There are a multitude of dimensions we can use, the one presented in the thesis is called “Sparse Coding“. The essence of sparse coding can be explained by examining the figure below:


Imagine we have a set of data (images of a forest in the figure above). We can pass this data through a “dictionary learner“. The job of the dictionary learner is to decompose our data into a set of unique “bases” or “atoms“. Just like in the real world, a language dictionary can be used to construct sentences. Any sentence I write can be decomposed into individual words that can subsequently be looked up in a dictionary.


Similarly in our previous example above, any picture can be decomposed into bases or atoms which can be found in the dictionary we just built from our training data. In the specific example in the figure, the bottom “test example” is expressed in terms of  three basis, each in different proportions (0.8 for the first one, 0.3 for the second one, and 0.5 for the last one)

Applying this to Cyber Security

Intuitively, such a system will struggle to express data it has never seen before – because it lacks the words or basis to decompose this data. Similarly, unusual or uncommon data will be expressed using a different set of words than those used to express common or normal data. Let’s test this theory.

Take the following practical scenario:

You collect data logs from your firewall, every 5 minutes. Being a good DevOps engineer, you write a quick script to summarize this data, converting all the data in a 5 minute time windows to:

  • The destination BGP AS number (because tracking each individual destination IP provides too many entries…)
  • The bytes transferred between your network and the destination AS number during those 5 minutes
  • The number of clients in your network that communicated with the destination AS number

You would end up with a dataset that looks something like the below. I built the below data set by using LibreOffice calc, randomly generating numbers for each entry. The only difference being the last entry, where I purposely entered an anomalous entry for demo purposes

Now, you are required to find from within these entries any anomalous or weird data. ideally, you should be able to use your work to calculate if future data points are anomalous or not.

We can apply the sparse coding principles I introduced in this article, as follows – using python, pandas and scipy:

The above code is basically using sparse coding to translate our data from one dimension to another (keep in mind that when doing so we usually can pick out details that are usually hidden, as in our night vision vs infrared example). The resulting data is shown at the end of the article, but it’s easier to visualise the data as a plot, shown below:


We immediately note three anomalies. One translates to the purposely anomalous data point I inserted into the end of our toy data set (as expected), while the other two are anomalies introduced by the random numbers generated. If we examine these further, it turns out that both these anomalies come from AS number “200”, which typically has “number of bytes transferred” being over 100. However for these two cases the number of bytes transferred turned out to be lower than expected – at about 80.

And there you have it – a quick and easy way of detecting anomalous data from firewall logs. Not only that, but you can use the dictionary generated by your code to see if new data points are anomalous or not. Of course this method doesn’t cover all cases and probably has its own set of problems but it’s a very good start considering the minimal amount of work we just put in.

At CyberSift we develop more advanced techniques which leverage machine learning and artificial intelligence to perform anomaly detection as we presented above – but on a much more advanced scale and in a more user friendly manner. Check us out!

Resulting data after sparse coding: