Cyber Security: Sparse coding and anomaly detection

I’ve recently published the thesis I wrote in fulfillment of my Masters in Computer Security, entitled

BioRFID: A Patient Identification System using Biometrics and RFID

Anyone interested can download and read the whole thesis here:

In this article I’ll give an extremely compressed version of the thesis and how the work therein can be translated to the cybersecurity domain – along with some practical code to illustrate my points.

In the physical world, we often translate visual data from one “dimension” to another. For example, looking at the picture below, on the left hand side we see a view using night vision – and we’re still unable to pick out any “anomalies”. The anomaly (standing person) becomes pretty clear when we translate the night google data to use infrared instead, and as can be seen on the right hand side, though we lose some image detail we are now easily able to pick out our “anomaly”


In machine learning, we spend a lot of time trying to find “dimensions” to represent our data in such a way as to make the anomalies we’re looking for stand out far more than if we leave the data in it’s original form. There are a multitude of dimensions we can use, the one presented in the thesis is called “Sparse Coding“. The essence of sparse coding can be explained by examining the figure below:


Imagine we have a set of data (images of a forest in the figure above). We can pass this data through a “dictionary learner“. The job of the dictionary learner is to decompose our data into a set of unique “bases” or “atoms“. Just like in the real world, a language dictionary can be used to construct sentences. Any sentence I write can be decomposed into individual words that can subsequently be looked up in a dictionary.


Similarly in our previous example above, any picture can be decomposed into bases or atoms which can be found in the dictionary we just built from our training data. In the specific example in the figure, the bottom “test example” is expressed in terms of  three basis, each in different proportions (0.8 for the first one, 0.3 for the second one, and 0.5 for the last one)

Applying this to Cyber Security

Intuitively, such a system will struggle to express data it has never seen before – because it lacks the words or basis to decompose this data. Similarly, unusual or uncommon data will be expressed using a different set of words than those used to express common or normal data. Let’s test this theory.

Take the following practical scenario:

You collect data logs from your firewall, every 5 minutes. Being a good DevOps engineer, you write a quick script to summarize this data, converting all the data in a 5 minute time windows to:

  • The destination BGP AS number (because tracking each individual destination IP provides too many entries…)
  • The bytes transferred between your network and the destination AS number during those 5 minutes
  • The number of clients in your network that communicated with the destination AS number

You would end up with a dataset that looks something like the below. I built the below data set by using LibreOffice calc, randomly generating numbers for each entry. The only difference being the last entry, where I purposely entered an anomalous entry for demo purposes

Now, you are required to find from within these entries any anomalous or weird data. ideally, you should be able to use your work to calculate if future data points are anomalous or not.

We can apply the sparse coding principles I introduced in this article, as follows – using python, pandas and scipy:

The above code is basically using sparse coding to translate our data from one dimension to another (keep in mind that when doing so we usually can pick out details that are usually hidden, as in our night vision vs infrared example). The resulting data is shown at the end of the article, but it’s easier to visualise the data as a plot, shown below:


We immediately note three anomalies. One translates to the purposely anomalous data point I inserted into the end of our toy data set (as expected), while the other two are anomalies introduced by the random numbers generated. If we examine these further, it turns out that both these anomalies come from AS number “200”, which typically has “number of bytes transferred” being over 100. However for these two cases the number of bytes transferred turned out to be lower than expected – at about 80.

And there you have it – a quick and easy way of detecting anomalous data from firewall logs. Not only that, but you can use the dictionary generated by your code to see if new data points are anomalous or not. Of course this method doesn’t cover all cases and probably has its own set of problems but it’s a very good start considering the minimal amount of work we just put in.

At CyberSift we develop more advanced techniques which leverage machine learning and artificial intelligence to perform anomaly detection as we presented above – but on a much more advanced scale and in a more user friendly manner. Check us out!

Resulting data after sparse coding:



Bleeding edge: The intersection of Bitcoin and cyber-security

The good, the bad, & the ugly…

There are some very obvious connections between bitcoin and cybersecurity; almost every hacker who blackmails their victims via ransomware or whatever other hack demand payment in bitcoin. This is the ugly side of bitcoin and cybersec; by it’s very nature bitcoin is pseudo-anonymous (read: difficult to trace), decentralized (read: difficult to take down) and increasingly easy to use. No wonder hackers love bitcoin.

But what are the other facets to bitcoin melding with cybersec?

The bad…

… it can be used to control botnets

The bitcoin blockchain is intended to be extremely difficult to take down, to be private and unregulated. Sounds like the perfect medium for a Command and Control [C&C] service. Enter ZombieCoin 2.0:

The authors of this paper successfully manage to design

[…] ZombieCoin bots which we then deploy and successfully control over the Bitcoin network.

They do this by embedding simple botnet commands into the bitcoin transaction field OP_RETURN which is normally used for transaction identifiers similar to what you’d have in your online ebanking portal. This field allows you to include up to 80 bytes of data which the authors use to control their bots. The resulting bot is only 7MB in size and stores only about 626kB worth of blockchain. with the traffic generated by this C&C method being indistinguishable from normal bitcoin traffic.

Time to start blocking bitcoin traffic on your enterprise network

The good…

… it can be make Man In The Middle Attacks a thing of the past

Most MiTM attacks rely on being able to change data that is supplied to a client – for example changing DNS entries or HTTPS certificates. Current DNS / SSL / TLS protocols struggle to make this data tamper proof. DNSSEC hasn’t really taken off and SSL/TLS rely on a central authority that can be compromised or abused.

However… if attackers can embed data in the blockchain, so can developers and defenders. Inheriting all the benefits of the blockchain, this embedded data will not only be resilient and de-centralized (like you’d hope DNS is…)  but also backed by cryptography to result in tamper-proof data. Any entries into this system would have to be validated and agreed upon by at least 51% of the network to be accepted…

So what id we use blockchain instead of DNS and HTTPS certificate authorities? Entities would use the blockchain to resolve their IP addresses and provide their public certificates in a safe, secure manner. This is the basic concept behind a blockchain-based technology aptly named NAMECOIN:

While it may look rather theoretical or at least difficult to migrate our systems to use blockchain, it turns out that there already is some excellent work being done by Greg Slepak to simplify this and make namecoin extremely easy to use for both webmasters and websurfers, in the form of okTurles:

For a more in-depth read abot okTurles, have a look at their overview: