Data mining firewall logs : Principal Component Analysis

In this article we’ll explore how Principal Component Analysis [PCA] [1] – a popular data reduction technique – can help a busy security or network administrator. Any such administrator has often been faced with a daunting problem… going through reams of firewall or router connection logs trying to figure out if any of the thousands of entries is abnormal or not, if it’s worth looking into, or even where to start. We’ll use an open source data analysis tool called RapidMiner [2] to explore how to tackle this exact problem and give admins a head start.

PCA may sound complicated – but breaking it down to its bare essence reveals a not so complicated concept. Given a CSV or table, PCA applies statistical and mathematical techniques to figure out which of the columns in the data actually have the most “impact” or variance in statistical speak. To illustrate this point… imagine having a the following really simple table:

| Destination IP Address | Protocol |
|                   |   tcp         |
|                 |   tcp         |
|                   |   tcp         |

It’s pretty intuitive that the only column adding any real value or having any real impact is the destination IP address, because it’s the only column whose value changes. The protocol column only contains “tcp” (it has no variance) and so doesnt really add much to figuring out if a connection is anomalous. PCA automates this process over a potentially large number of columns, and tells the analyst which columns have the most impact. I recommend you read the article [3] to get a good background into exactly how we’ll be extracting this information using RapidMiner.

Preparing the firewall logs

In this case, we’ll be using PaloAlto “Traffic” logs, exported to a CSV file. This results in a rather large data set, so either beef up your machine, or reduce the data set. For the purposes of the article, we’ll have to reduce the data set. So we keep only the first 10,000 rows, and reduce the number of columns by making some educated guesses as to which columns can be removed. For example, constant columns like “Serial Number”, “Domain”, “Config Version”, and so on can be eliminated. Other columns that do not add too much value such as “receive time” are also eliminated. It takes some practice but you should be able to very quickly determine which columns actually do make a difference. In this case, we retained the following columns:

Source address
Destination address
Source Port
Destination Port
IP Protocol
Bytes Sent
Bytes Received
Destination Country

PCA normally only takes numeric input, and some of these columns contain alphanumeric data. So we now need to convert all data to numbers only. A really simple way of doing this is just to convert each cell to it’s ASCII equivalent. So “abc” would be 97+98+99=294

I wrote a really simple python script that iterates through the CSV file and converts everything to the decimal ASCII equivalent. The script takes as an input a file called “log_prepared_truncated.csv” and outputs to a file called “log_transformed_truncated.csv”, and it assumes the first row contains column names so it leaves them untouched.

Here we can see the original CSV file on the left, and the transformed CSV file on the right:


Introducing RapidMiner

RapidMiner has a very simple UI…. it should be very intuitive to use. We first import our transformed CSV file. Just type “csv” into the search bar on the left hand side and click-drag the “Read CSV” operator. Clicking on the Read CSV operator gives you the option to run “Import Configuration Wizard” which makes it all dead simple. You can leave all parameters as default, except for ticking “first row as names”:


Next, we add the PCA operator. Simply type “pca” into the search box and click-drag the “Principal Component Analysis” operator, and connect the two operators together. I also like to include a “generate ID” operator which will number the outputted rows and make it easier to refer back to the original data. So the final process should look like this:


Focus on PCA

Before moving to the final results, let’s stop and look a bit at the PCA stage. Hit the run button… In the results, have a look at the “PCA (PCA)” tab:


In the above, we note that the first two component – PC1 and PC2 – together account for 94.7% of the variance in the data (that’s the last column in the above screenshot). So these two components are what we’ll focus on. Which columns do these component map to? That’s where “Eigenectors” come in. Click on the Eigenvectors button:Selection_275

PC1 is most influenced by “Destination Country” because that Attribute contributes -0.953 to the component (the maximum can be +/- 1), while PC2 is influenced most by “Application”. So now we know that these two attributes are what impact our data most.

Visualizing the results

A picture is worth a thousand words. So let’s generate some 🙂

Click on the “ExampleSet (Generate ID)” tab, and then hit the “Charts” button. To get a good idea of which is the best plot to choose, my favorite first port of call is the “Scatter Matrix”, which gives all possible permutations of scatter plots… in our case it’s quite simple… either (PC1 vs PC2) or (PC2 vs PC1). Make sure to select “id” as plots, and we should see something like the below:


The above is much more helpful when you have more than two components being plotted. Switching to a simple scatter plot, we already see useful data emerging:


Most points are grouped together… but there are a few that dont quite fit in… these are our “outliers” that we should have a closer look into… Hover your mouse over a point of interest and note the ID.

As a quick example, one point from the above screenshot immediately stands out, circled in green below:


The ID for this point is 6671. So we go back to our original CSV file (log_prepared_truncated.csv) and see what comes up on row 6671. In this case its:    incomplete    44181    80   tcp    120   60    60       3    MT    2    1

Something is already weird here… the destination port is “80” but application is not “incomplete” rather than “HTTP” as I’d have expected it. So we put “” into my next favorite tool… IBM’s X-Force Exchange [4], and indeed… this is an IP that’s associated with scanning…


Rinse and repeat…. 🙂


[1] Principal Component Analysis – Wikipedia. Available from:

[2] RapidMiner. Available from:

[3] How to run Principal Component Analysis with RapidMiner – Part 2. Available from:

[4] IBM X-Force Exchange. Available from:


First Steps in applying machine learning to InfoSec – WEKA

The intersection between machine learning [ML] and information security [InfoSec] is currently quite a hot topic. The allure of this intersection is easy to see, security analysts are drowning in alerts and data which need to be painstakingly investigated and if necessary acted upon. This is no easy processes and as was seen in the now infamous Target hack, more often than not alarms go by unnoticed. ML promises to alleviate the torrent of alerts and logs and (ideally) present to the analyst only those alerts which are really worthwhile investigating.

This is by no means an easy task however the rise of several enabling factors has made this goal reachable to the average InfoSec professional:

  • Cloud Computing
  • Big Data technologies such as Hadoop
  • Python (and other language) libraries like Scikit-Learn [1] which abstract away the nuances of Machine Learning and Data Mining
  • Distributed Data/Log collection and search technologies such as ElasticSearch [2]

From personal experience the process of learning about machine learning can be daunting, especially to those not of a mathematical background. However, in this series of articles I plan on outlining my learning process and enumerating the various excellent resources that are freely available on the internet to help anyone interested in getting started in this exciting field.

A good introduction into this field is a talk by @j_monty and @rsevey about “Using Machine Learning Solutions to Solve Serious Security Problems” which can be found here:

The talk really whets your appetite for this field. A small distinction that should be pointed out is the difference between “machine learning” and “data mining“. Data mining is the process of turning raw data into actionable information, while machine learning is one of the many tools/algorithms that help in this process. The presenters mention using WEKA [3] to get started in the field and get to grips with understanding the data that will eventually power our algorithms and machine learning. Before anything else, it will be very useful to manually try some data mining techniques to understand our data, which algorithms to apply to this data for best results and understand the challenges and rewards of doing so. This will allow us to better understand which machine learning algorithms we can later apply to infosec related data such as logs, pcaps and so on.

So it would seem WEKA is as good a place as any to get started! Some quick research turns up a hidden gem…. an online course from the creators of WEKA on how to use the program:

The course may not be open when reading this, however the course videos are still available on YouTube and this should be your first stop:

Note: if you need to find the datasets the instructor is using (the WEKA installation from the Ubuntu repositories do not include these), then you can find them here:


[1] Scikit-learn :

[2] ElasticSearch:

[3] WEKA: