Data mining firewall logs : Principal Component Analysis

In this article we’ll explore how Principal Component Analysis [PCA] [1] – a popular data reduction technique – can help a busy security or network administrator. Any such administrator has often been faced with a daunting problem… going through reams of firewall or router connection logs trying to figure out if any of the thousands of entries is abnormal or not, if it’s worth looking into, or even where to start. We’ll use an open source data analysis tool called RapidMiner [2] to explore how to tackle this exact problem and give admins a head start.

PCA may sound complicated – but breaking it down to its bare essence reveals a not so complicated concept. Given a CSV or table, PCA applies statistical and mathematical techniques to figure out which of the columns in the data actually have the most “impact” or variance in statistical speak. To illustrate this point… imagine having a the following really simple table:

| Destination IP Address | Protocol |
|                   |   tcp         |
|                 |   tcp         |
|                   |   tcp         |

It’s pretty intuitive that the only column adding any real value or having any real impact is the destination IP address, because it’s the only column whose value changes. The protocol column only contains “tcp” (it has no variance) and so doesnt really add much to figuring out if a connection is anomalous. PCA automates this process over a potentially large number of columns, and tells the analyst which columns have the most impact. I recommend you read the article [3] to get a good background into exactly how we’ll be extracting this information using RapidMiner.

Preparing the firewall logs

In this case, we’ll be using PaloAlto “Traffic” logs, exported to a CSV file. This results in a rather large data set, so either beef up your machine, or reduce the data set. For the purposes of the article, we’ll have to reduce the data set. So we keep only the first 10,000 rows, and reduce the number of columns by making some educated guesses as to which columns can be removed. For example, constant columns like “Serial Number”, “Domain”, “Config Version”, and so on can be eliminated. Other columns that do not add too much value such as “receive time” are also eliminated. It takes some practice but you should be able to very quickly determine which columns actually do make a difference. In this case, we retained the following columns:

Source address
Destination address
Source Port
Destination Port
IP Protocol
Bytes Sent
Bytes Received
Destination Country

PCA normally only takes numeric input, and some of these columns contain alphanumeric data. So we now need to convert all data to numbers only. A really simple way of doing this is just to convert each cell to it’s ASCII equivalent. So “abc” would be 97+98+99=294

I wrote a really simple python script that iterates through the CSV file and converts everything to the decimal ASCII equivalent. The script takes as an input a file called “log_prepared_truncated.csv” and outputs to a file called “log_transformed_truncated.csv”, and it assumes the first row contains column names so it leaves them untouched.

Here we can see the original CSV file on the left, and the transformed CSV file on the right:


Introducing RapidMiner

RapidMiner has a very simple UI…. it should be very intuitive to use. We first import our transformed CSV file. Just type “csv” into the search bar on the left hand side and click-drag the “Read CSV” operator. Clicking on the Read CSV operator gives you the option to run “Import Configuration Wizard” which makes it all dead simple. You can leave all parameters as default, except for ticking “first row as names”:


Next, we add the PCA operator. Simply type “pca” into the search box and click-drag the “Principal Component Analysis” operator, and connect the two operators together. I also like to include a “generate ID” operator which will number the outputted rows and make it easier to refer back to the original data. So the final process should look like this:


Focus on PCA

Before moving to the final results, let’s stop and look a bit at the PCA stage. Hit the run button… In the results, have a look at the “PCA (PCA)” tab:


In the above, we note that the first two component – PC1 and PC2 – together account for 94.7% of the variance in the data (that’s the last column in the above screenshot). So these two components are what we’ll focus on. Which columns do these component map to? That’s where “Eigenectors” come in. Click on the Eigenvectors button:Selection_275

PC1 is most influenced by “Destination Country” because that Attribute contributes -0.953 to the component (the maximum can be +/- 1), while PC2 is influenced most by “Application”. So now we know that these two attributes are what impact our data most.

Visualizing the results

A picture is worth a thousand words. So let’s generate some 🙂

Click on the “ExampleSet (Generate ID)” tab, and then hit the “Charts” button. To get a good idea of which is the best plot to choose, my favorite first port of call is the “Scatter Matrix”, which gives all possible permutations of scatter plots… in our case it’s quite simple… either (PC1 vs PC2) or (PC2 vs PC1). Make sure to select “id” as plots, and we should see something like the below:


The above is much more helpful when you have more than two components being plotted. Switching to a simple scatter plot, we already see useful data emerging:


Most points are grouped together… but there are a few that dont quite fit in… these are our “outliers” that we should have a closer look into… Hover your mouse over a point of interest and note the ID.

As a quick example, one point from the above screenshot immediately stands out, circled in green below:


The ID for this point is 6671. So we go back to our original CSV file (log_prepared_truncated.csv) and see what comes up on row 6671. In this case its:    incomplete    44181    80   tcp    120   60    60       3    MT    2    1

Something is already weird here… the destination port is “80” but application is not “incomplete” rather than “HTTP” as I’d have expected it. So we put “” into my next favorite tool… IBM’s X-Force Exchange [4], and indeed… this is an IP that’s associated with scanning…


Rinse and repeat…. 🙂


[1] Principal Component Analysis – Wikipedia. Available from:

[2] RapidMiner. Available from:

[3] How to run Principal Component Analysis with RapidMiner – Part 2. Available from:

[4] IBM X-Force Exchange. Available from:


Getting started with Neo4J and security data analysis

During a recent study module for a MSc I am undertaking we discussed the importance of continuous monitoring of data sources as part of a sound security defensive strategy. This lead me down a very interesting path, eventually culminating in my discovery of an entire subset in security discipline many refer to as “Security visualization”. There are plenty of examples on the internet as regards this discipline, one of the best links I found is “secviz”, located here:

My focus during this was actually trying to use graph databases to visualize security data, which in it’s raw form is simply CSV logs from our corporate firewall. I chose to explore graph databases via Neo4J, an extremely popular, open source, graph database. This article is intended as a quick reference or high level overview to any who’d like to get started in this field, and also as a reminder to my future self as to what I did so I can pick it up again sometime soon…

Getting Neo4J installed on ubuntu was child’s play. Copy/pasting the debian install script provided from the Neo4J site was all it took. A quick overview of cypher (Neo4J’s declarative language) is also easy for those coming from a SQL background. What did require thorough reading was the procedure on how to import CSVs into Neo4J, found here:

the firewall logs I used in this study came from a Palo Alto firewall, specifically the threat logs. These logs give the source (attacker) and destination (target) IP address of an attack, as well as the name of the attack. So first step is to open the raw CSV file, and whittle it down to the important columns, ending up with the following (ip addresses changed to protect the innocent):


Source Destination Threat Microsoft RPC Endpoint Mapper(30845) SSH2 Login Attempt(31914) DNS ANY Request(34842) Win32.Conficker.C p2p(12544) Suspicious or malformed HTTP Referer field(35554) Microsoft Windows SMB Negotiate Request(35364) Microsoft RPC Endpoint Mapper(30845)

— Warning —

The logs can be quite voluminous, and I didn’t pay much attention to the various optimizations that can and should be done (like indexing), so some queries can take a significant amount of time to execute.

First step is to import that csv file into Neo4J. The cypher code I used was:
LOAD CSV WITH HEADERS FROM "file:///home/davidvassallo/Downloads/log.csv" AS csvLine
MERGE (source:IP { address: csvLine.Source })
MERGE (destination:IP { address: csvLine.Destination })
MERGE (attack:Threat { attack: csvLine.Threat })
CREATE (source)-[:USED]->(attack)-[:AGAINST]->(destination)

In the first part we load the appropriate CSV file (note the triple slash in the beginning for Linux systems), and loop through each line of the csv file, storing it in the variable csvLine. Breaking down a MERGE command, we have:


The first “MERGE” line basically instruct Neo4J to create nodes of type IP, with an attribute/property called address from the “Source” column of the CSV file. If the node already exists, then don’t create a new one (hence the use of MERGE instead of CREATE). We do the same for the Destination column in the next line, and finally we create nodes of type “Threat” out of the threat column in the CSV line. Last but not least, the CREATE lines creates the edges (or relationships) between the source, destination, and attack. There are multiple ways to do this, but in my head it came down to:

“A source used an attack againstdestination

One notices that the cypher instruction very closely resembles the english sentence I wrote. In the cypher instruction, nodes are enclosed in parenthesis, while relationships are enclosed in square brackets. With that out of the way, we can get to visualizing our data. After pointing our browser to http://localhost:7474, we load up all nodes of type “threat”:

Click to enlarge

Note how the generating cypher command is also give to the user (number 2 above). Even at this stage, without running any queries, we can get some really useful information, by double clicking on the threat nodes. To give a concrete example, it’s worth while analyzing the following picture:

click to enlarge
Click to enlarge

You see two attackers (on the right hand side) which launched a variety of attacks designed to try to break into RDP. I checked the firewall and it really does have RDP open for some reason (a definite plus to using security visualization… auditing your own traffic). You see that one attacker ( is probably a poor schmuck who got infected by a worm. Morto is a worm that propagates using RDP. The other attacker on the top of the diagram is a more organized attacker…. First he did reconnaissance and found an open RDP port (that’s the “Ncrack RDP” scan) , then tested if the scan was actually correct by trying to use the actual RDP client (that’s the microsoft rdp initial attempt), then, finding out he couldnt guess the password, he launched a full blown brute force attempt against the server.

I haven’t even begun to scratch at the surface of Neo4J…. there’s plenty more to discover, but I’ll leave you with some very useful queries I constructed while on the learning journey:

--- Find sources and targets of SSH2 login attempts:
MATCH (a)-[r]-(b) WHERE a.attack =~ "SSH2.*" OR b.attack =~ "SSH2.*" RETURN a,b LIMIT 25
--- Node Traversal (who attacked this node id)
--- the node ID can be determined by clicking once on the node
--- it will appear in parenthesis in the black box that appears on the right
START n=node(233)
MATCH (x)-[r]->(n)
-- who used attack id 78, return only distinct attackers
START n=node(78)
MATCH (a)-[r]->(n)