Getting started with Neo4J and security data analysis

During a recent study module for a MSc I am undertaking we discussed the importance of continuous monitoring of data sources as part of a sound security defensive strategy. This lead me down a very interesting path, eventually culminating in my discovery of an entire subset in security discipline many refer to as “Security visualization”. There are plenty of examples on the internet as regards this discipline, one of the best links I found is “secviz”, located here:

http://secviz.org/

My focus during this was actually trying to use graph databases to visualize security data, which in it’s raw form is simply CSV logs from our corporate firewall. I chose to explore graph databases via Neo4J, an extremely popular, open source, graph database. This article is intended as a quick reference or high level overview to any who’d like to get started in this field, and also as a reminder to my future self as to what I did so I can pick it up again sometime soon…

Getting Neo4J installed on ubuntu was child’s play. Copy/pasting the debian install script provided from the Neo4J site was all it took. A quick overview of cypher (Neo4J’s declarative language) is also easy for those coming from a SQL background. What did require thorough reading was the procedure on how to import CSVs into Neo4J, found here:

http://docs.neo4j.org/chunked/stable/cypherdoc-importing-csv-files-with-cypher.html

the firewall logs I used in this study came from a Palo Alto firewall, specifically the threat logs. These logs give the source (attacker) and destination (target) IP address of an attack, as well as the name of the attack. So first step is to open the raw CSV file, and whittle it down to the important columns, ending up with the following (ip addresses changed to protect the innocent):

 

Source Destination Threat
192.168.7.6 192.168.1.2 Microsoft RPC Endpoint Mapper(30845)
192.168.9.5 192.168.9.4 SSH2 Login Attempt(31914)
95.169.160.24 1.2.3.4 DNS ANY Request(34842)
202.43.161.114 1.2.3.5 Win32.Conficker.C p2p(12544)
195.190.146.40 5.4.7.8 Suspicious or malformed HTTP Referer field(35554)
192.68.7.106 192.168.21.2 Microsoft Windows SMB Negotiate Request(35364)
192.16.2.200 92.168.123.3 Microsoft RPC Endpoint Mapper(30845)

— Warning —

The logs can be quite voluminous, and I didn’t pay much attention to the various optimizations that can and should be done (like indexing), so some queries can take a significant amount of time to execute.

First step is to import that csv file into Neo4J. The cypher code I used was:
CREATE INDEX ON :IP(address)
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///home/davidvassallo/Downloads/log.csv" AS csvLine
MERGE (source:IP { address: csvLine.Source })
MERGE (destination:IP { address: csvLine.Destination })
MERGE (attack:Threat { attack: csvLine.Threat })
CREATE (source)-[:USED]->(attack)-[:AGAINST]->(destination)

In the first part we load the appropriate CSV file (note the triple slash in the beginning for Linux systems), and loop through each line of the csv file, storing it in the variable csvLine. Breaking down a MERGE command, we have:

blog_neo4j_1

The first “MERGE” line basically instruct Neo4J to create nodes of type IP, with an attribute/property called address from the “Source” column of the CSV file. If the node already exists, then don’t create a new one (hence the use of MERGE instead of CREATE). We do the same for the Destination column in the next line, and finally we create nodes of type “Threat” out of the threat column in the CSV line. Last but not least, the CREATE lines creates the edges (or relationships) between the source, destination, and attack. There are multiple ways to do this, but in my head it came down to:

“A source used an attack againstdestination

One notices that the cypher instruction very closely resembles the english sentence I wrote. In the cypher instruction, nodes are enclosed in parenthesis, while relationships are enclosed in square brackets. With that out of the way, we can get to visualizing our data. After pointing our browser to http://localhost:7474, we load up all nodes of type “threat”:

blog_neo4j_2
Click to enlarge

Note how the generating cypher command is also give to the user (number 2 above). Even at this stage, without running any queries, we can get some really useful information, by double clicking on the threat nodes. To give a concrete example, it’s worth while analyzing the following picture:

click to enlarge
Click to enlarge

You see two attackers (on the right hand side) which launched a variety of attacks designed to try to break into RDP. I checked the firewall and it really does have RDP open for some reason (a definite plus to using security visualization… auditing your own traffic). You see that one attacker (80.82.65.60) is probably a poor schmuck who got infected by a worm. Morto is a worm that propagates using RDP. The other attacker on the top of the diagram is a more organized attacker…. First he did reconnaissance and found an open RDP port (that’s the “Ncrack RDP” scan) , then tested if the scan was actually correct by trying to use the actual RDP client (that’s the microsoft rdp initial attempt), then, finding out he couldnt guess the password, he launched a full blown brute force attempt against the server.

I haven’t even begun to scratch at the surface of Neo4J…. there’s plenty more to discover, but I’ll leave you with some very useful queries I constructed while on the learning journey:

--- Find sources and targets of SSH2 login attempts:
MATCH (a)-[r]-(b) WHERE a.attack =~ "SSH2.*" OR b.attack =~ "SSH2.*" RETURN a,b LIMIT 25
--- Node Traversal (who attacked this node id)
--- the node ID can be determined by clicking once on the node
--- it will appear in parenthesis in the black box that appears on the right
START n=node(233)
MATCH (x)-[r]->(n)
RETURN n,x
-- who used attack id 78, return only distinct attackers
START n=node(78)
MATCH (a)-[r]->(n)
RETURN DISTINCT(a),n
LIMIT 500

References:

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.