First Steps in applying machine learning to InfoSec – WEKA

The intersection between machine learning [ML] and information security [InfoSec] is currently quite a hot topic. The allure of this intersection is easy to see, security analysts are drowning in alerts and data which need to be painstakingly investigated and if necessary acted upon. This is no easy processes and as was seen in the now infamous Target hack, more often than not alarms go by unnoticed. ML promises to alleviate the torrent of alerts and logs and (ideally) present to the analyst only those alerts which are really worthwhile investigating.

This is by no means an easy task however the rise of several enabling factors has made this goal reachable to the average InfoSec professional:

  • Cloud Computing
  • Big Data technologies such as Hadoop
  • Python (and other language) libraries like Scikit-Learn [1] which abstract away the nuances of Machine Learning and Data Mining
  • Distributed Data/Log collection and search technologies such as ElasticSearch [2]

From personal experience the process of learning about machine learning can be daunting, especially to those not of a mathematical background. However, in this series of articles I plan on outlining my learning process and enumerating the various excellent resources that are freely available on the internet to help anyone interested in getting started in this exciting field.

A good introduction into this field is a talk by @j_monty and @rsevey about “Using Machine Learning Solutions to Solve Serious Security Problems” which can be found here:

https://youtu.be/48O6L_DfE2o

The talk really whets your appetite for this field. A small distinction that should be pointed out is the difference between “machine learning” and “data mining“. Data mining is the process of turning raw data into actionable information, while machine learning is one of the many tools/algorithms that help in this process. The presenters mention using WEKA [3] to get started in the field and get to grips with understanding the data that will eventually power our algorithms and machine learning. Before anything else, it will be very useful to manually try some data mining techniques to understand our data, which algorithms to apply to this data for best results and understand the challenges and rewards of doing so. This will allow us to better understand which machine learning algorithms we can later apply to infosec related data such as logs, pcaps and so on.

So it would seem WEKA is as good a place as any to get started! Some quick research turns up a hidden gem…. an online course from the creators of WEKA on how to use the program:

https://weka.waikato.ac.nz/dataminingwithweka/preview

The course may not be open when reading this, however the course videos are still available on YouTube and this should be your first stop:

http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/

Note: if you need to find the datasets the instructor is using (the WEKA installation from the Ubuntu repositories do not include these), then you can find them here:

http://storm.cis.fordham.edu/~gweiss/data-mining/datasets.html

References

[1] Scikit-learn : http://scikit-learn.org/stable/

[2] ElasticSearch: https://www.elastic.co/

[3] WEKA: http://www.cs.waikato.ac.nz/ml/weka/