Beyond the basics : Logging Forensics with ELK (Elasticsearch, Logstash, Kibana)

In a previous blog post we explored how to use the ELK stack to build a fast, flexible and efficient log forensics platform. In this post we’ll move beyond the basics and address some issues that are specific to configuring ELK to be a better log forensics platform.

In particular, we’ll be addressing querying – specifically running regex queries. First though, a really high level overview of exactly how elasticsearch implements “search” as well as it does. One of the first things that elasticsearch does on receiving a log entry would be to pass it through an analyzer [1]. An analyzer is itself made up of several components, the two of interest to us are the tokenizer [2] and filter [3]. By default, elasticsearch uses the standard analyzer, which breaks up a string into components which are suitable for search. For example:

“The quick brown fox”

would get analysed into:

“quick“, “brown“, “fox”

In forensics, this presents us with a couple of problems:

* While words like “the”, “and”, “a” are not very useful for search – removing them like elasticsearch does by default means removing potentially important information. The golden rule of forensics is always to preserve the evidence in it’s original state.

* The tokenization into individual words affects querying too. Matches are now based on tokens, i.e. individual words rather than entire sentences. As we’ll see soon, this has an impact especially when using regexes.

In the previous blog post we briefly touched on the problem of tokenization by creating a dynamic template to be used by logstash that simply turns off the analyzers. While this is a valid suggestion, it does have a major drawback: Kibana won’t allow you to issue regex queries against non-analysed fields ( just try to issue a regex query against a “raw” field – which by default are note analysed – for example data.raw:/.*my_test.*/ ).

That just wont do – as any security analyst knows – regex is your friend!

So we need to modify the default dynamic template to keep analyzing the fields, but not to tokenize them into individual words. The best way I found of doing this is to use the “keyword” analyzer [4], which basically keeps the field string as is ( remember we don’t want to simply set this to “not analyzed” – we’d like to keep regex queries! ). During troubleshooting, I wasn’t completely sure if the keyword analyzer was converting all text to lowercase, so in our dynamic template below we define a new analyzer called “custom_keyword” which is based off the keyword analyzer but adds the “lowercase” filter. Converting everything to lowercase is important for regex expressions because lucene query strings are automatically converted into lowercase as I discovered through an interesting stackoverflow answer [5].

Below is the dynamic template used by logstash:

	{
	"template" : "logstash-*",
	"settings" : {
	"analysis": {
	"analyzer": {
	"custom_keyword": {
	"filter": ["lowercase"],
	"type": "keyword"
	}
	}
	},
	"index.refresh_interval" : "5s"
	},
	"mappings" : {
	"_default_" : {
	"_all" : {"enabled" : true, "omit_norms" : true},
	"dynamic_templates" : [ {
	"message_field" : {
	"match" : "message",
	"match_mapping_type" : "string",
	"mapping" : {
	"type" : "string", "index" : "analyzed", "omit_norms" : true
	}
	}
	}, {
	"string_fields" : {
	"match" : "*",
	"match_mapping_type" : "string",
	"mapping" : {
	"type" : "string", "index" : "analyzed", "analyzer": "custom_keyword", "omit_norms" : true,
	"fields" : {
	"raw" : {"type": "string", "index" : "not_analyzed", "ignore_above" : 256}
	}
	}
	}
	} ],
	"properties" : {
	"@version": { "type": "string", "index": "not_analyzed" },
	"geoip" : {
	"type" : "object",
	"dynamic": true,
	"properties" : {
	"location" : { "type" : "geo_point" }
	}
	}
	}
	}
	}
	}

view raw

logging_dynamic_template.json

hosted with ❤ by GitHub

Once data is pumped into the elasticsearch cluster, and the dynamic template gets used, the analyzer use can be verified by viewing the mappings. Visiting a url similar to (change the index name):

http://elasticsearch_IP:9200/logstash-2015.06.25/_mapping/

Will show something similar to the following:

Now that the information is being stored in elasticsearch the way that it’s needed, we can start issuing regex queries against that data. For example, below is an example of a regex expression against the “userdata2” field:

userdata2:/.*ajaxid=[1-9]{1}435220730000.*/

There is one thing to note, which is appending and prepending the .* characters to the search string. This is important due to our use of the “keyword” analyzer, which means that elasticsearch now needs to match the entire string and not just part of it. This is quite important to keep in mind and took me a while to figure out since we’re all probably used to using tools which do this automatically for us.

References

[1] Elasticsearch Analyzers: http://www.elastic.co/guide/en/elasticsearch/reference/1.4/indices-analyze.html

[2] Elasticsearch Tokenizers: https://www.elastic.co/guide/en/elasticsearch/reference/1.x/analysis-tokenizers.html

[3] Elasticsearch Token Filters: https://www.elastic.co/guide/en/elasticsearch/reference/1.x/analysis-tokenfilters.html

[4] Elasticsearch Keyword Analyzer: https://www.elastic.co/guide/en/elasticsearch/reference/1.x/analysis-keyword-analyzer.html

[5] Elasticsearch: hyphen in PrefixQuery on Keyword-analyzed field: http://stackoverflow.com/questions/30400408/elasticsearch-hyphen-in-prefixquery-on-keyword-analyzed-field