David Vassallo's Blog

If at first you don't succeed; call it version 1.0

Category Archives: Security

Bringing reliability to OSSEC

As we saw in a previous blog post, OSSEC is UDP based. This is great for performance, and can scale to 1000s of nodes. However, it means there is an inherent problem of reliability. UDP is a connection-less protocol, hence the OSSEC agent has no guaranteed way of knowing that a particular event has been delivered to the OSSEC server. Instead, the architecture relies on heartbeats and keepalives. However, there is still a potential for lost events no matter how short the interval between keepalives. In this article we explore a simple python based broker solution that introduces some (but not complete) reliability into the OSSEC architecture, at the cost of performance.

The first requirement of the broker solution is that it absolutely does not touch any existing code from the current OSSEC solution. It must interfere as little as possible with the current solution, so that if there any updates or changes in OSSEC the broker can either continue to work as normal, or at least be removed and allow OSSEC to work as originally intended. To achieve this, the broker is also going to be split into two components: a TCP server which is installed on the same machine as the OSSEC server, and a proxy-like solution which is installed on the same machine as the OSSEC client.

The general idea is that the OSSEC client is configured to send it’s traffic to 127.0.0.1 rather than directly to the server. The broker client intercepts the UDP packets (which are kept encrypted and compressed, maintaining end to end security), and before sending them on to the OSSEC server, it checks via TCP (reliably) if the broker server is still reachable and if the ossec-remoted process is still alive. If the broker server responds, the the broker client “releases” the packets and forwards them on to the original OSSEC server. If no answer is received from the broker server, the broker client assumes the server is down and buffers the original UDP packets into a queue. After a while, the OSSEC agent will realise the server is down and pause operations (other than keepalives) When the server comes back online the broker client replays back all the packets that have been buffered, so no events would be lost. The general architecture is as follows:

 

Proposed OSSEC Broker architecture

Proposed OSSEC Broker architecture

 

Starting from the client, we have the following code, commented so one can follow along:

 

 

The server is significantly simpler, shown below:

Kicking the tires and testing

We use the same troubleshooting and techniques we used in the previous blog post.

First we setup the server, which is also quite straightforward. We just run the ossec_broker_server.py file, and of course ensure that the ossec process is actually running properly. Next, the client. We start off by starting the python client on the windows machine (assuming python is installed), and pointing the OSSEC agent to 127.0.0.1:

Selection_085

We immediately see some output on the ossec broker client, something like so:

 

Selection_086

 

We should also check the OSSEC agent logs to make sure it connected successfully to 127.0.0.1:

Selection_087

So far so good… we have communication between the OSSEC agent and the OSSEC server, through the broker. Now, time to test a network interruption. If we simply stop the ossec broker server (simulating such an interruption), we should see the OSSEC agent fail to keep communicating with the OSSEC server:

 

Selection_088

Now, during this interruption (but before the agent keepalives force a lock on the event viewer, so within a minute in default installs…) we generate some events:

Selection_089

These events would normally be lost, because the agent has not yet had time to realise there is a disconnection. So we now turn the server back on, and check the OSSEC archive logs to check if the above events were delivered anyways:

Selection_090

Success! :) There are some improvements to be made, but the principle is sound, if one can look past the added overhead introduced to accommodate reliability.

OSSEC event loss troubleshooting

There is a general consensus that OSSEC will lose events in the event that the main OSSEC server goes offline for whatever reason ( [1] , [2] ) – be it the service is stopped, a network disconnection, or anything in between. However, there doesn’t seem to be much information on when exactly even loss can occur, for how long, and how the OSSEC agent recovers. In this article we explore some troubleshooting steps taken in order to answer the these questions.

Test Environment:

  • Windows Server 2012, installed as a VMWare Guest.

  • VMWare ESX v5

  • Alienvault All-in-one (commercial, not OSSIM) acting as OSSEC server

The OSSEC agent was deployed to the windows server using the AlienVault GUI, and the agent confirmed to be active:

1

The OSSEC server was placed in debug logging mode (using the <logall> global directive [3] ) . Next, we tested if application event logs were being sent from the agent to the server. On the client, which is the windows server 2012 VM, we do the following:

2

The above screenshot shows using the “eventcreate” command to generate two application logs:

before disconnecting 1

before disconnecting 2

We then confirmed that these were being received by the OSSEC server as expected:

3

Of note in the screenshot above is that the “logall” directive of OSSIM logs to the /var/ossec/logs/archives/archives.log path.

Now we disconnect the NIC card from the VM as shown below:

4

This simulated a general network failure, as shown by the ping results below. Immediately after, we generate a further three application logs (after disconnecting 1, after disconnecting 2, after disconnecting 3):

5

These events never make it through to the OSSEC server. However, if we wait for 240 seconds – the configured timeout value – we now see in the OSSEC agent logs:

6

So now we generate a further three more application events:

8

these being after 240 seconds 1, after 240 seconds 2 and after 204 seconds 3. For now, these still do not appear in the OSSEC server since the NIC card is still disconnected.

We proceed to enable the NIC card and connecting it again as shown before. We also generate some application events, after reconnect 1, after reconnect 2, after reconnect 3, and monitor the OSSEC server logs, and we see the logs being sent to the server after a few seconds, correctly timestamped, with no intervention or restarting from our end:

7

So in conclusion, caching does occur, but only if one waits for the 240 seconds to elapse. Events during those 240 seconds will be lost. One should note that this 240 seconds is configurable from the OSSEC agent via the <time-reconnect> directive, shown below:

9

One can of course reduce the values of notify_time and time_reconnect. In the source code for the OSSEC agent, the default time_reconnect value is set to three times (3x) the notify_time value, which is a sane default in my opinion. There’s a trade-off here between performance and reliability, since OSSEC is UDP based, the only way of knowing if the OSSEC server is offline is for the agent to send periodic keepalives (notifications). Setting notify_time too low means more network traffic and processing on the OSSEC agents and server, but less logs lost in the event of a disconnection.

So unless OSSEC gets converted to TCP (at the cost of performance due to TCP overhead compared to UDP) there seems there will always be the possibility of log loss.

References

[1] https://groups.google.com/forum/#!topic/ossec-list/F_izIq3zEi4

[2] https://groups.google.com/forum/#!topic/ossec-list/mQr3L_sqJ-Q

[3] http://ossec-docs.readthedocs.org/en/latest/syntax/head_ossec_config.global.html#element-logall

Follow

Get every new post delivered to your Inbox.

Join 186 other followers