OSSEC event loss troubleshooting

There is a general consensus that OSSEC will lose events in the event that the main OSSEC server goes offline for whatever reason ( [1] , [2] ) – be it the service is stopped, a network disconnection, or anything in between. However, there doesn’t seem to be much information on when exactly even loss can occur, for how long, and how the OSSEC agent recovers. In this article we explore some troubleshooting steps taken in order to answer the these questions.

Test Environment:

Windows Server 2012, installed as a VMWare Guest.
VMWare ESX v5
Alienvault All-in-one (commercial, not OSSIM) acting as OSSEC server

The OSSEC agent was deployed to the windows server using the AlienVault GUI, and the agent confirmed to be active:

The OSSEC server was placed in debug logging mode (using the <logall> global directive [3] ) . Next, we tested if application event logs were being sent from the agent to the server. On the client, which is the windows server 2012 VM, we do the following:

The above screenshot shows using the “eventcreate” command to generate two application logs:

before disconnecting 1

before disconnecting 2

We then confirmed that these were being received by the OSSEC server as expected:

Of note in the screenshot above is that the “logall” directive of OSSIM logs to the /var/ossec/logs/archives/archives.log path.

Now we disconnect the NIC card from the VM as shown below:

This simulated a general network failure, as shown by the ping results below. Immediately after, we generate a further three application logs (after disconnecting 1, after disconnecting 2, after disconnecting 3):

These events never make it through to the OSSEC server. However, if we wait for 240 seconds – the configured timeout value – we now see in the OSSEC agent logs:

So now we generate a further three more application events:

these being after 240 seconds 1, after 240 seconds 2 and after 204 seconds 3. For now, these still do not appear in the OSSEC server since the NIC card is still disconnected.

We proceed to enable the NIC card and connecting it again as shown before. We also generate some application events, after reconnect 1, after reconnect 2, after reconnect 3, and monitor the OSSEC server logs, and we see the logs being sent to the server after a few seconds, correctly timestamped, with no intervention or restarting from our end:

So in conclusion, caching does occur, but only if one waits for the 240 seconds to elapse. Events during those 240 seconds will be lost. One should note that this 240 seconds is configurable from the OSSEC agent via the <time-reconnect> directive, shown below:

One can of course reduce the values of notify_time and time_reconnect. In the source code for the OSSEC agent, the default time_reconnect value is set to three times (3x) the notify_time value, which is a sane default in my opinion. There’s a trade-off here between performance and reliability, since OSSEC is UDP based, the only way of knowing if the OSSEC server is offline is for the agent to send periodic keepalives (notifications). Setting notify_time too low means more network traffic and processing on the OSSEC agents and server, but less logs lost in the event of a disconnection.

So unless OSSEC gets converted to TCP (at the cost of performance due to TCP overhead compared to UDP) there seems there will always be the possibility of log loss.

References

[1] https://groups.google.com/forum/#!topic/ossec-list/F_izIq3zEi4

[2] https://groups.google.com/forum/#!topic/ossec-list/mQr3L_sqJ-Q

[3] http://ossec-docs.readthedocs.org/en/latest/syntax/head_ossec_config.global.html#element-logall