Bringing reliability to OSSEC
As we saw in a previous blog post, OSSEC is UDP based. This is great for performance, and can scale to 1000s of nodes. However, it means there is an inherent problem of reliability. UDP is a connection-less protocol, hence the OSSEC agent has no guaranteed way of knowing that a particular event has been delivered to the OSSEC server. Instead, the architecture relies on heartbeats and keepalives. However, there is still a potential for lost events no matter how short the interval between keepalives. In this article we explore a simple python based broker solution that introduces some (but not complete) reliability into the OSSEC architecture, at the cost of performance.
The first requirement of the broker solution is that it absolutely does not touch any existing code from the current OSSEC solution. It must interfere as little as possible with the current solution, so that if there any updates or changes in OSSEC the broker can either continue to work as normal, or at least be removed and allow OSSEC to work as originally intended. To achieve this, the broker is also going to be split into two components: a TCP server which is installed on the same machine as the OSSEC server, and a proxy-like solution which is installed on the same machine as the OSSEC client.
The general idea is that the OSSEC client is configured to send it’s traffic to 127.0.0.1 rather than directly to the server. The broker client intercepts the UDP packets (which are kept encrypted and compressed, maintaining end to end security), and before sending them on to the OSSEC server, it checks via TCP (reliably) if the broker server is still reachable and if the ossec-remoted process is still alive. If the broker server responds, the the broker client “releases” the packets and forwards them on to the original OSSEC server. If no answer is received from the broker server, the broker client assumes the server is down and buffers the original UDP packets into a queue. After a while, the OSSEC agent will realise the server is down and pause operations (other than keepalives) When the server comes back online the broker client replays back all the packets that have been buffered, so no events would be lost. The general architecture is as follows:
Starting from the client, we have the following code, commented so one can follow along:
The server is significantly simpler, shown below:
Kicking the tires and testing
We use the same troubleshooting and techniques we used in the previous blog post.
First we setup the server, which is also quite straightforward. We just run the ossec_broker_server.py file, and of course ensure that the ossec process is actually running properly. Next, the client. We start off by starting the python client on the windows machine (assuming python is installed), and pointing the OSSEC agent to 127.0.0.1:
We immediately see some output on the ossec broker client, something like so:
We should also check the OSSEC agent logs to make sure it connected successfully to 127.0.0.1:
So far so good… we have communication between the OSSEC agent and the OSSEC server, through the broker. Now, time to test a network interruption. If we simply stop the ossec broker server (simulating such an interruption), we should see the OSSEC agent fail to keep communicating with the OSSEC server:
Now, during this interruption (but before the agent keepalives force a lock on the event viewer, so within a minute in default installs…) we generate some events:
These events would normally be lost, because the agent has not yet had time to realise there is a disconnection. So we now turn the server back on, and check the OSSEC archive logs to check if the above events were delivered anyways:
Success! 🙂 There are some improvements to be made, but the principle is sound, if one can look past the added overhead introduced to accommodate reliability.