David Vassallo's Blog

If at first you don't succeed; call it version 1.0

Category Archives: Open Source

OSSEC event loss troubleshooting

There is a general consensus that OSSEC will lose events in the event that the main OSSEC server goes offline for whatever reason ( [1] , [2] ) – be it the service is stopped, a network disconnection, or anything in between. However, there doesn’t seem to be much information on when exactly even loss can occur, for how long, and how the OSSEC agent recovers. In this article we explore some troubleshooting steps taken in order to answer the these questions.

Test Environment:

  • Windows Server 2012, installed as a VMWare Guest.

  • VMWare ESX v5

  • Alienvault All-in-one (commercial, not OSSIM) acting as OSSEC server

The OSSEC agent was deployed to the windows server using the AlienVault GUI, and the agent confirmed to be active:


The OSSEC server was placed in debug logging mode (using the <logall> global directive [3] ) . Next, we tested if application event logs were being sent from the agent to the server. On the client, which is the windows server 2012 VM, we do the following:


The above screenshot shows using the “eventcreate” command to generate two application logs:

before disconnecting 1

before disconnecting 2

We then confirmed that these were being received by the OSSEC server as expected:


Of note in the screenshot above is that the “logall” directive of OSSIM logs to the /var/ossec/logs/archives/archives.log path.

Now we disconnect the NIC card from the VM as shown below:


This simulated a general network failure, as shown by the ping results below. Immediately after, we generate a further three application logs (after disconnecting 1, after disconnecting 2, after disconnecting 3):


These events never make it through to the OSSEC server. However, if we wait for 240 seconds – the configured timeout value – we now see in the OSSEC agent logs:


So now we generate a further three more application events:


these being after 240 seconds 1, after 240 seconds 2 and after 204 seconds 3. For now, these still do not appear in the OSSEC server since the NIC card is still disconnected.

We proceed to enable the NIC card and connecting it again as shown before. We also generate some application events, after reconnect 1, after reconnect 2, after reconnect 3, and monitor the OSSEC server logs, and we see the logs being sent to the server after a few seconds, correctly timestamped, with no intervention or restarting from our end:


So in conclusion, caching does occur, but only if one waits for the 240 seconds to elapse. Events during those 240 seconds will be lost. One should note that this 240 seconds is configurable from the OSSEC agent via the <time-reconnect> directive, shown below:


One can of course reduce the values of notify_time and time_reconnect. In the source code for the OSSEC agent, the default time_reconnect value is set to three times (3x) the notify_time value, which is a sane default in my opinion. There’s a trade-off here between performance and reliability, since OSSEC is UDP based, the only way of knowing if the OSSEC server is offline is for the agent to send periodic keepalives (notifications). Setting notify_time too low means more network traffic and processing on the OSSEC agents and server, but less logs lost in the event of a disconnection.

So unless OSSEC gets converted to TCP (at the cost of performance due to TCP overhead compared to UDP) there seems there will always be the possibility of log loss.


[1] https://groups.google.com/forum/#!topic/ossec-list/F_izIq3zEi4

[2] https://groups.google.com/forum/#!topic/ossec-list/mQr3L_sqJ-Q

[3] http://ossec-docs.readthedocs.org/en/latest/syntax/head_ossec_config.global.html#element-logall

Plotting the 95th percentile using Centreon

Calculating the 95th percentile of bandwidth used by a client is a common method of billing for ISP and service providers [1]. Hence, it is also of great interest to the client to plot these values as well to keep track of their service provider fees and double check bills.

Plotting the 95th percentile on Centreon is currently not very straightforward but it is possible without too much hassle once you know what to do. This article documents what you have to do to get a proper bandwidth plot both incoming traffic and outgoing traffic’s 95th percentile.

We assume that you already have a normal bandwidth graph showing traffic in / traffic out over an interface. Once you have this graph, perform the following steps:

  • Navigate to Views > Graphs > Virtuals > Metrics


  • Create a new metric. Enter a valid name, and select the host and service for which you’d like to plot the 95th percentile. Normally the host would be the internet facing router, and the interface would be the WAN interface


  • Select a DEF type of VDEF. We use VDEF because the 95th percentile is calculated over an entire range of values, not on individual data points [2]
  • As an RPN, enter the following:

Make sure there are no spaces in the above, else you will get an error. For those interested in what is going on above, RPN (reverse polish notation) works by using a “stack”. So the way to read the above is:

 – “Push the ‘traffic_in’ dataset onto the stack”

- “Push the variable ’95’ onto the stack”

- “Calculate the result of the PERCENT function on the previous two values in the stack”

More details can be found here [3]

  • Note that the “traffic_in” will probably need to change for a specific installation, depending on what you have named this variable. Use the “list of known metrics” to select the appropriate name
  • Save the above metric. This will actually return the 95th percentile

Now, unfortunately centreon does not allow you to directly plot a VDEF (notice the”hidden graph” checkbox in the screenshot above? This cannot be de-selected). So we need to employ a little trick to turn this VDEF into a CDEF, which can be plotted. This is how that is done:

  • Again create a new virtual metric, giving it an appropriate name
  • Again select the same host and service you had previously used for the VDEF above.
  • This time, select a DEF type of CDEF


  • Here’s the trick… in the RPN field, enter the following:

Again, make sure you have no spaces in the above. Also, note that “95_in_vdef” is actually the name of the vdef you previously created. It should be listed under “List of known metrics”. What we’re doing here is simple. Break down the RPN and it should be clearer:


This adds the individual data points of “traffic_in” (the traffic of the interface) to the “95_in_vdef” you previously calculated, and places the result back onto the stack. Now, we’re only really interested in the value of “95_in_vdef”, so we subtract the result of the above from traffic_in again to just leave 95_in_vdef, which is what “traffic_in,-“ is doing.

So in essence we have the following formula:

traffic_in + 95th percentile - traffic_in = 95th percentile

Now since it’s a CDEF, we can plot this. You need to do the same for traffic_out of course. First the VDEF:

VDEF for traffic out: traffic_out,95,PERCENT

VDEF for outbound traffic: traffic_out,95,PERCENT

Followed by the CDEF:

CDEF for outbound traffic: traffic_out,95_out_vdef,+,traffic_out,-

CDEF for outbound traffic: traffic_out,95_out_vdef,+,traffic_out,-

Now, one should modify the curve lines of the two CDEFs we just defined to make them pop out a little on the graphs. this is done via the “curves” option shown in the first screenshot above, for example, in the below, we set the color for 95_percent_out (our CDEF value for the outbound traffic 95th percentile) red:


The result would be something like this:


Where you can see both inbound and outbound 95th percentile lines in blue and red respectively.


[1] http://en.wikipedia.org/wiki/Burstable_billing

[2] http://oss.oetiker.ch/rrdtool/doc/rrdgraph_data.en.html#___top

[3] http://oss.oetiker.ch/rrdtool/tut/rpntutorial.en.html


Get every new post delivered to your Inbox.

Join 185 other followers