Troubleshooting Centreon graphs

Symptom: Centreon stopped graphing performance data completely.

There are quite a large number of reasons why this would happen, in fact a quick google search will come up with some very good articles, the few which I found useful were:

http://en.doc.centreon.com/Setup:Graphs#Perfdata_activation_in_Nagios

http://en.doc.centreon.com/Troubleshooting:Graphs

http://felipeferreira.net/?p=1019

However, alas none of the points mentioned in the above articles worked to bring back my graphs. During the course of troubleshooting, we noted the following points:

  • Graphs stopped at almost exactly 2am (suspicion immediately fell on some sort of scheduled / cron job, since they normally run between 2 – 4am)
  • The .RRD files in the default path where centreon stores it’s metrics (/var/lib/centreon/metrics) had all stopped being updated at 2am

However, the program that creates those RRD files was still responding properly (/usr/bin/rrdtool).

For those lacking some background, the centreon procedure for graphing metrics follows this flow (for a full size image please click on the image itself):

 

centreon_graphing

Following the above flow, one thing we noticed in our case was that /usr/local/nagios/var/service-perfdata was quite big (150MB). This was unusual because centstorage usually reads this file about every 5 minutes, and when doing so, empties service-perfdata into service-perfdata_read, so the former file should never be too big.

This was pointing towards a centcore issue. After checking the logs under /usr/local/centreon/log/ we noticed the following entry at exactly the same time that our problem started occurring:

Waiting for centstorage to exit .. done.
17/2/2012 02:00:13 – Begin centstorage.data_bin purge

17/2/2012 02:02:00 – Finishing centstorage.data_bin purge

Starting centstorage Collector : centstorage

Checking out the script that logs the above (centreonPurge.sh), it seems that it created a backup file of the performance data (service-perfdata.bckp). In turn, it seems that the centcore script will not read the service perfdata if the bckp file is present

 

So the solution was to rerun centPurge.sh and delete the file under /usr/local/nagios/var/service-perfdata.bckp. Once this file was deleted, centreon again started processing it’s graph data.