Symptom: Centreon stopped graphing performance data completely.
There are quite a large number of reasons why this would happen, in fact a quick google search will come up with some very good articles, the few which I found useful were:
However, alas none of the points mentioned in the above articles worked to bring back my graphs. During the course of troubleshooting, we noted the following points:
- Graphs stopped at almost exactly 2am (suspicion immediately fell on some sort of scheduled / cron job, since they normally run between 2 – 4am)
- The .RRD files in the default path where centreon stores it’s metrics (/var/lib/centreon/metrics) had all stopped being updated at 2am
However, the program that creates those RRD files was still responding properly (/usr/bin/rrdtool).
For those lacking some background, the centreon procedure for graphing metrics follows this flow (for a full size image please click on the image itself):
Following the above flow, one thing we noticed in our case was that /usr/local/nagios/var/service-perfdata was quite big (150MB). This was unusual because centstorage usually reads this file about every 5 minutes, and when doing so, empties service-perfdata into service-perfdata_read, so the former file should never be too big.
This was pointing towards a centcore issue. After checking the logs under /usr/local/centreon/log/ we noticed the following entry at exactly the same time that our problem started occurring:
Waiting for centstorage to exit .. done.
17/2/2012 02:00:13 – Begin centstorage.data_bin purge
17/2/2012 02:02:00 – Finishing centstorage.data_bin purge
Starting centstorage Collector : centstorage
Checking out the script that logs the above (centreonPurge.sh), it seems that it created a backup file of the performance data (service-perfdata.bckp). In turn, it seems that the centcore script will not read the service perfdata if the bckp file is present
So the solution was to rerun centPurge.sh and delete the file under /usr/local/nagios/var/service-perfdata.bckp. Once this file was deleted, centreon again started processing it’s graph data.