Check_mk, pnp4nagios and mod_gearman

Ever since i’ve been working with Centreon, I’ve found it to be a relatively stable all-in-one solution. Its all-in-one aspect makes it extremely convenient (for example, graphing is taken care of without any third part modules. Moreover, since Centreon is built on top of nagios / icinga, you get to all the openness, and flexibility of the Nagios core. Using Nagios, it’s possible to monitor pretty much anything.

But, Centreon has a major shortcoming… it’s dependence on a database. Most commonly, this database is MySQL. The database makes things slow once you start monitoring a certain amount of checks. Centreon’s WebUI reads information from the centstatus / centreon_status database. In my case, we’ve been monitoring approximately 3,000 checks. We hit a bottleneck. For some reason, after a nagios restart/reload (say, to implement a new check), it would take ages (about 15 mins) for Centreon WebUI to display all the hosts and services being checked, if it showed them at all. Also, after scheduling an immediate check, it would take about 5-7 minutes for Centreon’s WebUI to get updated.

At first, we thought it was the nagios core, having heard that nagios doesnt scale well. But that’s actually a lie. I installed mod_gearman and nagios easily handled the workload. 3,000 checks from a single server without any problems. Mod_Gearman is a keeper, easy to install, with massive benefits. Until Nagios core includes mod_gearman functionality, it’s always my first performance enhancing install right after nagios. The nagios and mod_gearman combo executed all checks on time.

So next we turned our attention to the database. This turned out to be the bottleneck. No matter what method used to interface between centreon and MySQL, it was taking too long to read/write from the database. I used several methods to try improve performance. We updated to the latest NdoUtils, we tried two alternatives: IdoUtils (from icinga) and Centreon Broker. I tried decreasing the flush queue, tired unix sockets instead of tcp sockets, tried tweaking mysql itself… Still no joy.

So, it seems the best way is to completely ditch the database. Mathias Kettner came up with a brilliant solution of reading the current state of checks from the nagios core: check_mk livestatus. His solution is simple to implement, extremely fast and scalable, and easy to interface with. It seems the natural way to go. Unfortunately, this means ditching centreon until they decide to implement livestatus functionality. They seem to be (unwisely) ignoring the request to implement it:

http://forge.centreon.com/issues/2402

Case made against centreon, so what to use to replace it? There is a myriad of open source tool combinations that seem to work. I wanted a scalable, fast and all-in-one solution similar to centreon. This is what I went for:

– Core: nagios and mod_gearmanThis combination worked wonders for me. It’s fast and easy. Nagios’ community base makes it easy to troubleshoot. Nagios’ API makes it easy to script any monitoring check.

– GUI: check_mk multisite. Built on his aforementioned Livestatus. Fast, lightweight, and easy to use, this GUI is exactly what I was looking for. It also provides configuration of nagios. Multisite also makes it easy to integrate multiple sites into a single dashboard.

– Graphing: pnp4nagiosThis tool uses rrdtool (same as centreon). The other main contender was nagiosgraph, but pnp4nagios won out for two big, main reasons:

* it integrates really well with multisite. Multisite allows pnp4nagios graphs to be integrated seamlessly within it’s GUI

* it integrates with mod_gearman. Yep, it does… meaning perfdata from nagios gets parsed quickly… again, it’s fast and scalable. Since I already use mod_gearman for the performance benefits it gives to nagios, it’s a no-brainer to include mod_gearman as a perfdata processor.

Note: kudos to the developers of these tools. The tight and seamless integration between the different tools is amazing, and the advantages of each tool blend together to make an amazing single product.

Here’s the techie part of this article… I had some difficulty getting pnp4nagios to integrate with mod_gearman. It’s only due to lack of clear documentation, so i’ll try explain what worked for me.

  • I installed mod_gearman via the repositories provided here:

https://labs.consol.de/repo/

  • I followed the usual instructions for integrating mod_gearman into nagios, adding the following lines into nagios.cfg:
event_broker_options=-1
broker_module=/usr/lib/mod_gearman/mod_gearman.o config=/etc/mod_gearman/mod_gearman_worker.conf

– I setup pnp4nagios w/ mod_gearman as per instructions here:

http://docs.pnp4nagios.org/pnp-0.6/config#gearman_mode

However, I needed to make the following changes before it would work:

1. I checked the mod_gearman_neb.conf file and saw the following interesting option (and changed it from the default no to yes):

# defines if the module should distribute perfdata
# to gearman.
# Note: processing of perfdata is not part of
# mod_gearman. You will need additional worker for
# handling performance data. For example: pnp4nagios
# Performance data is just written to the gearman
# queue.
# Default: no
perfdata=yes

2. I changed the default nagios.cfg configuration to point to this file instead:

event_broker_options=-1
broker_module=/usr/lib/mod_gearman/mod_gearman.o config=/etc/mod_gearman/mod_gearman_neb.conf

That did it🙂 checkout the screenshot’s from Multisite’s site:

http://mathias-kettner.de/bilder/1276533368.png