Check_mk, pnp4nagios and mod_gearman

Ever since i’ve been working with Centreon, I’ve found it to be a relatively stable all-in-one solution. Its all-in-one aspect makes it extremely convenient (for example, graphing is taken care of without any third part modules. Moreover, since Centreon is built on top of nagios / icinga, you get to all the openness, and flexibility of the Nagios core. Using Nagios, it’s possible to monitor pretty much anything.

But, Centreon has a major shortcoming… it’s dependence on a database. Most commonly, this database is MySQL. The database makes things slow once you start monitoring a certain amount of checks. Centreon’s WebUI reads information from the centstatus / centreon_status database. In my case, we’ve been monitoring approximately 3,000 checks. We hit a bottleneck. For some reason, after a nagios restart/reload (say, to implement a new check), it would take ages (about 15 mins) for Centreon WebUI to display all the hosts and services being checked, if it showed them at all. Also, after scheduling an immediate check, it would take about 5-7 minutes for Centreon’s WebUI to get updated.

At first, we thought it was the nagios core, having heard that nagios doesnt scale well. But that’s actually a lie. I installed mod_gearman and nagios easily handled the workload. 3,000 checks from a single server without any problems. Mod_Gearman is a keeper, easy to install, with massive benefits. Until Nagios core includes mod_gearman functionality, it’s always my first performance enhancing install right after nagios. The nagios and mod_gearman combo executed all checks on time.

So next we turned our attention to the database. This turned out to be the bottleneck. No matter what method used to interface between centreon and MySQL, it was taking too long to read/write from the database. I used several methods to try improve performance. We updated to the latest NdoUtils, we tried two alternatives: IdoUtils (from icinga) and Centreon Broker. I tried decreasing the flush queue, tired unix sockets instead of tcp sockets, tried tweaking mysql itself… Still no joy.

So, it seems the best way is to completely ditch the database. Mathias Kettner came up with a brilliant solution of reading the current state of checks from the nagios core: check_mk livestatus. His solution is simple to implement, extremely fast and scalable, and easy to interface with. It seems the natural way to go. Unfortunately, this means ditching centreon until they decide to implement livestatus functionality. They seem to be (unwisely) ignoring the request to implement it:

http://forge.centreon.com/issues/2402

Case made against centreon, so what to use to replace it? There is a myriad of open source tool combinations that seem to work. I wanted a scalable, fast and all-in-one solution similar to centreon. This is what I went for:

– Core: nagios and mod_gearmanThis combination worked wonders for me. It’s fast and easy. Nagios’ community base makes it easy to troubleshoot. Nagios’ API makes it easy to script any monitoring check.

– GUI: check_mk multisite. Built on his aforementioned Livestatus. Fast, lightweight, and easy to use, this GUI is exactly what I was looking for. It also provides configuration of nagios. Multisite also makes it easy to integrate multiple sites into a single dashboard.

– Graphing: pnp4nagiosThis tool uses rrdtool (same as centreon). The other main contender was nagiosgraph, but pnp4nagios won out for two big, main reasons:

* it integrates really well with multisite. Multisite allows pnp4nagios graphs to be integrated seamlessly within it’s GUI

* it integrates with mod_gearman. Yep, it does… meaning perfdata from nagios gets parsed quickly… again, it’s fast and scalable. Since I already use mod_gearman for the performance benefits it gives to nagios, it’s a no-brainer to include mod_gearman as a perfdata processor.

Note: kudos to the developers of these tools. The tight and seamless integration between the different tools is amazing, and the advantages of each tool blend together to make an amazing single product.

Here’s the techie part of this article… I had some difficulty getting pnp4nagios to integrate with mod_gearman. It’s only due to lack of clear documentation, so i’ll try explain what worked for me.

– I installed mod_gearman via the repositories provided here:

https://labs.consol.de/repo/

– I followed the usual instructions for integrating mod_gearman into nagios, adding the following lines into nagios.cfg:

event_broker_options=-1
broker_module=/usr/lib/mod_gearman/mod_gearman.o config=/etc/mod_gearman/mod_gearman_worker.conf

– I setup pnp4nagios w/ mod_gearman as per instructions here:

http://docs.pnp4nagios.org/pnp-0.6/config#gearman_mode

However, I needed to make the following changes before it would work:

1. I checked the mod_gearman_neb.conf file and saw the following interesting option (and changed it from the default no to yes):

# defines if the module should distribute perfdata
# to gearman.
# Note: processing of perfdata is not part of
# mod_gearman. You will need additional worker for
# handling performance data. For example: pnp4nagios
# Performance data is just written to the gearman
# queue.
# Default: no
perfdata=yes

2. I changed the default nagios.cfg configuration to point to this file instead:

event_broker_options=-1
broker_module=/usr/lib/mod_gearman/mod_gearman.o config=/etc/mod_gearman/mod_gearman_neb.conf

That did it 🙂 checkout the screenshot’s from Multisite’s site:

http://mathias-kettner.de/bilder/1276533368.png

Advertisements

9 thoughts on “Check_mk, pnp4nagios and mod_gearman

    1. Hi Tobias,

      Thanks for the feedback. Interesting link. We actually do have satellite pollers in another enviornment, 3 pollers to be exact running mod_gearmand too to reduce load. I found this helped in reducing the load on the central server, but even with less load the database still performed poorly. We had moved the database to a seperate virtual server actually, but we ran into another problem… The database caused a large I/O load on the ESX host and impacted other virtual guests on the system, so we abandoned that idea. Admittedly though, I had not tuned the mysql instance in that particular case, which could have reduced the issue, plus the server was quite low end if I recall well

      One suggestion I’ve had was to use Centreon Broker and point it towards oracle RACs which we had present in our enviornment, however due to management reasons I was never able to test this out 😦

      Dave

      1. thanks for your response (also to the other post you’ve made)

        hm… it’s really a hard decision to make.
        currently i’ve implemented centreon with nagios in our environment on a centos 5 server
        this is running for about a year now. we have around 400 hosts and about 2000 service checks on a single server (esx vm).

        now by reading your post i’m a little afraid that we also come to the mentioned bottleneck over time as new customer servers will be installed and we’ll reach the 3000 checks next year probably.

        today we’re running three (3!) monitoring solutions at the same time with different approaches (nagios, orion, mom) but the aim is to just have a single solution and my idea was to implement a new installation of icinga / centreon in hope for better performance.

        but now you’ve got me thinking 😉
        i just hat a look on omd (check mk) for a few hours playing around with this solution. it’s great and easy to use but i felt a bit more flexible with centreon cause of the “nagios like” configuration (like having my own checks, different tresholds for customers, complex notifications, …). i know this would be also possible with check mk and probably even easier, i just need to getting used to this different approach since i’m so used to centreon / groundwork like frontends.

        (btw. just fallowing you on twitter now as you suggested in the other post 😉 )

      2. Thanks for the follow!

        Your concerns were exactly one of our debates at work…. I ended up suggesting we keep the centreon frontend since it was the best one we had tried, besides we already had gone through the learning curve… But intentionally “breaking” or leaving out parts of the config that the other tools more effectively provided

        The idea is to remove ndomod and ndocfg configurations by simply disabling them or disabling mysql. That should stop activity to the database but leave the configuration part of centreon still active, leaving pnp4nagios, livestatus, check_mk and mod-gearmand to do the rest

        Still untested though, so its still at the idea phase at the moment

  1. Hi Dvas0004,

    Thank for this post.
    Working at Merethis, Centreon editor, I would like to add some information.

    First, the bottleneck part, it seems really strange that it needs 15 minutes for Nagios to flush data to database with so few services.
    Generally this kind of problem appears with more than 40 000 services. A Nagios restart may not take more than a couple of seconds in your situation.
    You should have used the NDOutils patched by Merethis with less info flushed in DB.
    For your information, we support several installation with about +/-100 000 services on distributed architecture.

    The software suite you are talking about is really interesting, and we will certainly propose a support for live status for our monitoring screens. But we are also working on performance and log to provide advanced reporting, a big need for number of users, and we need SQL support to do that. So both usage might coexist.

    If you can take time again in the next weeks, I will be happy if you can test again Centreon Enterprise Server with several performance enhancements, in particular the fact that Engine+Broker will be fully integrated in it, and other stuff.

    Best regards.

    1. Hi

      Thanks so much for taking the time to reply!

      It is a strange situation that I had encountered which I found difficult to reproduce since I do not have the scale necessary in my test enviornments. However, it was clear that the DB was the issue since nagios would take such a long time to restart because it would get “stuck” on NDOUtils. Without NDOUtils, as you point out, nagios restarted in a couple of seconds. It is very probably some database issue, which is why I had wanted to bypass the database itself due to time constraints an me not having the necessary DB skills

      The Centreon Enterprise Server is a very attractive option, I will definitely test it out as soon as I can, and hopefully see it scale up to the figures you mention

      Overall, it would be awesome to see the check_mk functionality integrated into centreon, because I think Centreon’s webUI is by far the best Nagios UI i’ve tried, both in terms of administering and actual monitoring, and having a database less to worry about is an added bonus! 🙂

      Dave

  2. thanks for replying Romain. great to read that check mk integration could be possible in the future and that it’s possible to have more checks than the about 3000 with some tweeks.

    by patched version you mean “c) the patch for official version” and not the “b) the modified version” as it’s just for an older release of ndoutils, right? http://en.doc.centreon.com/Setup:ndoutils

  3. Both Check_MK (livestatus/multisite) and mod_gearman setups are used in setups with of multiple million checks (in Check_MK case *per minute*), so I hope you continue to optimize DB calls and people will be able to run similar size setups!

    Btw, also look into Naemon, the Nagios4 fork, where the restart computation of dependencies is down to 3-4 seconds for 0.5m services even in a VM on a *laptop*.

    Many cool changes ahead 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s