Monitoring with Nagios
Using the above criteria, we explored a few packages and ended up choosing Nagios. The decision was clinched once we discovered there is a separate community website as well as the main website, where people talk about the package. We also found a bunch of people offering commercial support for Nagios, which implies some people think it is worth paying for.
Nagios is a classic open-source offering: download the source code, configure it, compile it and install it, after which you read the manual and realise it is not going to be as simple as you thought. To be honest, the first few hours were very confusing, but now we know how it works we can share the information with you.
First, to make Nagios work, you will need more than just Nagios. The Nagios package co-ordinates a set of other tools into a single monitoring interface. It does all the hard stuff, working out what should be checked and when, who should be told when something breaks, accepting acknowledgements so people do not get warned twice, and scheduling downtime. However, there are two important things the Nagios package does not do: it does not include any software to carry out the checks, and it does not include any conduit for telling people when things are broken. The latter is not a big problem, as you should use existing mail or messaging software, but the former is less obvious. However, you soon discover there is a separate project that produces the actual tools (called plug-ins) for carrying out the checks, and a starter set is available from the Nagios Plug-ins project. To make Nagios work, you have to download and install these plug-ins as well.
Now you are ready to get started and, for reasons we will explain in a second, we suggest you set up Nagios to only check services on the machine you are running it on: monitor things like the load on the CPU and the disk usage. To do this, you need to define a collection of information in a set of config files, and to achieve that you really must read the documentation. However, here is a very sketchy overview.
Let’s assume you are going to check a bunch of services on a machine called mymachine. First, you need to describe that machine in the hosts.cfg configuration file. Look at the documentation and the example file to see how to do this. You then need to define the services you want to check on that machine in the services.cfg file. In the example file, you will see that you define a service on a hostname (mymachine in this example) and that the service description includes a check command declaration. How this works is defined in the checkcommands.cfg file, which states what plug-in gets called to check what service. The example configuration files also include contactgroups.cfg containing the information that decides who gets contacted when a service fails. Once you have defined a set of services for your local machine, you can check your configuration files and start the software.
When you define any service, you typically give two levels of service that can cause an event. You say what will cause a warning event and what will cause a critical event. For example, if you are monitoring disk space, you might specify that when a disk is 80 per cent full you want a warning, but once it is 90 per cent full this is critical. Again, within the service description, you can say who gets notified of warnings and who gets told of critical events. Once the software has started, you can check the state of your services via the web interface, which was loaded when you installed Nagios. To really make this work, you should install it into a web server tree where people have to log in to see content. If you are using Apache, you need to use authentication directives and require a valid user. You can then assign who is able to see what and what they are permitted to do via the Nagios configuration file cgi.cfg.