Monitoring with Nagios
Last Friday wasn’t a particularly bad day for us, but it certainly was for one of our clients. The previous evening, half their network had disappeared, at least from where I was sitting, and the next morning they told us a network switch had broken and been replaced. Later that afternoon, some of their services disappeared again and we discovered a router was wrongly configured. It was six o’clock on Friday evening by now and no-one was answering the phone, so we sent them an email and went down the pub. An hour later, I received a call on my mobile phone; I was told the router was fixed and they asked me to check everything was working. Normally, we would ask if it could wait until we got home or returned to the office, but we would just moved all our monitoring over to a piece of software called Nagios, which has a WAP interface. We checked that all their services were working from the comfort of the pub.
However, Nagios is much more than a belated justification for WAP phones. One day, we woke up to discover we had more servers than we knew what to do with and that our business depended on machines we weren’t monitoring very well. How could we find out what all those servers were doing, know what had gone wrong and when? We already ran a bunch of scripts on various machines to check matters like disk space usage, but these were becoming increasingly hard to maintain and didn’t check enough things. We also needed to monitor service quality as well as availability. Our company creates and hosts websites, and we need to know that pages are being served quickly from both within and without our own network.
We had to take a decision between writing more software ourselves, finding an open-source solution or buying in monitoring software. The first route wasn’t really viable, as we needed a cross-platform solution to monitor both Windows and Unix servers. We could write easily enough for Unix, but weren’t confident about writing for Windows. There is commercial software out there that seemed to do what we wanted, but it is often expensive (we hate to spend more on monitoring software than we paid for the server itself). That left open-source as our only option.
We soon uncovered a large collection of open-source monitoring software. However, unlike some applications where you know which solution is best – for example, the dominant open-source web server is Apache – in network monitoring there is no clear winner. We had to devise a way to choose, so we applied these three criteria, which we tend to use with all open-source solutions:
1. Does it work with what we have already?
2. Do other people use it?
3. Are many people working on it?
On the first point, we needed to monitor both Unix and Windows servers, and to monitor those parameters that interest us: resource usage (CPU and disk); service availability (is the web server up?); and service quality (how fast is this page generated?). It would also be good if we could monitor the hardware, to check that it is not getting too hot. We didn’t want to have to install much other software either. For example, we use MySQL for virtually all our database work and didn’t want to install a different database server (say, PostgreSQL) just to monitor the network.
The second and third points about other users and developer status are equally important. On open-source sites like SourceForge, you will see a lot of projects that seem to do what you want, but either aren’t used by anyone else or have had no development done on them for years. For software you are going to rely on, it is important to know there is a community that can help. A good way to check is to examine any mailing list that comes with the package. If there is not one, or it shows no postings for a year or so, there may be a reason.