Where did the speed go?

Simon Brock Read more October 7, 2009

Recognise this scenario? You buy a new computer to host some application, not a top-of-the-range machine but one that everyone agrees is powerful enough for the job. It runs like a dream for a while, then all goes pear-shaped and the finger-pointing starts: “Application’s fault”; “Admin’s fault”; “Wrong disks in it…” The graphing tools are broken out, everyone has a theory and the machine takes the blame. But if the machine really is to blame, where did all that performance you bought go? With packages bought from multiple vendors this can be really difficult to uncover, and the traditional tools don’t always help.

So in this month’s column, I’ll examine how to analyse machine performance and use the results in diagnosing problems. I’ll mostly be talking about dtrace running on Solaris and Macs, but before the rest of you slink off to another column, this utility will end up on Linux soon – and something like it is bound to be copied by Windows, too – so read on and admire the abstract beauty of this solution!

Analysing system performance means first getting the right information and then understanding what it means, the trouble being that you very quickly wind up with too much information that’s too difficult to analyse. The simplest performance analysis is straightforward observation, typified by ill-informed comments such as, “When the application was slowing down the disk drive/network/street lights kept flickering” or “When the app was running slowly, there seemed to be more fan noise”. All these comments could be true (even the street lights), but they lack scientific rigour – flashing lights might distinguish a busy disk from an idle one, but I defy anyone to distinguish busy from very busy that way.

The next step is to use standard operating system utilities – such as Windows Performance Monitor or Unix’s “top” – to investigate resource utilisation and find any bottlenecks. For example, if an application is using 90% of the CPU, the machine may not be fast enough, and that might be what’s slowing down the application: or it could be that the other 10% of the time it’s waiting for something else to finish and that’s the problem; or, worse still, a poorly written app is consuming CPU time while it waits for something else to finish.

Some well-known metrics are almost designed to mislead you, the worst example in the Unix world being the “iowait” time, the percentage of its time the CPU has spent waiting for input-output (I/O). For a simple system where an application talks to a disk drive, this figure might make sense, but in more complex systems it soon becomes nonsense. A system has to wait for slow devices precisely because they’re slow, but how can you tell whether it’s waiting for something that’s supposed to be slow or something that isn’t? Indeed the confusion caused by users misunderstanding iowait has led Sun to remove it altogether from Solaris 10, which always reports zero time! This caught me out when I was trying to work out why a Solaris 9 system seemed to be working slower than a Solaris 10 system, not realising that the Solaris 10 system wasn’t reporting the same thing.

If some of these operating system utilities aren’t very good, and some of the statistics they provide can be misleading, which of them are useful? The truth is that none of them should be considered in isolation – they all measure different things and they all work at a very high level, which may not be the level where your real problem is located. In general, to cure a symptom is not to cure the underlying disease. For example, an application might run slow due to some disk-related performance issue and putting in faster disks may alleviate this symptom, but it isn’t fixing the real problem that lies within the application. Adding faster disks may provide a quick fix, but the underlying problem is still there and may return.
There’s a need for a utility that analyses system performance more rigorously and in more detail (that’s where the devil is said to lurk). Consider an application that manipulates a large collection of files: conventional performance tools will show the disk in heavy use and may even suggest that disk I/O is the problem, but it won’t show whether the specific application is at fault and, if so, which part of it. Is the app opening too many files, or accessing some particular file more often than others? This style of analysis means we need to see what’s going on under the hood of the application – what files is it opening and how much data is it reading from or writing to each one? Most Unix systems have an application called “truss” that will show what system calls an application is making, but this sort of information is too diffuse to be much use.

Dtrace

A modern operating system, whether Windows or some Unix variant, maintains a large collection of counters about its various operations. We use these counters to reveal the external performance of the system, but the operating system itself also uses them to tailor its own behaviour. Within the operating system, there’s always a crucial component called the scheduler, which manages how much time on the processors (or processor cores) each process will get compared with others. The most refined of these scheduling modules are referred to as “fair share” schedulers, based on a set of design ideas developed in the 1970s.

The problem with fair-share schedulers is that the schedule they generate is based on information collected by counters, which makes the scheduling process quite complex. Aside from updating its counters, the operating system has to handle a huge number of events. Processes are continually asking it to do something such as read from a disk, peripherals are announcing that they’ve done something such as transferring a disk block into main memory, and internal decisions – such as which process to run next – have to be taken. Machine performance will be related to how these events get ordered in time with respect to changes in the counters, and understanding this whole sequence is virtually impossible. A different approach is required; namely, to develop a theory of the operation of the system and then apply this theory to the system.

This requires more proactive system administration – the system administrator no longer just sits there ploughing through pages of numbers and graphs, but must attempt to develop a theory for the system’s operation and then go out to prove this theory correct. At its highest level and purest form, it’s a language for developing and proving such theories.

Dtrace is just such a programming language for monitoring systems. The idea is that as a system runs, it carries out actions that generate events; dtrace is based on the idea of probes, which are handlers for these events. Put simply, your dtrace program creates probes that fire whenever a particular event occurs, and these probes can report more than just the fact that the event occurred – they can collect information about the event. Depending on the probe being monitored, a number of different types of information can be returned. For example, if you’re probing a system call then all the arguments to that call will be available. Kernel modules and some other programs provide sets of probes and so are known as providers – a dtrace program can associate actions with the probes implemented by providers.
The dtrace programming language (sometimes just referred to as D) is very similar to C in its syntax, but it has useful additions such as associative arrays and aggregations, which are simple ways of combining collections of data points to provide results such as “average”, “count” and “sum”. To get dtrace to work, it has to be introduced into the operating system kernel. In a similar way to Java, D gets compiled into an intermediate code that can be security checked before execution. What’s very clever about dtrace is that it doesn’t add any overhead to the system when it’s in use: it doesn’t impact performance outside of the probes being used. Also, it can be turned off and on in a running system, and can monitor an executing process without that process knowing it’s being watched.

I won’t bore you with the usual trivial programming examples – “hello world” written in dtrace is no more interesting than in any other language. There are some good introductions and I recommend downloading the Dtrace Toolkit from http://opensolaris.org/os/community/dtrace/dtracetoolkit

To give you some idea of the scope and power of dtrace, we first need to consider the number of probes. On my Mac OS X system, running the command “dtrace -l” says there are 18,156 probes, while on a Solaris 10 system it says there are 44,261. Using just a few of those probes in a one-line dtrace can produce interesting results. Take, for example, this dtrace program:

io:::start { @size[execname] = quantize(args[0]->b_bcount); }

This says monitor the “io:::start” probe, which fires whenever an I/O operation starts. The program creates an array called “size” with quantised values (which means it divides up based on powers of 2) based on the size in bytes (b_count) of the I/O operation. If you run this one liner for a few minutes, as in the screenshot opposite, it will show you how much I/O each program on your machine is doing, grouped by size. Note that you can get inside each individual program, not just where there’s a process provider. Using that feature, you can trace each function call in the process, in a similar way to each probe in the kernel.

Two real-life examples

To show the power of dtrace, we’re going to give you two examples based on real problems on real systems – a Macintosh running Mac OS X 10.5 and a Sun server running Solaris 10. I’ve always been a Mac owner and like most Mac owners I tend to hang on to my machines for a long time, so that when I finally come to replace them they strike me as pitifully slow. This is being written on my PowerBook G4, although by the time it’s printed I hope a MacBook Air will be in my hands.

Apple ported the dtrace utility to the latest variant of Mac OS X (Leopard) so it was available when my Mac began to run slowly. The problem was related to email: for years, I’d been using Eudora, which worked but was beginning to feel rather tired. Eudora is now an open-source product based on Mozilla Thunderbird (which is very mature) so I thought it was time to move to that, but I lasted about three months before I got sick of it and decided there had to be something better.

I considered going back to the old Eudora, but then decided Apple Mail was a better bet and changed over. Everything was fine at first and worked well with an IMAP server, but then it started to slow down to a point where it ground to a halt – sending an email seemed to take an age and my New Year’s resolution to read all my emails was rapidly broken because I just couldn’t get to them. In desperation, I decided to apply dtrace to the problem.
It was obvious there was a problem and there seemed to be a lot of disk activity, but what was that activity achieving? By tracking with dtrace, I discovered that Apple Mail was putting an awful lot of activity through one file: it turns out that Mail uses an open-source SQL database module called SQLite to track its mail messages, and mine had become distinctly less than optimal. A quick check on Google for more information about this file yielded a command to optimise it and once again I could read my emails – hurrah!

My second example concerns a much bigger problem. My company operates a collection of websites out of two hosting centres, which are connected by a 100Mb/sec link. In the hosting centre is a load-balanced web server farm that talks to a back-end database server, which is a quad-socket Opteron Sun with lots of memory and a dual-controller EMC fibre channel RAID array. These servers run MySQL and are duplicated so that an update in one centre gets copied to the other centre and runs there, too.

A few weeks ago, the performance of one of these systems began to routinely fall off a cliff, with web pages that would normally take tenths of seconds to create taking several seconds instead. The front-end web servers were all fine, barely breaking sweat, and to all intents and purposes the back-end databases looked fine, too. The obvious tools such as “top” yielded no useful information, and tweaking MySQL’s performance variables didn’t yield any major improvements, particularly as reproducing the problem was very difficult. There had to be another cause. The most suggestive aspect of the problem was that it affected only one hosting centre, so something was happening in one centre but not in the other. It’s true that different websites did run out of the different centres, but all database write transactions were duplicated.

Ironically, the first hint came from the humble “iostat” I criticised earlier. It showed me that one particular device was very busy – up to 90% load – and that the average wait time was longer than I’d have liked. It was a real surprise that we were soaking up all of the 2Gb/sec bandwidth of fibre channel and the RAID. Without dtrace, we’d now have been pretty much on our own, down to pure guesswork. We needed to balance up the I/O load between the controllers in the RAID unit, but how were we going to do that? And could we really be sure what was causing it?

This is where dtrace really came into its own. The first thing I tried was using “rfileio” from the dtrace toolkit to find out the identity of the really busy files, which told me that a particular type of table in the databases was getting hit a lot, but what sort of traffic was it being hit by? Entering a script from the MySQL dtrace toolkit (available from www.pcpro.co.uk/links/163open) let me monitor a particular call in the MySQL database server (using the process provider mentioned above) and therefore told me what SQL queries the database was running – and it was running a lot of them. A pattern soon emerged in which I could see that there was a particular case where two queries were being run where one would have done, and it also exposed a logic problem.

The case of these two queries – which could be replaced with one – was no fault of the developer. What was happening was that a query was made to fetch one column from a table and then – very soon after – another column from the same table. I could see both these queries executing next to each other, which enabled me to merge them, a technique rather like peephole optimisation in compilers where redundant statements can be removed. Doing this work allowed me to completely transform the instantaneous performance of the application (how it performs normally) in addition to its performance under heavy load. One particular operation that routinely took more than a second is now performing consistently below 0.4 seconds, or roughly three times faster. It’s also shown me some weaknesses in an application that will let me improve it.