Dare to be wrong

Steve Cassidy Read more October 7, 2009

How do you feel about being wrong? That’s a basic question about a pretty basic emotion, and there’s a particular kind of network-orientated wrongness that I think deserves further exploration. An astonishing proportion of the emails I receive every month from readers of this column could broadly be said to be on the subject of being wrong, and how that affects the correspondent’s ability to do anything useful about their problem. I personally have very little fear of being wrong, because I spend so much time in the over-complicated world of networks where perpetual rightness is an impossible dream.

You might nowadays encounter a good half-dozen different network structures, each of which possesses a set of attributes that you can only just count on the fingers of one hand. That makes the number of permutations so enormous that perpetual rightness is beyond reach. Think about this a little. There are at least seven attributes for the cables (speed, duplex, topology, fibre/copper, interior/exterior, shielded/unshielded, bought-in/home-made) and the TCP/IP protocols add roughly the same again (address, netmask, CIDR yes/no, IPV6 yes/no, hop count, MAC, ARP). And it’s easy enough to construct a similarly sized list for most of the other structures you need in order to get your LAN to work well: name resolution, traffic management, standards, platform requirements. All these have more than two but less than ten fundamental parameters you can set.

I happen to believe that when dealing with this scale of problem complexity, humans do best when they first think about them as a team, followed up by group discussion. Networks in general can benefit from “debugging by conversation” – almost the instant you open your mouth to describe some problem you can’t fix, the solution you need will occur to you, but if you’d remained silent that solution would probably never have sprung to mind. Apparently, this effect is well understood by cognitive scientists, and various academic and technical bodies (the one I’ve heard of specifically being the BBC) nowadays instruct their technical teams that adopting such a technique is an officially sanctioned approach to problem solving.

Now, in a lot of businesses, the person with responsibility for IT is the finance director, and most often they’ll have some nerd (be that internally employed or “outsourced”) reporting to them. Finance is a very different kind of discipline from IT, in that there isn’t too much uncertainty involved in the quantities, and diagnostic processes aren’t often called for while you’re balancing accounts or setting a budget – the numbers remain the numbers. Very often I watch downtrodden techies desperately trying to engage in the process of diagnosis-by-discussion with their finance director, whose approach to the whole affair will be – to put it politely – somewhat antagonistic. “I thought you were supposed to know all this stuff…”, he or she will bark, “…no, you can’t have any more help – this is what we pay you for”. It’s hardly surprising that in such an environment people will develop a massive reluctance to be seen to be wrong (or even being thought to be asking for help).

I was already hinting at this problem a few columns ago when I described my adventures with the FS728TS Netgear Smart Stacking Switch, and glibly mentioned a few systems I’d seen whose administrators were rather worried by the prospect of their network expanding beyond a single 24-port switch because they didn’t know how to daisy-chain switches together. As always, it’s the most seemingly insignificant asides that set my mailbox bulging, and this one worked a treat.
Several of you wrote in to describe some variation on that kind of network topology, the most common solution seemingly being to put multiple network cards into the server, with each card serving a different physical switch and the server being the only connection between the different switches and segments. Now, this kind of layout will work, and in a network with just one server and whose users never move around or print on one another’s printers it will probably never excite very much attention, because the files can be shared and the work can get done without much impediment. However, the minute you try anything even slightly smart, such as load-balancing, teaming, routing traffic to or from another server, remote access, or even passing through a PDA for internet synchronisation, it will all fall over in a big heap. As a long-term solution, one-server-network-card-per-switch is a disaster area.

One chap was in even more dire straits than this might suggest. He’d been managing very well with his single server, mainly because he was supporting an industry-standard package that adds tons of value to his business – one that every single member of staff uses from the minute they sit down in the morning to the final whistle – and because his finance director, while clearly being of the irascible and accusatory temperament I just described, also understands quite clearly how vital the network is to the business. This clear linkage between the company’s bottom line and its IT investment, when combined with a “no switch interlink” topology and a “just one server” configuration, meant that a rather painful and unique crunch point had arisen for this chap.

His server, now bulging with 144GB disks, had filled up its single tape drive and he wanted to add a second unit. The only problem was, if he attached it to the same SCSI card as the existing unit, the overnight backup wouldn’t finish in time to allow his people back into that overwhelmingly important database at start of play the following morning. To make matters worse still, this small but high-value company was particularly vulnerable to one of the more dread consequences of the Second Law of Thermodynamics, the little-remarked corollary to Sod’s Law that insists that the more useful a product is, the less flexible, tolerant or reliable it must be.

In my man’s case, he’d had to perform four complete database restores during the preceding year thanks to client PC crashes taking out part of the central database structure. On the one hand, this was good news, because at least he knew that his chosen backup software and hardware combo did the job they were intended to do. On the other hand, it was bad news, because the reason he had to do this restore in the first place wasn’t amenable to being “tuned out” by system improvements – the database package’s developer simply wasn’t going to fix the bug as it had other priorities. So I went to have a look at his situation for myself.

His server really was full up, with each available PCI slot having either a SCSI, RAID or Ethernet card in it. He was using the 10.x.x.x private subnet, with each separate Ethernet card having a different major net number, first card being 10.0.1.x, the second 10.0.2.x and so on. The population of hubs and switches connected to these cards, as well as the cards themselves, was quite miscellaneous. One was a fibre link to a distant building, two were major 24-port switches on different floors, and one was a small eight-port hub on the “directors’ suite” floor (which he’d have liked to update, but the chairman’s secretary literally had a key to access that area and would only very rarely let him in…)
Quick fixes like adding a multiport Ethernet card such as the Intel PRO/1000 GT (see www.pcpro.co.uk/links/networks145a) were out of the question, as was installing a central switch. Devices that could handle the wide range of connection types he was using can be bought relatively easily, but the process of converting each connection to a pooled setup, then running a single Ethernet interface to either multihome each of the subnets or else route the traffic in a way that emulated the multicard layout was a long-term project best suited to a bank holiday weekend – and that was far too long to wait to do his backup. I know it should have been trivial to re-assign the workstation IP addresses to get out of this multiple-subnet configuration, but he wasn’t using DHCP, and touring around the whole site (including the locked executive floor) took more than a complete working day.

This left us with two short-term fixes – run a D2D2T (disk-to-disk-to-tape) backup by beefing up one workstation and moving the tape drives onto it, or try to squeeze more SCSI capability out of the over-filled server. A quick look at all the candidate workstations ruled out the D2D2T option: while money had been spent on the server, the workstations were bargain-basement machines, their cheap tin cases mostly full of hot, sticky dust bunnies and PCI slots on wobbly riser cards. Their onboard Ethernet interfaces were weird, slow and ill-matched to the switches – certainly not up to continuously force-feeding a top-dollar tape drive all night. Upgrading the switches would be just as painful as reconfiguring the server, so back we went and that’s when we finally got luck on our side.

The server was a Compaq shed, a double-fronted tower on fat castor wheels with plenty of power and several drive cages. These workhorses are the backbone of the networking business, and despite the fact that they’re now turning up on Ebay at £200 a pop for a pallet-load of 12 I still think they have sensible applications if you ignore their CPU rating and focus instead on the number of drives they can fit in, and how well they power, cool and access those drives. The biggest problem with the whole ProLiant range was that seemingly trivial options could make a very large difference to how the whole server operates. I’ve mentioned here before how often I discover dual- or quad-capable ProLiants toiling away with just a single processor, not enough memory or a paltry pair of 9.1GB SCSI disks, just because the original purchaser had no idea what delights lay waiting in the options section of the catalogue.

My man had gone some way towards collecting the right bits to service his ravenous database application: the server contained separate SCSI RAID cards to service each drive cage, and all three slots available for drives were filled. He’d made rational use of the number of storage devices available in this configuration and distributed his volumes sensibly across the devices. However, when it came to running the tape drives, he’d fallen victim to Options-Catalogue Myopia, compounded by accepting dodgy industry wisdom concerning tape drive misbehaviour. Like all cautious LAN administrators, he’d mounted his tape drives on separate SCSI cards from his RAID arrays, despite each RAID card having an external connector on the back. He’d resisted the temptation to use those and instead drove the tape device via an Adaptec 29160, leaving the two onboard Symbios Logic integrated SCSI, non-RAID chipsets entirely unemployed.
These chips are included on the ProLiant motherboard, where they appear as simple 68-pin headers buried close to the edges of the board near the drive bays. Almost everyone ignores them, because the “right thing” to do with those drive bays is to hook them up to a caching RAID controller and ignore the simple onboard chipset: however, they’re almost always left operational. The standard ProLiant backplate includes little punched-out slots, sited away from the expansion card cage, which can be opened and then used with the right ribbon cable and mounting plate to present the SCSI connector on the back of the machine – in this case, the ribbon connector was even present, but not plugged in. We plugged it in with the server still running – despite his nerves – and even managed to get the second tape drive seen by Backup Exec after judicious stopping and starting of the relevant services, without even rebooting the server.

He has now ordered another of these little external-presentation ribbon adapters (for about $30 from a spares company in the US), which will release another PCI slot and, while this may not put his network into the optimal state it could attain with a little bit more re-engineering, at least the pressure is off for now.