Recently I have experienced at first hand a couple of these failures, and listened to the excuses of the two service organisations directly affected. In hindsight, neither problem was that surprising, but both caused damage and loss. I think it is worth recounting these as cautionary tales. I expect there are many other hidden vulnerabilities just waiting to bite the ankles of the unwary. The first was a fierce winter storm accompanied very heavy rain that ultimately caused a severe flood. Unfortunately, the storm brought down power lines, which, in turn, disabled critical parts of the mobile phone network. Orange, on a hilltop, lost power, and with its microwave hub and its GSM network down, the local council (an all-Orange user) lost much of its ability to direct emergency staff as the floodwaters rose. (So no sandbags got to those critical and vulnerable points). The powers of nature seemed determined to share the misery, with O2’s base station in the valley then becoming submerged in the very same flood – another network partially out of action. The chief executive of the local council, ruefully addressing the subsequent and rather lively public meeting, confessed that they were now considering systems that would give access to all networks. As he said, ‘we learn from our mistakes’ and the lack of communications became a major problem at a critical time – luckily no lives were lost, just properties damaged. The second ‘impossible’ failure was the internet service provider (ISP) that hosts the MCUG email servers. This well-established ISP makes bold statements about reliability, multiple data centres, and multiple networks, all very comforting to the business user. At least, comforting until it failed. It turns out that a critical part of the multiple network connectivity was an ATM backbone loop. One reputable ATM ‘wires’ provider giving access to two separate routes (each side of the ATM loop) to the international Internet backbone. What the ISP and the ‘wires’ provider didn’t allow for was where those independent loop ends terminated. It turns out that they both terminated at the same data centre in Docklands, on the same floor, and on co-located devices. Of course, that data centre offered 100% reliability; Yes, 100%, so no problem there. What a surprise when that data centre lost power on that floor for some time, and then restored power with a power surge so large that it blew up both the terminal devices. A day or so later the ‘100% redundant and reliable’ system was still struggling to life, as were the thousands of businesses nationwide who were affected. In the old days of private mobile radio systems and specially built networks, businesses knew the strengths and weaknesses of their infrastructure. It was expensive, imperfect, and bespoke, so they took a very big interest. Today these services are competitively-priced commodities. They are very sophisticated, and it is hard for the user to establish what might happen in a storm, a flood, or when a data centre fails. The Environment Agency confidently predicts more severe weather incidents, and is now revising its estimates. A few years ago, increased summer air-conditioning load took power out in most of downtown Auckland, New Zealand for weeks; can the same happen here? How many of the UK’s business are now dependent on common failure points? One can not help feeling that the plot was lost somewhere. When the internet was originally planned by the US military many years ago one of its strengths was as a network that could withstand attack. What a good idea. Single points of failure were avoided, and IP data packets, each with their own address, could be routed to their destinations in many ways, but not, it seems, today. In the scramble for performance and cost reduction we seem to have lost sight of redundancy, just at the point where our businesses are becoming dangerously dependant on that most addictive of drugs – data. It could be that redundancy ceases to be a term used in the modern IT manager’s vocabulary. Of course it may possibly be the term used after a critical system failure, when that hapless sole meets his chief executive?
www.mcug.org.uk
|