Disasters in IT, and Ninja Networking

Other than Unix Beards and “funny” T-shirts with hex code on them – which more accurately qualify as fashion disasters – the biggest project disasters in IT, according to today’s top story in Computer World, tend to repeat themselves:

When you look at the reasons for project failure, “it’s like a top 10 list that just repeats itself over and over again,” says Holland, who is also a senior business architect and consultant with HP Services.

You’ve got your usual run of top-ten disasters in the article, including IBM’s Stretch project (Overpromised and underdelivered), Knight-Ridder’s Viewtron (misread the market), California and Washington States’ DMV overhaul and FoxMeyer’s ERP program, (didn’t make sure the new system worked better than the old one), Apple’s Copland (succumbed to feature-creep), Sainsbury’s warehouse automation (just plain didn’t work), and Canada’s Gun Registration System (cost much more than anticipated due to poor planning), and three U.S. government projects (multiple failures with perhaps more in the future).

But one of the things that I noticed was that it’s relatively rare (not unheard of, but relatively rare) to see networking take a prime role in the huge IT disaster stories that get passed around the campfire during IT tribe meetings. And I think that there are a few reasons why that is – the first is that most of these blunders would fall under the category of “strategic errors” as opposed to “tactical errors.” That is, network problems are usually subtle errors caused by mis-configurations and highly technical mistakes. The networking screw-up can be one of the most subtle, stealthy types, compared to the grandiosity of all-out strategic incompetence.

Or in other words, networking performance problems can cause the best laid plans to often go astray; the worst laid plans need no additional help.

Take, for example, a common error from back when they were first rolling out VoIP deployments – companies would roll out VoIP on the network as if it were just another data application, but then found that their other applications slowed to a crawl or even stopped working.

The problem was that VoIP packets are based on protocols designed to use as much of the pipeline as possible, while most applications are based on the TCP protocol, which is designed to throttle back it’s use of the pipeline if packets don’t go through. So what happened was that the VoIP packets would take more of the pipe, TCP applications would be crowded out and drop packets, which would cause the TCP protocol to throttle back, and the VoIP packets would now see the free space and take up more of the pipe, crowd out TCP packets and TCP would throttle back… creating a vicious cycle.

Was this a problem with strategy? Was it some form of bureaucratic incompetence? No – it’s just that it was a very subtle effect and if you didn’t know enough about the TCP and VoIP protocols (or even if you did, but didn’t put two and two together until it was deployed) you ended up with a problem.

Networking problems may have major effects but they’re rarely caused by major boneheaded screw-ups. I think that’s one of the reasons why the two major areas where IT departments spend a great deal of money – networking and security – is because those two problems are extremely subtle to detect and tricky to solve; security problems by malicious design, networking by nature.

Networking problems are subtle, can strike quickly, can often leave little trace of their presence. They’re the ninjas of IT problems.

Of course, ninjas can be defeated.

, , , ,

Comments are closed.