Fingerpointing, Frustrated Network Engineers, and the Application Performance Blame Game

brianboyko3.jpgBy Brian Boyko
Fingerpointing – it’s a frustrating and lingering problem for IT organizations. Whenever, application performance degrades, all of a sudden application, server, and networking teams start pointing the finger at one another in an attempt to pass the blame. But isn’t that why we have network monitoring tools in the first place – to tell you where the problem is, so that you can fix it faster.
Theoretically, yes. Unfortunately, network teams may be suffering from an undeserved “credibility gap” that prevents companies from taking timely action when problems arise.
For so very long, problems with network performance have often been laid at the feet of the networking team because, quite frankly, to the end user, an application problem, a server problem, and a network link problem all look like the same thing. “The network is slow.” So even when it’s not the network, the network team often gets the blame.

Application and network performance monitoring solutions enable companies to diagnose and determine – prove, if you will – where the problem lies. And by and large, those who deploy and use these products are able to solve problems faster and get better performance from their network.
However, there are two basic problems that no network monitoring solution, no matter how comprehensive, no matter how well-written, no matter how effective, (no matter how expensive!), can solve.
First, once you buy your network monitoring product, you have to actually look at the data it provides and trust that data. It seems obvious, but it’s true – an unused network monitoring solution is a useless network monitoring solution.
Second, once you get the data from the network monitoring product, don’t discard it and blame who you wanted to anyway. Sadly, this sometimes happens, with the network team often becoming a scapegoat even when they can clearly show that the problem resides somewhere else.
If you don’t know where the problem is, that’s ignorance. But if you insist the problem is where you know it isn’t, that’s – well, that’s just dumb.
This phenomenon is covered in-depth in an article by Steve Taylor and Jim Metzler who contribute newsletter articles to Network World, entitled “When apps are slow, net managers are wrong until proven right.” In that article, Taylor and Metzler talked to a network engineer at a large healthcare-industry company.
We also talked to this network engineer. Because of the sensitive nature of the conversation, we’ve decided to refer to the network engineer only by initial.
J is assigned to an engineering team that handles network monitoring using the NetQoS Performance Center suite of monitoring products, including SuperAgent, ReporterAnalyzer, and NetVoyant. [Full Disclosure for new readers: Network Performance Daily is the company blog of NetQoS.] The team has about ten critical enterprise applications being monitored, and manages tickets based on performance issues called in by the help desk.
“The first thing that comes out of people’s mouths is that ‘the network is slow,’ regardless of what the problem might actually be.” J explained, pointing out that when a user makes a complaint about “the network,” it sets off a lengthy investigation in the network infrastructure, regardless of whether there is any evidence of it actually being a network problem.
One particular issue at J’s company remained unresolved because, despite J’s insistence, no one believed that it wasn’t the network that was the problem.
“This particular issue had been going on for probably over a month and a half, and it came to a head when we were able to see in NetQoS SuperAgent – and we have all this in print-screens and presentations – that retransmissions went through the roof, while no other metric did anything,” J said. “We had long term baselines for latency on the WAN, long time baselines for server performance and other tiers and other applications that run behind the same load balancer –all this mounting evidence that we hoped the server team for this particular Web server would just take a look at.”
“But it never went that way. The [Web server team] kept asking us to bring in our vendors, or bring in AT&T and do all this other stuff, when we had already called AT&T to verify that we had an error-free network. It got to batting back-and-forth ‘Whose problem is it?’”
“What frustrated us is that we’ve got these great graphs and great tools that tell us specifically – maybe not what is wrong – but where it’s going wrong and where we should focus, and it’s never an easy sell. They don’t want to believe it could be something in the application or server itself, and they always want to point to – what about the routers, what about this? We have to constantly go back and tell them, well, if that were the case, it wouldn’t be just your app, it would probably be all the applications that run off this router or load-balancer.”
J sent us an e-mail after our interview, mentioning that she read this blog, and that an early post of ours helped to explain the frustration that she felt:

This post from Network Performance Daily nails it:

‘Imagine a man walking into a hospital, saying that he doesn’t feel good, and doctors around the country are immediately called in, starting with the cardiologist, who rules out heart trouble. The man is next wheeled to a podiatrist, who rules out any problems with his feet. He’s then wheeled to a gynecologist (But I’m a man… Ma’am, I’m a doctor. I think I should make that determination – and only after the tests come back.) If your diagnostic process is trial by error, you’re not, technically, diagnosing.’”

So, what can be done to help solve this problem? One thing IT managers can do is consider the idea of consolidating the application, server, and network teams into one “application delivery” team. After all, all three teams are ultimately tasked with doing the same thing: delivering applications in a timely manner to the end user. (This is probably the solution Jim Metzler would prefer, as the “application delivery” idea is one of his recurring themes in the many articles he’s written and speeches he’s given, including four different panels at Interop 2007, and even talked about it with us at NetQoS in a podcast.)
We’ll hope to have more information on the IT consolidation approach in future articles. In the meantime, please share with us your frustrations with finger-pointing in the IT department.

, , , , , , ,

Comments are closed.