Problem Management: Major Payback for the Data Center

September, 2006

One of the most popular studies on the Computer Economics website is a report on help desk staffing ratios, which our clients use to determine benchmarks for help desk and technical support headcount levels. Why is this report so popular? Because the help desk and tech support staff is a large and visible presence in many data centers. Across all organizations, nearly 10% of the IT headcount is devoted to help desk and technical support. Therefore, these positions are often a target for cost-cutting or outsourcing.

But in focusing on these highly visible positions, IT managers are often missing the bigger picture. Help desk and technical support are just the tip of the iceberg when it comes to what the IT Infrastructure Library (ITIL) calls “incident management” and “problem management.” Incident management includes all activities that take place to restore services as quickly as possible in the event of a disruption, whether the disruption affects only a single user or is system-wide. Incident management includes nearly all the work of the help desk and second level technical support. But in many data centers, incident management also occupies a large part of other jobs, such as console operators, production schedulers, system programmers, system engineers, and even managers. 

Problem management, which is closely related to incident management, analyzes the causes of incidents and identifies trends so that solutions to reduce the volume of incidents in the future may be developed.

How much time is involved in the data center for incident and problem management? Our partner, Metrics Based Assessments, recently did an analysis, based on data collected from hundreds of data center benchmark studies, to specifically identify the percentage of data center staff time spent on incident and problem management. The study analyzed these activities in following categories:

  • Incident Management Level 1: first level resolution, such as calls to the help desk.
  • Incident Management Level 2-3: actual resolution of production problems.
  • Problem Management: identifying underlying problems behind incidents, including root cause and trend analysis.

The study allocated staff time to each of these three categories across hardware/OS platforms, as shown in Figure 1.

Problem Fig1 - Problem Management: Major Payback for the Data Center

The results of this analysis are striking: overall a full 20% of all data center staff time is spent in incident and problem management, about double the level of activity represented in the Computer Economics staff ratio for help desk and technical support. In other words, help desk and tech support – the top of the iceberg – represent only half of the time spent by data center personnel on incident and problem management.

Furthermore, although there are differences between platforms, the differences are not so great to indicate that the choice of hardware or operating system is the primary factor driving the level of effort required for incident and problem management.

What then is the primary factor? Drilling further into the details of Figure 1, we see that across all platforms, incident management takes up the bulk of the effort, with problem management accounting for less than 3% of data center staff time. This indicates that, generally, organizations spend much more time chasing individual incidents than fixing the root cause of problems that give rise to incidents. If the data center would devote even a small portion of the effort currently expended on incident management to problem management, the overall number of incidents would decrease, resulting in genuine cost savings and enormous payback.

For example, assume a data center has 2,000 installed z/OS MIPS, 1,000 UNIX processors, and 1,000 Windows processors. Based on standard staffing ratios for each platform, the total staff size would be about 221 full-time equivalents (FTEs). Using the ratios provided in Figure 1, we can therefore calculate that about 42 FTEs would be devoted to incident and problem management. If this number could be reduced by a third and assuming the average annual compensation including benefits per FTEs is $84,000, the data center could save approximately $1.2 million annually.

Data centers are under tremendous pressure to reduce personnel and operating costs.  Devoting more time to problem management, including root cause and trend analysis, will reduce the number of incidents and the associated need for resources to resolve them. Working smarter, not just harder, is the key to optimizing operating costs in the data center.

Figure 1 statistics in this article were provided by Mark Levin, a Partner at Metrics Based Assessments, LLC, from data collected from thousands of data center benchmarking studies over the past 16 years.

Current benchmark data is available in Levin’s book, Best Practices and Benchmarks in the Data Center, which may be purchased from the Computer Economics website (click for pricing).