Reducing IT User Downtime Using TQM A Case Study

This Information Technology (IT) case study was done during the implementation of total quality management (TQM) in a financial services company with several hundred computers and computer users in multiple locations throughout India. The results have widespread applicability and in particular are aimed at organizations with large computer networks, IT facilities management companies and customer service providers. Success in any improvement effort is a function of techniques accompanied by a mindset change in the organization. This project was undertaken as part of the second wave of projects aimed at spreading the quality mindset in the organization.

The narrative unfolds in the chronological sequence of TQM’s seven steps of problem solving (similar to DMAIC [Define, Measure, Analyze, Improve, Control] in Six Sigma), describing the critical process stages where results were achieved and mindsets changed.

Step 1 – Define the Problem

Selecting the theme: After an initial two-day TQM awareness program, the company’s senior management selected a theme by consensus: dramatic improvements in customer service. As part of the theme, one of the improvement areas selected was “reducing the response time to resolve IT (hardware and software) problems faced by internal customers.” The company had outsourced its network and facility management. A small technical services management team and help desk oversaw the vendors’ work.

Problem = Customer desire – actual status: Detailed data was available regarding the time of receipt of each call from the customer (in this case, the network users) and the time of call closure. Monthly management reports aggregated the performance by enumerating the number of calls that were resolved in the following categories:

Call
Closure Time

< 30
Mins.

< 60
Mins.

< 2
Hours

> 2
Hours

< 24
Hours

< 48
Hours

> 48
Hour

While the information about what happened was well recorded, there was no information about what users had desired to have happen. The deviation from user desires or even the service standard promised to users was not measured.

Defining the problem therefore resulted in a changed mindset from data being used just as an internal record to measuring and “assuring a service standard to the user.” The calls were categorized into groups that would be expected to have a service standard time of closure as defined in the table above.

A month of data was analyzed by subtracting the service standard time expected to be delivered and the actual time taken to resolve each call. The gaps between the actual closure time and the standard time were a measure of the problem. It was clear that the data needed to be prioritized in order to proceed. A Pareto diagram was drawn (Figure 1). It indicated that two categories < 30 minutes (67 percent) and > 120 minutes (27 percent) constituted 87 percent of the incoming load. It was decided to attack the < 30 minutes category first.

Definition of metrics: In order to define clear metrics, the concept of sigma was introduced to represent variability in timeliness of service. It was quickly grasped by the group that a 3-sigma standard translates into a 99.7 percent on-time performance. (Average + 3 sigma) of the actual closure times should be less than the service standard.

This meant that for the < 30-minute call category:

If T30 = average + 3 sigma of 30-minute calls’ closure times
T30 < 30 minutes for a 99.7 percent on time performance

The past month’s data revealed:
T30 = 239 minutes

The objective was now clearly defined:
Reduce T30 from 239 to

Dividing the Task into Phase A and Phase B

Since making such a big reduction was too daunting a task for a team embarking on its first project, using the concept that “improvement occurs step by step,” the initial objective, or Phase A, was to reduce T30 percent by 50 percent. A project charter was drawn up accordingly.

Step 2 (Phase A) – Analyze the Problem: The T30 calls were arranged in descending order according to actual time of closure. Those calls that had taken more than 30 minutes were segregated for analysis. It was recognized that the problem of quality was one of variability, and that the most effective solution to the problem would be ending the causes of calls with a very high time of closure. Thus, T30 calls that had taken more than 130 minutes (T30:130) were analyzed first (Figure 2).

The top three categories contributed approximately 75 percent of the problem. To sequence the order of attack, the group chose “big and easy” to precede “big and difficult” problems. Using that criteria, “not aware of change rule” was chosen.

Step 3 (Phase A) – Find the Root Cause: In these cases the engineer attending to the call had not closed the call after attending to it. The 5-Whys technique was used to determine the root cause – Why had he not closed the call? Why was he not aware that he was supposed to close the call? Why was the procedure of call closure changed and he was not informed? Why is there no standard operating procedure to inform employees before closing the call?

Step 4 (Phase A) – Generate and Test Countermeasure Ideas: Countermeasures were easily identified – first, inform all the engineers; second, develop a standard procedure for informing all users before making a change in procedure which affects them. The engineers were informed of the new procedure.

Step 5 (Phase A) – Check the Results: The next three weeks showed a dramatic drop in the T30 value from 239 to 121 minutes. The objective of 50 percent reduction had been achieved.

Step 6 (Phase A) – Standardize the Results: A standard operating procedure was drawn up for future reference. An X Bar control chart (Figure 3) was introduced for routine day-to-day control.

Step 7 (Phase A) – Present a Quality Improvement Report: Drawing up the quality improvement report was deferred due to the project being continued to attempt to make further improvements.

Figure 3: Control Chart for 30-Minute Calls (September)

Phase B to Further Reduce Downtime

Step 2 (Phase B) – Analyze the Problem: The second phase of the project, or Phase B, was to reduce the T30 value by 50 percent again, from less than 120 minutes to less than 60. The T30 calls which took more than 30 minutes to close were collated and arranged by category in descending order of time to close. There were two categories with the following data:

Categories	Calls	Minutes	Minutes/Call
Log-in	39	2720	70
Printing	16	1672	104

Based upon the “big and easy” principle, the group chose to attempt the printing problem first. The printing calls were sub-categorized by “location” and then by “solution” since they had already been resolved.

Seven of the 16 calls were from Location 1 and seven of the 16 calls had been solved using the same remedy – reinstalling the printer driver.

Step 3 (Phase B) – Finding the Root Cause: Why did the printer driver need frequent re-installation? The group brainstormed and generated 10 possible causes. A check sheet to collect data was designed. For the next two weeks, the engineers were asked to record the reason of why the printer driver needed to be reinstalled each time they were attending to such a call.

Figure 4: Control Chart for 30-Minute Calls (October)

When reviewed, the data surprised the group members. It clearly illustrated the superiority of data-based problem-solving over intuitive problem-solving. And it acted as a major mindset changer. The problem, the data showed, was that the printer was going off-line rather than its driver needing reinstallation.

Why was the printer going off-line? Brainstorming quickly produced the cause: The machines being used had three versions of the Windows operating system – 98, 2000 and XP. In the Windows 98 version there was a problem – if a user tried to print without logging-in, the printer would go offline and the next user would experience the problem. The cause was quickly confirmed as the root cause by one of the members trying to print without logging-in.

Step 4 (Phase B) – Generate and Implement Countermeasure Ideas: The group discussion produced the idea of adopting a software change to not allow a user to try printing without logging-in. All the machines using Windows 98 were identified and the change was implemented. Applying the standard operating procedure used in Phase A, the group was careful to inform all users of the change before implementing it.

Step 5 (Phase B) – Check the Results: The calls were monitored for another two weeks and the results amazed the group. The data showed a dramatic drop of the T30 value from 121 to 47 minutes (Figure 4). A total reduction of 80 percent had been obtained in the T30 value. The question arose why had the reduction been much more dramatic than the data as per the Pareto chart would indicate. There are two reasons:

While the problem-solving method identified the vital problems using the calls that took a long time to resolve, there were undoubtedly many calls with the same problem and cause that were attended to within the standard time and therefore did not show in the analysis.
The system of daily control chart plotting and review with the engineers and the group raised the awareness of timeliness and thereby increased the urgency for a solution.

Step 6 (Phase B) – Standardize the Results: A standard procedure was developed and circulated to all regions to implement the change at all locations.

Step 7 (Phase B) – Present a Quality Improvement Report: A quality improvement report was written and presented to the Steering Committee.

Future Work and Conclusions

The work of the group is continuing in the following directions:

The T30 calls are now being analyzed to further reduce the time. Two interesting solutions are emerging that promise to cut the downtime further.
T60 calls are now under study. The average + 3 sigma of closure time of this category has been measured at 369 minutes. Work is being done to reduce it to < 60 minutes.

This case study demonstrates several principles of TQM and Six Sigma:

What cannot be measured, cannot be improved. (Establishing service standards and the use of sigma and control charts for on-time delivery of services were essential in making improvements.)
It is important to develop customer-oriented metrics.
Mindset change is crucial to the success of any improvement effort.
Standardizing the improvement can take longer than the improvement itself. (It is still continuing in this application.)
There is value in step-by-step improvement and continuous improvement.

Reducing IT User Downtime Using TQM – A Case Study

Step 1 – Define the Problem

Dividing the Task into Phase A and Phase B

Phase B to Further Reduce Downtime

Future Work and Conclusions

About the Author

Niraj Goyal

Step 1 – Define the Problem

Dividing the Task into Phase A and Phase B

Phase B to Further Reduce Downtime

Future Work and Conclusions

Join 65,000 Black Belts and Register For The Industry Leading ISIXSIGMA Newsletter Today

About the Author

Niraj Goyal