THURSDAY, OCTOBER 19, 2017
Font Size
Operations IT Creating a More Accurate IT Availability Definition

Creating a More Accurate IT Availability Definition

Availability is one of the key metrics that demonstrates the overall performance of an information technology (IT) system. But defining and calculating the availability of an IT system from a business perspective is a challenging task. Most of the time, IT departments report availability values that are on the higher side (such as more than 99 percent availability), but business people may not believe them, especially when there are instances of outages for applications supporting critical business functions or outages during core business hours.

Although the availability numbers may be numerically correct, they may not be a true representation of the real business situation. This misrepresentation can be addressed using two concepts:

  1. The outside in approach – Defining availability from a business perspective
  2. The business throughput approach – Availability calculation based on resultant value as experienced by the business

The Problem With Traditional Availability Definition

In most business environments, any business function is supported by several IT applications. Consider, for example, money collection in a credit card business. End users have several ways of making payments, such as a cash or check deposit, a wire transfer, online payment, or payment by phone. Different IT applications, such as secured login enabling online payment and voice-recognition applications enabling payments by phone, support these different ways of money collection. Each of these applications has a different set of business core hours (e.g., websites may be available 24/7, whereas voice-recognition applications may only be used from 8 a.m. to 6 p.m.)

In the traditional IT availability calculation, service level agreements (SLAs) are set for application uptime, and application availability is calculated against those SLAs. In the money collection example, the availability is calculated for all end-user applications (websites, voice recognition, etc.). The availability calculation must be based on core business hours rather than total application uptime; the latter provides leeway to show better availability using uptime beyond business hours. Many organizations base core hours on SLA definitions and availability calculations. Table 1 shows the availability values in the money collection example, including the amount of time that applications were unavailable due to outage. (Note – impact of capacity issues on availability is not considered in this analysis; capacity is assumed to be a non-issue for availability).

Table 1: Availability Values of Money Collection Options
Application or ComponentSLA Minutes (based on core hours)Outage MinutesAvailability Percent
Banking application8 a.m. to 6 p.m. = 600 minutes5 minutes99.17 percent
Third-party administration application8 a.m. to 6 p.m. = 600 minutes10 minutes98.33 percent
Voice recognition application8 a.m. to 6 p.m. = 600 minutes5 minutes99.17 percent
Website24 hours x 7 days = 1,440 minutes15 minutes98.96 percent
Payment system database24 hours x 7 days = 1,440 minutes6 minutes99.58 percent
Account information system database24 hours x 7 days = 1,440 minutes3 minutes99.79 percent
Overall6,120 minutes44 minutes99.28 percent

Outside In View

The data in Table 1 is the availability from the IT systems’ perspective. If the front-end applications are up all the time, but users are unable to complete transactions because of infrastructure failures or database issues, the system is still unavailable to users and the business. In such a case, claiming high availability values for the front-end applications is misleading. For the end user, the system is available only if the entire business process is completed successfully. Because of this, businesses must use the outside in view of the availability metric to define availability SLAs at the business-function level. Using this approach, in the process of collecting money through a website, the SLA should be met only if all components in the business process are up and running and users are able to execute the business process successfully.

Rolled Throughput Method

To meet the business-function-level SLA, all components need to meet their SLAs individually as well as collectively. If one component is down, the process cannot be completed, hence the system is to be treated as unavailable to users. That means businesses must use the rolled throughput method for calculating function-level availability, instead of simply aggregating all SLA minutes and outage minutes for all components. In the case of money collection through a website, the SLA is 24/7 availability, although there were 15 minutes of outages. Table 2 illustrates the difference between the simple aggregation method and rolled throughput method for calculation of availability at function level.

Table 2: Traditional and Rolled Throughput Availability Values
Calculation MethodApplication or ComponentSLA Minutes (based on core hours)Outage MinutesAvailability Percent
Traditional – simple aggregationWebsite24 hours x 7 days = 1,440 minutes15 minutes98.96 percent
Traditional – simple aggregationPayment system database24 hours x 7 days = 1,440 minutes6 minutes99.58 percent
Traditional – simple aggregationAccount information system database24 hours x 7 days = 1,440 minutes3 minutes99.79 percent
Traditional – simple aggregationMoney collection process using website4,320 minutes24 minutes99.44 percent
Rolled throughputWebsite24 hours x 7 days = 1,440 minutes15 minutes98.96 percent
Rolled throughputPayment system database24 hours x 7 days = 1,440 minutes6 minutes99.58 percent
Rolled throughputAccount information system database24 hours x 7 days = 1,440 minutes3 minutes99.79 percent
Rolled throughputMoney collection process using website (no overlapping outages)1,440 minutes24 minutes98.33 percent
Rolled throughputMoney collection process using website (all outages overlapping)1,440 minutes15 minutes98.96 percent

Eventually, the availability of the overall money collection process should be calculated by aggregating availability through all different channels. This aggregation must be done using the weighted average method (Table 3).

Table 3: Weighted Availability Values
Application or ComponentWeight (1 to 5, 5 is max)SLA Minutes (based on core hours)Outage Minutes (no overlaps)Availability Percent
Money collection using banking application4600 minutes14 minutes97.66 percent
Money collection using third-party administration application2600 minutes19 minutes96.83 percent
Money collection using voice recognition application4600 minutes14 minutes97.66 percent
Money collection using website51,440 minutes24 minutes98.33 percent
Overall money collection function(4 x 97.66) + (2 x 96.83) + (4 x 97.66) + (5 x 98.33) ÷ (4 + 2 + 4 + 5) = 14.66 ÷ 15
= 97.77 percent

For the same number of outage minutes, the overall percentage of availability for a business function (97.77 percent) is much lower than the overall percentage of availability calculated by traditional method (99.28 percent).

That explains why businesses’ experiences with and perceptions of availability numbers are not always inline with those of the IT systems. By using the outside in and rolled throughput concepts, however, measurement errors in availability calculations can be minimized and a method that is closer to the business and user experience can be devised.

Register Now

  • Stop this in-your-face notice
  • Reserve your username
  • Follow people you like, learn from
  • Extend your profile
  • Gain reputation for your contributions
  • No annoying captchas across site
And much more! C'mon, register now.

Leave a Comment



Comments

Philip

Nice concept.. However is it practically possible to implement this. Is this practiced in your organization? I am very keen to to know how this can be implemented as this is an ideal situation.

Reply
Jeroen Weeda

Hi Philip, yes, measuring and reporting on IT chains from a business perspective is more than just concept. It allows to bridge the gap between business and IT. The technical implementation is not the hard part, rearranging your IT governance to embed managing IT from from a service or even chain perspective rather then IT components is equally important but often forgotten or underestimated. An integral approach touching tooling, processes and organizations is required.

Best regards,

Jeroen Weeda

Reply
Jeff Dunn

I like your approach and the article has excellent content. I think there is an error in your calculations. In your example, you use 24 hours x 7 days = 1,440 minutes. The number of minutes in one week would be: 24 hours x 7 days x 60 minutes = 10,080 minutes per week. One day would represent 1,440 minutes (24 hours x 60 minutes).

Thank you again for the article!

Sincerely,

Jeff

Reply
Chris Seider

I’m sorry but your rolled throughput availability concept doesn’t make sense to me. If I had a production line based on 2 cells that were in series and both were down 50% of the available time and they didn’t overlap in lost time, I interpret your thought the availability would be 25% but in reality the line was never fully operating (assuming dependencies) and the availability is 0%.

Also, I can’t imagine COMPLICATING any efficiency with weighting factors of a process.

I’d just worry about improving uptime (availability) for the entire system without the complications inferred by your article, though thoughtfully presented.

Also, have you considered the defect rate of these transactions? That might be of interest also. Before you say the routines don’t have errors, consider one reason why customer service exists–because no process is error free.

My two cents.

Reply


5S and Lean eBooks
GAGEpack for Quality Assurance
Six Sigma Statistical and Graphical Analysis with SigmaXL
Six Sigma Online Certification: White, Yellow, Green and Black Belt
Lean and Six Sigma Project Examples
Six Sigma Online Certification: White, Yellow, Green and Black Belt

Find the Perfect Six Sigma Job

Login Form