Creating a More Accurate IT Availability Definition

Availability is one of the key metrics that demonstrates the overall performance of an information technology (IT) system. But defining and calculating the availability of an IT system from a business perspective is a challenging task. Most of the time, IT departments report availability values that are on the higher side (such as more than 99 percent availability), but business people may not believe them, especially when there are instances of outages for applications supporting critical business functions or outages during core business hours.

Although the availability numbers may be numerically correct, they may not be a true representation of the real business situation. This misrepresentation can be addressed using two concepts:

The outside in approach – Defining availability from a business perspective
The business throughput approach – Availability calculation based on resultant value as experienced by the business

The Problem With Traditional Availability Definition

In most business environments, any business function is supported by several IT applications. Consider, for example, money collection in a credit card business. End users have several ways of making payments, such as a cash or check deposit, a wire transfer, online payment, or payment by phone. Different IT applications, such as secured login enabling online payment and voice-recognition applications enabling payments by phone, support these different ways of money collection. Each of these applications has a different set of business core hours (e.g., websites may be available 24/7, whereas voice-recognition applications may only be used from 8 a.m. to 6 p.m.)

In the traditional IT availability calculation, service level agreements (SLAs) are set for application uptime, and application availability is calculated against those SLAs. In the money collection example, the availability is calculated for all end-user applications (websites, voice recognition, etc.). The availability calculation must be based on core business hours rather than total application uptime; the latter provides leeway to show better availability using uptime beyond business hours. Many organizations base core hours on SLA definitions and availability calculations. Table 1 shows the availability values in the money collection example, including the amount of time that applications were unavailable due to outage. (Note – impact of capacity issues on availability is not considered in this analysis; capacity is assumed to be a non-issue for availability).

Table 1: Availability Values of Money Collection Options
Application or Component	SLA Minutes (based on core hours)	Outage Minutes	Availability Percent
Banking application	8 a.m. to 6 p.m. = 600 minutes	5 minutes	99.17 percent
Third-party administration application	8 a.m. to 6 p.m. = 600 minutes	10 minutes	98.33 percent
Voice recognition application	8 a.m. to 6 p.m. = 600 minutes	5 minutes	99.17 percent
Website	24 hours x 7 days = 1,440 minutes	15 minutes	98.96 percent
Payment system database	24 hours x 7 days = 1,440 minutes	6 minutes	99.58 percent
Account information system database	24 hours x 7 days = 1,440 minutes	3 minutes	99.79 percent
Overall	6,120 minutes	44 minutes	99.28 percent

Outside In View

The data in Table 1 is the availability from the IT systems’ perspective. If the front-end applications are up all the time, but users are unable to complete transactions because of infrastructure failures or database issues, the system is still unavailable to users and the business. In such a case, claiming high availability values for the front-end applications is misleading. For the end user, the system is available only if the entire business process is completed successfully. Because of this, businesses must use the outside in view of the availability metric to define availability SLAs at the business-function level. Using this approach, in the process of collecting money through a website, the SLA should be met only if all components in the business process are up and running and users are able to execute the business process successfully.

Rolled Throughput Method

To meet the business-function-level SLA, all components need to meet their SLAs individually as well as collectively. If one component is down, the process cannot be completed, hence the system is to be treated as unavailable to users. That means businesses must use the rolled throughput method for calculating function-level availability, instead of simply aggregating all SLA minutes and outage minutes for all components. In the case of money collection through a website, the SLA is 24/7 availability, although there were 15 minutes of outages. Table 2 illustrates the difference between the simple aggregation method and rolled throughput method for calculation of availability at function level.

Table 2: Traditional and Rolled Throughput Availability Values
Calculation Method	Application or Component	SLA Minutes (based on core hours)	Outage Minutes	Availability Percent
Traditional – simple aggregation	Website	24 hours x 7 days = 1,440 minutes	15 minutes	98.96 percent
Traditional – simple aggregation	Payment system database	24 hours x 7 days = 1,440 minutes	6 minutes	99.58 percent
Traditional – simple aggregation	Account information system database	24 hours x 7 days = 1,440 minutes	3 minutes	99.79 percent
Traditional – simple aggregation	Money collection process using website	4,320 minutes	24 minutes	99.44 percent
Rolled throughput	Website	24 hours x 7 days = 1,440 minutes	15 minutes	98.96 percent
Rolled throughput	Payment system database	24 hours x 7 days = 1,440 minutes	6 minutes	99.58 percent
Rolled throughput	Account information system database	24 hours x 7 days = 1,440 minutes	3 minutes	99.79 percent
Rolled throughput	Money collection process using website (no overlapping outages)	1,440 minutes	24 minutes	98.33 percent
Rolled throughput	Money collection process using website (all outages overlapping)	1,440 minutes	15 minutes	98.96 percent

Eventually, the availability of the overall money collection process should be calculated by aggregating availability through all different channels. This aggregation must be done using the weighted average method (Table 3).

Table 3: Weighted Availability Values
Application or Component	Weight (1 to 5, 5 is max)	SLA Minutes (based on core hours)	Outage Minutes (no overlaps)	Availability Percent
Money collection using banking application	4	600 minutes	14 minutes	97.66 percent
Money collection using third-party administration application	2	600 minutes	19 minutes	96.83 percent
Money collection using voice recognition application	4	600 minutes	14 minutes	97.66 percent
Money collection using website	5	1,440 minutes	24 minutes	98.33 percent
Overall money collection function	(4 x 97.66) + (2 x 96.83) + (4 x 97.66) + (5 x 98.33) ÷ (4 + 2 + 4 + 5) = 14.66 ÷ 15 = 97.77 percent

For the same number of outage minutes, the overall percentage of availability for a business function (97.77 percent) is much lower than the overall percentage of availability calculated by traditional method (99.28 percent).

That explains why businesses’ experiences with and perceptions of availability numbers are not always inline with those of the IT systems. By using the outside in and rolled throughput concepts, however, measurement errors in availability calculations can be minimized and a method that is closer to the business and user experience can be devised.

Creating a More Accurate IT Availability Definition

The Problem With Traditional Availability Definition

Outside In View

Rolled Throughput Method

About the Author

Vishwajit Joshi

The Problem With Traditional Availability Definition

Outside In View

Rolled Throughput Method

Join 65,000 Black Belts and Register For The Industry Leading ISIXSIGMA Newsletter Today

About the Author

Vishwajit Joshi