iSixSigma

Friday, 26 February 2010 17:46

Tips for Recognizing and Transforming Non-normal Data

Written by Peter Sherman
Rate this item
(0 votes)
Practitioners can benefit from an overview of normal and non-normal distributions, as well as familiarizing themselves with some simple tools to detect non-normality and techniques to accurately determine whether a process is in control and capable.

By Peter J. Sherman

Six Sigma professionals should be familiar with normally distributed processes: the characteristic bell-shaped curve that is symmetrical about the mean, with tails approaching plus and minus infinity (Figure 1).

Figure 1: Normally Distributed Data 

When data fits a normal distribution, practitioners can make statements about the population using common analytical techniques, including control charts and capability indices (such as sigma level, Cp, Cpk, defects per million opportunities and so on).

But what happens when a business process is not normally distributed? How do practitioners know the data is not normal? How should this type of data be treated? Practitioners can benefit from an overview of normal and non-normal distributions, as well as familiarizing themselves with some simple tools to detect non-normality and techniques to accurately determine whether a process is in control and capable.

Spotting Non-normal Data

There are some common ways to identify non-normal data:

  1. The histogram does not look bell shaped. Instead, it is skewed positively or negatively (Figure 2).

Figure 2: Positively and Negatively Skewed Data

     2.   A natural process limit exists. Zero is often the natural process limit when describing cycle times and lead times. For example, when a restaurant promises to deliver a pizza in 30 minutes or less, zero minutes is the natural lower limit.
     3.   A time series plot shows large shifts in data.
     4.   There is known seasonal process data.
     5.   Process data fluctuates (i.e., product mix changes).

Transactional processes and most metrics that involve time measurements exist with non-normal distributions. Some examples:

  • Mean time to repair HVAC equipment
  • Admissions cycle time for college applicants
  • Days sales outstanding
  • Waiting times at a bank or physician's office
  • Time being treated in a hospital emergency room

Example: Time in a Hospital Emergency Room

A sample hospital's target time for processing, diagnosing and treating patients entering the ER is four hours or less. Historical data is shown in Figure 3.

Figure 3: Time Spent in ER

An Individuals chart shows several data points outside of the upper control limits (Figure 4). Based on control chart rules, these special causes indicate the process is not in control (i.e., not stable or predictable). But is this the correct conclusion?

Figure 4: Individuals Chart of Time Spent in ER

There are a couple of ways to tell the data may not be normal. First, the histogram is skewed to the right (positively). Second, the control chart shows the lower control limit is less than the natural limit of zero. Third, notice the number of high points and no real low points. These tell-tale signs indicate the data may not be normally distributed enough for an individuals control chart. When control charts are used with non-normal data, they can give false special-cause signals. Therefore, the data must be transformed to follow the normal distribution. Once this is done, standard control chart calculations can be used on the transformed data.

A Closer Look at Non-normal Data

There are two types of non-normal data:

  • Type A: Data that exists in another distribution
  • Type B: Data that contains a mixture of multiple distributions or processes

Type A data - One way to properly analyze the data is identify it with the appropriate distribution (i.e., lognormal, Weibull, exponential and so on). Some common distributions, data types and examples associated with these distributions are in Table 1.

Table 1: Distribution Types
Distribution Type Data Examples
Normal Continuous Useful when it is equally likely the readings will fall above or below the average
Lognormal Continuous Cycle or lead time data
Weibull Continuous Mean time-to-failure data, time to repair and material strength 
Exponential Continuous Constant failure rate conditions of products
Poisson Discrete Number of events in a specific time period (defect counts per interval such as arrivals, failures or defects)
Binomial Discrete Proportion or number of defectives

A second way is to transform the data so that it follows the normal distribution. A common transformation technique is the Box-Cox. The Box-Cox is a power transformation because the data is transformed by raising the original measurements to a power lambda (l). Some common lambda values, the transformation equation and resulting transformed value assuming Y = 4 are in Table 2.

Table 2: Lambda Values and Their Transformation Equations and Values
Lambda (l) Transformation Equation Transformed Value
-2 1/Y2 1/42 = 0.0625
-0.5 1/((sq.rt)Y) 1/((sq.rt)4) = 0.5
-1.0 1/Y 1/4 = 0.25
0.0 Lognormal (ln) The logarithm having base e, where e is the constant equal to approximately 2.71828. The natural log of any positive number, n, is the exponent, x, to which e must be raised so that ex = n. For example, 2.71828x = 4, so the natural log of 4 is 1.3863.
0.5

(sq.rt)Y

(sq.rt)4 = 2

1.0 Y 4
2.0 Y2 42 = 16

Type B data - If none of the distributions or transformations fit, the non-normal data may be "pollution" caused by a mixture of multiple distributions or processes. Examples of this type of pollution include complex work activities; multiple shifts, locations, or customers; and seasonality. Practitioners can try stratifying or breaking down the data into categories to make sense of it. For example, the cycle time required for attorneys to complete contract documents is generally not normally distributed. Nor does it have a lognormal distribution. Stratifying the data can make some contract documents, such as residential real estate closings, much simpler to research, draft and execute than more complex contract documents. Hence, the complex contracts represent all the longer times, while the simpler contracts have shorter times. Another approach is to convert all the process data into a common denominator, such as contract draft time per page. After, all the data can be recombined and tested for a single distribution.

Revisiting the Hospital Example

Because the hospital ER data is non-normal, it can be transformed using the Box-Cox technique and statistical analysis software. The optimum lambda value of 0.5 minimizes the standard deviation (Figure 5).

Figure 5: Box-Cox Plot of Time Spent in ER

Notice that the histogram of the transformed data (Figure 6) is much more normalized (bell-shaped, symmetrical) than the histogram in Figure 3.

Figure 6: ER Time Data after Transformation

An alternative to transforming the data is to find a non-normal distribution that does fit the data. Figure 7 shows probability plots for the ER waiting time using the normal, lognormal, exponential and Weibull distributions.

Figure 7: Various Distributions of Time in ER Data

Statistical software calculated the x- and y-axis of each probability plot so the data points would follow the blue, perfect-model line if that distribution was a good fit of the data. Looking at the various distributions, the exponential distribution appears to be a poor model for hospital ER times. In contrast, data points in the lognormal and Weibull probability plots follow the model line well. But which one is the better distribution?

The Anderson-Darling Normality test can be used as an indicator of goodness-of-fit. It produces a p-value, which is a probability that is compared to the decision criteria, alpha (a) risk. Assume a = 0.05, meaning there is a 5 percent risk of rejecting the null when it is true. The hypothesis test for this example is:

Null (H0) = The data is normally distributed

Alternate (H1) = The data is not normally distributed

If the p-value is equal to or less than alpha, there is evidence that the data does not follow a normal distribution. Conversely, a p-value greater than alpha suggests the data is normally distributed.

The p-value for the lognormal distribution is 0.058 while the p-value for the Weibull distribution is 0.162. While both are above the 0.05 alpha risk, the Weibull distribution is the better distribution because there is a 16.2 percent chance of being wrong when rejecting the null.

Now the Weibull distribution can be used to construct the proper individuals control chart (Figure 8). Notice all of the data points are within the control limits; hence, it is stable and predictable.

Figure 8: Individuals Control Chart Using Weibull Distribution

Now that the process is in control, it can be assessed using indices such as Cpk (Figure 9). Overall, this is a predictable process with 8.85 percent of ER visit time out of specification.

Figure 9: Process Capability of Time in ER 

A similar assessment can be made with a probability plot, which shows this is a predictable process and that 91 percent of the ER waiting times are within four hours. Put another way, only 9 percent of the patients will take longer than the four-hour target to be processed, diagnosed and treated in the hospital ER. This is an explanation that management can readily understand.

Figure 10: Probability Plot of Time Spent in ER

Better Knowledge, Better Decisions

Non-normal data may be more common in business processes than many people think. When control charts are used with non-normal data, they can give false signals of special cause variation, leading to inaccurate conclusions and inappropriate business strategies. Given this reality, it is important to be able to identify the characteristics of non-normal data and know how to properly transform the data. In doing so, practitioners will make better decisions about their business and save time and resources in the process.

About the Author: Peter J. Sherman is a certified Lean Six Sigma Master Black Belt and an ASQ-certified Quality Engineer with 22 years of experience, including serving as senior Black Belt for AT&T's Product Development Group. He has a master's degree in engineering from the Massachusetts Institute of Technology (MIT) and an MBA from Georgia State University. As a visiting scholar to Japan while at MIT, he worked with quality expert W. Edwards Deming. Sherman is the lead instructor at Emory University's Six Sigma Certificate Program in Atlanta, and is a member of the American Society for Quality and the International Society of Six Sigma Professionals. He can be reached at This e-mail address is being protected from spambots. You need JavaScript enabled to view it .

Additional Info

  • CID: 1189

Add comment


From Our Partners

  
 
 
 
 
 
 
 
 


Training

Explore upcoming courses for Green Belt, Black Belt, Master Black Belt training and more. Plus, find out what certification
is all about.
More Training

Methodology

Learn about different approaches to process improvement.
The techniques can be used as part of a Lean Six Sigma
effort, or on their own. 
More Methodology

Implementation

Apply best practices to your process improvement effort, from launching Lean Six Sigma to taking it to the next level. 
More Implementation

Resources

Haven't found what you're looking for? Bookmark these spots
for fast access to top iSixSigma links. 
More Resources

Tools & Templates

Get the answers you need about the tools of the trade. Or, try
one of our templates or wizards to give you a jump start. 
More Tools & Templates

Featured Articles
Starwood Lean Six Sigma Team

Community: Starwood: No. 1 on Best Places to Work List
The 10 companies on iSixSigma's second annual Best Places to Work list have one thing in common: They have cultivated a winning work environment for their practitioners. At Starwood Hotels & Resorts, that environment includes a mix of opportunities for Belts.

Community: Welcome to the New iSixSigma.com
iSixSigma Publisher Katie Barry introduces the redesigned website and encourages readers to send feedback.

Implementation: Six Sigma Success Factors
To reap benefits, organizations must pay close attention to key factors that can make or break a Six Sigma deployment.

Community
chart

News: Florida Medical Center Uses Six Sigma to Improve Mortality Rate
Parrish Medical Center, a 210-bed community hospital in Titusville, Fla., has reduced mortality by 30.8 percent with help  from Thomson Reuters. The hospital also reduced "code blue" conditions — patients requiring resuscitation after a cardiac arrest or respiratory arrest — outside of the ICU by 76.5 percent.

Blogosphere: The People Behind Six Sigma Culture
Associate Editor Randy Woods blogs about the features in the March/April 2010 issue of iSixSigma Magazine, including an article on  iSixSigma's Best Places to Work and the seventh annual Global Salary Survey.

Blogosphere: Cox-Box Cartoon
See what Six Sigma Guy is up to in the latest cartoon, and visit the archives.

Events
chart

iSixSigma Live! Event: Energy Forum for Process Excellence - May 24-27      
In 2009, more than 150 process excellence leaders gathered in Houston for the 1st Annual Energy Forum for Process Excellence. In 2010, be sure you're a part of the 200 leaders who will learn and network with executives and practitioners across the energy sector.

Event: Webcast on New Lean Six Sigma Belt Certification Standard - March 10

iSixSigma Live! Event: 2010 DoD Performance Symposium - June 8-10

To learn about other upcoming public training sessions and conferences, visit the full events calendar.

Directory

The following guides – and many more designed to aid practitioners – are available in the Directory:

To find more resources, please visit the full Directory.