Six Sigma professionals should be familiar with normally distributed processes: the characteristic bellshaped curve that is symmetrical about the mean, with tails approaching plus and minus infinity (Figure 1).
When data fits a normal distribution, practitioners can make statements about the population using common analytical techniques, including control charts and capability indices (such as sigma level, C_{p}, C_{pk}, defects per million opportunities and so on).
But what happens when a business process is not normally distributed? How do practitioners know the data is not normal? How should this type of data be treated? Practitioners can benefit from an overview of normal and nonnormal distributions, as well as familiarizing themselves with some simple tools to detect nonnormality and techniques to accurately determine whether a process is in control and capable.
There are some common ways to identify nonnormal data:
1. The histogram does not look bell shaped. Instead, it is skewed positively or negatively (Figure 2).
2. A natural process limit exists. Zero is often the natural process limit when describing cycle times and lead times. For example, when a restaurant promises to deliver a pizza in 30 minutes or less, zero minutes is the natural lower limit.
3. A time series plot shows large shifts in data.
4. There is known seasonal process data.
5. Process data fluctuates (i.e., product mix changes).
Transactional processes and most metrics that involve time measurements exist with nonnormal distributions. Some examples:
A sample hospital’s target time for processing, diagnosing and treating patients entering the ER is four hours or less. Historical data is shown in Figure 3.
An Individuals chart shows several data points outside of the upper control limits (Figure 4). Based on control chart rules, these special causes indicate the process is not in control (i.e., not stable or predictable). But is this the correct conclusion?
There are a couple of ways to tell the data may not be normal. First, the histogram is skewed to the right (positively). Second, the control chart shows the lower control limit is less than the natural limit of zero. Third, notice the number of high points and no real low points. These telltale signs indicate the data may not be normally distributed enough for an individuals control chart. When control charts are used with nonnormal data, they can give false specialcause signals. Therefore, the data must be transformed to follow the normal distribution. Once this is done, standard control chart calculations can be used on the transformed data.
There are two types of nonnormal data:
Type A data – One way to properly analyze the data is identify it with the appropriate distribution (i.e., lognormal, Weibull, exponential and so on). Some common distributions, data types and examples associated with these distributions are in Table 1.
Table 1: Distribution Types  
Distribution  Type Data  Examples 
Normal  Continuous  Useful when it is equally likely the readings will fall above or below the average 
Lognormal  Continuous  Cycle or lead time data 
Weibull  Continuous  Mean timetofailure data, time to repair and material strength 
Exponential  Continuous  Constant failure rate conditions of products 
Poisson  Discrete  Number of events in a specific time period (defect counts per interval such as arrivals, failures or defects) 
Binomial  Discrete  Proportion or number of defectives 
A second way is to transform the data so that it follows the normal distribution. A common transformation technique is the BoxCox. The BoxCox is a power transformation because the data is transformed by raising the original measurements to a power lambda (l).Some common lambda values, the transformation equation and resulting transformed value assuming Y = 4 are in Table 2.
Table 2: Lambda Values and Their Transformation Equations and Values  
Lambda (Λ)  Transformation Equation  Transformed Value 
2  1/Y^{2}  1/4^{2 }= 0.0625 
0.5  1/((sqrt)Y)  1/((sqrt)4) = 0.5 
1.0  1/Y  1/4 = 0.25 
0.0  Lognormal (ln)  The logarithm having base e, where e is the constant equal to approximately 2.71828. The natural log of any positive number, n, is the exponent, x, to which e must be raised so that e^{x} = n. For example, 2.71828^{x} = 4, so the natural log of 4 is 1.3863. 
0.5  (sqrt)Y  (sqrt)4 = 2 
1.0  Y  4 
2.0  Y^{2}  4^{2} = 16 
Type B data – If none of the distributions or transformations fit, the nonnormal data may be “pollution” caused by a mixture of multiple distributions or processes. Examples of this type of pollution include complex work activities; multiple shifts, locations, or customers; and seasonality. Practitioners can try stratifying or breaking down the data into categories to make sense of it. For example, the cycle time required for attorneys to complete contract documents is generally not normally distributed. Nor does it have a lognormal distribution. Stratifying the data can make some contract documents, such as residential real estate closings, much simpler to research, draft and execute than more complex contract documents. Hence, the complex contracts represent all the longer times, while the simpler contracts have shorter times. Another approach is to convert all the process data into a common denominator, such as contract draft time per page. After, all the data can be recombined and tested for a single distribution.
Because the hospital ER data is nonnormal, it can be transformed using the BoxCox technique and statistical analysis software. The optimum lambda value of 0.5 minimizes the standard deviation (Figure 5).
Notice that the histogram of the transformed data (Figure 6) is much more normalized (bellshaped, symmetrical) than the histogram in Figure 3.
An alternative to transforming the data is to find a nonnormal distribution that does fit the data. Figure 7 shows probability plots for the ER waiting time using the normal, lognormal, exponential and Weibull distributions.
Statistical software calculated the x– and yaxis of each probability plot so the data points would follow the blue, perfectmodel line if that distribution was a good fit of the data. Looking at the various distributions, the exponential distribution appears to be a poor model for hospital ER times. In contrast, data points in the lognormal and Weibull probability plots follow the model line well. But which one is the better distribution?
The AndersonDarling Normality test can be used as an indicator of goodnessoffit. It produces a pvalue, which is a probability that is compared to the decision criteria, alpha (a) risk. Assume a = 0.05, meaning there is a 5 percent risk of rejecting the null when it is true. The hypothesis test for this example is:
Null (H_{0}) = The data is normally distributed
Alternate (H_{1}) = The data is not normally distributed
If the pvalue is equal to or less than alpha, there is evidence that the data does not follow a normal distribution. Conversely, a pvalue greater than alpha suggests the data is normally distributed.
The pvalue for the lognormal distribution is 0.058 while the pvalue for the Weibull distribution is 0.162. While both are above the 0.05 alpha risk, the Weibull distribution is the better distribution because there is a 16.2 percent chance of being wrong when rejecting the null.
Now the Weibull distribution can be used to construct the proper individuals control chart (Figure 8). Notice all of the data points are within the control limits; hence, it is stable and predictable.
Now that the process is in control, it can be assessed using indices such as C_{pk} (Figure 9). Overall, this is a predictable process with 8.85 percent of ER visit time out of specification.
A similar assessment can be made with a probability plot, which shows this is a predictable process and that 91 percent of the ER waiting times are within four hours. Put another way, only 9 percent of the patients will take longer than the fourhour target to be processed, diagnosed and treated in the hospital ER. This is an explanation that management can readily understand.
Nonnormal data may be more common in business processes than many people think. When control charts are used with nonnormal data, they can give false signals of special cause variation, leading to inaccurate conclusions and inappropriate business strategies. Given this reality, it is important to be able to identify the characteristics of nonnormal data and know how to properly transform the data. In doing so, practitioners will make better decisions about their business and save time and resources in the process.


Comments
Excellent Post – Very Informative
This is a great post. What impact would we see if this was a short term analysis, to the point where you’re data on the “newer” data points have less variance simply because they are newer, whereas the “older” data has had more time to accrue substantial outliers? Is there a definitive way to address this, or is it just a matter of trimming the outliers off or simply waiting longer to analyze?
Hello, thanks for this post. One question, if you use a transformation on the data, how do you assess the error? E.g. with a regression analysis?
Great post Peter. I have one doubt here.
For large sample size, are capability analysis and control charts sensitive to normality assumption ?
By large sample size I mean greater than 100…
Yes; the answer is to be found earlyon:
When data fits a normal distribution, practitioners can make statements about the population using common analytical techniques, including control charts and capability indices …
If you are clear that the sample is representative of the population then the characteristics describing the shape should be identical for both sample and population. The normal distribution is, or should be, the shape of both the sample and the population. A larger sample size should, if randomly selected, be more representative of the population than a smaller one. HTH
You can use KolmogorovSmirnov test for large sample size and shapiro wilk for smaller than 2000.
null hypothesis is normality
Hi,
Very helpful! Was this completed in R? Is there any place I can find this code/dataset on the web?
Thanks!