Tips for Recognizing and Transforming Non-normal Data

Six Sigma professionals should be familiar with normally distributed processes: the characteristic bell-shaped curve that is symmetrical about the mean, with tails approaching plus and minus infinity (Figure 1).

Figure 1: Normally Distributed Data

Figure 1: Normally Distributed Data

When data fits a normal distribution, practitioners can make statements about the population using common analytical techniques, including control charts and capability indices (such as sigma level, Cp, Cpk, defects per million opportunities and so on).

But what happens when a business process is not normally distributed? How do practitioners know the data is not normal? How should this type of data be treated? Practitioners can benefit from an overview of normal and non-normal distributions, as well as familiarizing themselves with some simple tools to detect non-normality and techniques to accurately determine whether a process is in control and capable.

Spotting Non-normal Data

There are some common ways to identify non-normal data:

1. The histogram does not look bell shaped. Instead, it is skewed positively or negatively (Figure 2).

Figure 2: Positively and Negatively Skewed Data

Figure 2: Positively and Negatively Skewed Data

2. A natural process limit exists. Zero is often the natural process limit when describing cycle times and lead times. For example, when a restaurant promises to deliver a pizza in 30 minutes or less, zero minutes is the natural lower limit.
3. A time series plot shows large shifts in data.
4. There is known seasonal process data.
5. Process data fluctuates (i.e., product mix changes).

Transactional processes and most metrics that involve time measurements exist with non-normal distributions. Some examples:

  • Mean time to repair HVAC equipment
  • Admissions cycle time for college applicants
  • Days sales outstanding
  • Waiting times at a bank or physician’s office
  • Time being treated in a hospital emergency room

Example: Time in a Hospital Emergency Room

A sample hospital’s target time for processing, diagnosing and treating patients entering the ER is four hours or less. Historical data is shown in Figure 3.

Figure 3: Time Spent in ER

Figure 3: Time Spent in ER

An Individuals chart shows several data points outside of the upper control limits (Figure 4). Based on control chart rules, these special causes indicate the process is not in control (i.e., not stable or predictable). But is this the correct conclusion?

Figure 4: Individuals Chart of Time Spent in ER

Figure 4: Individuals Chart of Time Spent in ER

There are a couple of ways to tell the data may not be normal. First, the histogram is skewed to the right (positively). Second, the control chart shows the lower control limit is less than the natural limit of zero. Third, notice the number of high points and no real low points. These tell-tale signs indicate the data may not be normally distributed enough for an individuals control chart. When control charts are used with non-normal data, they can give false special-cause signals. Therefore, the data must be transformed to follow the normal distribution. Once this is done, standard control chart calculations can be used on the transformed data.

A Closer Look at Non-normal Data

There are two types of non-normal data:

  • Type A: Data that exists in another distribution
  • Type B: Data that contains a mixture of multiple distributions or processes

Type A data – One way to properly analyze the data is identify it with the appropriate distribution (i.e., lognormal, Weibull, exponential and so on). Some common distributions, data types and examples associated with these distributions are in Table 1.

Table 1: Distribution Types
Distribution Type Data Examples
Normal Continuous Useful when it is equally likely the readings will fall above or below the average
Lognormal Continuous Cycle or lead time data
Weibull Continuous Mean time-to-failure data, time to repair and material strength
Exponential Continuous Constant failure rate conditions of products
Poisson Discrete Number of events in a specific time period (defect counts per interval such as arrivals, failures or defects)
Binomial Discrete Proportion or number of defectives

A second way is to transform the data so that it follows the normal distribution. A common transformation technique is the Box-Cox. The Box-Cox is a power transformation because the data is transformed by raising the original measurements to a power lambda (l).Some common lambda values, the transformation equation and resulting transformed value assuming Y = 4 are in Table 2.

Table 2: Lambda Values and Their Transformation Equations and Values
Lambda (Λ) Transformation Equation Transformed Value
-2 1/Y2 1/42 = 0.0625
-0.5 1/((sqrt)Y) 1/((sqrt)4) = 0.5
-1.0 1/Y 1/4 = 0.25
0.0 Lognormal (ln) The logarithm having base e, where e is the constant equal to approximately 2.71828. The natural log of any positive number, n, is the exponent, x, to which e must be raised so that ex = n. For example, 2.71828x = 4, so the natural log of 4 is 1.3863.
0.5 (sqrt)Y (sqrt)4 = 2
1.0 Y 4
2.0 Y2 42 = 16

Type B data – If none of the distributions or transformations fit, the non-normal data may be “pollution” caused by a mixture of multiple distributions or processes. Examples of this type of pollution include complex work activities; multiple shifts, locations, or customers; and seasonality. Practitioners can try stratifying or breaking down the data into categories to make sense of it. For example, the cycle time required for attorneys to complete contract documents is generally not normally distributed. Nor does it have a lognormal distribution. Stratifying the data can make some contract documents, such as residential real estate closings, much simpler to research, draft and execute than more complex contract documents. Hence, the complex contracts represent all the longer times, while the simpler contracts have shorter times. Another approach is to convert all the process data into a common denominator, such as contract draft time per page. After, all the data can be recombined and tested for a single distribution.

Revisiting the Hospital Example

Because the hospital ER data is non-normal, it can be transformed using the Box-Cox technique and statistical analysis software. The optimum lambda value of 0.5 minimizes the standard deviation (Figure 5).

Figure 5: Box-Cox Plot of Time Spent in ER

Figure 5: Box-Cox Plot of Time Spent in ER

Notice that the histogram of the transformed data (Figure 6) is much more normalized (bell-shaped, symmetrical) than the histogram in Figure 3.

Figure 6: ER Time Data after Transformation

Figure 6: ER Time Data after Transformation

An alternative to transforming the data is to find a non-normal distribution that does fit the data. Figure 7 shows probability plots for the ER waiting time using the normal, lognormal, exponential and Weibull distributions.

Figure 7: Various Distributions of Time in ER Data

Figure 7: Various Distributions of Time in ER Data

Statistical software calculated the x– and y-axis of each probability plot so the data points would follow the blue, perfect-model line if that distribution was a good fit of the data. Looking at the various distributions, the exponential distribution appears to be a poor model for hospital ER times. In contrast, data points in the lognormal and Weibull probability plots follow the model line well. But which one is the better distribution?

The Anderson-Darling Normality test can be used as an indicator of goodness-of-fit. It produces a p-value, which is a probability that is compared to the decision criteria, alpha (a) risk. Assume a = 0.05, meaning there is a 5 percent risk of rejecting the null when it is true. The hypothesis test for this example is:

Null (H0) = The data is normally distributed

Alternate (H1) = The data is not normally distributed

If the p-value is equal to or less than alpha, there is evidence that the data does not follow a normal distribution. Conversely, a p-value greater than alpha suggests the data is normally distributed.

The p-value for the lognormal distribution is 0.058 while the p-value for the Weibull distribution is 0.162. While both are above the 0.05 alpha risk, the Weibull distribution is the better distribution because there is a 16.2 percent chance of being wrong when rejecting the null.

Now the Weibull distribution can be used to construct the proper individuals control chart (Figure 8). Notice all of the data points are within the control limits; hence, it is stable and predictable.

Figure 8: Individuals Control Chart Using Weibull Distribution

Figure 8: Individuals Control Chart Using Weibull Distribution

Now that the process is in control, it can be assessed using indices such as Cpk (Figure 9). Overall, this is a predictable process with 8.85 percent of ER visit time out of specification.

Figure 9: Process Capability of Time in ER

Figure 9: Process Capability of Time in ER

A similar assessment can be made with a probability plot, which shows this is a predictable process and that 91 percent of the ER waiting times are within four hours. Put another way, only 9 percent of the patients will take longer than the four-hour target to be processed, diagnosed and treated in the hospital ER. This is an explanation that management can readily understand.

Figure 10: Probability Plot of Time Spent in ER

Figure 10: Probability Plot of Time Spent in ER

Better Knowledge, Better Decisions

Non-normal data may be more common in business processes than many people think. When control charts are used with non-normal data, they can give false signals of special cause variation, leading to inaccurate conclusions and inappropriate business strategies. Given this reality, it is important to be able to identify the characteristics of non-normal data and know how to properly transform the data. In doing so, practitioners will make better decisions about their business and save time and resources in the process.

You Might Also Like

Comments 7

  1. Yathiesh

    Excellent Post – Very Informative

  2. Kevin C

    This is a great post. What impact would we see if this was a short term analysis, to the point where you’re data on the “newer” data points have less variance simply because they are newer, whereas the “older” data has had more time to accrue substantial outliers? Is there a definitive way to address this, or is it just a matter of trimming the outliers off or simply waiting longer to analyze?

  3. Victoria

    Hello, thanks for this post. One question, if you use a transformation on the data, how do you assess the error? E.g. with a regression analysis?

  4. Amit Kumar Ojha

    Great post Peter. I have one doubt here.
    For large sample size, are capability analysis and control charts sensitive to normality assumption ?
    By large sample size I mean greater than 100…

  5. Sean


    Very helpful! Was this completed in R? Is there any place I can find this code/dataset on the web?



Leave a Reply