TUESDAY, SEPTEMBER 02, 2014
Font Size
Six Sigma Tools & Templates Normality Dealing with Non-normal Data: Strategies and Tools

Dealing with Non-normal Data: Strategies and Tools

Normally distributed data is a commonly misunderstood concept in Six Sigma. Some people believe that all data collected and used for analysis must be distributed normally. But normal distribution does not happen as often as people think, and it is not a main objective. Normal distribution is a means to an end, not the end itself.

Normally distributed data is needed to use a number of statistical tools, such as individuals control charts, Cp/Cpk analysis, t-tests and the analysis of variance (ANOVA). If a practitioner is not using such a specific tool, however, it is not important whether data is distributed normally. The distribution becomes an issue only when practitioners reach a point in a project where they want to use a statistical tool that requires normally distributed data and they do not have it.

The probability plot in Figure 1 is an example of this type of scenario. In this case, normality clearly cannot be assumed; the p-value is less than 0.05 and more than 5 percent of the data points are outside the 95 percent confidence interval.

Figure 1: Probability Plot of Cycle Time

Figure 1: Probability Plot of Cycle Time

What can be done? Basically, there are two options:

  1. Identify and, if possible, address reasons for non-normality or
  2. Use tools that do not require normality

Addressing Reasons for Non-normality

When data is not normally distributed, the cause for non-normality should be determined and appropriate remedial actions should be taken. There are six reasons that are frequently to blame for non-normality.

Reason 1: Extreme Values

Too many extreme values in a data set will result in a skewed distribution. Normality of data can be achieved by cleaning the data. This involves determining measurement errors, data-entry errors and outliers, and removing them from the data for valid reasons.

It is important that outliers are identified as truly special causes before they are eliminated. Never forget: The nature of normally distributed data is that a small percentage of extreme values can be expected; not every outlier is caused by a special reason. Extreme values should only be explained and removed from the data if there are more of them than expected under normal conditions.

Reason 2: Overlap of Two or More Processes

Data may not be normally distributed because it actually comes from more than one process, operator or shift, or from a process that frequently shifts. If two or more data sets that would be normally distributed on their own are overlapped, data may look bimodal or multimodal – it will have two or more most-frequent values.

The remedial action for these situations is to determine which X’s cause bimodal or multimodal distribution and then stratify the data. The data should be checked again for normality and afterward the stratified processes can be worked with separately.

An example: The histogram in Figure 2 shows a website’s non-normally distributed load times. After stratifying the load times by weekend versus working day data (Figure 3), both groups are normally distributed.

Figure 2: Website Load Time Data

Figure 2: Website Load Time Data

Figure 3: Website Load Time Data After Stratification

Figure 3: Website Load Time Data After Stratification

Reason 3: Insufficient Data Discrimination

Round-off errors or measurement devices with poor resolution can make truly continuous and normally distributed data look discrete and not normal. Insufficient data discrimination – and therefore an insufficient number of different values – can be overcome by using more accurate measurement systems or by collecting more data.

Reason 4: Sorted Data

Collected data might not be normally distributed if it represents simply a subset of the total output a process produced. This can happen if data is collected and analyzed after sorting. The data in Figure 4 resulted from a process where the target was to produce bottles with a volume of 100 ml. The lower and upper specifications were 97.5 ml and 102.5 ml. Because all bottles outside of the specifications were already removed from the process, the data is not normally distributed – even if the original data would have been.

Figure 4: Sorted Bottle Volume Data

Figure 4: Sorted Bottle Volume Data

Reason 5: Values Close to Zero or a Natural Limit

If a process has many values close to zero or a natural limit, the data distribution will skew to the right or left. In this case, a transformation, such as the Box-Cox power transformation, may help make data normal. In this method, all data is raised, or transformed, to a certain exponent, indicated by a Lambda value. When comparing transformed data, everything under comparison must be transformed in the same way.

The figures below illustrate an example of this concept. Figure 5 shows a set of cycle-time data; Figure 6 shows the same data transformed with the natural logarithm.

Figure 5: Cycle Time Data

Figure 5: Cycle Time Data

Figure 6: Log Cycle Time Data

Figure 6: Log Cycle-Time Data

Take note: None of the transformation methods provide a guarantee of a normal distribution. Always check with a probability plot to determine whether normal distribution can be assumed after transformation.

Reason 6: Data Follows a Different Distribution

There are many data types that follow a non-normal distribution by nature. Examples include:

  • Weibull distribution, found with life data such as survival times of a product
  • Log-normal distribution, found with length data such as heights
  • Largest-extreme-value distribution, found with data such as the longest down-time each day
  • Exponential distribution, found with growth data such as bacterial growth
  • Poisson distribution, found with rare events such as number of accidents
  • Binomial distribution, found with “proportion” data such as percent defectives

If data follows one of these different distributions, it must be dealt with using the same tools as with data that cannot be “made” normal.

No Normality Required

Some statistical tools do not require normally distributed data. To help practitioners understand when and how these tools can be used, the table below shows a comparison of tools that do not require normal distribution with their normal-distribution equivalents.

Comparison of Statistical Analysis Tools for Normally and Non-Normally Distributed Data
Tools for Normally Distributed Data Equivalent Tools for Non-Normally Distributed Data Distribution Required
T-test Mann-Whitney test; Mood’s median test; Kruskal-Wallis test Any
ANOVA Mood’s median test; Kruskal-Wallis test Any
Paired t-test One-sample sign test Any
F-test; Bartlett’s test Levene’s test Any
Individuals control chart Run Chart Any
Cp/Cpk analysis Cp/Cpk analysis Weibull; log-normal; largest extreme value; Poisson; exponential; binomial
Tags:  

Register Now

  • Stop this in-your-face notice
  • Reserve your username
  • Follow people you like, learn from
  • Extend your profile
  • Gain reputation for your contributions
  • No annoying captchas across site
And much more! C'mon, register now.

Leave a Comment



Comments

Bravo Al-Hamadani 23-09-2012, 09:32

I have data set for some variables (like age) are normally distributed and others (like height) are not normally distributed. The question is: When I compare these two variables with other categorical variable (with gender for example).
Can I use Independent sample t-test for age and Non-Parametric t-test for height. (Just for record: This is for sake of a publication. And I do not know if you can possibly use these two tests in one study).
Thanks

-Bravo

Reply
MikeMac 02-05-2013, 15:21

I have an abnormal dataset I’m currently working with that seems to be unique from the above. Analytical chemists like myself often find themselves testing a new method of analysis against an old one. In this particular case we have several different types of sample that have been analysed by both methods, in fact it’s more of a spectrum of sample types. When testing the new method against the old, some sample types seem to be affected more by the new method than others. This results in hints of multimodel distrubition but because of this spectrum of sample types and relatively good agreement between methods has normalish (haha) look to it.

An Anderson-Darling test (.05) confirms that it is not normal, and because the paired t-test would have been my natural choice had the distribution been normal, I’m a little lost as to what might be my next best option. At this point I’m looking at presenting my data in simple method 1 vs method 2 plot with a linear regression. y=x being perfect agreement, a slope and fit being close to 1 as being a good indication of good agreement between methods. Do you think this would be sufficient? Any other suggestions for me?

Reply
Marc Pilgaard 11-05-2013, 07:57

Hi there, me and my study group found this blog entry very helpful for our research and it gave us a lot of guidance on where to look for further information.

Reply
Andrea Moreno 26-07-2013, 05:33

Nice job! and easy to understand! just what I needed. Thanks!

Reply
Liza 24-10-2013, 18:37

Hey Arne, thanks for a great summary! One question – what is understood by “survival times of a product” (Weibull distribution) – can you give an example of a real industry case? THANK YOU! I am your fan!

Reply
oluwafemi 15-11-2013, 05:52

please, i need a data set that is not normally distributed. am having problem getting one for my project

Reply
SF Lau 23-12-2013, 18:11

Good knowledge sharing. Thanks.

Reply
Melissa 24-02-2014, 08:01

This article is just what I need to know. I’m currently working with kernel density estimation and some of my data are not normally distributed.

Reply
soujanya 13-03-2014, 23:02

Very very useful.
The information provided is apt.

Reply
Gustav 10-07-2014, 13:04

“Reason 6: Data Follows a Different Distribution: Log-normal distribution, found with length data such as heights”…

For what kind of height is this for??? i.e.: height of a bunch of people or height of a industrial product manufactured (pharmaceutical tablet)???

regards

Reply

Login Form