SATURDAY, APRIL 19, 2014
Font Size
Featured A Primer on Non-normal Data

A Primer on Non-normal Data

The distribution of data can be categorized in two ways: normal and non-normal. If data is normally distributed, it can be expected to follow a certain pattern in which the data tend to be around a central value with no bias left or right (Figure 1). Non-normal data, on the other hand, does not tend toward a central value. It can be skewed left or right or follow no particular pattern.

Figure 1: Normally Distributed Data

Figure 1: Normally Distributed Data

Non-normal data sounds more dire than it may be. The distribution becomes an issue only when practitioners reach a point in a project where they want to use a statistical tool that requires normally distributed data and they do not have it.

Non-normality is the result of either:

  1. Data that contains “pollution,” such as outliers, the overlap of two or more processes (Figure 2), the result of inaccurate measures, etc.
  2. Data that follows an alternative distribution, such as cycle time data, which has a natural limit of zero (Figure 3).

Figure 2: Website Load Data Time

Figure 2: Website Load Data Time

Figure 3: Cycle Time Data

Figure 3: Cycle Time Data

To move forward with analysis, the cause of the non-normality should be identified and addressed. For example, in the case of the website load time data in Figure 2, once the data was stratified by weekends versus working days, the result was two sets of normally distributed data (Figure 4). Each set of data can then be analyzed using statistical tools for normal data.

Figure 4: Website Load Time Data After Stratification

Figure 4: Website Load Time Data After Stratification

If the data follows an alternative distribution (see table below for common non-normal distribution types), transforming the data will allow practitioners to still take advantage of the statistical analysis options that are available to normal data. The best method for transforming non-normal data depends upon the particular situation, and it is unfortunately not always clear which method will work best. A common transformation technique is the Box-Cox.

Common Non-normal Distribution Types
Distribution Type Data Examples
Lognormal Continuous Cycle or lead time data
Weibull Continuous Mean time-to-failure data, time to repair and material strength
Exponential Continuous Constant failure rate conditions of products
Poisson Discrete Number of events in a specific time period (defect counts per interval such as arrivals, failures or defects)
Binomial Discrete Proportion or number of defectives

Another option is to use tools that do not require normally distributed data. Testing for statistical significance can be done with nonparametric tests such as the Mann-Whitney test, Mood’s median test and the Kruskal-Wallis test.

To learn more about non-normal data and related topics, refer to the following articles and discussions on iSixSigma.com:

Non-normal data is a typical subject in Green Belt training. To learn more about non-normal data and hypothesis testing, purchase the Six Sigma Green Belt Training Course available at the iSixSigma Marketplace.

Register Now

  • Stop this in-your-face notice
  • Reserve your username
  • Follow people you like, learn from
  • Extend your profile
  • Gain reputation for your contributions
  • No annoying captchas across site
And much more! C'mon, register now.

Leave a Comment


Comments

Manoj Kumar Sharma 23-12-2013, 05:06

Very good articles, easy description & contains good articles.

Reply
Dr. Mikel Harry 23-12-2013, 06:43

This is a great short-and-sweet primer that nicely frames the issue of how to handle non-normal data. In this context, the article would fit nicely as “teaser” for a larger introduction.

To the credit of the author, this article focuses on several carefully selected examples that are “clean and simple.” The examples are textbook-like problems and solutions, both of which are devoid of the ambiguous circumstances that usually accompanies reality. Of course, this is where most such problems exist — in the grey zone of reality. When a typical practitioner parachutes in this zone, the advice is simple — get a subject-matter-expert to assist with the problem.

Achieving the ultimate aims of this article can be quite difficult, even for the experienced practitioner. For example, one would likely want to consider the “robustness” of the statistic being used to analyze the non-normal data. This means giving full analytical consideration to the Type I error stability, as well as that of Type II errors, not to mention the mitigating effects related to degrees-of-freedom and delta sigma. If the statistic of choice proves to be reasonably robust, there is no need to transform the data. On the other hand, if the statistic of choice is not robust, then what? Herein lies the “grey zone” of the “circumstantially complex.”

Far too often, practitioners attempt to transform non-normal data into a state of normality prior to conducting some type or form of analysis that is theoretically dependent on an underlying distribution that is normal. However, there are other types of distributions that can be used as the target of transformation, like the uniform or triangular distribution, just to mention a couple.

To illustrate, consider figure 2 in the article. This graphic clearly displays the case of when the data are actually associated with two different categories, but for whatever reason, have been inappropriately combined (i.e., blended into a single distribution that appears to be non-normal. For the case provided in this article, the solution is quite simple and straight forward — just sort the data on the categorical variable and then analyze each distribution separately. However, many times such bimodal looking distributions are entirely natural, like the family of extreme value distributions (that are known to naturally exist). The proper use of extreme value distributions is an entirely different manner, yet related to the topic at hand.

In summary, its one thing to form a panoramic view of the mountain tops, but quite another to farm the fertile soil at the base of that mountain, so to speak. The panoramic view is for sightseers, but the growing of produce is the job of an experienced farmer. Just because you might be able to use a set of binoculars does not mean you can drive a tractor.

Reply
Avatar of Jeremiah Lewis
Jeremiah Lewis 23-12-2013, 08:43

The article failed to mention the consequences of using methods such as those of Mann-Whitney and Kruskall-Wallis. Generally any non-parametic methods use assumptions that are just as unlikely and have a loss of power.

As Dr. Harry has suggested it can be very complex. However, I would add that outliers can be very valuable when we consider what has pushed them to the extremities of our data set.

Reply
Dr A Burns 23-12-2013, 15:02

What rubbish. Read “Normality and the Process Behaviouir Chart” – Wheeler.

Reply
Avatar of Chris Seider
Chris Seider 23-12-2013, 17:08

I’m sorry but the statement “data contains pollution” is a reason for nonnormality is absurd. Is it absurd to have a poor MSA or two overlapping distributions or outliers with special causes?

Solve the causes for the various peaks and you’ll improve the process. Don’t bother using Box Cox…..A customer won’t say “Hey, I’ll just Box Cox your incoming process results” and I’ll feel better. You still have the same % defective in the process with a proper transformation.

I agree with Dr. MH’s advise to not just transform for the sake of transformation.

Reply
Holly 03-01-2014, 22:52

Nice!

Reply
Kicab 24-12-2013, 05:43

While the two reasons (although I would have preferred some other word than “pollution”) for nonnormality are accurate, few people realize that the same two reasons occur FOR normality.

Did you know that a distribution may appear normal because it is a combination of two overlapping distributions? This occurs, for example, when two normal distributions with approximately the same standard deviation have means that are about one standard deviation different. If you have a small enough sample size, you may not recognize that there are actually two distributions.

The author raises a critical issue: “The distribution becomes an issue only when practitioners reach a point in a project where they want to use a statistical tool that requires normally distributed data and they do not have it.” This is backward.

Why are practitioners wanting to use a particular tool rather than wanting to answer a specific question? Then, determine the best tool that would provide the answer. Unfortunately, too many pracititoners in my experience are tool-oriented rather than on purpose, deliverable, question oriented.

I would suggest not transforming to get a normal distribution.

The problem with nonlinear transformations (e.g., Box-Cox) is that the characteristics of the distribution change. So, what is true of the transformed data may not be true of the original. And, supposedly, it is the original in which you have an interest in describing or analyzing.

Reply
Avatar of Darth
Darth 25-12-2013, 09:15

I don’t totally agree with your opening comments that non normal data doesn’t tend towards a central value. Of course it does. You can calculate means, medians and modes of non normal continuous data which are basic descriptors of central tendency. But, depending on the degree of non normality, the central tendency will not be symmetrically located which is, I believe, the point you were trying to make.

As has been pointed out, while this is just an oversimplistic article, there are some statements that aren’t quite spot on. First, we cannot prove that data is normal, we can only state it is statistically different or not different than normal. That is the value of the p value and testing for normality. A small point in interpretation but never the less important. I also agree that the first step is NOT to transform but to first understand the distribution and why it might be non normal or use the less powerful but useful non parametrics. Transformation should be the last step since it involves an output that is difficult to understand due to the change in the values of the original data.

As Dr. Burns suggests, a quick read of the Wheeler booklet on the subject explains his position and by association that of Shewhart’s regarding normality as it applies to control charts. But normality is also a consideration in doing Process Capability or certain Hypothesis Tests. Good news, is, as Dr. Harry points out, the normality assumption is often robust to departures from normal.

Finally, although it is probably way outside your intent in this primer, you tease us with some discrete distributions yet fail to mention the normal approximation to the Binomial/Poisson and the fact that large counts may also be treated as continuous data with certain caveats.

Thanks for taking a shot at a tough and oft misunderstood topic.

Reply
Kicab 27-12-2013, 05:12

@Darth. Good points about measures of central tendency for normal and nonnormal distributions, robustness of the normality assumptions for some tests, and the normal approximations to various discrete distributions.

To elaborate on two of your comments (to make sure they are more “spot on”):

-“ First, we cannot prove that data is normal .” True—simply because we can prove that actual data IS never normal as the normal distribution contains on infinite number of values and actual datasets are finite. Testing for normality is not testing whether the data are normally distributed but whether they represent a sample from a (theoretical or hypothetical) normal distribution. That is a key distinction. Another way to look at distributional tests is to recognize that distributions (normal and the others listed) are models. So, we test whether the data have sufficiently similar characteristics to the model to use the model to make inferences.

-“ use the less powerful but useful non parametrics.” I have always wondered why this misleading (false, in many cases) statement got into Six Sigma. Nonparametric tests are more powerful than parametric tests for many distributions and many (actually infinite) situations. Proofs can be found in peer-reviewed journals (e.g., Journal of the American Statistical Association) going back more than half a century.

Reply
Avatar of Darth
Darth 27-12-2013, 11:11

Kicab,
I agree on the first of your further comments. I try to emphasize the Null and Alternate hypotheses to state that “the data is not different from normal” or “data is different from normal” rather than “data is normal” or “data is not normal”.

As to your second follow up comment, nonparametrics are “less efficient” or have less power than the parametric test…for the same sample size…..if the data does actually adhere to the assumed parametric distribution. Are you in agreement with this statement?

Reply
Kicab 30-12-2013, 07:16

@Darth. Not completely.
First, a parametric test may not be testing the same thing as a nonparametric test (e.g., means vs. medians). So, the means may be different while the medians are not or vice-versa. Thus, if one test showed a difference in means and the other didn’t show a difference in medians, doesn’t show/prove that the former is more powerful.

Second, the factors that determine sample size are alpha, beta, delta=difference to be detected, sigma=population standard deviation. If the first three are determined and the same for both types of tests, then the only difference is the standard deviation (SD). For the statement to be true, the SD for parametric values must be smaller than for nonparametric. This is not always the case.

Third,does the statement claim that parametric tests are always more powerful than nonparametric (given the conditions you stated)? If not, then why make the general claim without specifying the conditions when it’s true and when it’s false?

Fourth, the published articles show that for some distributions nonparametric tests have greater efficiency (power) than parametric tests.

Reply
Avatar of Darth
Darth 30-12-2013, 08:06

Kicab,
I guess we won’t be resolving this issue this year. So, have a Happy Healthy New Year and we will speak again in a year.

Holly 03-01-2014, 22:49

Super primer!

Reply
salsaguy 23-01-2014, 11:12

Minitab’s Quality Trainer is a great resource to learn more about this topic, and other statistical topics that are difficult to understand/interpret. Great examples and walk thru lesson to help you go thru a lesson/exercise yourself after you have learned the theory.

Reply

Login Form