Normally distributed data is a commonly misunderstood concept in Six Sigma. Some people believe that all data collected and used for analysis must be distributed normally. But normal distribution does not happen as often as people think, and it is not a main objective. Normal distribution is a means to an end, not the end itself.
Normally distributed data is needed to use a number of statistical tools, such as individuals control charts, C_{p}/C_{pk} analysis, ttests and the analysis of variance (ANOVA). If a practitioner is not using such a specific tool, however, it is not important whether data is distributed normally. The distribution becomes an issue only when practitioners reach a point in a project where they want to use a statistical tool that requires normally distributed data and they do not have it.
The probability plot in Figure 1 is an example of this type of scenario. In this case, normality clearly cannot be assumed; the pvalue is less than 0.05 and more than 5 percent of the data points are outside the 95 percent confidence interval.
What can be done? Basically, there are two options:
When data is not normally distributed, the cause for nonnormality should be determined and appropriate remedial actions should be taken. There are six reasons that are frequently to blame for nonnormality.
Too many extreme values in a data set will result in a skewed distribution. Normality of data can be achieved by cleaning the data. This involves determining measurement errors, dataentry errors and outliers, and removing them from the data for valid reasons.
It is important that outliers are identified as truly special causes before they are eliminated. Never forget: The nature of normally distributed data is that a small percentage of extreme values can be expected; not every outlier is caused by a special reason. Extreme values should only be explained and removed from the data if there are more of them than expected under normal conditions.
Data may not be normally distributed because it actually comes from more than one process, operator or shift, or from a process that frequently shifts. If two or more data sets that would be normally distributed on their own are overlapped, data may look bimodal or multimodal – it will have two or more mostfrequent values.
The remedial action for these situations is to determine which X’s cause bimodal or multimodal distribution and then stratify the data. The data should be checked again for normality and afterward the stratified processes can be worked with separately.
An example: The histogram in Figure 2 shows a website’s nonnormally distributed load times. After stratifying the load times by weekend versus working day data (Figure 3), both groups are normally distributed.
Roundoff errors or measurement devices with poor resolution can make truly continuous and normally distributed data look discrete and not normal. Insufficient data discrimination – and therefore an insufficient number of different values – can be overcome by using more accurate measurement systems or by collecting more data.
Collected data might not be normally distributed if it represents simply a subset of the total output a process produced. This can happen if data is collected and analyzed after sorting. The data in Figure 4 resulted from a process where the target was to produce bottles with a volume of 100 ml. The lower and upper specifications were 97.5 ml and 102.5 ml. Because all bottles outside of the specifications were already removed from the process, the data is not normally distributed – even if the original data would have been.
If a process has many values close to zero or a natural limit, the data distribution will skew to the right or left. In this case, a transformation, such as the BoxCox power transformation, may help make data normal. In this method, all data is raised, or transformed, to a certain exponent, indicated by a Lambda value. When comparing transformed data, everything under comparison must be transformed in the same way.
The figures below illustrate an example of this concept. Figure 5 shows a set of cycletime data; Figure 6 shows the same data transformed with the natural logarithm.
Take note: None of the transformation methods provide a guarantee of a normal distribution. Always check with a probability plot to determine whether normal distribution can be assumed after transformation.
There are many data types that follow a nonnormal distribution by nature. Examples include:
If data follows one of these different distributions, it must be dealt with using the same tools as with data that cannot be “made” normal.
Some statistical tools do not require normally distributed data. To help practitioners understand when and how these tools can be used, the table below shows a comparison of tools that do not require normal distribution with their normaldistribution equivalents.
Comparison of Statistical Analysis Tools for Normally and NonNormally Distributed Data  
Tools for Normally Distributed Data  Equivalent Tools for NonNormally Distributed Data  Distribution Required 
Ttest  MannWhitney test; Mood’s median test; KruskalWallis test  Any 
ANOVA  Mood’s median test; KruskalWallis test  Any 
Paired ttest  Onesample sign test  Any 
Ftest; Bartlett’s test  Levene’s test  Any 
Individuals control chart  Run Chart  Any 
C_{p}/C_{pk} analysis  C_{p}/C_{pk} analysis  Weibull; lognormal; largest extreme value; Poisson; exponential; binomial 


Comments
I have data set for some variables (like age) are normally distributed and others (like height) are not normally distributed. The question is: When I compare these two variables with other categorical variable (with gender for example).
Can I use Independent sample ttest for age and NonParametric ttest for height. (Just for record: This is for sake of a publication. And I do not know if you can possibly use these two tests in one study).
Thanks
Bravo
@Bravo: Yes, you can use both tests and most likely you should. Just be aware that their interpretation is different. Make sure you understand each test’s preconditions and how do they apply to your data. On the other hand, if departure from normally is not severe and you sample is reasonably large (100+) you can safely use T tests with negligible side effects.
I have an abnormal dataset I’m currently working with that seems to be unique from the above. Analytical chemists like myself often find themselves testing a new method of analysis against an old one. In this particular case we have several different types of sample that have been analysed by both methods, in fact it’s more of a spectrum of sample types. When testing the new method against the old, some sample types seem to be affected more by the new method than others. This results in hints of multimodel distrubition but because of this spectrum of sample types and relatively good agreement between methods has normalish (haha) look to it.
An AndersonDarling test (.05) confirms that it is not normal, and because the paired ttest would have been my natural choice had the distribution been normal, I’m a little lost as to what might be my next best option. At this point I’m looking at presenting my data in simple method 1 vs method 2 plot with a linear regression. y=x being perfect agreement, a slope and fit being close to 1 as being a good indication of good agreement between methods. Do you think this would be sufficient? Any other suggestions for me?
@MikeMac: Method comparison is a complicated issue and you shouldn’t take decisions on a single T test. You need information from various tests, including means difference (T or Wilcoxon tests) and correlation / concordance /agreement. I may suggest to perform BlandAltman analysis which is easy enough. If your data is not normal, you may try some transformations but inherently nonparametric methods are available in some software.
Hi there, me and my study group found this blog entry very helpful for our research and it gave us a lot of guidance on where to look for further information.
Nice job! and easy to understand! just what I needed. Thanks!
Hey Arne, thanks for a great summary! One question – what is understood by “survival times of a product” (Weibull distribution) – can you give an example of a real industry case? THANK YOU! I am your fan!
please, i need a data set that is not normally distributed. am having problem getting one for my project
Good knowledge sharing. Thanks.
This article is just what I need to know. I’m currently working with kernel density estimation and some of my data are not normally distributed.
Very very useful.
The information provided is apt.
“Reason 6: Data Follows a Different Distribution: Lognormal distribution, found with length data such as heights”…
For what kind of height is this for??? i.e.: height of a bunch of people or height of a industrial product manufactured (pharmaceutical tablet)???
regards
Extremely useful information on NonNormality and NonParametric tests.
Hi, i read this topic and it very helpful. But i dont understand why we can use ttest when the distribution is nonnormal. this is because of the size of the sample? for example 534? Thanks in advance
Thanks for some guidance. I have 35 categories with two data points each. The differences in the two data points for each of the categories range from 0 to 20. I constructed a histogram and found that after removing outliers I have a distribution that is skewed positively. Is it appropriate to apply the 689598.7 rule of thumb to determine normalcy. My understanding is that normal distributions meet these criteria. Is the complement to that statement that any distribution which meets these criteria can be considered normal.
Hi Arne, really nice article. It cleared up a lot of things. However the link to stratification does not go into details and I couldn’t find anything on google either. Can you direct me to some literature on stratification please?
I have a set of nonnormal nutrients data vs time. The graph produced is very scatter so I try to logtransformed and squareroot transform it. But after the transformation, the graph still look the same. any reason?
when I removed the outlier, the data change from nonnormal into normal distribution. is this ok to remove those outlier?
This is really very informative and simple to understand.
Thank you so much. This is best among articles in this topic
This is simply wrong and has led to many six sigma practitioners doing bad stats.
A lot of common tests like TTest and ANOVA work on the SAMPLING DISTRIBUTION, not the distribution of your sample. No data in real life is normal, a lot of them highly skewed, but that does NOT mean its sampling distribution is not normal. But many people do not know the difference so here’s a link to an animated explanation:
http://onlinestatbook.com/stat_sim/sampling_dist/
Thanks to central limit theorem, the sampling distribution for ANY population distribution, given a “big enough sample size”, i.e. N = 1015 seems to already approach normality thanks to Central Limit Theorem per the animated explanation above, will approach normality. It is this normal sampling distribution that TTest and ANOVA works off on, it does NOT work on the distribution of your sample.
I have to explain this a lot to people, sometimes very seasoned six sigma black belts. I hope this will raise the statistical knowledge bar for everyone else.
I have a nonnormal data which is the academic achievement of students in %. This was planned to be regressed with a 5 point scale feedback about their trust and school climate. Can I transform them and see their regression output? if the regression is not significant , should I recommend further study? NO impact ? problem on the tools? , actually tools were piloted and found consistent?
How to know that will normally distributed or not
Hi!!!
I found this article very interesting and I´m writing this email hoping you could shed some light on an analysis I´m performing regarding GLM.
I am trying to analyze some data about animal behaviour and would need some help or advice regarding which nonparametric test should I use.
The variables I have are:
Response variable: a continuous one (with both positive and negative values)
Explicatory variable: a factor with 6 levels
Random effect variable: as the same animal performing some behavioural task was measured more than once.
As I have a random effect variable, I chose a GLM model. Then, when checking the normality and homoscedasticity assumptions, ShapiroWilks test showed there was no normality and QQplots revealed there weren´t patterns nor outliers in my data. So the question would be: which nonparametric test would be optimal in this case, knowing that I would like to perform certain a posteriori comparisons (and not allagainstall comparisons)?
My database has lots of zeros responses in some conditions, I´ve read that for tstudents tests lacking of normality due to lots of zeros it´s OK to turn a blind eye on lack of normality (Srivastava, 1958; Sullivan & D’agostino, 1992) … is there something similar with GLM?
Thank you so much in advance for any advice you could provide.
Kind regards,
Yair Barnatan
Ph.D. Student – Physiology and Molecular Biology Department
Faculty of Science
University of Buenos Aires