Normalizing NonNormal Data
Six Sigma – iSixSigma › Forums › Old Forums › General › Normalizing NonNormal Data
 This topic has 9 replies, 10 voices, and was last updated 14 years, 4 months ago by Sean Shanley.

AuthorPosts

June 25, 2007 at 2:38 pm #47366
I’m hoping there is someone out here that can help me. For some reason the majority of the data that I have been taking from my processes are turning out to be NonNormal. Since I’m really interested in the capability of my process I can’t use the typical Six Pack as I would normally do because of the NonNormal data. I’ve been trying to transform the data, but I’m not sure what transformation to use. My data looks normal (i.e. it is not a Wiebull, Log, Exponential, or any other glaringly obvious distribution) but the AD tests are all telling me that it is NonNormal. I’ve also noticed that if I take between 30 – 100 data points my data tends to be Normal, but beyond 100 data points my data begins to turn NonNormal. Is this typical because I thought that more data points would tend to Normalize the data? I’m not sure how much I can trust my capability results with the NonNormal data, but I don’t know what other options I have at the moment. Any help or insights are appreciated.
Peter
0June 25, 2007 at 3:22 pm #157888Hi Peter,
Pls go thru the link below to convert your nonnormal data into normal.
https://www.isixsigma.com/library/content/c020121a.asp
Tho, I would recommend that you do not convert it into normal. There are tests available in minitab for nonparametric data. Use those.0June 25, 2007 at 6:18 pm #157892There are a couple of options to use.
If you are interested in capability, why not use a DPU, RTY, or DPMO? All can be calculated with nonnormal data sets, thus preserving the original data…I like using DPU then converting to RTY andor DPMO..it easy to collect understand, present, and is additive.
You can work from the nonparametic perspective to determine significant differences….DPMO, Moods, Mann and Whitney, etc.
You can still use most inferential tools that use a comparison of means (anova, ttest, doe, regression, etc) if your data set is not grossly skewed…they are largely robust to violations of normality and equal variance, but not of independence.
You can simply subgroup your data set (n=5, for example). This will invoke the CLT and move your data set closer to normal….if data is plentiful, this is an option.
Transform it…..it has been awhile, but using BoxCox will provide you with an optimal lambda value (the one in the 95% CI) that you can use to to determine the proper transform function (although i think MTB will do that for you).
Good luck and verify the above info before moving forward.0June 25, 2007 at 7:41 pm #157905Peter,
I need to be careful saying this, since I’m not a statistician, but is it possible your AD test is just overly sensitive to a large amount of data? No data population is actually normal and it could be that you just have enough data to “prove” the fact. How straight do the data points plot on a normal probability plot? If they look normal, you might just be able to analyze them as normal, no transformation needed. AD is just one test, you have to use visual cues as well.
OTOH, if you have glaring “tails” (these will be clear in an NPP), then those are worth investigating. Could be you have some flyers, or (at the other end) a physical constraint or measurement discrimination issue.
As for the comment “I thought more data points would tend to normalize the data”…this is true when you are plotting sample averages. But this won’t take 100 data points, more like 5 or 10.
Let me know what you come up with.
Good luck
BC0June 26, 2007 at 5:06 pm #157936
Robert ButlerParticipant@rbutler Include @rbutler in your post and this person will
be notified via email.Assuming you’ve done all of the usual things to make sure that the nonnormality is just due to the process and it is reasonable to run computations for process capability then the post below may be of some value.
https://www.isixsigma.com/forum/showmessage.asp?messageID=829980June 26, 2007 at 7:32 pm #157938
Jim ShelorParticipant@JimShelor Include @JimShelor in your post and this person will
be notified via email.Peter,
It sounds like you are using Minitab.
If so, use the distribution ID function to determine the distribution of your data.
Then run Capability Analysis Nonnormal and select the distribution identified.
Minitab will now run the analysis appropriate for the identified distribution.
Regards,
Jim Shelor0June 26, 2007 at 8:04 pm #157940I tend to agree with BC. When attempting to run monte carlo simulations we often see perfectly bell shaped distributions that fail AD tests every time. Why? Because this test is sensitive in the tails. So these “fliers” as BC calls them often cause all kinds of problems with the AD test. So if it looks normal it probably is. Also check to see if there are any trends, clusters, etc. You can do this in control charts and/or run charts. Also, you can try the “individual distribution identification” trick another poster recommended. Just read the help menu if you are not sure how to do this. You will find this in Stat > Quality Tools. I may blog on this topic soon if you anyone is interested (http://lssacademy.com). Good luck and great question and answers by all! I love this forum.
0June 26, 2007 at 11:23 pm #157942
Step by step approachMember@Stepbystepapproach Include @Stepbystepapproach in your post and this person will
be notified via email.the first question that comes to mind is: did you run chart your data before you did your normality test. after all you are dealing with time related data. secondly, did you run a histogram, and what does the histogram tell you? thirdly, did you investigate skewness and kurtosis? with random samples of 100 or so it is not very likely that the AndersonDarling test will give you a false signal, particularly if you get pvalues < 0.05 with repeated sampling from the same data set. you're making inferences based on one statistical test. the additional "tools" may give you an indication if indeed you are dealing with an issue of "power" or with an underlying system that generates the nonnormal distribution. a review of the descriptives of a distribution (graphically or statistically) should always precede any inferential test. if the review of the run chart, histogram, skewness and kurtosis point to normality, but the AndersonDarling test shows nonnormality, then you have more confidence in the inference that you are dealing with "too much" power.
0June 26, 2007 at 11:41 pm #157943
Stats guyMember@Statsguy Include @Statsguy in your post and this person will
be notified via email.BC,
“AD is just one test, you have to use visual cues as well”. This is absolutely correct!
Just as an fyi (no criticism intended!) you are saying “No data population is actually normal”. IQ test scores, for example, are normally distributed because they have been calibrated this way.
“”I thought more data points would tend to normalize the data”…this is true when you are plotting sample averages. But this won’t take 100 data points, more like 5 or 10”. Just a clarificaiton: What you are describing here is the result of the drawing of random samples (with or without replacement) from a known population, i.e. the average of the average of the individual samples tend to become normally distributed (independent of the underlying distribution of the population). The sample size of the repeated sampling, 100, vs. 5 or 10 impacts the confidence interval around that estimated or known mean. The phenomenon itself is independent of sample size. That is why simply taking more measurements from the same distribution will not change the distribution to normality. There is a difference between the underlying distribution of the populatation and the distribution of the means of continuous random sampling from that distribution with sample size n. Just a clarification.0August 29, 2007 at 11:16 am #160529
Sean ShanleyMember@SeanShanley Include @SeanShanley in your post and this person will
be notified via email.There are 7 reasons for failing normality
1. shift occurred in the middle of the data
2. mixed populations
3. truncated data
4. rounding of data to smaller values
5. outliers
6. too much data
7. underlying distibution is not normal
for too much data situation anderson darling is too sensitive. fix the problem by taking 50 random data points0 
AuthorPosts
The forum ‘General’ is closed to new topics and replies.