iSixSigma

Normalizing Non-Normal Data

Six Sigma – iSixSigma Forums Old Forums General Normalizing Non-Normal Data

Viewing 10 posts - 1 through 10 (of 10 total)
  • Author
    Posts
  • #47366

    Ang
    Participant

    I’m hoping there is someone out here that can help me.  For some reason the majority of the data that I have been taking from my processes are turning out to be Non-Normal.  Since I’m really interested in the capability of my process I can’t use the typical Six Pack as I would normally do because of the Non-Normal data.  I’ve been trying to transform the data, but I’m not sure what transformation to use.  My data looks normal (i.e. it is not a Wiebull, Log, Exponential, or any other glaringly obvious distribution) but the A-D tests are all telling me that it is Non-Normal.  I’ve also noticed that if I take between 30 – 100 data points my data tends to be Normal, but beyond 100 data points my data begins to turn Non-Normal.  Is this typical because I thought that more data points would tend to Normalize the data?  I’m not sure how much I can trust my capability results with the Non-Normal data, but I don’t know what other options I have at the moment.  Any help or insights are appreciated.
    -Peter
     

    0
    #157888

    mohit
    Participant

    Hi Peter,
    Pls go thru the link below to convert your non-normal data into normal.
    https://www.isixsigma.com/library/content/c020121a.asp
    Tho, I would recommend that you do not convert it into normal. There are tests available in minitab for non-parametric data. Use those.

    0
    #157892

    annon
    Participant

    There are a couple of options to use.

    If you are interested in capability, why not use a DPU, RTY, or DPMO?  All can be calculated with nonnormal data sets, thus preserving the original data…I like using DPU then converting to RTY and-or DPMO..it easy to collect understand, present, and is additive.
    You can work from the nonparametic perspective to determine significant differences….DPMO, Moods, Mann and Whitney, etc.
    You can still use most inferential tools that use a comparison of means (anova, t-test, doe, regression, etc) if your data set is not grossly skewed…they are largely robust to violations of normality and equal variance, but not of independence.
    You can simply subgroup your data set (n=5, for example).  This will invoke the CLT and move your data set closer to normal….if data is plentiful, this is an option.
    Transform it…..it has been awhile, but using Box-Cox will provide you with an optimal lambda value (the one in the 95% CI) that you can use to to determine the proper transform function (although i think MTB will do that for you). 
    Good luck and verify the above info before moving forward. 

    0
    #157905

    BC
    Participant

    Peter,
    I need to be careful saying this, since I’m not a statistician, but is it possible your AD test is just overly sensitive to a large amount of data?  No data population is actually normal and it could be that you just have enough data to “prove” the fact.  How straight do the data points plot on a normal probability plot?  If they look normal, you might just be able to analyze them as normal, no transformation needed.  AD is just one test, you have to use visual cues as well.
    OTOH, if you have glaring “tails” (these will be clear in an NPP), then those are worth investigating.  Could be you have some flyers, or (at the other end) a physical constraint or measurement discrimination issue.
    As for the comment “I thought more data points would tend to normalize the data”…this is true when you are plotting sample averages.  But this won’t take 100 data points, more like 5 or 10.
    Let me know what you come up with.
    Good luck
    BC

    0
    #157936

    Robert Butler
    Participant

      Assuming you’ve done all of the usual things to make sure that the non-normality is just due to the process and it is reasonable to run computations for process capability then the post below may be of some value.
    https://www.isixsigma.com/forum/showmessage.asp?messageID=82998

    0
    #157938

    Jim Shelor
    Participant

    Peter,
     
    It sounds like you are using Minitab.
    If so, use the distribution ID function to determine the distribution of your data.
    Then run Capability Analysis Non-normal and select the distribution identified.
    Minitab will now run the analysis appropriate for the identified distribution.
    Regards,
    Jim Shelor

    0
    #157940

    Ron
    Member

    I tend to agree with BC.  When attempting to run monte carlo simulations we often see perfectly bell shaped distributions that fail AD tests every time.  Why?  Because this test is sensitive in the tails.  So these “fliers” as BC calls them often cause all kinds of problems with the AD test.  So if it looks normal it probably is.  Also check to see if there are any trends, clusters, etc.  You can do this in control charts and/or run charts.  Also, you can try the “individual distribution identification” trick another poster recommended.  Just read the help menu if you are not sure how to do this.  You will find this in Stat > Quality Tools.  I may blog on this topic soon if you anyone is interested (http://lssacademy.com).  Good luck and great question and answers by all!  I love this forum.

    0
    #157942

    Step by step approach
    Member

    the first question that comes to mind is: did you run chart your data before you did your normality test. after all you are dealing with time related data. secondly, did you run a histogram, and what does the histogram tell you? thirdly, did you investigate skewness and kurtosis? with random samples of 100 or so it is not very likely that the Anderson-Darling test will give you a false signal, particularly if you get p-values < 0.05 with repeated sampling from the same data set. you're making inferences based on one statistical test. the additional "tools" may give you an indication if indeed you are dealing with an issue of "power" or with an underlying system that generates the non-normal distribution. a review of the descriptives of a distribution (graphically or statistically) should always precede any inferential test. if the review of the run chart, histogram, skewness and kurtosis point to normality, but the Anderson-Darling test shows non-normality, then you have more confidence in the inference that you are dealing with "too much" power.

    0
    #157943

    Stats guy
    Member

    BC,
    “AD is just one test, you have to use visual cues as well”. This is absolutely correct!
    Just as an fyi (no criticism intended!) you are saying “No data population is actually normal”. IQ test scores, for example, are normally distributed because they have been calibrated this way.
    “”I thought more data points would tend to normalize the data”…this is true when you are plotting sample averages.  But this won’t take 100 data points, more like 5 or 10”. Just a clarificaiton: What you are describing here is the result of the drawing of random samples (with or without replacement) from a known population, i.e. the average of the average of the individual samples tend to become normally distributed (independent of the underlying distribution of the population). The sample size of the repeated sampling, 100, vs. 5 or 10 impacts the confidence interval around that estimated or known mean. The phenomenon itself is independent of sample size. That is why simply taking more measurements from the same distribution will not change the distribution to normality. There is a difference between the underlying distribution of the populatation and the distribution of the means of continuous random sampling from that distribution with sample size n. Just a clarification.

    0
    #160529

    Sean Shanley
    Member

    There are 7 reasons for failing normality
    1.  shift occurred in the middle of the data
    2. mixed populations
    3. truncated data
    4. rounding of data to smaller values
    5. outliers
    6. too much data
    7. underlying distibution is not normal
     
    for too much data situation anderson darling is too sensitive. fix the problem by taking 50 random data points

    0
Viewing 10 posts - 1 through 10 (of 10 total)

The forum ‘General’ is closed to new topics and replies.