iSixSigma

Non-Normal Data

Six Sigma – iSixSigma Forums Old Forums General Non-Normal Data

Viewing 24 posts - 1 through 24 (of 24 total)
  • Author
    Posts
  • #43322

    Wäsbom
    Participant

    Hey Folks,
    Need your help here!
    If my data turned out to be a non-normal after normality test, how can I transform into normal data? I tried box cox transformation through the minitab, but it doesn’t work with negative numbers. Is it the minitab version that has problem or box-cox transformation is not applicable for negative numbers?

    0
    #137323

    new guy
    Participant

    My understanding is that “transforming” is to be used only if the data is discrete.
    In case your data is continuous then you can try using “segmentation” or “sub grouping” for normalizing the data.

    0
    #137326

    Wäsbom
    Participant

    thanks for the comment. any other information from you guys..thanks..

    0
    #137327

    Craig
    Participant

    Diego,
    Don’t automatically assume you can transform the data successfully. Have you looked for outliers with assignable causes? Have you looked for evidence of a multi-modal distibution? Sometimes there are a few too many sources of variation and the data does not fit a normal distribution.

    0
    #137328

    Wäsbom
    Participant

    thank you for the response. i got what you mean. but what if i don’t have enough resources to measure a certain process, assuming that taking a measurement is limited for one time or two times only, and my data turned out to be non-normal, after normality test. i know something about box-cox transformation from the minitab that uses power functions(log, square root, square, inverse, etc.), but they don’t work with negative numbers. i tried running capability analysis using weibull process capability, but still, it doesn’t work with negative data response. i tried normal process capability analysis with box-cox transformation function, but just the same, an error. anyone can explain why, and what minitab function should I use?

    0
    #137329

    Robert Butler
    Participant

    Three observations:
    1. The Box-Cox transform won’t work with negative numbers
    2. The Box-Cox transform is for data from a skewed distribution
    2. Transforms are used with both continuous and discrete data
    Some comments:
      You said “…data turned out to be a non-normal after normality test”. Your post gives the impression that this is all you have done with respect to checking your data.  If this is the case you need to stop and start over.
    1. To HACL’s points – plot the data – first, last, and always plot the data. Plot it using a histogram, plot it on a normal probability plot, and plot it any other way that makes physical sense. Plots will allow you to address everything HACL mentioned. Most importantly, the plots will put your normality test in perspective and, if the data fails the test, allow you to make a decision concerning the importance of that failure. 
      There are many reasonably symmetric distributions which fail a normality test but which can be analyzed with the usual tools without fear of reaching incorrect conclusions.

    0
    #137335

    Ernesto Garcia
    Participant

    Hi!Just to support Bob’s statements:1) Box cox does not work data with negative and zero values. If you have negative or zero values add a constant term to all your data points to make them positive.2) Box Cox works for distributions that are uni-modal (one peak), and where the ratio between the maximum and the minimum value is greater than 2 or 3 (See Box Hunter and Hunter on this)Thus, when faced with a distribution that has many “peaks” (multi-modal) use your process experience and stratify. 3) Minitab has another transformation possible when Box Cox does not work. It is called Johnson’s transformation. I do not know the nature of your data/process, but you might be able to use it.Best regards!Er

    0
    #137609

    Sathiya
    Member

    Hi Diego,
    If you want to compute process capability value, group your data (grouping size should be atleast 4), and then check for normality.
    Still if the data is not normal, then identify the distribution of the grouped data through minitab.
    Quality tool>Individual distribution identification
    P value with more than 0.5 is good to fit.
    Even after the data is non normal, increase the grouping size.
    Then use process capability for non-normal data, use the distribution you have identified in that and compute Cp or Pp
    Regards
    Sathiya

    0
    #137611

    Anonymous
    Guest

    Sathiya,
    Are you sure?
    Andy

    0
    #137612

    S.S.Choudhury
    Member

    Diego,First check your basic process stability for presence of mixture,cluster,trends and oscillations.You may use minitabs run chart for the above.Sometimes the shape of the normal probability plot indicates lot of things.For example a s shaped curve indicates many populations.Once you are sure that the above factors are not present then only use transformation. As rgards the negative factor normally box cox transformation does not take negative valuse as these are normally for distributions bounded by Zero.However if required in extreme cases you may add to all your readings the highest negative value and do the analysis as this will not disturb the distribution.Be sure to verify the distribution of transformed data befor progressing and converting back the additions before publishing your results.
     
    Regards,
    S.S.Choudhury
     
     

    0
    #137613

    Yadav
    Participant

    Sathiya’s approach is correct. When data fails normality test, individual distribution identification should be applied and then capability analysis can be done for that disctribution.

    0
    #137616

    Anonymous
    Guest

    I disagree. Try reading his post more carefully. He does not mention a distribution of individuals. To the contrary he implies using the distribution of sampled averages, which is incorrect.
    I also disagree with ‘fitting a distribution’ to individual data – even though some others take that vie; and my reasoning for taking this position is process capability is a Japanese invention, and is based on a study of Shewhart Charts, not on some USA academic’s musing.

    0
    #137633

    Paul Keller
    Participant

    A couple points that have been missed or even mis-applied in the thread I reviewed:
    1. You need to know the underlying shape of the process distribution to calculate a meaningful Process Capability index. The standard calculations apply only to a process whose observations are normally distributed. To properly calculate a capability index for non-normal data, you either need to transform the data to normal, or use special case calculations for non-normal processes.
    2. You should never do a transformation, or calculate Process Capability, until you have determined the process is in the state of statistical control. If the process is not in control, then it is not stable, and cannot be predicted using capability indices. Likewise, an out of control situation is evidence that multiple distributions are in place, so a single transformation for all the process data would be meaningless.
    3. Sub-grouping the data and analyzing the subgroup averages for normality will not tell you anything about the underlying distribution of the observations. It will prove the Central Limit Theoreom, and allow you to use with confidence the X-Bar charts to test for process control.
    So how do you handle this data?
    1. First investigate the process stability using a control chart. We could use an X-Bar chart with a subgroup size of 5. Why five? The CLT tells us the average of five observations from even pretty non-normal processes will tend to be normally distributed. You can do a Normality test on these averages to verify. You might also go with a subgroup size 3 if that works, which it often does. Another approach is to use an EWMA chart with your original subgroup size of one (a lambda of 0.4 works well). This chart should handle even non-normal data well.
    2. If the process is out of control, stop there and improve the process. Don’t bother with a capability analysis or with transformation, as they will be meaningless.
    3. If the process is in control, then you can estimate capability. You could either transform the data to Normal and use the standard calculations for capability applied to the normalized data, or fit a distribution to the data and calculate the capability using the percentiles of the distribution. The Johnson technique applies this latter approach.
    I hope this is helpful. There’s more information on each of these topics in quality america’s Knowledge Center (http://www.qualityamerica.com/knowledgecente/knowctrKnowledge_Center.htm).

    0
    #137636

    Anonymous
    Guest

    The shape of the distribution is the distribution of sampled means. We all know what it is – it is normal.
    Process capability can be determined by ‘back ‘calculating the variance of individuals. This was the orignal method used by Japanese engineers who invented Cp and Cpk.
    Just because a USA academic decides to re-define a Japanese metric doesn’t make it right.
    In fact, it can be most useful to ‘back calculate’ individual variance since it allows one to compare observed values with the back calculated individual spread to check independence assumptions.
    You also claim the optimal subgroup size is five – but Shewhart states that the optimum subgroup size is between 3 and 5. Anyway, the distribution of samples means will depend more on the number of subgroups, g, rather than the difference beteween using 3 or 4, and the 5 you suggest.
    There is also the other issue you raised – the question of stability – which is usually determined from a Shewhart Chart. Therefore, the use of a Shewhart Chart should be ‘central’ to any calculation of process capability, and not the distribution of the individuals, since without if there is no way of knowing whether the data is indpeendent, homogeneous or not.
     

    0
    #137639

    Paul Keller
    Participant

    I think Andy may have mis-interpreted some of my comments:
    The shape of the distribution of the means is predicted to be Normal by the Central Limit Theorem. In practice, it is often Normal, with larger subgroup sizes tending to be more closely Normal than smaller subgroup sizes. However, the distribution of the underlying observations is still unknown, and it is that distribution which is pertinent to the discussion of capability indices.
    I did not mean to suggest that a subgroup size of five is ‘optimal’. I mentioned the subgroup size of three may give similar results. The smaller subgroup size is often preferred for economic considerations. Yet, a subgroup of size five is more likely to be normally distributed if the observations are more non-normal, so is an acceptable place to start.
    Regarding the number of subgroups, the standard control limit calculations using the tabulated control chart “constants” are based on the assumption that there is a large number of subgroups. The “constants” A2, D4, etc. (or their more basic d2, d3 and c4) are really not constant, but depend on the number of subgroups (g). If you had a small number of subgroups, you would see differences in the distribution of the averages based on the number of subgroups at a fixed subgroup size, which is why it is important to have a sufficient number of subgroups to estimate your control limits.
    Andy made a statement that is unclear to me: “Process capability can be determined by ‘back ‘calculating the variance of individuals. This was the orignal method used by Japanese engineers who invented Cp and Cpk.”
    I think he is refering to the fact that the standard deviation of the observations equals the standard deviation of the averages times the square root of n. Yes, this is part of the accepted method used to calculate process sigma based on the X-Bar chart’s control limits, and this value of process sigma is used in the capability calculation for a normal distribution. However, that doesn’t help to understand anything about the shape of the distribution of the observations. There are four parameters needed to define any distribution: mean, standard deviation, skewness and kurtosis. We can mathematically “back-calculate” the mean and standard deviation of the observations from the mean and standard deviation of the averages, but we still need to know something about the shape of the distribution of the observations (i.e. skewness and kurtosis).
    In the standard calculation of process capability, we consider the plus and minus 3 sigma levels of the process. Why? Well, for the Normal distribution, this will account for 99.73% of the process variation. Yet in a Non-normal case, plus and minus 3 sigma may provide a completely different level of protection. While the plus three sigma value coincides with the 99.865 percentile of a Normal distribution, the percentile at plus 3 sigma for the non-normal distribution is dependent on the shape of the distribution. If you consider typical data from a highly-skewed process bounded at zero (such as Cycle Times, Wait Times, TIR, flatness, etc.), you’ll find that a minus three sigma is often a negative number. That’s a clue that the plus three sigma level is unlikely to be the 99.865 percentile. In capability analysis, it is the percentiles we care about, not that they occur at plus and minus three sigma.
    Does that make more sense?

    0
    #137642

    Robert Butler
    Participant

      Assuming you have done all of the things you should have done before trying to compute your process capability and assuming that after all of this effort your data is really non-normal then you don’t have to subgroup and you don’t have to transform.  What you need to do is the following:
      Take your data and plot it on normal probability paper. Identify the .135 and 99.865 percentile values (Z = +-3). The difference between these two values is the span for the middle 99.73% of the process output.  This span is the equivalent 6 sigma spread. Use this estimate for the 6 sigma term in the process capability calculations. The capability goal of this spread is to have it equal to .75 x Tolerance. 
      This is the method recommended in Measuring Process Capability by Bothe. For additional details read Chapter 8 – “Measuring Capability for Non-Normal Variable Data.

    0
    #137644

    Jonathon Andell
    Participant

    First of all, plot the data over time to see if they are statistically stable (“common cause”). Special cause variation can create the appearance of non-normality.If the process appears stable, and if negative numbers are feasible, you may want to offset the data – add the same number to every single data point, so that the entire set is positive. Then I’d suggest Minitab’s distribution ID utility. Box-Cox can make data “look” normal, but identifying the distribution is a source of new knowledge.

    0
    #149575

    Amar
    Participant

    Probably the best lambda is evaluated to 0 by Minitab and in this case the Box-Cox transformation is simply the Logarithm which is not defined for negative values in the sample. This is why Minitab is complaining

    0
    #149578

    Hal
    Participant

    One of six sigma’s many flaws is it’s obsession with normal distributions.  This appears to stem from six sigma tables which require data to be normal.
    There is no need for data transforms.  Control charts don’t need them, nor do histograms.  Transformed data loses its meaning.  Transforms will also require inverse transforms that become too complex for practical purposes.

    0
    #149581

    Darth
    Participant

    Sorry Hal, the first paragraph of your response is wayyyyy offffff.  SS doesn’t have an obsession with normality.  T tests, ANOVA, capability, Regression and a host of other statistical tools have varying assumptions about normality.  You can’t lay that on SS because they make use of the stat tools.  Those assumptions were there way before SS came into being.

    0
    #149598

    Chris Seider
    Participant

    Diego,
    Remember that Minitab goes through an iteration of values to decide what power to take your value.  Because of the need to not have imaginary numbers, 0.5 and 0 are not mathematically possible if your data is negative.
    Two things to try.  First, make sure your max is at least twice your min which helps.  Second, add a fixed value to all of your data so the data is non zero.  Remember this is now part of your transformation which you must do for your specs, etc.

    0
    #149599

    Chris Seider
    Participant

    Hal,
    It seems you have had a bad experience or are having a peering over the wall perspective.
    I don’t think the very good practictioners of Six Sigma have ever insisted on having a normal process before they can improve the process.  The better ones really understand the assumptions and wouldn’t want to misapply a t-test of F-test and potentially lead to erroneous conclusions.
    I’d like to think of the need to apply normal distributions appropriately like using a phillips screwdriver on the appropriate screw head.  I’d hate to use a slot edged screwdriver on the wrong distribution of screws.

    0
    #149618

    Hal
    Participant

    With reference to processes, and not surveys of people and other such enumerative studies that are frequently quoted here, how many processes do you actually think are perfectly normally distributed ?
    Have you ever used a Pearson lack of fit test ?

    0
    #149628

    Chris Seider
    Participant

    I have read some interesting material on not many if no processes are actually normal. 
    I find them more academic than useful.  If the bell curve is closely fitting, and using my favorite test of the Anderson Darling statistic, I will assume it is normal.  Any deviation from the perfect normal curve could be any set of reasons like measurement error, not enough samples to accurately predict the population, etc is not enough of a reason to not drive business or personal results using the Six Sigma methodology in my opinion.
    I’m curious where you are driving towards….with your last question.  Thanks.

    0
Viewing 24 posts - 1 through 24 (of 24 total)

The forum ‘General’ is closed to new topics and replies.