Why not normal distribution?
Six Sigma – iSixSigma › Forums › Old Forums › General › Why not normal distribution?
 This topic has 17 replies, 12 voices, and was last updated 19 years, 4 months ago by DANG Dinh Cung.

AuthorPosts

January 8, 2003 at 10:54 am #31163
I selected 36 samples from process to calculate the CpK. Unfortunately, the data was not normal distribution. But I found it will be normal distribution if I add a sample data or decrese a sample data. How to explain this situation? And how to evluate this process by CpK? And I also want to know why the data is not normal distribution, how to solve this problem?
0January 8, 2003 at 12:20 pm #81980Jackey,
A few comments/questions for you:
1. Given your process, is 36 samples sufficient to characterize the capability? How long did it take to gather 36 samples? In that amount of time, would the process have exhibited a characteristic amount of variation?
2. If the data is not normally distributed, you first have to ask why? Most processes do not exhibit normally distributed data. You may simply have not collected enough samples for the process to exhibit normality. Or, there may be some form of “special” cause variation occurring that is affecting your data. If so, you just found an “x.” It could be that your process will never exhibit normality. But either way, you need to rule out sample size and special cause variation as potential factors to nonnormality first.
3. In terms of “solving” the problem, I would collect more data first and see if it approaches normality. If that doesn’t work, you may be able to subgroup the data and analyze the averages of the subgroups. You may be able to transform the data with BoxCox, et al.
Good luck.
Neil0January 8, 2003 at 1:26 pm #81983Were the 36 samples consecutive? Was the process adjusted during the sampling process? Nonconsecutive samples taken offline that may be subject to adjustment would be of limited value in calculating capability.
0January 8, 2003 at 2:03 pm #81986Hi Jackey,
The normal distribution is valid exactly when you take an infinite number of samples for an event with a constant probability. If you take fewer samples, there may be deviations from the theoretical curve. When you take almost no samples, your deviations will be very large.
You can prove it by a simple experiment yourself. Take many coins, say 10 coins. Throw them and count the number of heads. The theoretical distribution is binomial. The more coins you throw, the closer it will come to the normal distribution.
Throw them 3 times, 10 times, 36 times, 100 times and record the results. Repeat. Display the distributions. Remove one sample, add one sample. Be amazed.0January 8, 2003 at 4:41 pm #81993
Chip HewetteParticipant@ChipHewette Include @ChipHewette in your post and this person will
be notified via email.In a meeting, I once tossed a coin eight times and got heads every time. Did not succeed in demonstrating randomness!
0January 8, 2003 at 5:37 pm #81999Hi Chip,
May be you gave up too early?
Let’s recall, your chance to get x times head out of n tosses isP(n,x) = n_over_x * p**n * q**(nx)
wheren_over_x = n!/x!/(nx)!
p: constant probability of each event
q = (1p)
p, q >= 0
p + q = 1
Your chance to have 8 heads out of 8 tosses, while the coins probability is (p=0.5) for heads, is:
P(8,8) = (8!/0!/8!) * 0.5**8 * 0.5**0 = 3.9E3 = 0.39% = 3906ppm = 4.2sigma (in 6sigma terms)
There are several possibilities. a) your coin was ok (p=0.5) and by chance you demonstrated the unlikely, but possible, event of a headseries. recall, each toss in itself is independent from the other ones, and each toss is executed with the same, constant porobability, we assume. b) for some reason your coin was not ok. when your coin prefers head at (p=0.69), your chance is 5% to have 8 heads in 8 tosses. had it been (p=0.92) your chance would have been 50% to have 8 heads in 8 tosses. did you precheck your coin? c) you did not take enough samples. had you thrown your coin 18 times, the chance to have 18 heads out of 18 tosses at (p=0.5) would have been 3.4E6 – very unlikely, but possible d) the way you threw the coin resulted in a nonconstant p. how did you do it? e) the 8 tosses somehow were dependent on each other. how did you do it? did you became upset or amazed? f) your possibility here
Recently I found that some of our dices do not provide a uniform probability for each side (my assumption p=1/6 does not hold). The bouncing ground impacted the distribution. Violate the preconditions, and you’ll have unexpected results. Think that something is impossible and it will kill you – sometimes. Whatever the reason was in your case, it could have been a good chance to introduce stochastics, binominal distribution, testing of hypothesis, at least how to observe and execute experiments critically. May be even how to improve your experiment.
enjoy0January 8, 2003 at 7:18 pm #82002Jackey,
It is perfectly normal, no pun intended, to see distributions with a size as low as 36 to come out looking quite nonnormal. How did you determine the nonnormality? Histograms? If so, don’t be concerned based off of the analysis of a histogram. You could specify a normal distribution within Minitab and have it generate multiple runs of data that upon graphing would look quite nonnormal.
My recommendation would be to plot the data in a Normal Probability Plot and assess it for linearity, or at least most of the data within the 95% CIs that Minitab can provide, to check the assumption of normality. Boxcox could also provide that analysis if the 95% CI for lambda encompasses 1, or no transform needed.
The main thing that you should be concerned about with the calculation of Cpk is whether the process that you have pulled the data from is exhibiting control, since this is where the sample estimate of the population parameter for estimate of variability, through Rbar/D2, is derived from. Beyond that, independence of the data should be an area of interest.
Regards,
Erik0January 8, 2003 at 8:43 pm #82004
Chip HewetteParticipant@ChipHewette Include @ChipHewette in your post and this person will
be notified via email.Cpk is often used to give people a sense of ‘goodness’ about a production process. Calculating Cpk based on a single sample from a process is, in my view, an incorrect use of the Cpk calculation. If one is lucky, and all the parts are similar, the Cpk looks good. Later, when the process drifts, the customer calls and says “Why are you sending me bad parts with a Cpk of 1.44?” This is not an enjoyable conversation.
Cpk is derived from the estimate for the standard deviation, and in control chart methods is based on R bar. One must have enough subgroups to calculate R bar. 36 samples, of themselves, don’t really show the process owner all the sources of variation.
I suggest that you develop a control chart suitable for the process and key measure, and subgroup properly to calculate R bar. With enough time, you may see special causes that can be fixed prior to presenting a customer with a Cpk value based on a reasonable length production run.0January 9, 2003 at 8:21 am #82009
THOTHATHRIMember@THOTHATHRI Include @THOTHATHRI in your post and this person will
be notified via email.If the 36 Nos of data are taken continously, you can calculate the Ppk not Cpk.
Ppk indicies tells Process capability which meat the specification on current situation.
Cpk tells process capability continously over time.By removing some data its become normal, this abnormality may be due to measurement error.This data can be eliminated by using BOX PLOT in MINITAB.0January 9, 2003 at 8:37 am #82010Some of the contributors here appear to have gotten too academic. Let me explain in simple terms.
It is quite normal (or almost always impossible) to get a normal distribution, particularly so with such low sample sizes.
But Cpk theory is based on the assumption that the distribution is normal. So what do you do?
You make the distribution normal!
How?
By sampling in subgroups. In your case you can divide the 36 samples 6 groups data.
You then take the mean of each group.
(If you plotted the mean of each subgroup, you would notice that the distribution is normal)
For your Cpk calsulation, you use the means to calculate the sigma and the mean of the subgroup means for the xbar
Your results should become more accurate.0January 9, 2003 at 9:19 am #82015
DANG Dinh CungParticipant@DANGDinhCung Include @DANGDinhCung in your post and this person will
be notified via email.Good morning and Happy new year.
I agree with RAJ.
Cpk is calculated from means fo small samples (5 to 10 units). I suggest following procedure
1. Make from your 36 some combinations of 6 units (for instance 50).
2. Calculate the means of those combinations. Practically those means will be distributed according to normal law of distribution.
3. Calculate mean and standard deviation of those means.
Best regards,
DANG DINH CUNG0January 9, 2003 at 9:24 am #82016Maybe it’s not that simple.
Distribution of subgroup means would tend to follow a normal distribution. That is guaranteed by the Central Limit Theorem.
But you don’t use these subgroup means to compute for the process mean and stdev. If you do this, the Cpk value that you will be getting is going to be inflated by a factor of sqrt(n).
As suggested by other posters, test for normality should be performed if there is doubt that the distribution is normal. If results indicate nonnormality, then one can do data transformations in order to make it normal. Then that’s the time Cpk can be calculated. (Assuming that the process is in statistical control).0January 9, 2003 at 1:28 pm #82026you can not use subgroups to define Cpk. A sample size of 36 seems to be too small to calculate Cpk . if the data is non normal it can be simply because of less data . I would suggest you should investigate further and see if your process is stable . IF yes then only proceed with Cpk calculations .
Also Cpk and Ppk use diffent sigma calculations and mean quite different.0January 9, 2003 at 1:45 pm #82028Process capability is calculated from subgroups all the time. In addition, there is not enough information to conclude that insufficient data exist. I would argue that 36 consecutive samples are certainly sufficient to calculate the standard deviation. Once you know that you can make any number of informed decisions about the process.
0January 9, 2003 at 5:13 pm #82046It depends on what you want and how you want to use Cpk.Yes you can use sub group but if that is the way you want in future.
Since the data is not normal it is possible that the cause of non normality is the size of data . It has to be investigated0January 10, 2003 at 4:22 am #82073
DrSeussParticipant@DrSeuss Include @DrSeuss in your post and this person will
be notified via email.I am going to weigh in on the side of not having enough data, just like everyone else. Here is a slightly different twist….How did you determine if your data set was not normal? Graphs? Statistical test….Anderson Darling, Chi Square, etc…. If you just collected your data and punched it into Minitab or some other software, you really have to understand how the software calculates normality. Most statistical packages need a minimum of 25 data points from a truely normal process with correct subgrouping. The farther away your process is from a nominal normal distribution, then you will need more data. Typically you would need anywhere from 50 to 200 data points to really have any confidence that a process is truly normal. And these are only general guidlines. There are other factors that drive normality also, such as, finding the correct rational subgroup or making sure your data points are collected at sufficient time intervals for variabiity, common/special cause to work on your process. All the responses to your question are valid. Look at your process and its inherent variation and make sure your sampling will capture the variation. Otherwise, go some more data….ie. 75 to 150 data points.
0January 11, 2003 at 5:19 am #82096At first I am very grateful to everyone’s poster. I got so much information from your side!
OK! Let me define more information to my action. What I did was all for First Article validation by customer. He give us an Excel sheet. We can collect 3050pcs samples(the samples must be consecutive produced, so I think it is shortterm CpK – PpK indeed.) and measure the critical dimension of the products, then we put the measurement data into that Excel sheet. It will give us the result at once. Now our 36pcs sample data cannot pass the test for normal and so the CpK can’t be calculated out. The process was stable during gathering sample. Shall I need gather more sample and subgroup them to calculate CpK? I will try it out.
Jackey0January 13, 2003 at 2:39 pm #82121
DANG Dinh CungParticipant@DANGDinhCung Include @DANGDinhCung in your post and this person will
be notified via email.
I assume your 36 datas are randomly distributed.
With this assumption, you have 36!/(5!*21!)=376 992 combinations (or subgroups) of five units. So you can say that you have a population of 376 992 sets of five units each of them has the same probability to appear during you sampling operation.
From this great amount of combinations you choose randomly a sample of at least 50 combinations. For each set of combination, you calculate mean X bar and range R. Then you calculate mean and standard deviation of X bar and of R.0 
AuthorPosts
The forum ‘General’ is closed to new topics and replies.