# Fitting a distribution conflict

Six Sigma – iSixSigma Forums Old Forums General Fitting a distribution conflict

Viewing 17 posts - 1 through 17 (of 17 total)
• Author
Posts
• #35189

Marty
Participant

While trying to fit a distribution I found something curious. Notice the Goodness-of-Fit results below.
Goodness-of-Fit
Weibull                0.890                           0.987
Normal                0.804                           0.978
Based on the Anderson-Darling results I should use a normal distribution since the smaller number indicates a better fit. However, based on the Pearson Correlation Coefficient I should the Weibull distribution because the larger the number the better the fit. I found this to be confusing since they contradict each other. Questions: Would it be better to use the Anderson-Darling since the delta is larger (.890-.804 vs .987-.978)? The sample size, n=30 for the test, which I believe is sufficient for the test. The resulting Ppk really doesnt change much, with normal=1.49 and weibull=1.51, so this isnt a make or break issue. I am just curious why there would be this contradiction and if there is a statistical reason to use one over the other in case I do run into a situation where it makes a difference.
Thanks,
Marty

0
#98314

Isabel
Participant

Can you share the 30 datapoints?

0
#98315

Marty
Participant

isabel,
Sorry, I don’t know how to post it in an Excel file.  Here’s the data.  The USL is 0.0080 and has a lower boundary of 0.0000.  These are individual readings, not subgroup averages.

0.0062

0.0017

0.0035

0.0020

0.0027

0.0030

0.0030

0.0036

0.0032

0.0025

0.0035

0.0033

0.0047

0.0027

0.0034

0.0047

0.0032

0.0033

0.0026

0.0028

0.0022

0.0013

0.0040

0.0036

0.0035

0.0047

0.0020

0.0045

0.0017

0.0037

Regards,
marty

0
#98319

Tim Folkerts
Member

When I run the numbers in Minitab, I get much different results. First of all, a quick check using a histogram makes it look like the data set is not normal to begin with. There seems to be an unusual grouping near 0.035 and again near 0.045 The number I get are AD P
Normal 0.366 0.413
Weibull 0.341 0.480These numbers don’t seem to support either conclusion, although normal is slightly better for both tests. Tim F

0
#98326

Scw
Member

You can not simply compare A-D values under different distributions as .89 under Weibull could have lower p-value than .804 under Normal does.
If you have Minitab 14.0, you should be able to get p-values for these A-D statistics. Reading too much into these numbers may not be a good idea—I would take a look the Prob. plots and try to understand the nature of the process.

0
#98345

Marty
Participant

Tim,
Thanks for the reply, I re-ran the stats again to confirm the results.  Just so we’re on the same page, I am using Minitab rel. 14.11.  I went to Stats – Reliability/Survival – Distribution Analysis (Right Censoring)  – Distribution ID Plot.  Did you use a different methond?  Also, I used the LSXY method.
Thanks,

Marty

0
#98349

Marty
Participant

Scw, Could you explain what you mean by:
“You can not simply compare A-D values under different distributions as .89 under Weibull could have lower p-value than .804 under Normal does.”
Do you mean that I shouldn’t compare the AD under Weibull-.890 to the AD under Normal-.804 to choose a distribution fit?  I thought that this was the correct procedure.  If you don’t think so, I’d be interested in reading how you’d interpret this.
Acutally the Normality Test does give a good result with a P value of 0.413.  It’s greater than 0.10 so I should assume normality.  However, when I look at the histogram, it really doesn’t look normal.  That’s why I decided to do a goodnes-of-fit test.  Using Minitab rel 14.11, I used the output in the session window.  That’s when I noticed the conflict. Maybey I am reading too much into this as it really didn’t make much difference when looking at Ppk.  However, what if it did?  I was hoping someone could tell me why I see the contradiction in results (AD vs Correlation Coef.) and if there is a good reason that I should use one over the other.
Thanks,
marty

0
#98350

Gabriel
Participant

Does not look normal? How did you do the histogram?
With only 30 values, you should use 5 bins. Because of simplicity, I used 6 bins with central values 0.001 to 0.006. Plotting the normal distribution bell curve with the same average and standard deviation as the sample, it looks pretty normal to me. Also plotted the cummulative frequency against the cummulative normal distribution for the same average and standard deviation as in the sample. Again a close match.
If you want, post your e-mail and I’ll send you the graphs.

0
#98352

Karthik Subbiah
Participant

Hello There:
If your objective is to find the capability of the process along the dimension for which the observations were made, here are the steps that you need to follow:
1) Your data is unidimensional i.e., LSL = 0 and only USl is defined;
2) Normal distribution assumption for this unidimensional RV is wrong.
3) Typically, log normal distribution will work.
4) Convert your data to log base e.
5) In minitab, go capability analysis and use your transformed data.
6) Declare LSL as bounded and USL = loge (0.008)
7) Run the program.
8) I had a Ppk = 0.92 .

0
#98362

Marty
Participant

Gabriel,
I agree that the frequency plot looks good.  Also the high P value was a good indication of normality.  The histogram made me suspicious so I did the goodness-of-fit test.  I didnt even think about how the software was creating the histogram (insert forehead slap).  Using K=1+3.3 logN, to calculate the number of cells, I now see the difference in the histogram and it looks more normal, as you have indicated.  If I had thought about that in the first place, I never would have done the goodness-of-fit test.  Still curious why I see the conflicting results with the goodness-of-fit test though.
Thanks for reply. Good catch on the histogram.
marty

0
#98363

Tim Folkerts
Member

Marty,Interesting. I have Minitab 14.1. I first ran it just using the Graph, Probabilty Plot, menu choice. It plots out the data with a fit to a distribution, along with some basic stats. Just now, I reran it with your method. The two results are indeed quite different. I also tried Stats, Basic Stats, Normality Test, and that gave the same results as my original Probabilty Plot method. I’m not sure why Minitab gives different answers to basically the same question.
Tim F

0
#98413

Sean
Member

Why does Minitab use 10 bins by default instead of using K=1+3.3logN.
This would likely solve many Normality questions where the histogram in MiniTab doesn’t look very normal (because of the number of bins) even though the P value indicates that the data Normal.
SP

0
#98421

Sean
Member

I have done more digging and answered my own question.  MinTab does calculate the number of bins required, but seems to use too may bins for small samples sizes (or at least more than I would).
I can’t determine what formula MinTab uses to calculate the bin size, but it does seem to work better for large samples than for small ones…
SP

0
#98424

Dillon
Participant

Sean,
You can e-mail/call MiniTab to get your answer – they are generally pretty helpful when it comes to providing the formulas they use to calculate things.  And if you can provide them with ideas on how to do it better (i.e. a different calculation for smaller sample sizes), they will generally work it in in future releases.

0
#98484

Scw
Member

Marty,
That is exactly what I meant. The right procedure is to use the p-value combined with the prob. plot and your know about the process.
Also, if you get a high p-value from the normality test, you CAN (not should) assume normality as you might also get a high p-value from a Weibull test, in which case you can also assume Weibull distribution. Which one should you use? Your knowledge about the process should play a key role here. Having high p-value from the normality test only says you can not reject normal assumption and that situation can change if your sample size increases.

0
#99077

Philip Whateley
Participant

You need to take care when interpreting distribution fitting tests (which are intended to test the validity of experimental design analysis such as ANOVA) when looking at the extreme tails.
The basic principle for fitting a normal distribution is to use chi-square to fit the middle of the distribution, and then check that there are 5 points remaining in each tail. This means that the goodness of fit is a function of the amount of data.
With 30 data points, you will be fitting the central 2/3 of your distribution, with 5 points being left for each tail. This means that you can characterise +/- 1sigma of your data.
In effect you will never have enough data to characterise any distribution out beyond about +/-3sigma, which is why the relationship between Z-score and proportion defective becomes meaningless beyond this point, as it can only ever be based on your assumption about what distribution your data follows.
Take a look at “Normality and the Process Behaviour chart”, and “Beyond Capability Confusion” by Don Wheeler (http://www.spcpress.com) for further information.
The important thing to remember is that distributions are theoretical abstractions, which are characterised by parameters. Data can be used to calculate statistics, which under certain circumstances can be used to estimate parameters. This estimation process effectively breaks down in the extreme tails of any distribution because the requirements are:

Exact stability. (impossible for the amount of data needed at the extreme tails)
Continuous data. (i.e. no granularity due to measurement resolution)
For a normal distribution the data must be able to extend to +/- infinity.
To give you an idea of the amount of data needed to fit a normal distribution:
+/- 1sigma

0
#99154

Jonathon L. Andell
Participant

Marty, I know you have gotten a lot of good inputs. I apologize if I come across as patronizing, but so far the discussion as not mentioned plain old plotting the data:

Some knowledge of the context is important here. What kind of process / product data are you evaluating? What kind of sampling and/or sub-grouping scheme are you using?
Have you plotted the results against time? Do you detect special cause in the pattern?
Have you looked at the actual probability plots? A series of straight, parallel lines on a probability chart may indicate a multi-modal distribution – there’s not an A-D, a P-Value, a K-S, or any other analysis that will reveal that. Nothing substitutes for just plotting the data and seeing what you have.
Other software packages offer the option of a K-S (two Rusian names, the second is Smirnov) statistic instead of the A-D. A-D is stronger at detecting based on the tails of the curve, while K-S is somewhat better at evaluating the middle of te data. But don’t even try it until you have plotted, plotted, and then plotted some more.

0
Viewing 17 posts - 1 through 17 (of 17 total)

The forum ‘General’ is closed to new topics and replies.