Does My Data Follow a Normal Distribution?
Six Sigma – iSixSigma › Forums › General Forums › Tools & Templates › Does My Data Follow a Normal Distribution?
 This topic has 11 replies, 7 voices, and was last updated 1 month, 2 weeks ago by Fausto Galetto.

AuthorPosts

July 10, 2020 at 9:47 pm #248859
Lucas Paes PimentaParticipant@lucaspaesp Include @lucaspaesp in your post and this person will
be notified via email.Hello everyone!
I wonder if somebody could help me. I’m doing a process analysis of a company for my final term.
The process is “assembly of books”. The measurement system consists of getting the weight of books, and compare it with the “standard” weight, that is, how much should the book weight. So, it has positive and negative values. And, the weight scale has a precision of only 0,005 Kg. So, converting to grams, all my data are over 10,5,0,5,10 grams (the specification is 15 to +15 grams) They are also collect on subgroups of 5 parts, and them the mean calculated, to insert on the control chart.
I’m doing a normality test (AndersonDarling) on Minitab. When I use all the data, the result is a nonnormal distribution. But, if I do the test with the mean of the values, it’s normal. Which one is correct to do?
Thanks!
0July 11, 2020 at 10:19 am #248872
Robert ButlerParticipant@rbutler Include @rbutler in your post and this person will
be notified via email.The central limit theorem applies to distributions of means so, the fact that the distribution of your means passes the AndersonDarling test isn’t too surprising.
I’m not trying to be nasty or mean spirited here but the bigger issue is a basic mistake you have made – your post suggests you didn’t take the time to really examine your data. As written all you have done is take some data, dumped it into a normality test, found the test indicates the data is nonnormal, and decided that it is, in fact, nonnormal to a degree that really matters with respect to what you are doing.
The AndersonDarling test, along with the other tests, is extremely sensitive to any deviation from ideal normality. Indeed it possible for data taken from a generator of random numbers with an underlying normal distribution to fail one or more of these tests.
Before you do an analysis you need to examine your data and that means the very first thing to do is plot the data in any way that is meaningful and see what you see. In this case, the minimum you should do is generate a histogram, generate a normality plot and a boxplot of the data. Given what you are doing I’d also run a time plot of the data to see if there is any obvious underlying trending over time.
Once you have the normality plot you should look at it to see if it deviates from the reference straight line and then apply what is termed “the fat pencil test”. This amounts to looking at the plot and checking to see if the plotted data can be covered with a fat pencil.
The various plots will tell you the following:
1. The histogram will give you an idea of the overall shape of the distribution of individual points. The questions you would want to investigate with this plot would be:
a. Is the plot unimodal? – If it is bimodal then you have work to do.
b. Does the overall plot provide a visual appearance of something that is approximately normal?
c. Is there any data that gives a visual impression of being outliers? Are there just a couple or are there a lot (yes, this is a judgement call)?
d. If there are just a few visual outliers, drop them from consideration and rerun your plots and your statistical tests. What do you see? Does everything change or do things stay much as they were?
2. The normal probability plot will give you a much better sense of the approximate normality of the data.
a. if the plot doesn’t approximate a fit to the reference straight line but veers off sharply at either the low or the high end or exhibits clean breaks in the data with the subsets approximating straight lines that have large differences in slopes then you have work to do.
b. On the other hand, if the data is randomly scattered about either side of the line or if it is staying in the relative vicinity of the line but randomly drifting above and below the straight line then it is safe to assume the data approximates normality to an acceptable degree.
3. The boxplot will not only give you a good sense of the distribution shape it will also clearly characterize the behavior of the data in the tails as well as in the central part of the data distribution.
4. A time plot will tell you if your process is changing over time.
There are many individuals who view plotting data as something only a child should do. They find the idea of plotting data before running an analysis to be somewhat insulting and beneath their dignity. What they fail to understand is that a meaningful graph IS a statistical analysis.
I’m a statistician and I could bore you to tears with story after story about running an analysis that began and ended with basic plots of the data – in other words – once my engineer/doctor/line manager/scientist/division head/CEO/technician looked at the plots I had generated the project ended because the graphs told us everything we needed to know with respect to the source of the problem and its solution.
Some recommended reading:
In order to get some idea of how, nonnormal, data generated using a random number generator with an underlying normal distribution can look I would recommend you borrow a copy of Fitting Equations to Data – Daniel and Wood and look at the probability plots in the appendices of Chapter 3.
In order to get some idea of the power of real graphs I would recommend you borrow all four of Tufte’s books on graphs and graphical methods (The Visual Display of Quantitative Information, Visual Explanations, Envisioning Information and Beautiful Evidence) and “read” them. I put “read” in quotes because, while there is text in his books, the books are graphs, all kinds of graphs, and he provides the reader with a very clear visual understanding of graphical excellence and what a proper graph can do.
4July 12, 2020 at 6:45 pm #248883
Lucas Paes PimentaParticipant@lucaspaesp Include @lucaspaesp in your post and this person will
be notified via email.Robert,
Thanks for your reply and explanation. A lot of information that I didn’t know. I’ll for sure take a look in the reading recommendations.
One reason that made me post this here, was because of the histogram. I generate it, and it looked “different” (it’s in the attachment). Since the weight scale has limitations on the values measures, its like the data can’t assume all the values on the variation permitted. (For eg: It will never assume 1,2,3,4 or 5 grams)
Can we say that the data isn’t continuous, being a discrete one? Or the measurement system isn’t appropriated? Or the analysis can still be done with this set of data?
0July 12, 2020 at 11:06 pm #248889
StrayerParticipant@Straydog Include @Straydog in your post and this person will
be notified via email.It appears that your measurement system conflates discrete and continuous data. I’d suggest that you’re overcomplicating and “going down the garden path”. Use the continuous measurements without collecting into subgroups. Don’t use your subgroups, treating them as discrete. They are not. For a histogram, use proportional division of the continuous data rather than predetermined buckets.
2July 13, 2020 at 8:16 am #248892
Robert ButlerParticipant@rbutler Include @rbutler in your post and this person will
be notified via email.@Straydog is correct – no subgrouping here just the individual data points. Once you have a histogram built as he recommended (I’d also recommend plotting the data on a normal probability plot) post the graphs here and perhaps I or @Straydog or someone else may be able to offer additional thoughts.
0July 13, 2020 at 8:43 pm #248909
Lucas Paes PimentaParticipant@lucaspaesp Include @lucaspaesp in your post and this person will
be notified via email.So, I think that is the problem. The histogram above is of all the individual data points (a total of 150 points – 30 subgroups of 5 parts). All the data points measured by the system are multiple of five grams. I’m sending a picture with an example.
Still, I manage to do the graphs again, and they’re in the attachment too. Both are using all the 150 values, not considering the subgroups or means.
What do you think?
Attachments:
0July 13, 2020 at 10:05 pm #248913
Robert ButlerParticipant@rbutler Include @rbutler in your post and this person will
be notified via email.It looks like your plotting routine for the histogram is running some kind of default binning which is giving you a false impression of your data. If you look at the linear plot of the data the histogram should have vertical bars at the same places 15, 10, 5, 0 , 5, 10, 15 but it doesn’t. Rather the histogram looks like the bars are at (maybe) 15, 10, 5, 0(?), 6, 10, 14.
If you know the actual data counts for the measures at 15, 10, 5, 0 , 5, 10, 15 just make a two column data set with 15, 10, 5, 0 , 5, 10, 15 in one column and the associated counts in the other and make a histogram using that data set. Given that the data is very granular it still looks like the shape of the histogram should be acceptably normal.
After rereading your initial post I guess the question I have is what is it that you wanted to do with the data? If you are thinking about process control you could build an individuals control chart using the data but depending on what it is that you are trying to investigate that may not be what you really want to do.
Your focus on differences from an ideal suggests your real concern might be one of checking for random vs nonrandom trending in the differences over time. Since everything is in increments of 5 it might be worth considering examining your data looking at runs above and below the median. The question you would answer with an analysis of this type is – are the differences above and below the median random over time or is there a pattern. If a pattern occurred it would tell you there is special cause variation present and you need to investigate why.
Most good basic statistics text books will have the details for assessing runs above and below the median and the topic in the textbook will have the same title.
0July 23, 2020 at 2:27 pm #249062
Chris SeiderParticipant@cseider Include @cseider in your post and this person will
be notified via email.If you followed the DMAIC methodology you would have checked your measurement system. My educated guess is it won’t pass a good variable gage R&R. This is PART of the reason your normality tests are failing but don’t fret too much if it doesn’t pass…median shifts are just as valid for statistical confidence–yes I know how my friend R. Butler may be rolling his eyes towards me! :)
0July 24, 2020 at 10:44 am #249079
Mike CarnellParticipant@MikeCarnell Include @MikeCarnell in your post and this person will
be notified via email.@cseider Just for clarification are you suggesting not check nor fixing the measurement system and just running nonparametric tests instead?
0July 24, 2020 at 10:47 am #249080
Chris SeiderParticipant@cseider Include @cseider in your post and this person will
be notified via email.I’d say fix the measurement system.
Just saying AFTER fixed, it would probably allow the Anderson Darling test to come to another conclusion.
0August 29, 2020 at 6:57 pm #249648
MBBinWIParticipant@MBBinWI Include @MBBinWI in your post and this person will
be notified via email.@lucaspaesp – a lot of good discussion here, and I hope you’ve learned some things. Fundamentally, you have a measurement system that doesn’t have sufficient precision. There are several techniques that I might suggest, but the easiest is to use the raw data instead of subgroups (as it seems you plotted in your graph #2).
0September 1, 2020 at 4:36 am #249696
Fausto GalettoParticipant@fausto.galetto Include @fausto.galetto in your post and this person will
be notified via email.IF YOU had provided the data one could have shown you how to deal with YOUR Control Chart…
DATA allow making analyses, Graphs sometimes are useful…
0 
AuthorPosts
You must be logged in to reply to this topic.