Sample size for normal distribution prediction
Six Sigma – iSixSigma › Forums › Old Forums › General › Sample size for normal distribution prediction
 This topic has 5 replies, 5 voices, and was last updated 18 years, 8 months ago by Robert Butler.

AuthorPosts

September 17, 2003 at 3:30 pm #33326
I’d like to know what the minimum sample size needs to be for me to predict that a particular sample was chosen from a normally distributed population.
0October 3, 2003 at 3:58 am #90596To predict normal distribution, it is important that the sample size should be selected so that all common cause variations are included…. but at the same time special causes should not be there. Thus it depends on the process.
However personally I’ll suggest data points should be more than 100 so that we are more close to the Z sistribution.
vivek0October 3, 2003 at 6:55 am #90602
Yashwant M JoshiMember@YashwantMJoshi Include @YashwantMJoshi in your post and this person will
be notified via email.As a thumbrule , the sample size should be 10 %.However this 10 % has to be carefully selected to sufficiently cover different subpopulations of the total population e g In a Manufacturing process, we should take 10% sample from each shift, each process,each vendorsupply, even different seasons( since climatic conditions do change drastically) i e the sample should represent all the variable parameters.In fact we need to stratify the data variable parameterwise and then test the theories for each of the population sample.
Similarly, in case of Marketing, regionwise,salesmanwise etc stratification of data is required to be done and then take 10 % sample for testing of theories. We can use chi square test too to verify the representativeness of the sample.0October 3, 2003 at 1:35 pm #90619Vivek,
I disagree on the need to have a minimum number of data points. While in theory, you’d ideally want to have a large sample size, in practise, collecting such data could be prohibitevly expensive. I think you need to make a judgement call, based on your understanding of variation. However, if you are looking at sources of variation, it is easy to subgroup according to known sources of variation. But unless you can cover upto, say 80% of that special cause variation, there is little sense in collecting 100 points to begin with.0October 5, 2003 at 2:23 am #90647
marklamfuParticipant@marklamfu Include @marklamfu in your post and this person will
be notified via email.If the data is from a normal distribution, I think the minimum sample size is 30, >50 is perfer
0October 5, 2003 at 4:53 pm #90657
Robert ButlerParticipant@rbutler Include @rbutler in your post and this person will
be notified via email.As phrased, your question does not give enough detail to permit a specific answer. As written, the answer to your question is 2. I realize that this sounds like I’m trying to be “cute” but in fact you can attempt an estimate with only 2. Obviously, depending on what you are trying to do, such an estimate may either be adequate or of no use whatsoever.
If you are confronted with the task of testing the assumption of normality of a population you can use the “eyeball” approach to a graphical analysis of your data. You can also use quantitative tests such as the AndersonDarling, the ChiSquare, or the W test to check for assumptions of data normality.
If you plot your data on normal probability paper you can examine the plot to determine how well it approximates a straight line. Many computer packages will do this as well as run one of the above tests on a data set of any size to give you a sense of normality.
If you have access to a book of graphical tables such as the Rand Corporation plots you can visually compare your plot against the examples to get a sense of how well your data is approximating a normal. In describing such plots we have the following from Fitting Equations to Data by Daniel and Wood: “Our sample sizes (from a random normal distribution) range from 8 to 364. As might be expected, samples of 8 tell us almost nothing about normality, whereas samples of 384 seem very stable execept for their few lowest and highest points. Sets of 16 show shocking wobbles; sets of 32 are visibly better behaved; sets of 64 nearly always appear straight in their central regions but fluctuate at their ends.”
The above quote highlights the key problem associated with using quantitative tests without also examining a graphical representation of your data. Some tests are more sensitive to variation in the ends of the probability plot and others are sensitive to variation in the central region. Thus, even random data taken from a known normal distribution could fail a normality test if the test and the data were mismatched.
The ChiSquare test does not do well with small data sets and there are additional issues surrounding the arbitrary arrangement of data into cells.
The W test (ShapiroWilk) is a very good test for normality when you have a data set with less than 50 points. If you don’t have this option available to you in your statistical software you can consult Hahn and Shapiro – Statistical Models in Engineering pp. 294297 of the first edition. You will also need to use Table IX (in that book) to complete the calculations. In the past, I have used a combination of graphical plots and the W test to investigate the normal properties of data sets with as few as 10 data points.
As a final point it should be remembered that statistical tests provide objective methods for testing whether or not an assumed distribution provides an adequate description of the observed data. They never allow one to prove that the assumed distribution is the correct one.0 
AuthorPosts
The forum ‘General’ is closed to new topics and replies.