Sample size…Why 30?
Six Sigma – iSixSigma › Forums › Old Forums › General › Sample size…Why 30?
- This topic has 102 replies, 86 voices, and was last updated 12 years, 3 months ago by
Karim.
-
AuthorPosts
-
July 18, 2002 at 5:42 pm #29909
We recently concluded a GB training and the question of why a sample size of 30 was suitable and where it came from. (What if my population was less than this or was destructive in nature?) Im lost as there is conflicting data. One reference says determining the sample size depends on (1) the level of confidence (2) margin of error tolerated, and (3) variability in population studied. Another says…
n=(z*s/E)squared. These do not take into consideration the population do they? If I am faced with a transactional example and had 100000 accounts (population) that were in default and wanted to sample it to determine how many had credit scores less than 615 what sample size would be reflective of the population without offsetting costs by time spent gathering data.
When i pose this question (thinking 30 may not reflect my population) I remain unsatisfied. Im told that its due to large samples related to the central limit theorm or also that a typical run chart that is in control is stable after 30 data points and thats why 30 is used. Cant find this in any reference materials…can anyone help?? Please provide an example if needed…I need laymans terms! as I am not too statistically inclined.0July 18, 2002 at 7:09 pm #77364
SambuddhaMember@SambuddhaInclude @Sambuddha in your post and this person will
be notified via email.DT,
You ask a very good question. There are various responses based on situation, tool you are using, type of data.
One reference says determining the sample size depends on (1) the level of confidence (2) margin of error tolerated, and (3) variability in population studied.
The above reference is right in general. Formulae for sample size calculation vary depending on the test you are going to conduct.
The parametsrs/issues you need to address are:Type of test e.g. 2 sample T, Z, ANOVA etc
Standard Deviation (variability) of the process
Delta that is significant in distinguishing 2 or more effects
Alpha (level of significance of the test)
Power of the test (1-beta). Beta is the probability of type-II error
Number of levels (ANOVA). In case you know how many levels/effects are you aiming to distinguish.
Sample size
The interesting part is that in Minitab (assuming you would use it) allows you to vary any 2 parameters from Delta, Power and Sample size for any given number of levels. Try Stat>Power and Sample Size>ANOVA or the tool you want to use.
That lets you know the error (or lack there of, since you are measuring power) associated with each sample size and delta for any given setting. So you could make a trade-off study and see where your sweet spot lies. In cases where testing involves capital & consumables, this is a great tool. In your case you have the data. So it is not resource intensive that way. Still this is better than using 30 blindly.
There is a reason 30 is widely used. It is a result of simulation studies involving the Central Limit Theorem. If you are interested in the “histroy”or reason for prevalence of 30 samples as aguidance, i could give you a few pointers.
I have a project that is similar in tool usage. There are quite a few neat things one could do with power and sample size studies.
Good luck,
Sambuddha
0July 18, 2002 at 7:19 pm #77365I would be VERY interested in the pointers for history or relevance of the 30 to samples…if its easier to email…
[email protected]0July 18, 2002 at 7:51 pm #77366
SambuddhaMember@SambuddhaInclude @Sambuddha in your post and this person will
be notified via email.DT,
Check your email. I have sent some information.
Hope that helps.
Best,
Sambuddha0July 18, 2002 at 8:50 pm #77368Sambuddha:
Hi. I am very new to this forum.
Can you send me the information that you sent to DT on why a sample size of 30 is required ? I too am curious.
Thanks.0July 18, 2002 at 9:17 pm #77369
SambuddhaMember@SambuddhaInclude @Sambuddha in your post and this person will
be notified via email.Hrishi,
No problem. Post your email address. Either me or DT could email you.
The reason, I cannot post it here is that it is a scanned picture attachment and it is easier perhaps to email.
Best and welcome to this community,
Sam0July 19, 2002 at 11:25 am #77376
GabrielParticipant@GabrielInclude @Gabriel in your post and this person will
be notified via email.Sambuddha
You can attach it here and share it with all the forum. It would be great!
Just click on the clip here at the right, where you read “Post/attach document”. It will lead you to send an email to iSixSigma with the attachment and they will post the attachment here!
Thanks for sharing!0July 19, 2002 at 11:52 am #77377Sambuddha
Yes as Gabriel says you can post it on his site. I am curious too.
Thanks for sharing
DD0July 19, 2002 at 1:22 pm #77379
SambuddhaMember@SambuddhaInclude @Sambuddha in your post and this person will
be notified via email.Gabriel, DD
I thought of posting it here. Attaching was a small hassle. But looks like I have a bigger problem. It is a scanned picture of some graphs. And mea culpa, I cannot find the reference from where I took that from. I am buried amidst a bunch of books and I can’t find it.
I have no problem sharing it with you all individually through email. But I am afraid if I post it in a public manner without credits, I might be in trouble for copyright violation for public distribution of intellectual property.
The good news is that the following website illustrates the same thing.
http://http://www.statisticalengineering.com/central_limit_theorem.htm
Public domain is great, isn’t it? Hopefully that will satisfy your curiosity.The number 30 came as a result of simple sampling simulations from different parent populations (Uniform, Normal, Exponential, Triangular) and by the time the sample sizes reached 30-32, the distribution of the means started looking normal. That is the reason for the rule-of-thumb.
I haven’t seen any theoretical explanation yet for that i.e. what is so special about 30 from analytical point of view. Shall let you all know if I come across anything to that effect.
Hope it still helps.
Best,
Sambuddha0July 19, 2002 at 7:14 pm #77399Sambudhha
Can I also share the info on sample size of 30. I will appriciate you emailing me the same at
[email protected]
0July 20, 2002 at 1:47 pm #77410
Rajanga SivakumarParticipant@Rajanga-SivakumarInclude @Rajanga-Sivakumar in your post and this person will
be notified via email.Mr. Sambudhha,
Could you share the sample size 30 with me too? Thanks
email to [email protected]
Rajanga
0July 20, 2002 at 2:06 pm #77412Making the assumption that even with the ability of most software to sort and count the numbers out of your population of 100,000 records you wish to sample, there are two questions you have to ask. How many do I take and what risk can I assume in making the wrong assumption from the statistic.
A number of answers here address why 30 samples are needed to approximate a normal distribution allowing for the estimation based on the probabilities of the normal curve. However, once the mean and std dev have been estimated, and the cumulative probability found up to and including the critical limit you set. The second question comes into play, specifically how sensitive are you to making an error in assigning that proportion to your population.
As an example, say 6% of your population was expected to fall below your cut off, how sensitive are you that the true proportion isnt 7% or 8% or 10% etc. You would need to calculate the Beta risk of assuming the proportion at 6% given your original sample size and the statistics you calculated. If you utilize minitab (or other software perhaps) you can adjust the minimum sample size you need to take for the risk you choose. Under the power and sample size tab 1 proportion test, you can enter both the calculated proportion (as a percentage) and the critical proportion, along with the level of risk (beta) and it will calculate the number of sample you need to take. Go back, resample to that level and run the calculation again to find the proportion defective (credit levels below your cut off), and rerun the beta again with the new numbers. The process is iterative until you are satisfied with number of samples vs the risk you are willing to assume. Therefore you might start out with a sample of 30, find that the beta risk is too high and have to take 400 samples, do so and recalculate and find that you actually need 435 etc etc. Others here might have a better way to adjust for risk and sample size without all the iterations but thats the only way Ive found to consistently do it.
My other question for you however is what do you plan on using the data for. Be careful if the intent is to show that you get higher numbers of defaults with credit scores below a certain number using those accounts in default as your population for the hypothesis. Your choice of frame for the population would be wrong in that kind of test.
hope that helps.0July 20, 2002 at 11:07 pm #77415Hi Sambuddha,
Could you also share the sample size of 30 info with me as well? Please email [email protected]. Thanks!
Jay
0July 22, 2002 at 5:48 am #77425
zhouParticipant@lawrenceInclude @lawrence in your post and this person will
be notified via email.Dear Sambuddha,
I was only reading about this discussion topic from the newsletter link today, could you also send me in a separate email on the 30 sample size information as well? appreciate it.
Best regards,
Lawrence.0July 22, 2002 at 7:42 am #77427
Glenn GoodingParticipant@Glenn-GoodingInclude @Glenn-Gooding in your post and this person will
be notified via email.Sam,
Along with a great many of our colleagues, I would be interested and grateful if you could let me have a copy of the information about the rationale of the 30pc smaple size.
My e-mail address is: –
[email protected]
regards
Glenn
0July 22, 2002 at 9:42 am #77429Dear Sambuddha,
Could you also send me in a separate email on the 30 sample size information as well? appreciate it. [email protected]
Best regards,
JA.0July 22, 2002 at 11:43 am #77431
Nicholas L. SquegliaParticipant@Nicholas-L.-SquegliaInclude @Nicholas-L.-Squeglia in your post and this person will
be notified via email.In layman’s terms, if you were to prepare graph on the basis of attribute data letting sample size vs confidence, you will see that there is quite a difference from, for example, 2 to 30. This is not the optimum, but more of a minimum. 50 would be perhaps a better choice and is what Dorian Shainan used in his “lot plot” many years ago. Th slope of the curve increases after 30/50, but at a much lower rate.The central limit theorem is somewhat different, and relys on taking averages of data to show a normal, gaussian, distribution for control chart purposes although the underlying data is non-normal
Nicholas L. Squeglia, author, Zero Acceptance Number (c=0) Sampling Plans0July 22, 2002 at 11:55 am #77432Sambuddha,
Could you please forward the Sample size information to me as well. [email protected] Thank you.
Cedric0July 22, 2002 at 2:43 pm #77442
Janet HunterParticipant@Janet-HunterInclude @Janet-Hunter in your post and this person will
be notified via email.I believe you are correct about the Central Limit Theorem, at least, as I recall from my statistics classes a few years ago. You may want to contact the local college and speak to one of the professors in the mathmatics department for further direction or confirmation.
0July 22, 2002 at 2:59 pm #77444
Mike CarnellParticipant@Mike-CarnellInclude @Mike-Carnell in your post and this person will
be notified via email.DT,
I have not read the entire string so if some of this is redundant I apologize. Sam gave a good answer whaen he said it was different for different situations.
Assuming that everything works off of 30 is incorrect.
Frequently you will see variable control charts listed as a sample size of 25-30. They typicaly are speaking of a sample size of 25-30 groups of 5. That makes it 125-150 actual samples. It is the subgrouping giving you a distribution of averages (Central Limit Theorem) that makes it work.
When you are doing hypothesis testing and using ANOVA the sensitivity of the test is extremely dependent on sample size. I was doing site support and found a guy who could not understand why his 2 sample t test was showing significance. He was sure it should not. His sample size was >400. The test was sensitive to < a .1 sigma shift.
There are sample size implications with virtually every tool. 30 is not a catch-all particularly if you are working with attribute data.
Good luck.0July 22, 2002 at 3:48 pm #77447Sam:
Can also send me your e-mail response to the question dealing with a sample size of 30?0July 22, 2002 at 7:26 pm #77460Don’t confuse population sampling with process sampling these are two very different animals.You need population variation, power etc when considering population sampling.When process sampling the purpose of taking 30 samples is to establish with reasonable certainty these issues and develop control limits. these limits remain constant unless significant changes are made to the process.It is common in SS training that the true essence of what is meant by the mathematics behind these issues are lost.Processs sampling also assumes a process that is in statistical control. If it is not stop fix the process then proceed.
0July 23, 2002 at 6:00 pm #77514Dear Sambuddha,
Could I trouble you to send me the information also?
[email protected]
Thanks
Bahram0July 23, 2002 at 6:36 pm #77515
DewayneParticipant@DewayneInclude @Dewayne in your post and this person will
be notified via email.Sambuddha,
I, too, would appreciate your sharing/sending the information on the sample size of 30. Thanks.
[email protected]0July 26, 2002 at 1:44 pm #77631I have also not read all previous replys to this subject, but from my experience in statistics the reason why 30 or 31 has always been the magical number is because the students t distribution approaches the normal z distribution at 30 samples. Hope this helps.
-NB0July 27, 2002 at 6:16 am #77649
H.KirchhausenParticipant@H.KirchhausenInclude @H.Kirchhausen in your post and this person will
be notified via email.Hi add all,
it will be fine if you can send me also information or an example of the magic samplesize of 30!
Thanks in advanced
send it please to
[email protected]
0July 28, 2002 at 11:39 am #77657Hello Sambuddha,
Appreciate if you can email the 30sample size info to me, too.
My email add is: [email protected]0July 30, 2002 at 3:33 pm #77711
Ged BryantParticipant@Ged-BryantInclude @Ged-Bryant in your post and this person will
be notified via email.Want to test magic number 30. Take any group of people, good party trick. Bet any one present that two or more of the group will have the same birthday. Month and date. This has 98% confidence.
0July 30, 2002 at 3:52 pm #77712I did that game in a training class once. It worked!
But how does it work? I’d love to know so I can sound intelligent next time I do it :).0August 15, 2002 at 7:48 pm #78161
julio jaimeParticipant@julio-jaimeInclude @julio-jaime in your post and this person will
be notified via email.Sambuddha:
Hi. I am very new to this forum.
Can you send me the information that on why a sample size of 30 is required ? I too am curious.
Thks.0August 6, 2003 at 5:43 pm #88659
Allen JacqueParticipant@Allen-JacqueInclude @Allen-Jacque in your post and this person will
be notified via email.I am interested in receiving the articles identified and originated from Sambuddah that address the sample size of 30 issue.
My email address is [email protected]0August 10, 2003 at 4:28 am #88752
Mark ChockalingamParticipant@Mark-ChockalingamInclude @Mark-Chockalingam in your post and this person will
be notified via email.There are several web references on the Central Limit theorem and 30 that are interesting. When the sample size approaches 30, we don’t have to worry about the distribution of the population since it can be safely assumed to be normal for inference purposes. Here are some references:
http://www.mathwizz.com/statistics/help/help4.htm
http://www.statisticalengineering.com/central_limit_theorem.htm
Here is a little more technical article on normal distributions and central limit theorem.
http://www.itl.nist.gov/div898/handbook/index.htm0August 10, 2003 at 4:16 pm #88755
ThanachaiMember@ThanachaiInclude @Thanachai in your post and this person will
be notified via email.Mr. Sambuddha.May you please send me the pointer of sample size of 30, I’m very curious to know.Thanachai S,[email protected]
0August 13, 2003 at 7:13 am #88837
StatisticianMember@StatisticianInclude @Statistician in your post and this person will
be notified via email.Mr. Sambuddha,
I am a statistician by profession and as far as I know, sample size is determined by margin of error allowed, the estimate for the population variance, the risk factors (level of confidence, power as a function of the OCC, etc.), and most importantly, the assumed distribution of the population (or the estimable function) in study.
In my own experience, the magic number 30 is being used to approximate the normal distribution using the Central Limit Theorem, as used in regression analysis, factor analysis, etc., but not in sample size determination.
I am also curios with this article. Would you be kind enough to send me a copy, too? Also, would you happen to know/ recommend six sigma training centers in the Philippines?
Thanks,
Beryl
[email protected]
0September 23, 2003 at 3:26 pm #90196Hi Sambuddha,
You have been inundated with requests for this information on n=30, I am also a statistician and would really appreciate this information! Thanks.
0September 23, 2003 at 5:36 pm #90200I WOULD ALSO LIKE TO RECEVE THE INFROMATION / POINTERS. CAN YOU EMAIL THEM TO ME AT THE ADDRESS BELOW,
[email protected]
Thanks!
PH0September 25, 2003 at 4:07 pm #90282I have had the same question arise in the past. Mark L. Crossley wrote a good article titled “Size Matters: How Good Is Your Cpk Really?” located at http://www.qualitydigest.com/may00/html/lastword.html that seems to address your question quite well. When Mr. Crossley’s equations are rearranged you can look at plots of Cpk vs sample size with various lines of constant Cpk and specific confidence intervals. For example, after generating the curves, one is able to directly determine sample size required for a desired Cpk of 2.0 with 90% confidence. In playing with the equations it was interesting to note the confidence level obtained for a 2.0 Cpk using a common sample size of 30.
I hope that helps.0October 6, 2003 at 3:24 am #90663I would also like to recieve the information/ pointers about the sample size of 30. My email adress is [email protected]
thanks!
Tom0October 13, 2003 at 4:13 am #90933Hi Sam, add me to the distribution list please! It is last for too long time, isn’t it? Thank you inadvance!
[email protected]0December 2, 2003 at 3:36 pm #93129Dear Sambudda,
I too am interested in the “Why 30?” discussion. Please e-mail me at:
[email protected]
Thank you,
Haim0December 2, 2003 at 7:03 pm #93132
Rocky FirthMember@Rocky-FirthInclude @Rocky-Firth in your post and this person will
be notified via email.I would also like to see the information. I can post it to a web location for other as well.
0December 3, 2003 at 2:28 am #93145I don’t believe 30 came from simulations involving the CLT. Please post some backup to this assertion. Sounds like Dr. Mikel’s proof of the 1.5 shift.
By the way, what advice do you give on choosing sample size when you are interested in reducing sigma instead of moving the mean?0June 7, 2004 at 6:36 pm #101346Please let me know why 30
SK0July 27, 2004 at 6:23 am #104412
SATTHISH KUMARMember@SATTHISH-KUMARInclude @SATTHISH-KUMAR in your post and this person will
be notified via email.Dear sambuddha
Thank you for your reply to that Query.Now i am in the interest to know the variables like slop,linearity,bias,and Uncertinity relation with Instrument Repeatiablity and reproduciablity.
CAn you send the same in my mail id [email protected]
Expecting your reply
REGARDS
R.L.SATTHISH KUMAR
0August 11, 2004 at 4:02 am #105409I would be VERY interested in the pointers for history or relevance of the 30 to samples
0September 17, 2004 at 6:28 am #107505Pls email me the info for “why sample = 30”
0September 30, 2004 at 4:42 pm #108297
Surya GadeMember@Surya-GadeInclude @Surya-Gade in your post and this person will
be notified via email.Sambuddha:
Hi. I am very new to this forum.
Can you please send me the pointers or the links that you sent to DT on why a sample size of 30 is required? That’s the same question I have for a long time.
Thanks.
Surya0September 30, 2004 at 4:44 pm #108298
Surya GadeMember@Surya-GadeInclude @Surya-Gade in your post and this person will
be notified via email.Sambudha:
I forgot to mention my e-mail in my previous message…please e-mail to the following address…[email protected]
Thank you for sharing.
Surya0October 8, 2004 at 12:32 pm #108736
Mark ChockalingamParticipant@Mark-ChockalingamInclude @Mark-Chockalingam in your post and this person will
be notified via email.Surya,
There are several web references on the Central Limit theorem and sample size of 30 that are interesting. When the sample size approaches 30, we don’t have to worry about the distribution of the population since the it can be safely assumed to be normal for inference purposes.
Remember for interval estimation, the standard error is computed from a Sampling distribution of the mean. When the sample size approaches 30, the sampling distribution approaches normality. Here are some references:
http://www.mathwizz.com/statistics/help/help4.htm
http://www.statisticalengineering.com/central_limit_theorem.htm
Here is a little more technical article on normal distributions and central limit theorem.
http://www.itl.nist.gov/div898/handbook/index.htm
Mark Chockalingam0October 8, 2004 at 1:00 pm #108739
Robert ButlerParticipant@rbutlerInclude @rbutler in your post and this person will
be notified via email.
Mark,
The central limit theorem applies to the mean not to individuals 30 samples from a log normal distribution will not suddenly become normal. The distribution of 30 averages of data from a log normal distribution, however, will be. To this end, the first citation you mentioned (and as quoted below) is in error. The second and third citations, however, are correct. (Note: some of the text from your citations don’t copy over to the forum page so I had to rewrite the equation in the first citation). I also took the liberty of highlighting in order to emphasize the focus on distributions of means and not individuals.
#1 The Central Limit Theorem says that if you have a random sample and the sample size is large enough (usually bigger than 30), then
Z = (sample avg – pop avg)/(s/sqrt(n))
where Z is the standard Normal distribution with m = 0 and s = 1. This comes in really handy when you haven’t a clue what the distribution is or it is a distribution you’re not used to working with like, for instance, the Gamma distribution.
#2 The distribution of an average tends to be Normal, even when the distribution from which the average is computed is decidedly non-Normal.
Thus, the Central Limit theorem is the foundation for many statistical procedures, including Quality Control Charts, because the distribution of the phenomenon under study does not have to be Normal because it’s average will be
#3 The central limit theorem basically states that as the sample size (N) becomes large, the following occur:The sampling distribution of the mean becomes approximately normal regardless of the distribution of the original variable.
The sampling distribution of the mean is centered at the population mean, , of the original variable. In addition, the standard deviation of the sampling distribution of the mean approaches .0October 8, 2004 at 3:27 pm #108749
Mark ChockalingamParticipant@Mark-ChockalingamInclude @Mark-Chockalingam in your post and this person will
be notified via email.Rob,
Thanks for copying and pasting from the source. However, I submit humbly, that it is in appropriate to quote the content without acknowledging the source. I agree it reads easier in one page but still it is important to insert the name of the source.
Now as to your point on the error, I don’t see it. May be it is semantics. The CLT is a statement on the sampling distribution of the mean NOT on the sample or the original population itself. When the sample size approaches 30, the sampling distribution approaches normality regardless of the original distribution.
Now for intereval estimation, the big leap that is made in practice is to assume that the sample standard deviation is a sufficient estimate for the population standard deviation. Is this what is not in agreement with you when the original citation on #1 gives the formula for the Standard normal deviate.
Good discussion.
thanks,
Mark0October 13, 2004 at 9:02 am #109016Hi Sambuddha,
Kindly mail me the same at [email protected]
Thanks,
John C0November 21, 2004 at 8:36 pm #111071Grateful if you could send on the background to the sample size of 30..
Thanks ..
PC0November 22, 2004 at 5:19 am #111097Those are great URL’s for any beginner to study.
One might use some “rules of thumb” based on practical experience, as well as the more rigorous statistical methods.
For example, if the DATA is to be from a MECHANICAL process for making discrete parts, then one should first try sampling the FAMILIES of possible variation, using sample size 2 for each family, per Shainon’s recommendations. 2 sites on each of 2 parts, repeated every hour for 2 shifts perhaps, then graphed. Once it is clear which family of variation is the main problem, then SPC sampling (subgroups measured over time) can be used IF the problem is temporal. But if the problem is variation WITHIN the parts, then perhaps closer stratification of data is needed, or measuring more sites per part, or conparing that variation for all similar machines, or looking at tool wear trends of this “within-part” spread over time. Means and ranges both can drift with tool wear. Sampling for mean data involves the famous sample size of 30 (or 15) for OOC determinations. Sampling for changes in variance require much larger samples. So a wandering mean is not the same as a wandering variance. Think 1000 parts.
RULE OF THUMB for SPC chart startup is given as 15 to 30 but if the process is non-stationary (drifts, has wandering mean and unstable variance, for example) then other methods are needed. Box and Luceno’s book (Amazon.com) talks at great lenght about modern issues with process monitoring nad adjustment methods. Assessing non-subgrouped data is another issue.
RULE OF THUMB: Individual Charts are less sensitive, less powerful, (they give more false alarms for each rule added, for example) than X-bar charts.
RULE OF THUMB: Subgroup size of 2 to 4 is common and usually adequate. But for diagnostic reasons, many engineers use subgroups of 10 to 100 sites per part or parts per subgroup. ASQ had a paper recently on effect of large subgroup sizes. In general, it invalides the method used by most people to calculate control limits, as many of the subgroup sites or parts are CORRELATED and so the control limits would be wrong. What is good for diagnostics is often not good for control, given various control models. With automated gages, more data is cheap. But how you use it depends on whether the data is really INDEPENDENT and RANDOMLY SAMPLED and IDENTICALLY DISTRIBUTED. Most important is that INDEPENDENCE. And if the data is AUTO_CORRELATED, its also messy (wandering mean, showing predictability instead of randomness).
Then there is the data that comes from CHEMICAL processes, such as continuous refining. Read Svante Wold or Dr. John McGregor’s books on PCA/PLS multivariate methods for sensor-based data, which is HUGE stream of data. Rule of Thumb: Get help.
Central Limit Theorem: Only for subgrouped data!!!
Shewhart Charts: Only for stationary processes where samples are independent!!! (I am not a statistician, and those guys are still arguing about these issues. See Journal of Quality Technology, Woodall’s papers, for example.
Don’t forget; NIST online handbook. http://www.itl.nist.gov/div898/handbook/0January 13, 2005 at 6:31 am #113427Hello Mr Sambuddha,Can you also send me the article on the 30 sample size that you sent to DT? I am also interested to know why 30? My email add is, [email protected]
Thanks,
Rechel0January 13, 2005 at 8:32 am #113430
Kevin AldersonParticipant@Kevin-AldersonInclude @Kevin-Alderson in your post and this person will
be notified via email.Reference sample size 30, reasonable amount to measure / analyse.
approximate 10% margin of error between a sample of 30 to 500.
Of course it would be better to do 500 for accuracy but you must take into account the cost between 30 and 500 pending what you are measuring. Remember the std dev (10%) margin and you should be fine.0January 14, 2005 at 12:06 am #113457
quality_abParticipant@quality_abInclude @quality_ab in your post and this person will
be notified via email.Could somebody please email the link to me at quality_abyahoo.com
Thanks,
AB0January 22, 2005 at 2:58 pm #113868Hi Mr. Sambudhha,
Could you also share the information about sample size 30 with me? I’m very much interested. Kindly email it to [email protected]
Thanks,
–Glo0January 22, 2005 at 7:31 pm #113877
DrSeussParticipant@DrSeussInclude @DrSeuss in your post and this person will
be notified via email.DT, let me try to answer this from a practical experience approach,
I have also asked this question and have never receive a definitive academic answer. Here is what I have seen from analyzing real process data. Take a continuous process and produces data that is normally distributed (near normal is also good enough) and collect your data using rational subgroups approach. Using the Minitab Six Sigma Process report to calculate short term and long process capability. Look at report #4 or #5, it shows both the Sigma ST & Sigma Lt on a graph. Notice how their values stabilize toward a value at the number of subgroup increase. You will notice a flating of the curves at about 10 subgroups, then around 20-25 subgroups the curves are almost horizontal. By the time you reach 30 subgroups the sigmas have stabilized and adding anymore subgroups will only change the sigmas in the 4th or larger decimal places. If you are an Excel wizard, you can demonstrate this very easily also. The idea is that after about 30 subgroups (30 data points) the variance of the data typically stabilizes.0April 28, 2005 at 12:10 am #118555Dear Sambuddha
could you please send the reference to me, thanks very much
[email protected]
Leon0May 24, 2005 at 7:20 pm #120057Dear Sambuddha – –
I’m very interested in your projects.
Please send me mail.
Thank you very much.
Sincerely,
vee0May 24, 2005 at 7:23 pm #120058from vee
my email [email protected]
thank again0July 23, 2005 at 5:13 pm #123533
DEEPAK JAINParticipant@DEEPAK-JAINInclude @DEEPAK-JAIN in your post and this person will
be notified via email.Sambudha:
please also email me , because fm last few weeks i am finiding the answer
please reply me on email [email protected]
D.JAIN
98115641230July 25, 2005 at 6:05 am #123564Sambuddha/DT,
Please send me the information on sample size. I know this msg is 3 years too late, but would appreciate your or anyone’s help in getting this info to me.
Thanks
[email protected]0September 16, 2005 at 9:58 pm #127028When the population size is greater than 100, the normality condition is met when the sample size is greater than 30. Increase sample size depending on the process being studied and the variability of the data produced. 30 is not a “magic” number applicable to all data sets and processes.
0September 17, 2005 at 2:12 am #127036
Ken FeldmanParticipant@DarthInclude @Darth in your post and this person will
be notified via email.Dave, might I suggest that you check the dates on any post that you respond to. This is a really old one. OK, Nick, how did I do????
0January 1, 2006 at 8:35 pm #131786hi sambuddha,
I know that the forum thread has been going on for quite sometime now and am not sure if you would be able to receive this message but I’m requesting and hoping that you would be able to send the reference materials to me as well.
Here’s my email add: [email protected]0January 11, 2006 at 1:31 am #132163
KulananParticipant@KulananInclude @Kulanan in your post and this person will
be notified via email.Dear Sambudha
I am finding the answer about sampling size. Please kindly send me the information on why sample size = 30. Because I have to use this information for my report and if you have more information please reply me. Thank you very much.Best regards,
Kulananhttp://mailto: [email protected]0June 9, 2006 at 3:21 pm #138875Hi Sambuddha,
I’m keen to know why 30 samples too? can you send me 1 copy too?
Email: [email protected]
sue
0June 9, 2006 at 6:13 pm #138882
Heebeegeebee BBParticipant@Heebeegeebee-BBInclude @Heebeegeebee-BB in your post and this person will
be notified via email.Sue,
This is a FOUR YEAR OLD thread.0June 9, 2006 at 6:59 pm #138883
Ken FeldmanParticipant@DarthInclude @Darth in your post and this person will
be notified via email.Heck, that trumps my measly 18 month one earlier this week.
0June 9, 2006 at 6:59 pm #138884
Mike CarnellParticipant@Mike-CarnellInclude @Mike-Carnell in your post and this person will
be notified via email.Heebeegeebee,
…and unfortunately we have not see Sambuddah post on here for a couple years.
Regards0June 9, 2006 at 9:45 pm #138892
Heebeegeebee BBParticipant@Heebeegeebee-BBInclude @Heebeegeebee-BB in your post and this person will
be notified via email.Yeah,
Whatever happened to Sambuddah???0June 26, 2006 at 7:13 am #139587
Mahesh Kumar SParticipant@Mahesh-Kumar-SInclude @Mahesh-Kumar-S in your post and this person will
be notified via email.Sambuddha:
Hi. I am very new to this forum.
Can you send me the information that you sent to DT on why a sample size of 30 is required ? I too am curious.
[email protected]
Thanks.0June 26, 2006 at 5:18 pm #139614
Heebeegeebee BBParticipant@Heebeegeebee-BBInclude @Heebeegeebee-BB in your post and this person will
be notified via email.Mahesh,
Sambuddah’s last post under that nom de plume was 2002.
It is unlikely that you will get a rouse out a 4 year old thread.
We are still tied at 4 years folks!0September 1, 2006 at 7:09 am #142636
Tan Li RenMember@Tan-Li-RenInclude @Tan-Li-Ren in your post and this person will
be notified via email.Dear Sambuddha,
Could you also send me on the 30 sample size information as well? appreciate it. [email protected]
Best regards, Li Ren0September 10, 2006 at 9:19 am #143031Dear Mr Sambuddha,
I work in the research field and have immense interest in knowing more about the sample size of 30, would you also send me articles and reference materials on this topic by e-mail at:
[email protected]
Many thanks for your sharing.
Best regards,
Edmond
0September 10, 2006 at 10:41 am #143032DT,
I first came across the n = 30 rule of thumb duing a lecture by Dorian Shainin (1983). Dorian was brought to Scotland by someone called Ted Williams, who was instrumental in bring Dorian to Motorola in Phoenix some years before.
According to Dorian, if you plot the error of the estimate of sigma as a function of n, the curve becomes asymptopic at around n = 30, where it can be estimate with a 95% confidence for n = 30. As Stan has previously pointed out Dorian always used a 95% confidence.
As Mike Carnell has also noted, typcial X-bar and R charts use 30 subgroups of n = 3 or n= 5, which is a sample size of 90 or 150 – a far cry from n = 30.
Another issue lost on many is the use of multiple subgroups,which provide a pessimistic estimate of sigma, since both the data and the subgroup mean vary in small subgroups; so the entropy of a mulitple subgroups is larger than a single subgroup.
No one in their right mind would estimate process capability based on a single subgroup of n = 30.
Regards,
Andy
0September 10, 2006 at 12:15 pm #143034DT,
Avoid all of the complications of interpretations of interpretations and opinions of interpretations and interpretations of opinions of interpretations and review Gosset’s 1908 article in Biometrika: “On the probable error of the mean”. From there you can make your own informed judgement about how other statisticians incorporated and adapted his work into theirs. What is it that they say in lean: Go see for yourself :-). Regards.0November 27, 2006 at 2:57 pm #147999Hi Sambuddha,
Please email me the information on the theory and history behind sample size being 30
[email protected]
Thanks,
Nitesh0November 27, 2006 at 8:20 pm #148033It is very simple. Harry picked 30 because it gave 1.5 in his 2003 attempt. Other numbers give anything between 0 and 50+ for his “correction”.
0February 14, 2007 at 5:01 am #151969
Shon StewartMember@Shon-StewartInclude @Shon-Stewart in your post and this person will
be notified via email.Please forward me information about the history on a sample size of 30 as a rule of thumb. Your help will be very appreciated.
0February 14, 2007 at 5:20 am #151970
Confusion about 2 papersParticipant@Confusion-about-2-papersInclude @Confusion-about-2-papers in your post and this person will
be notified via email.n = 25 has a truly statistical justification. At n = 25 the Law of Large numbers will start to show a pronounced symmetric/normal distribution of the sample means around the population mean. This normal distribution becomes more pronounced as n is increased.
n = 30 comes from a quote from Student (Gosset) in a 1908 paper “On the probable error of a Correlation” in Biometrika. In this paper he reviews the error associated with drawing two independent samples from infinitely large population and their correlation (not the individual errors of each sample relative to the sample mean and the population mean!). The text reviews different corrections to the correlation coefficient given various forms of the joint distribution. In a few sentences, Student says that at n = 30 (which is his own experience) the correction factors don’t make a big difference. Later, Fisher showed that the sample for a correlation needs to be determined based on a z-transformation of the correlation. So, Student’s argument is only interesting historically. Also, Student wrote his introduction of the t-test in Biometrika during the same year (his prior article). Historically, the n = 30 discussed in his correlation paper has been confused with the t-test paper, which only introduced the t-statistic up to sample size 10.
In sum, the n = 30 is a rule of thumb that accidentally works. But ironically the n = 30 for sampling from population was confused with the n = 30 observation from correlations.0May 8, 2007 at 3:04 pm #155817
Pramod Thomas JohnParticipant@Pramod-Thomas-JohnInclude @Pramod-Thomas-John in your post and this person will
be notified via email.Dear Sambudda,
Could you please mail this information to me (pointers for choosing sample size as 30). I recently had an interview when this question was asked and I drew a blank.
Thank you in advance.
Cheers
Pramod0September 4, 2007 at 8:23 am #160732
PhillipParticipant@PhillipInclude @Phillip in your post and this person will
be notified via email.Hi Sam or anyone has gotten his information about why minimum sample size is 30, can you pls forward it to me?
Thanks,
Phillip
[email protected]
0September 4, 2007 at 8:42 am #160733Did you read any of this thread?
See: https://www.isixsigma.com/forum/showmessage.asp?messageID=1123850September 27, 2007 at 6:35 am #161859hi sambuddha,
can u e mail me that presentation as well at [email protected]0October 2, 2007 at 2:46 am #162271My background is in mathematical statistics but several years past, and I tried to read through this thread which appears to be it is quite complicated, but the actual question does not seem to be answered. I am also new to 6 sigma, but did a google on 30/sample size and this appeared to have a good discussion so I will rephrase the original question and give my opinion: that question is:
1. Is 30 some magic number that can be used for an adequate sample size for “most” purposes”?
I recognize that in use, there are almost always assumptions about the underlying distributions and parameters, but the calculation of power and sample size are well worked out. I can understand assumptions that the distribution of the mean for sample size of 30 should “look fairly normal for most distributions”, but the power/sample size strongly depends on the underlying variance as well as other variables that the field has not seemed to define. I suppose that we can assume that with a sample size of 30, we have a sample distribution that is normal with the population mean as the mean and the population variance divided by 30 as the sample variance. We assume that the allowable power is .8 (why not .9 or .95?), and that the allowable difference between the true mean and estimated mean is x% of the population standard deviation. (again arbitrary), then with all these assumptions and the right x%, perhaps a sample size of 30 might arise as a reasonable sample size. However, we usually are more concerned with the absolute error between the sample mean and population mean which would completely negate the possibility that there is any unique N that could satisfy an adequate sample size since population variances have no bounds that I know of. If however, there is some consensus that we are making all these assumptions it should be spelled out.
Reading several of the comments, I contend that the number 30 is just some number that has nestled into the literature without any true mathematical/statistical verification. Its small enough to be practical to do, but is an arbitrary number without true mathematical signficance. What bothers me is that if we are talking about 6 sigma, it would appear that to accept 30 as a magic number for sample size rather than using the standard known statistical procedures to estimate the proper sample size is anathema to the underlying concept of precision which I am assuming that 6 sigma represents.0October 2, 2007 at 5:16 am #162275
HistoriographyParticipant@HistoriographyInclude @Historiography in your post and this person will
be notified via email.I posted this response earlier. It is based on a review of Fisher’s early work.
Overall, rules of thumb were heavily introduced into statistics when it became commericialized and therefore entered the engineering field. The rules of thumb regarding estimation of the parameter is only one example where classical statisticians gave up and gave way to the more pragmatically oriented statisticians. The histor of the magical numbers 22, 25 and 30 are replicated below. But other rules of thumb emerged to make the science more usable.
n=22 was proposed by Fisher in Statistical Mehthod, p. 44, when he reviewed the impact of the the exeeding of the standard deviation once in evey three trials. Twice the standard deviation is exceeded in about 22 trials “For p-value = 0.05, or 1 in 20 and 1.96 or nearly 2; it is convenient to take the point as a limit in judging whether a deviation is to be condisered dignificant or not. Deviations exceeding twice the standard deviation are thus formally regarded as signif8icant. Using this criterion we should be led to follow up a false indication only once in 22 trials even if the statsitics were the only guide. Small effects will still escape notice if the data are insufficiently numerous to bring them out, but lowering of the standard of signficicance meet this difficulty.
n = 25 has a truly statistical justification. At n = 25 the Law of Large numbers will start to show a pronounced symmetric/normal distribution of the sample means around the population mean. This normal distribution becomes more pronounced as n is increased.
n = 30 comes from a quote from Student (Gosset) in a 1908 paper “On the probable error of a Correlation” in Biometrika. In this paper he reviews the error associated with drawing of two independent samples from infinitely large population and their correlation (not the individual errors of each sample relative to the sample mean and the population mean!). The text reviews different corrections to the correlation coefficient given various forms of the joint distribution. In a few sentences, Student says that at n = 30 (which is his own experience) the correction factors don’t make a big difference. Later, Fisher showed that the sample for a correlation needs to be determined based on a z-transformation of the correlation. So, Student’s argument is only interesting historically. Also, Student wrote his introduction of the t-test in Biometrika during the same year (his prior article). Historically, the n = 30 discussed in his correlation paper has been confused with the t-test paper, which only introduced the t-statistic up to sample size 10.
In sum, the n = 30 is a rule of thumb that accidentally works. But ironically the n = 30 for sampling from population was confused with the n = 30 observation from correlations.
So, to your point, yes there are historical reasons but the true reasons are the need for Statistics to establish itself as a useful field. Now, rules of thumb have taken over the crticial thinking about statistics. Six Sigma accelerated this movement.0October 2, 2007 at 5:57 am #162276
GrasshopperParticipant@GrasshopperInclude @Grasshopper in your post and this person will
be notified via email.Arn’t you clever…oh yes you are…now reread your post and update with some additional builds to support your argument.
Grasshopper0October 2, 2007 at 6:09 am #162277
StatisticianMember@StatisticianInclude @Statistician in your post and this person will
be notified via email.you’re making progress, you can actually read now. great accomplishment!
0October 7, 2007 at 4:52 am #162728So does the number 30 have significance in the use of an sample for an arbitrary population or is it just a number that “seems” to work because no one has actually tested it.
0October 7, 2007 at 8:27 am #162729Hello,
I would be interested to have a better understanding of how this sample size issue relate to SPC charts.
Usually, the sample size of an SPC chart is 5, but my understanding is that the sample size should be determined according to the ‘normality’ of the underlying distribution.
If the underlying distribution is ‘absolutely not normal’, the sample size required might be around 30 and if the underlying data is normal, there is no need to use samples and individual data can be used.
I am correct ?
Thanks
Vincent
0April 22, 2008 at 2:43 pm #171381
Danny CarballoParticipant@Danny-CarballoInclude @Danny-Carballo in your post and this person will
be notified via email.Can you also “E” mail this attachment.
Thanks in advance.0April 23, 2008 at 9:08 am #171403
Tiffany LianMember@Tiffany-LianInclude @Tiffany-Lian in your post and this person will
be notified via email.Hi, Sambuddha:
I am new to this forum, & very curios “why 30”? Could you please also send me the info, thank you very much.
[email protected]
Tiffany Lian
0May 15, 2008 at 3:21 pm #171996Hey Sambuddha,again, I’m a newbie here, could you please send me the info about why 30 sample size… pleasee… thank you so much.Please send it to me to [email protected], as i will need it for my final paper.Thanks again!
0May 16, 2008 at 3:48 am #172011Hi Sambuddha,
I’m new to this forum and was intereted to know more about 30pc could you send me this project when you get a chance.
Sid0May 16, 2008 at 3:54 am #172012Hi sambuddha,
forgot to write my email id, [email protected]
appreciate your help!
Thanks
Syed0May 17, 2008 at 4:00 am #172048If any one in this group has this information please do send it to me…..
Thanks,
[email protected]0May 21, 2008 at 2:42 pm #172134
BelowTheBelt CertifiedParticipant@BelowTheBelt-CertifiedInclude @BelowTheBelt-Certified in your post and this person will
be notified via email.Because the Standard Error Of the Mean improves as sample size increase to 30.
0 -
AuthorPosts
The forum ‘General’ is closed to new topics and replies.