# Sample size…Why 30?

Six Sigma – iSixSigma Forums Old Forums General Sample size…Why 30?

Viewing 100 posts - 1 through 100 (of 103 total)
• Author
Posts
• #29909

DT
Participant

We recently concluded a GB training and the question of why a sample size of 30 was suitable and where it came from.  (What if my population was less than this or was destructive in nature?)  Im lost as there is conflicting data.  One reference says determining the sample size depends on (1) the level of confidence (2) margin of error tolerated, and (3) variability in population studied.  Another says…
n=(z*s/E)squared. These do not take into consideration the population do they?  If I am faced with a transactional example and had 100000 accounts (population) that were in default and wanted to sample it to determine how many had credit scores less than 615 what sample size would be reflective of the population without offsetting costs by time spent gathering data.
When i pose this question (thinking 30 may not reflect my population) I remain unsatisfied.  Im told that its due to large samples related to the central limit theorm or also that a typical run chart that is in control is stable after 30 data points and thats why 30 is used.  Cant find this in any reference materials…can anyone help??  Please provide an example if needed…I need laymans terms! as I am not too statistically inclined.

0
#77364

Sambuddha
Member

DT,
You ask a very good question. There are various responses based on situation, tool you are using, type of data.
One reference says determining the sample size depends on (1) the level of confidence (2) margin of error tolerated, and (3) variability in population studied.
The above reference is right in general. Formulae for sample size calculation vary depending on the test you are going to conduct.
The parametsrs/issues you need to address are:

Type of test e.g. 2 sample T, Z, ANOVA etc
Standard Deviation (variability) of the process
Delta that is significant in distinguishing 2 or more effects
Alpha (level of significance of the test)
Power of the test (1-beta). Beta is the probability of type-II error
Number of levels (ANOVA). In case you know how many levels/effects are you aiming to distinguish.
Sample size
The interesting part is that in Minitab (assuming you would use it) allows you to vary any 2 parameters from Delta, Power and Sample size for any given number of levels. Try Stat>Power and Sample Size>ANOVA or the tool you want to use.
That lets you know the error (or lack there of, since you are measuring power) associated with each sample size and delta for any given setting. So you could make a trade-off study and see where your sweet spot lies. In cases where testing involves capital & consumables, this is a great tool. In your case you have the data. So it is not resource intensive that way. Still this is better than using 30 blindly.
There is a reason 30 is widely used. It is a result of simulation studies involving the Central Limit Theorem. If you are interested in the “histroy”or reason for prevalence of 30 samples as aguidance, i could give you  a few pointers.
I have a project that is similar in tool usage. There are quite a few neat things one could do with power and sample size studies.
Good luck,
Sambuddha

0
#77365

DT
Participant

I would be VERY interested in the pointers for history or relevance of the 30 to samples…if its easier to email…
[email protected]

0
#77366

Sambuddha
Member

DT,
Check your email. I have sent some information.
Hope that helps.
Best,
Sambuddha

0
#77368

Hrishi
Participant

Sambuddha:
Hi. I am very new to this forum.
Can you send me the information that you sent to DT on why a sample size of 30 is required ? I too am curious.
Thanks.

0
#77369

Sambuddha
Member

Hrishi,
The reason, I cannot post it here is that it is a scanned picture attachment and it is easier perhaps to email.
Best and welcome to this community,
Sam

0
#77376

Gabriel
Participant

Sambuddha
You can attach it here and share it with all the forum. It would be great!
Just click on the clip here at the right, where you read “Post/attach document”. It will lead you to send an email to iSixSigma with the attachment and they will post the attachment here!
Thanks for sharing!

0
#77377

DD
Participant

Sambuddha
Yes as Gabriel says you can post it on his site. I am curious too.
Thanks for sharing
DD

0
#77379

Sambuddha
Member

Gabriel, DD
I thought of posting it here. Attaching was a small hassle. But looks like I have a bigger problem. It is a scanned picture of some graphs. And mea culpa, I cannot find the reference from where I took that from. I am buried amidst a bunch of books and I can’t find it.
I have no problem sharing it with you all individually through email. But I am afraid if I post it in a public manner without credits, I might be in trouble for copyright violation for public distribution of intellectual property.
The good news is that the following website illustrates the same thing.
http://http://www.statisticalengineering.com/central_limit_theorem.htm
Public domain is great, isn’t it? Hopefully that will satisfy your curiosity.The number 30 came as a result of simple sampling simulations from different parent populations (Uniform, Normal, Exponential, Triangular) and by the time the sample sizes reached 30-32, the distribution of the means started looking normal. That is the reason for the rule-of-thumb.
I haven’t seen any theoretical explanation yet for that i.e. what is so special about 30 from analytical point of view. Shall let you all know if I come across anything to that effect.
Hope it still helps.
Best,
Sambuddha

0
#77399

aush
Participant

Sambudhha
Can I also share the info on sample size of 30. I will appriciate you emailing me the same at
[email protected]

0
#77410

Rajanga Sivakumar
Participant

Mr. Sambudhha,
Could you share the sample size 30 with me too? Thanks
email to [email protected]
Rajanga

0
#77412

Ted
Member

Making the assumption that even with the ability of most software to sort and count the numbers out of your population of 100,000 records you wish to sample, there are two questions you have to ask. How many do I take and what risk can I assume in making the wrong assumption from the statistic.
A number of answers here address why 30 samples are needed to approximate a normal distribution allowing for the estimation based on the probabilities of the normal curve. However, once the mean and std dev have been estimated, and the cumulative probability found up to and including the critical limit you set. The second question comes into play, specifically how sensitive are you to making an error in assigning that proportion to your population.
As an example, say 6% of your population was expected to fall below your cut off, how sensitive are you that the true proportion isnt 7% or 8% or 10% etc. You would need to calculate the Beta risk of assuming the proportion at 6% given your original sample size and the statistics you calculated. If you utilize minitab (or other software perhaps) you can adjust the minimum sample size you need to take for the risk you choose. Under the power and sample size tab  1 proportion test, you can enter both the calculated proportion (as a percentage) and the critical proportion, along with the level of risk (beta) and it will calculate the number of sample you need to take. Go back, resample to that level and run the calculation again to find the proportion defective (credit levels below your cut off), and rerun the beta again with the new numbers. The process is iterative until you are satisfied with number of samples vs the risk you are willing to assume. Therefore you might start out with a sample of 30, find that the beta risk is too high and have to take 400 samples, do so and recalculate and find that you actually need 435 etc etc. Others here might have a better way to adjust for risk and sample size without all the iterations but thats the only way Ive found to consistently do it.
My other question for you however is what do you plan on using the data for. Be careful if the intent is to show that you get higher numbers of defaults with credit scores below a certain number using those accounts in default as your population for the hypothesis. Your choice of frame for the population would be wrong in that kind of test.
hope that helps.

0
#77415

Picklyk
Participant

Hi Sambuddha,
Could you also share the sample size of 30 info with me as well?  Please email [email protected].  Thanks!
Jay

0
#77425

zhou
Participant

Dear Sambuddha,
Best regards,
Lawrence.

0
#77427

Glenn Gooding
Participant

Sam,

Along with a great many of our colleagues, I would be interested and grateful if you could let me have a copy of the information about the rationale of the 30pc smaple size.
[email protected]
regards
Glenn

0
#77429

Ja
Participant

Dear Sambuddha,
Could you also send me in a separate email on the 30 sample size information as well? appreciate it. [email protected]
Best regards,
JA.

0
#77431

Nicholas L. Squeglia
Participant

In layman’s terms, if you were to prepare graph on the basis of attribute data  letting sample size vs confidence, you will see that there is quite a difference from, for example, 2 to 30. This is not the optimum, but more of a minimum. 50 would be perhaps a better choice and is what Dorian Shainan used in his “lot plot” many years ago. Th slope of the curve increases after 30/50, but at a much lower rate.The central limit theorem is somewhat different, and relys on taking averages of data to show a normal, gaussian, distribution for control chart purposes although the underlying data is non-normal
Nicholas L. Squeglia, author, Zero Acceptance Number (c=0) Sampling Plans

0
#77432

CBetts
Participant

Sambuddha,

Could you please forward the Sample size information to me as well.  [email protected]   Thank you.
Cedric

0
#77442

Janet Hunter
Participant

I believe you are correct about the Central Limit Theorem, at least, as I recall from my statistics classes a few years ago. You may want to contact the local college and speak to one of the professors in the mathmatics department for further direction or confirmation.

0
#77444

Mike Carnell
Participant

DT,
I have not read the entire string so if some of this is redundant I apologize. Sam gave a good answer whaen he said it was different for different situations.
Assuming that everything works off of 30 is incorrect.
Frequently you will see variable control charts listed as a sample size of 25-30. They typicaly are speaking of a sample size of 25-30 groups of 5. That makes it 125-150 actual samples. It is the subgrouping giving you a distribution of averages (Central Limit Theorem) that makes it work.
When you are doing hypothesis testing and using ANOVA the sensitivity of the test is extremely dependent on sample size. I was doing site support and found a guy who could not understand why his 2 sample t test was showing significance. He was sure it should not. His sample size was >400. The test was sensitive to < a .1 sigma shift.
There are sample size implications with virtually every tool. 30 is not a catch-all particularly if you are working with attribute data.
Good luck.

0
#77447

Antero
Participant

Sam:

Can also send me your e-mail response to the question dealing with a sample size of 30?

0
#77460

Ron
Member

Don’t confuse population sampling with process sampling these are two very different animals.You need population variation, power etc when considering population sampling.When process sampling the purpose of taking 30 samples is to establish with reasonable certainty these issues and develop control limits. these limits remain constant unless significant changes are made to the process.It is common in SS training that the true essence of what is meant by the mathematics behind these issues are lost.Processs sampling also assumes a process that is in statistical control. If it is not stop fix the process then proceed.

0
#77514

Bahram
Participant

Dear Sambuddha,
Could I trouble you to send me the information also?
[email protected]
Thanks
Bahram

0
#77515

Dewayne
Participant

Sambuddha,
I, too, would appreciate your sharing/sending the information on the sample size of 30. Thanks.
[email protected]

0
#77631

NB
Participant

I have also not read all previous replys to this subject, but from my experience in statistics the reason why 30 or 31 has always been the magical number is because the students t distribution approaches the normal z distribution at 30 samples. Hope this helps.
-NB

0
#77649

H.Kirchhausen
Participant

it will be fine if you can send me also information or an example of the magic samplesize of 30!
[email protected]

0
#77657

sw
Member

Hello Sambuddha,
Appreciate if you can email the 30sample size info to me, too.
My email add is: [email protected]

0
#77711

Ged Bryant
Participant

Want to test magic number 30. Take any group of people, good party trick. Bet any one present that two or more of the group will have the same birthday. Month and date. This has 98% confidence.

0
#77712

O’Connell
Participant

I did that game in a training class once. It worked!
But how does it work? I’d love to know so I can sound intelligent next time I do it :).

0
#78161

julio jaime
Participant

Sambuddha:
Hi. I am very new to this forum.
Can you send me the information that on why a sample size of 30 is required ? I too am curious.

Thks.

0
#88659

Allen Jacque
Participant

I am interested in receiving the articles identified and originated from Sambuddah that address the sample size of 30 issue.
My email address is [email protected]

0
#88752

Mark Chockalingam
Participant

There are several web references on the Central Limit theorem and 30 that are interesting.  When the sample size approaches 30, we don’t have to worry about the distribution of the population since it can be safely assumed to be normal for inference purposes.  Here are some references:
http://www.mathwizz.com/statistics/help/help4.htm
http://www.statisticalengineering.com/central_limit_theorem.htm
Here is a little more technical article on normal distributions and central limit theorem.
http://www.itl.nist.gov/div898/handbook/index.htm

0
#88755

Thanachai
Member

Mr. Sambuddha.May you please send me the pointer of sample size of 30, I’m very curious to know.Thanachai S,[email protected]

0
#88837

Statistician
Member

Mr. Sambuddha,
I am a statistician by profession and as far as I know, sample size is determined by margin of error allowed, the estimate for the population variance, the risk factors (level of confidence, power as a function of the OCC, etc.), and most importantly, the assumed distribution of the population (or the estimable function) in study.
In my own experience, the magic number 30 is being used to approximate the normal distribution using the Central Limit Theorem, as used in regression analysis, factor analysis, etc., but not in sample size determination.
I am also curios with this article. Would you be kind enough to send me a copy, too? Also, would you happen to know/ recommend six sigma training centers in the Philippines?
Thanks,
Beryl
[email protected]

0
#90196

Vicki
Member

Hi Sambuddha,
You have been inundated with requests for this information on n=30, I am also a statistician and would really appreciate this information!  Thanks.

0
#90200

PH
Participant

I WOULD ALSO LIKE TO RECEVE THE INFROMATION / POINTERS.  CAN YOU EMAIL THEM TO ME AT THE ADDRESS BELOW,
[email protected]
Thanks!
PH

0
#90282

Sinnicks
Participant

I have had the same question arise in the past. Mark L. Crossley wrote a good article titled “Size Matters: How Good Is Your Cpk Really?” located at http://www.qualitydigest.com/may00/html/lastword.html that seems to address your question quite well. When Mr. Crossley’s equations are rearranged you can look at plots of Cpk vs sample size with various lines of constant Cpk and specific confidence intervals. For example, after generating the curves, one is able to directly determine sample size required for a desired Cpk of 2.0 with 90% confidence. In playing with the equations it was interesting to note the confidence level obtained for a 2.0 Cpk using a common sample size of 30.
I hope that helps.

0
#90663

mcintosh
Participant

I would also like to recieve the information/ pointers about the sample size of 30. My email adress is [email protected]
thanks!
Tom

0
#90933

Stella
Member

Hi Sam, add me to the distribution list please! It is last for too long time, isn’t it? Thank you inadvance!
[email protected]

0
#93129

Haim
Participant

Dear Sambudda,
I too am interested in the “Why 30?” discussion.  Please e-mail me at:
[email protected]
Thank you,
Haim

0
#93132

Rocky Firth
Member

I would also like to see the information. I can post it to a web location for other as well.

0
#93145

Mikel
Member

I don’t believe 30 came from simulations involving the CLT. Please post some backup to this assertion. Sounds like Dr. Mikel’s proof of the 1.5 shift.
By the way, what advice do you give on choosing sample size when you are interested in reducing sigma instead of moving the mean?

0
#101346

singh
Member

Please let me know why 30

SK

0
#104412

SATTHISH KUMAR
Member

Dear sambuddha
Thank you for your reply to that Query.Now i am in the interest to know the variables like slop,linearity,bias,and Uncertinity relation with Instrument Repeatiablity and reproduciablity.
CAn you   send the same in my mail id [email protected]

REGARDS

R.L.SATTHISH KUMAR

0
#105409

Simon Wei
Member

I would be VERY interested in the pointers for history or relevance of the 30 to samples

0
#107505

Sankar
Participant

Pls email me the info for “why sample = 30”

0
#108297

Member

Sambuddha:

Hi. I am very new to this forum.
Can you please send me the pointers or the links that you sent to DT on why a sample size of 30 is required? That’s the same question I have for a long time.
Thanks.
Surya

0
#108298

Member

Sambudha:
I forgot to mention my e-mail in my previous message…please e-mail to the following address…[email protected]
Thank you for sharing.
Surya

0
#108736

Mark Chockalingam
Participant

Surya,
There are several web references on the Central Limit theorem and sample size of 30 that are interesting.  When the sample size approaches 30, we don’t have to worry about the distribution of the population since the it can be safely assumed to be normal for inference purposes.
Remember for interval estimation, the standard error is computed from a Sampling distribution of the mean.  When the sample size approaches 30, the sampling distribution approaches normality.  Here are some references:
http://www.mathwizz.com/statistics/help/help4.htm
http://www.statisticalengineering.com/central_limit_theorem.htm
Here is a little more technical article on normal distributions and central limit theorem.
http://www.itl.nist.gov/div898/handbook/index.htm
Mark Chockalingam

0
#108739

Robert Butler
Participant

Mark,
The central limit theorem applies to the mean not to individuals  30 samples from a log normal distribution will not suddenly become normal.  The distribution of 30 averages of data from a log normal distribution, however, will be.  To this end, the first citation you mentioned (and as quoted below) is in error.  The second and third citations, however, are correct.  (Note: some of the text from your citations don’t copy over to the forum page so I had to rewrite the equation in the first citation). I also took the liberty of highlighting in order to emphasize the focus on distributions of means and not individuals.
#1 The Central Limit Theorem says that if you have a random sample and the sample size is large enough (usually bigger than 30), then
Z = (sample avg – pop avg)/(s/sqrt(n))
where Z is the standard Normal distribution with m = 0 and s = 1. This comes in really handy when you haven’t a clue what the distribution is or it is a distribution you’re not used to working with like, for instance, the Gamma distribution.

#2 The distribution of an average tends to be Normal, even when the distribution from which the average is computed is decidedly non-Normal.
Thus, the Central Limit theorem is the foundation for many statistical procedures, including Quality Control Charts, because the distribution of the phenomenon under study does not have to be Normal because it’s average will be

#3 The central limit theorem basically states that as the sample size (N) becomes large, the following occur:

The sampling distribution of the mean becomes approximately normal regardless of the distribution of the original variable.
The sampling distribution of the mean is centered at the population mean, , of the original variable. In addition, the standard deviation of the sampling distribution of the mean approaches .

0
#108749

Mark Chockalingam
Participant

Rob,
Thanks for copying and pasting from the source.  However, I submit humbly, that it is in appropriate to quote the content without acknowledging the source.  I agree it reads easier in one page but still it is important to insert the name of the source.
Now as to your point on the error, I don’t see it.  May be it is semantics.  The CLT is a statement on the sampling distribution of the mean NOT on the sample or the original population itself.  When the sample size approaches 30, the sampling distribution approaches normality regardless of the original distribution.
Now for intereval estimation, the big leap that is made in practice is to assume that the sample standard deviation is a sufficient estimate for the population standard deviation.  Is this what is not in agreement with you when the original citation on #1 gives the formula for the Standard normal deviate.
Good discussion.
thanks,
Mark

0
#109016

John C
Participant

Hi Sambuddha,
Kindly mail me the same at  [email protected]
Thanks,
John C

0
#111071

Paul C
Participant

Grateful if you could send on the background to the sample size of 30..

Thanks ..

PC

0
#111097

SemiMike
Member

Those are great URL’s for any beginner to study.
One might use some “rules of thumb” based on practical experience, as well as the more rigorous statistical methods.
For example, if the DATA is to be from a MECHANICAL process for making discrete parts, then one should first try sampling the FAMILIES of possible variation, using sample size 2 for each family, per Shainon’s recommendations.  2 sites on each of 2 parts, repeated every hour for 2 shifts perhaps, then graphed.  Once it is clear which family of variation is the main problem, then SPC sampling (subgroups measured over time) can be used IF the problem is temporal.  But if the problem is variation WITHIN the parts, then perhaps closer stratification of data is needed, or measuring more sites per part, or conparing that variation for all similar machines, or looking at tool wear trends of this “within-part” spread over time.  Means and ranges both can drift with tool wear.  Sampling for mean data involves the famous sample size of 30 (or 15) for OOC determinations. Sampling for changes in variance require much larger samples.  So a wandering mean is not the same as a wandering variance.  Think 1000 parts.
RULE OF THUMB for SPC chart startup is given as 15 to 30 but if the process is non-stationary (drifts, has wandering mean and unstable variance, for example) then other methods are needed.   Box and Luceno’s book (Amazon.com) talks at great lenght about modern issues with process monitoring nad adjustment methods. Assessing non-subgrouped data is another issue.
RULE OF THUMB:  Individual Charts are less sensitive, less powerful, (they give more false alarms for each rule added, for example) than X-bar charts.
RULE OF THUMB:  Subgroup size of 2 to 4 is common and usually adequate.  But for diagnostic reasons, many engineers use subgroups of 10 to 100 sites per part or parts per subgroup.  ASQ had a paper recently on effect of large subgroup sizes.  In general, it invalides the method used by most people to calculate control limits, as many of the subgroup sites or parts are CORRELATED and so the control limits would be wrong.  What is good for diagnostics is often not good for control, given various control models.  With automated gages, more data is cheap.  But how you use it depends on whether the data is really INDEPENDENT and RANDOMLY SAMPLED and IDENTICALLY DISTRIBUTED.  Most important is that INDEPENDENCE.  And if the data is AUTO_CORRELATED, its also messy (wandering mean, showing predictability instead of randomness).
Then there is the data that comes from CHEMICAL processes, such as continuous refining.  Read Svante Wold or Dr. John McGregor’s books on PCA/PLS multivariate methods for sensor-based data, which is HUGE stream of data.   Rule of Thumb:  Get help.
Central Limit Theorem:   Only for subgrouped data!!!
Shewhart Charts:  Only for stationary processes where samples are independent!!!  (I am not a statistician, and those guys are still arguing about these issues.  See Journal of Quality Technology, Woodall’s papers, for example.
Don’t forget;  NIST online handbook.  http://www.itl.nist.gov/div898/handbook/

0
#113427

Chelle
Participant

Hello Mr Sambuddha,Can you also send me the article on the 30 sample size that you sent to DT? I am also interested to know why 30? My email add is, [email protected]
Thanks,
Rechel

0
#113430

Kevin Alderson
Participant

Reference sample size 30, reasonable amount to measure / analyse.
approximate 10% margin of error between a sample of 30 to 500.
Of course it would be better to do 500 for accuracy but you must take into account the cost between 30 and 500 pending what you are measuring. Remember the std dev (10%) margin and you should be fine.

0
#113457

quality_ab
Participant

Thanks,
AB

0
#113868

Glo
Participant

Hi Mr. Sambudhha,
Could you also share the information about sample size 30 with me? I’m very much interested. Kindly email it to [email protected]
Thanks,
–Glo

0
#113877

DrSeuss
Participant

DT, let me try to answer this from a practical experience approach,
I have also asked this question and have never receive a definitive academic answer.  Here is what I have seen from analyzing real process data.  Take a continuous process and produces data that is normally distributed (near normal is also good enough) and collect your data using rational subgroups approach. Using the Minitab Six Sigma Process report to calculate short term and long process capability. Look at report #4 or #5, it shows both the Sigma ST & Sigma Lt on a graph.  Notice how their values stabilize toward a value at the number of subgroup increase.  You will notice a flating of the curves at about 10 subgroups, then around 20-25 subgroups the curves are almost horizontal.  By the time you reach 30 subgroups the sigmas have stabilized and adding anymore subgroups will only change the sigmas in the 4th or larger decimal places.  If you are an Excel wizard, you can demonstrate this very easily also.  The idea is that after about 30 subgroups (30 data points) the variance of the data typically stabilizes.

0
#118555

Leon
Participant

Dear Sambuddha
could you  please send the reference to me, thanks very much
[email protected]
Leon

0
#120057

vee
Member

Dear Sambuddha – –
I’m very interested in your projects.
Thank you very much.
Sincerely,
vee

0
#120058

vee
Member

from vee
my email  [email protected]
thank again

0
#123533

DEEPAK JAIN
Participant

Sambudha:
please also email me , because fm last few weeks  i am finiding the answer

D.JAIN
9811564123

0
#123564

Manav
Participant

Sambuddha/DT,
Please send me the information on sample size. I know this msg is 3 years too late, but would appreciate your or anyone’s help in getting this info to me.
Thanks
[email protected]

0
#127028

Ropp
Participant

When the population size is greater than 100, the normality condition is met when the sample size is greater than 30.  Increase sample size depending on the process being studied and the variability of the data produced.  30 is not a “magic” number applicable to all data sets and processes.

0
#127036

Darth
Participant

Dave, might I suggest that you check the dates on any post that you respond to.  This is a really old one.  OK, Nick, how did I do????

0
#131786

Rhex
Member

hi sambuddha,

I know that the forum thread has been going on for quite sometime now and am not sure if you would be able to receive this message but I’m requesting and  hoping that you would be able to send the reference materials to me as well.

Here’s my email add: [email protected]

0
#132163

Kulanan
Participant

Dear Sambudha
I am finding the answer about sampling size. Please kindly send me the information on why sample size = 30. Because I have to use this information for my report and if you have more information please reply me. Thank you very much.Best regards,
Kulananhttp://mailto: [email protected]

0
#138875

sue
Member

Hi Sambuddha,
I’m keen to know why 30 samples too? can you send me 1 copy too?
Email: [email protected]
sue

0
#138882

Heebeegeebee BB
Participant

Sue,
This is a FOUR YEAR OLD thread.

0
#138883

Darth
Participant

Heck, that trumps my measly 18 month one earlier this week.

0
#138884

Mike Carnell
Participant

Heebeegeebee,
…and unfortunately we have not see Sambuddah post on here for a couple years.
Regards

0
#138892

Heebeegeebee BB
Participant

Yeah,
Whatever happened to Sambuddah???

0
#139587

Mahesh Kumar S
Participant

Sambuddha:
Hi. I am very new to this forum.
Can you send me the information that you sent to DT on why a sample size of 30 is required ? I too am curious.
[email protected]
Thanks.

0
#139614

Heebeegeebee BB
Participant

Mahesh,
Sambuddah’s last post under that nom de plume was 2002.
It is unlikely that you will get a rouse out a 4 year old thread.
We are still tied at 4 years folks!

0
#142636

Tan Li Ren
Member

Dear Sambuddha,
Could you also send me on the 30 sample size information as well? appreciate it. [email protected]
Best regards, Li Ren

0
#143031

Edmond
Participant

Dear Mr Sambuddha,
I work in the research field and have immense interest in knowing more about the sample size of 30, would you also send me articles and reference materials on this topic by e-mail at:
[email protected]
Best regards,
Edmond

0
#143032

Anonymous
Guest

DT,
I first came across the n = 30 rule of thumb duing a lecture by Dorian Shainin (1983). Dorian was brought to Scotland by someone called Ted Williams, who was instrumental in bring Dorian to Motorola in Phoenix some years before.
According to Dorian, if you plot the error of the estimate of sigma as a function of n, the curve becomes asymptopic at around n = 30, where it can be estimate with a 95% confidence for n = 30. As Stan has previously pointed out Dorian always used a 95% confidence.
As Mike Carnell has also noted, typcial X-bar and R charts use 30 subgroups of n = 3 or n= 5, which is a sample size of 90 or 150 – a far cry from n = 30.
Another issue lost on many is the use of multiple subgroups,which provide a pessimistic estimate of sigma, since both the data and the subgroup mean vary in small subgroups; so the entropy of a mulitple subgroups is larger than a single subgroup.
No one in their right mind would estimate process capability based on a single subgroup of n = 30.
Regards,
Andy

0
#143034

Hans
Participant

DT,
Avoid all of the complications of interpretations of interpretations and opinions of interpretations and interpretations of opinions of interpretations and review Gosset’s 1908 article in Biometrika: “On the probable error of the mean”. From there you can make your own informed judgement about how other statisticians incorporated and adapted his work into theirs. What is it that they say in lean: Go see for yourself :-). Regards.

0
#147999

Nitesh
Participant

Hi Sambuddha,
Please email me the information on the theory and history behind sample size being 30
[email protected]
Thanks,
Nitesh

0
#148033

Ashman
Member

It is very simple.  Harry picked 30 because it gave 1.5 in his 2003 attempt.  Other numbers give anything between 0 and 50+ for his “correction”.

0
#151969

Shon Stewart
Member

Please forward me information about the history on a sample size of 30 as a rule of thumb.  Your help will be very appreciated.

0
#151970

Participant

n = 25 has a truly statistical justification. At n = 25 the Law of Large numbers will start to show a pronounced symmetric/normal distribution of the sample means around the population mean. This normal distribution becomes more pronounced as n is increased.

n = 30 comes from a quote from Student (Gosset) in a 1908 paper “On the probable error of a Correlation” in Biometrika. In this paper he reviews the error associated with drawing two independent samples from infinitely large population and their correlation (not the individual errors of each sample relative to the sample mean and the population mean!). The text reviews different corrections to the correlation coefficient given various forms of the joint distribution. In a few sentences, Student says that at n = 30 (which is his own experience) the correction factors don’t make a big difference. Later, Fisher showed that the sample for a correlation needs to be determined based on a z-transformation of the correlation. So, Student’s argument is only interesting historically. Also, Student wrote his introduction of the t-test in Biometrika during the same year (his prior article). Historically, the n = 30 discussed in his correlation paper has been confused with the t-test paper, which only introduced the t-statistic up to sample size 10.

In sum, the n = 30 is a rule of thumb that accidentally works. But ironically the n = 30 for sampling from population was confused with the n = 30 observation from correlations.

0
#155817

Pramod Thomas John
Participant

Dear Sambudda,
Could you please mail this information to me (pointers for choosing sample size as 30). I recently had an interview when this question was asked and I drew a blank.
Cheers
Pramod

0
#160732

Phillip
Participant

Hi Sam or anyone has gotten his information about why minimum sample size is 30, can you pls forward it to me?
Thanks,
Phillip
[email protected]

0
#160733

Trev
Member
#161859

aparna
Participant

hi sambuddha,
can u e mail me that presentation as well at [email protected]

0
#162271

Robin
Member

My background is in mathematical statistics but several years past, and I tried to read through this thread which appears to be it is quite complicated, but the actual question does not seem to be answered.  I am also new to 6 sigma, but did a google on 30/sample size and this appeared to have a good discussion so I will rephrase the original question and give my opinion:  that question is:
1.  Is 30 some magic number that can be used for an adequate sample size for “most” purposes”?
I recognize that in use, there are almost always assumptions about the underlying distributions and parameters, but the calculation of power and sample size are well worked out.  I can understand assumptions that the distribution of the mean for sample size of 30 should “look fairly normal for most distributions”, but the power/sample size strongly depends on the underlying variance as well as other variables that the field has not seemed to define. I suppose that we can assume that with a sample size of 30, we have a sample distribution that is normal with the population mean as the mean and the population variance divided by 30 as the sample variance.  We assume that the allowable power is .8 (why not .9 or .95?), and that the allowable difference between the true mean and estimated mean is x% of the population standard deviation.  (again arbitrary), then with all these assumptions and the right x%, perhaps a sample size of 30 might arise as a reasonable sample size.  However, we usually are more concerned with the absolute error between the sample mean and population mean which would completely negate the possibility that there is any unique N that could satisfy an adequate sample size since population variances have no bounds that I know of.  If however, there is some consensus that we are making all these assumptions it should be spelled out.
Reading several of the comments, I contend that the number 30 is just some number that has nestled into the literature without any true mathematical/statistical verification.   Its small enough to be practical to do, but is an arbitrary number without true mathematical signficance.  What bothers me is that if we are talking about 6 sigma, it would appear that to accept 30 as a magic number for sample size rather than using the standard known statistical procedures to estimate the proper sample size is anathema to the underlying concept of precision which I am assuming that 6 sigma represents.

0
#162275

Historiography
Participant

I posted this response earlier. It is based on a review of Fisher’s early work.
Overall, rules of thumb were heavily introduced into statistics when it became commericialized and therefore entered the engineering field. The rules of thumb regarding estimation of the parameter is only one example where classical statisticians gave up and gave way to the more pragmatically oriented statisticians. The histor of the magical numbers 22, 25 and 30 are replicated below. But other rules of thumb emerged to make the science more usable.
n=22 was proposed by Fisher in Statistical Mehthod, p. 44, when he reviewed the impact of the the exeeding of the standard deviation once in evey three trials. Twice the standard deviation is exceeded in about 22 trials “For p-value = 0.05, or 1 in 20 and 1.96 or nearly 2; it is convenient to take the point as a limit in judging whether a deviation is to be condisered dignificant or not. Deviations exceeding twice the standard deviation are thus formally regarded as signif8icant. Using this criterion we should be led to follow up a false indication only once in 22 trials even if the statsitics were the only guide. Small effects will still escape notice if the data are insufficiently numerous to bring them out, but lowering of the standard of signficicance meet this difficulty.
n = 25 has a truly statistical justification. At n = 25 the Law of Large numbers will start to show a pronounced symmetric/normal distribution of the sample means around the population mean. This normal distribution becomes more pronounced as n is increased.

n = 30 comes from a quote from Student (Gosset) in a 1908 paper “On the probable error of a Correlation” in Biometrika. In this paper he reviews the error associated with drawing of two independent samples from infinitely large population and their correlation (not the individual errors of each sample relative to the sample mean and the population mean!). The text reviews different corrections to the correlation coefficient given various forms of the joint distribution. In a few sentences, Student says that at n = 30 (which is his own experience) the correction factors don’t make a big difference. Later, Fisher showed that the sample for a correlation needs to be determined based on a z-transformation of the correlation. So, Student’s argument is only interesting historically. Also, Student wrote his introduction of the t-test in Biometrika during the same year (his prior article). Historically, the n = 30 discussed in his correlation paper has been confused with the t-test paper, which only introduced the t-statistic up to sample size 10.

In sum, the n = 30 is a rule of thumb that accidentally works. But ironically the n = 30 for sampling from population was confused with the n = 30 observation from correlations.

So, to your point, yes there are historical reasons but the true reasons are the need for Statistics to establish itself as a useful field. Now, rules of thumb have taken over the crticial thinking about statistics. Six Sigma accelerated this movement.

0
#162276

Grasshopper
Participant

Grasshopper

0
#162277

Statistician
Member

you’re making progress, you can actually read now. great accomplishment!

0
#162728

Robin
Member

So does the number 30 have significance in the use of an sample for an arbitrary population or is it just a number that “seems” to work because no one has actually tested it.

0
#162729

Quainoo
Member

Hello,
I would be interested to have a better understanding of how this sample size issue relate to SPC charts.

Usually, the sample size of an SPC chart is 5, but my understanding is that the sample size should be determined according to the ‘normality’ of the underlying distribution.
If the underlying distribution is ‘absolutely not normal’, the sample size required might be around 30 and if the underlying data is normal, there is no need to use samples and individual data can be used.
I am correct ?
Thanks

Vincent

0
#171381

Danny Carballo
Participant

Can you also “E” mail this attachment.

0
#171403

Tiffany Lian
Member

Hi, Sambuddha:
I am new to this forum, & very curios “why 30”? Could you please also send me the info, thank you very much.
[email protected]
Tiffany Lian

0
#171996

Devie
Participant

Hey Sambuddha,again, I’m a newbie here, could you please send me the info about why 30 sample size… pleasee… thank you so much.Please send it to me to [email protected], as i will need it for my final paper.Thanks again!

0
#172011

J
Member

Hi Sambuddha,
I’m new to this forum and was intereted to know more about 30pc could you send me this project when you get a chance.
Sid

0
#172012

J
Member

Hi sambuddha,
forgot to write my email id, [email protected]
Thanks
Syed

0
#172048

J
Member

If any one in this group has this information please do send it to me…..
Thanks,
[email protected]

0
#172134

BelowTheBelt Certified
Participant

Because the Standard Error Of the Mean improves as sample size increase to 30.

0
Viewing 100 posts - 1 through 100 (of 103 total)

The forum ‘General’ is closed to new topics and replies.