Why Is Sample Size N>=30 Sufficient?

Six Sigma – iSixSigma Forums General Forums Tools & Templates Why Is Sample Size N>=30 Sufficient?

Viewing 10 posts - 1 through 10 (of 10 total)
  • Author
  • #56037


    I’m hoping someone can clarify this for me. Maybe I’m not understanding something here, but I keep reading that a “large sample size” is one that is greater than or equal to 30. And with a sample size of n>=30, you are able to make inferences about the population from the sample. If this is correct and you only need a minimum of 30, then why do we bother taking samples larger than 30? Why did I learn how to calculate sample sizes if I only need 30? Why bother using sample size calculators ( It can’t be that simple, so if someone can please explain this me, I would greatly appreciate it.



    Essentially the caveat is the more samples the better to get the closest representative data results to the overall population.

    More always = Better in sampling. The only reason to sample at all is for time or cost reasons. The risk in sampling is that you miss or have skewed results based on the data points you pick, which allows you to make inaccurate or bad decisions. Once again more samples should give you more accurate data results.

    30 is essentially just an arbitrary number used to define the minimum. I used to know why 30, I don’t recall anymore off the top of my head, I am sure you can Google it. It is statistical supported and standard in the Six Sigma guides.


    Robert Butler

    What you have been told is boilerplate to make sure you won’t go wrong with great assurance given that you have little or no understanding of statistical analysis. Unfortunately, the boilerplate is overly restrictive. It also presumes the cost per sample is very low, the effort needed to acquire the sample is minimal and there really isn’t any issue with respect to time needed to get the sample(s).

    The fact remains the minimum number of samples needed to make inferences about a population is 2. With 2 samples you have an estimate of the mean and the standard deviation and you can use those results to test for differences between your sample mean/variation and another sample mean/variation or a target mean/variation.

    If indeed you have a claim which states a sample of 30 or greater MUST be gathered before you can assess your population then that claim is false. The proof is very simple – go to the back of any basic statistics text and look at the t-table – the minimum sample size is 2. The whole point of Gossett’s 1908 effort with respect to the development of the t-distribution and the t-test was to permit accurate assessments of population parameters and differences between populations with as few samples a possible.


    Robert Butler

    To answer the second part of your question about sample size calculators. As I said previously, the minimum number of data points needed to make statements about a population is 2.

    When you add in the question of sample size and sample size calculators you are now asking about the number of samples needed to make statements about population differences with respect to a degree of certainty that

    1. You have seen a statistically significant difference
    2. If you were to go back and run the experiment again you could observe that same statistically significant difference.

    The statistic associated with #1 is alpha, commonly called the p-value and the statistic associated with #2 is beta. The value (1-beta) is called the power and it expresses the probability that you will be able to achieve a given alpha should you repeat the same experiment with the same number of samples.

    Sample size calculators allow you to specify the level of certainty you would like to have that a difference exists (alpha), the degree to which you would like to be certain that you could repeat this finding (1-beta) and it requires that you define the size of a difference you want to detect given your prior definition of alpha and beta.

    The two most common ways of defining a difference is by defining population means and standard deviations or changes in observed percents.

    If the sample size calculator is any good it will also allow you to determine your power (1-beta) given that you have already run an experiment (thus fixing the sample size and the measures of either means/spreads or percentage changes) and have found a statistically significant difference (alpha at some level of significance which you previously chose to accept).

    The usual procedure, particularly with respect to initial assessments of a problem, is to take however many samples time/money/effort will permit run the statistical test needed to answer the question concerning the existence of a statistically significant difference (the level of alpha) and then run a post-hoc power test (using a sample size calculator) to determine the odds of repeating the find.

    There are caveats and yeah-but’s too numerous to mention with respect to the above paragraph but two that are worth remembering are:

    1. If you have something that is statistically significant you, as the investigator, must make the judgement call as to whether that difference has any real physical meaning/value. Given the degree of precision of many measurement instruments it is quite easy to get statistically significant results of no consequence. In this situation what your analysis has told you is your measurement system is really good at detecting differences.

    2. If you have something that is statistically not significant but the difference is such that, if it were true, it would have physical meaning/value then running a post-hoc power analysis will tell you the number of samples needed to demonstrate statistical significance with adequate power.

    This situation often arises in medicine due to sample size restrictions driven by time/money/effort. I’ve run numerous studies where the trend was in a physically meaningful direction but the alpha level was not less than the pre-determined cut point and the power was low. In these cases the existing data will be presented along with a request for increased funding to check out the possibility of the existence of something of medical value.



    Mr. Butler,
    Thank you very much for your detailed explanations. I greatly appreciate the time and effort you made to articulate your responses. If I may, I’d like to continue this discussion and request your further reply.

    You stated previously, “The fact remains the minimum number of samples needed to make inferences about a population is 2. With 2 samples you have an estimate of the mean and the standard deviation and you can use those results to test for differences between your sample mean/variation and another sample mean/variation or a target mean/variation.” I’d like to validate my understanding of this, please, using an example? In my line of work, my company does bill review for other companies. I’m currently working on a project now aimed at reducing the number of bills that were reviewed or processed incorrectly (which we’ll refer to as a defect). My team is attempting to measure the proportion of bills that have been processed incorrectly out of a population of bills. Say, in a 6-month period, we processed 10,000 bills. I’m trying to understand how many bills we would need to audit to estimate, with some certainty, the total proportion of bills in the population that were processed incorrectly. Clearly, we cannot audit all 10,000. So, what’s the minimum number we have to audit?

    According to an Excel spreadsheet I have that calculates the confidence intervals for proportions, I would need to audit 385 bills to have a 95% confidence level with a 5% MOE (since we have no previous data to go on, I also entered .5 for both the proportion for successes and failures). Am I to understand from your previous comment that instead of auditing 385, I could audit 2 and the results of those 2 would be sufficient for me to make inferences about the 10,000 in the population?

    If not, would you kindly explain to me where my understanding has gone awry?


    Chris Seider

    keep in mind the rule of thumb for 30 observations is for continuous data, not proportions


    Robert Butler

    The problem is the question you are asking in your second post is not the same question you asked in your first post.

    In your first post you essentially asked: how many samples do I need in order to make inferences about a population? The answer, as I stated is a minimum of 2.

    Example 1: The view from the perspective of Means and Standard Deviations. I have two populations. From the first population my measurements for a particular property are 2 and 3. From the second population my measurements for the same property are 8 and 11.
    Population #1: Mean = 2.5, Standard Deviation = .71.
    Population #2: Mean = 9.5, Standard Deviation = 2.1
    the t-value for the means test is -4.3 and the associated p-value = .047. If a cutoff of p<.05 had been my choice prior to gathering the samples and running the test then I could say the difference in the population means was significant.

    Example 2: The view from the perspective of percentages of defects: Again I have two populations and for each of the two populations I take two samples and determine if the samples are defective or not. For population #1 I have 2 successes and 0 defects. For population 2 I have 0 successes and 2 defects. If I test these proportions using Fisher’s exact test I get the Fisher statistic of .1667 with a p-value = .33. I have a measure of the proportions defective for the two populations and, with the sample size I have I cannot say there is any difference in these proportions. (In case you are wondering you would need a minimum of 4 samples per population with a (4,0), (0,4) split in successes and failures to get statistical significance (P = .03).

    The question you are asking in your second post is not about drawing inferences about a population rather it is about drawing inferences about the detectable difference between measures of two populations predicated on characteristics of those populations.

    That question takes you into the realm of my second post to this thread. Now the question is: What kind of a sample, comprised of successes and failures, must I draw in order to have a specified degree of probability that the sample represents a given level of failure that differs from a null proportion by some amount.

    So the short answer is that the approach you have taken is correct and in that case you will need a sample of 385.

    There is one problem and that is I can’t duplicate your calculation. My guess is I’m not understanding what you are doing.

    The way I read what you have told your program is that you have a null proportion defective of .5 and you want to find the number of samples needed to detect a shift of .05 with an alpha of .05 and a power of .8. If I run that combination I get a sample size (I’m using SAS) of 636 for a one sided test and 803 for a two sided test.

    I’d like to try to duplicate your results so could you re-state what you are doing – in particular why are you saying you are assuming BOTH null proportions are .5 and what does the acronym MOE mean?



    Mr. Butler,

    Allow me to clarify the situation for you, so you’ll better understand what I’m asking and why I’m asking it. My apologies for not being clear previously. To quote Julie Andrews in the Sound of Music, “Let’s start at the very beginning, a very good place to start.”

    I’m coaching a colleague on a project she’s leading. She approached me and said that based on some data she received, she believes there is an issue with bills my company is processing; namely, that bills are being processed incorrectly, which is causing rework for us and our customers. I asked her what the magnitude of the problem is. In other words, we process literally millions of bills annually. How many of these bills that we annually process are defective, based on the specific issue she’s referring to? 100? 1,000? 10,000? She didn’t know. So, I suggested we estimate the proportion of bills that are defective out of the total population of bills we process annually. Since we cannot audit the millions of bills we process annually, we could pull a sample, and based on the proportion of those bills that are defective in the sample, we could create a confidence interval for the population proportion.

    We began by calculating the sample size needed to give us a 95% confidence interval with a 5% margin of error (MOE, that’s the acronym I used previously). I used the Excel spreadsheet attached to calculate this. This is a spreadsheet I found online somewhere (I don’t recall where). Since this is a study we’ve never conducted before, I entered .5 in the proportion of success and failures fields. It showed she needed to pull 385 samples.
    She pulls the samples, audits them, and tells me that out of the 385 audits she conducted, 93 (or 24.16%) met her criteria as a defect. Confidence interval= 19.88% to 28.43%. Boom. Done.

    A day or so later, I’m having a conversation with one of my managers and I’m telling her about this project. She says to me, “You didn’t need to have her audit 385 samples. That was a waste. She only needed to do 30.” Huh? 30? “Why 30?” I asked. I don’t recall the exact response I got, but whatever it was made no sense to me. But I do vaguely recall learning in “Green Belt” school that all you needed was 30 sample (because it was a “large” sample) to do certain tests. I didn’t remember much about it, so I Googled it to learn more. Needless to say, that only made the problem worse. I searched websites, watched YouTube videos, read white papers, black papers, and every other colored paper you can imagine. While people attempted to explain it, I couldn’t understand it. Some places said the “30 sample” rule is a myth. Others claimed it’s true. Contradictions everywhere. No where could I get a simple, easy-to-understand-in-laymen’s-terms explanation of this “30 sample rule”. But I know I’ve heard people say this before now. This was not the first time. I’ve just never questioned it until now.

    Hence, my friend, why I posed the question here. So, again, if 30 is a sufficient number of samples that I can use to create a confidence interval for a population proportion, then why is this calculator telling me I need 385? If 30 is sufficient as I’ve heard people say, why do we have sample size calculations or calculators? And, by the way, when I do this same calculation in Minitab v.17, it tells me I need 402 samples! <banging head on desk>

    This is one site I found. I think the answer is in the first paragraph, but I’m not sure.



    @chebetz 30 is a rule of thumb. Like all rules of thumb it only says something about reasonableness. You’re optimum sample size depends on the size of the population and your desired confidence level. You only need 2 samples for statistical analysis. But that may be far too small to give reasonable confidence. For 100% confidence you need to use the entire population, which is usually impractical. We typically choose 95% confidence, which is also a rule of thumb. Sample size formulae (calculators) tell us how large a sample size we need given the population size and desired confidence level. The result may be impractical. If so you may need to reduce sample size, and confidence level.


    Robert Butler

    Fair enough – let’s go with Ms. Andrews and start at the beginning.

    *Key points*

    1.The bald statement: “You only need a sample size of X to do certain tests”, without any reference to the context of the problem (that is whatever it is that you are trying to do) is rubbish.

    2.The actual sample size needed for any given effort will be driven by things such as the time needed to obtain a given sample, the cost of that sample, the effort needed to get the sample, the size of the effect you want to observe with a given sample size, the degree of certainty you wish to have with respect to any claims you might want to make concerning the observed effect size, etc.

    3.What constitutes an effect size will depend on the question you are asking. For example:

    a.If the focus is on some continuous population measure and you are interested in how the two populations may differ with respect to that measure then the effect size will often be expressed in terms of the minimum difference in the measurement mean values between the two populations that result in a significant difference with some degree of certainty.
    b.If the focus is on a difference of proportions of the occurrence of something such as a defect count (yes/no) then the effect size will often be expressed as the minimum difference in percentage of occurrence that result in a significant difference with some degree of certainty.
    c.If the focus is on determining the confidence bounds of a measurement of a mean or of a percentage then the effect size will be the degree of confidence associated with that measure and the degree of confidence you wish to have with respect to your assessment of the confidence bounds around that target mean or percentage.

    **The claim that 30 samples is sufficient**

    Let’s turn to your particular situation and see how this holds up.

    In your situation you have the following:

    1.You have millions of bills which are processed annually.
    2.You have no idea of the proportion of these bills which are incorrect.
    3.Your guestimates range from 100 to 10,000 incorrect billings (yes, I know, it could be more it could be less but we need to start somewhere). In other words you believe the proportions of incorrect billings are very small.

    For purposes of estimation let’s assume you process exactly 1,000,000 bills annually and the error rates range from 100 to 10,000 as previously stated. This would mean your guestimate of defect rate is between .01% and 1%.

    If we take a random sample size of 30 bills, the smallest non-zero defect we can detect would be a single error. This translates into 1/30 = .033 for a defect rate of 3.3%. This, in turn, means the guestimated defect rates (.01% and 1%) range from over 300 to 3.3 times SMALLER than the smallest possible non-zero defect rate you could detect with 30 samples – in other words – a sample size of 30 will provide 0 information about whether your guestimates are correct or incorrect.

    What the above illustrates is there are many situations where a sample size of 30 is grossly insufficient. In my earlier posts I illustrated situations where a sample size of 30 was gross overkill. Taken together they illustrate that problem context is everything. Without context claims concerning either the necessity or sufficiency of sample sizes of 30, or any other number for that matter, are of no value.

    As to where the 30 sample size estimate came from, my answer as a statistician is I don’t know and I don’t care – given the work I do and have done for many years it is an estimate of no value.

    **What you have done with the online sample size calculator**

    I found a couple of sample size calculators which gave me the same number you generated – 384/385.

    The one I used can be found here:

    What you have done is the following:
    You are assuming you have an error rate of 50%. You told the machine you wanted to be 95% certain that this estimate is good to within +- 5%. As you can see this particular online site only allows a maximum population size of 100,000 and, as you can also see, the sample size estimate is in agreement with what you stated earlier.

    If you want a sample size estimate to test to see if your error rate is .01% and if you want that to be precise within 5% then the numbers you would enter in the above online calculator would be an estimated true proportion of .0001 and a desired precision of .000005 (5% of .0001) For these values the sample size would be 99354 – basically your entire population of 100,000.

    For the case of 1% the values would be .01 and .0005 respectively and the sample size for a population of 100,000 would be 60,337.

    The fact of large sample sizes for small defect estimates with a high degree of precision is to be expected and it was numbers of this magnitude I was getting when I tried to generate sample sizes based on my understanding of your earlier posts.

    I know very little about billing procedures and billing record keeping but I find it hard to believe that someone, somewhere in your organization doesn’t keep track of incorrect billing as well as number of bills processed. If not the exact numbers then possibly other numbers that could stand in as a surrogate for defects and totals and could be used to generate an estimate of proportion defective.

    Regardless of what you might be able to use for generating an estimate of percent defective just calculating that estimate would give you a sense of the magnitude of the problem. Depending on what you found I would think the next question one should ask is this: Given our crude estimate of an error rate – what is the cost of that error rate to our business and is there any benefit with respect to dollars to the bottom line to try to further reduce the error rate?

    If it turns out your error rate is very low and reducing it further would really be of benefit then you will have to explain the sample size situation to everyone and, in that case, I hope what I have provided will be of some value.

Viewing 10 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic.