# Categorization Sample Size Calculation

Six Sigma – iSixSigma Forums Old Forums General Categorization Sample Size Calculation

Viewing 17 posts - 1 through 17 (of 17 total)
• Author
Posts
• #33531

lin
Participant

My company receives about 25,000 calls per day and wants to categorize the calls by CALL TYPE.  (For example, 55% of the calls are web tech calls, 10% are complaints, 20% are account requests, and 15% are personal.) However, we can not categorize all 25,000 calls per day and would like to take a monthly sample to gain an estimated understanding (+/- 5% precision) of the “category type distribution”.
How do I calculate the sample size to ensure that the sample “category type distribution” is within 5% of the population “category type distribution”  (w/ 95% CI)
I am not having a problem finding a sample size formula for discrete-attribute data provided there are only two types of occurances(good/bad).   However, I have 11 types of calls and therefore more than two types of discrete-attribute occurances.  I am having trouble finding an applicable formula.
Any direction to a formula or a table is very much appreciated.
Thanks.  –Bill

0
#90840

Gabriel
Participant

I never thought of that, but I will make a try.
In a binomila distribution you have 1 parameter to estimate: p (proportion). And the problem is to find a sample size that lets you estimate p with a given error (width of the CI) and with a given confidence. However, in this case you have 11 parameters to estimate p1, p2, …, p11 (one proportion for each category). So I will replace your “ensure that the sample category type distribution is within 5% of the population category type distribution (w/ 95% CI)” with “ensure that the estimated proportion of each category is within 5% of the actual proportion of that category in the population with a 95% of confidence”
You can treat each parameter as a binomial distribution (it is either a web tech or not, it is either a complaint or not, and so on) and you will have 11 binomial ditributions (i.e. with two types of occurrences) for which you already know how to calculate the sample size. You can calculate the sample size for each catgory and then take the largest. That category will meet your requirement of precision and confidence and all the other categories will exceed it.
(I know that there is a “multinomial distribution” that adress this type of problems, but that’s all I know that about it)

0
#90843

Mikel
Member

This is correct and use the smallest percentage you expect to calculate the sample size – it will require the largest sample.

0
#90847

John H.
Participant

Bill
Stratified Random Sampling is often used in situations such as the one you described. i.e the population is divided into strata(catagories) which are as homogenous as possible and a random sample is drawn from each stratum. The number of observations taken at random from a given stratum should be proportional to the statistical weight and the standard deviation of the stratum and inversely proportional to the square root of the cost per observation from the stratum.
If you enter Stratified Random Sampling for the search in http://www.google.com you will find a considerable amount of information on this subject.
I hope this helps
John H.

0
#90849

lin
Participant

Thanks John…but I think my situation is a little different than the one you described.  I do not know the strata “distribution” prior to the sample…therefore, I can not perform stratified random sampling.  I’m actually sampling to determine the strata and its distribution/proportions.
Does this make sense?

0
#90866

Gabriel
Participant

Stan,
What do you mean? Tell me if the following is wrong:
The binomial distribution has a standard deviation of sqrt[p-(1-p)], which has the maximum for p=0.5 (50%). Since the needed sample size for a given confidence and a given CI width increases with the standard deviation, the proportion that will require the largest sample will be that which is closer to 50% to either side. In fact, if you calculate the needed sample size for 50%, any other proportion will give you a narrower CI for the same confidence (or better confidence for the same with of the CI), so using 50% to calculate the sample size is allways conservative and, further more, not very ineficient unless the actual proportion is close to 0% or 100% Why?: The standard of the binomial distribution deviation is 0.5 for 50% and 0.3 (not that smaller than 0.5) for both 10% and 90%, meaning that 10% or 90% would not require a sample size much smaller than 50%, but it is 0.01 (50 times smaller than 0.5) for 1% and 99%, meaning that using a sample size calculated for 50% would be very inefficient.

0
#90890

Ronald
Participant

Try this white paper and see if it helps.  I browsed through it and thought it might apply. http://www.nawrs.org/ClevelandPDF/chakra2.PDF

0
#90894

Mikel
Member

You are right – my bad. The largest sample will be required for the expected proportion that is closest to .5.

0
#90907

John H.
Participant

Bill
Thanks for the clarification. It has been quite a few years since I encounted this type of problem and if I find a reasonable suggestion/ solution I will post it.
-John H.

0
#90912

Hemanth
Participant

What is the characteristic you wish to measure?
My assumption (based on your post) is that you get 25000 call/day and you wish to do a category wise analysis. But your problem is you cannot categorise all 25000 calls (seems cumbersome) and hence wish to do a sample study.
I would say if this is the case then take atleast 25% (no specific reason) of your total calls per day and do this study over a period of time (a week or two.). Another guideline can be as suggested previously, collect data till you have atleast 5 calls in all categories..(remember, np = 5..??).

0
#90927

lin
Participant

Thanks John…I appreciate the help.

0
#90940

Gabriel
Participant

Hemanth,
Remember the sampling plan design input given by Bill: I want +/-5% with 95% of confidence.
Using 25% could be too much or not enough.
“np=5 at least” is a gideline to use the normal approximation to the binomial. As as the control limits in the p chart are calculated at +/-3 sigmas as it is done for a normal distribution, np=5 is a good gideline to use control charts to detect out-of-control situations. But it will hrdly give you a sample size large enough to estimate p with in one single sample with the desired error and confidence. That’s why, in the p chart, you calculate pbar (the estimator of population’s p) only after about 20 samples had been taken (what would be a total sample size of 20*n).

0
#90988

Hemanth
Participant

Hi Gabriel
Thanks for refreshing me..you are right and I checked some calculations here..request you to read my next reply to Bill.
Hemanth

0
#90990

Hemanth
Participant

Hi Bill,
I am attempting to answer the query again, do let me know if you agree with it or not. Well, I would have to admit that previous reply may not have been very good. But never the less attempting again.
I am assuming the same scenario, only difference being you wish to have a confidence band of +/- 5% at 95% confidence limit around your proportion of calls for each category. for example: 55% +/- 5% of the calls are web tech calls with 95% confidence limit (right..??)
So this is what I tried, I ran 1 proportion test in Minitab and got 95% confidence limit for different sample size assuming I get 55% defectives. Shown below are some of the iterations..
Test and Confidence Interval for One Proportion
Test of p = 0.5 vs p not = 0.5
Exact
Sample X N Sample p 95.0 % CI P-Value
1 11 20 0.550000 (0.315278, 0.769422) 0.824

Test and Confidence Interval for One Proportion
Test of p = 0.5 vs p not = 0.5
Exact
Sample X N Sample p 95.0 % CI P-Value
1 28 50 0.560000 (0.412544, 0.700093) 0.480

Test and Confidence Interval for One Proportion
Test of p = 0.5 vs p not = 0.5
Exact
Sample X N Sample p 95.0 % CI P-Value
1 55 100 0.550000 (0.447280, 0.649680) 0.368

Test and Confidence Interval for One Proportion
Test of p = 0.5 vs p not = 0.5
Exact
Sample X N Sample p 95.0 % CI P-Value
1 110 200 0.550000 (0.478249, 0.620246) 0.179

Test and Confidence Interval for One Proportion
Test of p = 0.5 vs p not = 0.5
Exact
Sample X N Sample p 95.0 % CI P-Value
1 220 400 0.550000 (0.499779, 0.599475) 0.051
So if you have sample p value of 0.55 then you would need to collect data on 400 calls..
Hemanth

0
#91002

Cormac
Participant

Bill,
Maybe I am missing something here — first here are the assumptions that I am making about your post:
–  You have one hunt-group and the percentages you guesstimate are produced from a sample of that single hunt group
–  You need to sample each call not knowing what it is and as part of that action you put it in the sample category
If these assumptions are correct then you just need to work out the confidence level for the lowest population category (in your case complaints).  So you work out the sample size for your required confidence level for complaints assuming they represent 10% of your group — If your guess is wrong and you find another category showing up with a lower percentage during your sampling, then switch and use that instead.  You are then assured that all the others have a better confidence level since they all have a better sample size.
This means that you have way oversampled for the likely categories, but since you don’t know what the call is until you sample, you have no way out of that dilemma.
Alternatively…. You could put wrap codes on the phones and just run a complete report!! (but then you need to trust your agents categorisation)

0
#91005

Gabriel
Participant

Please read in this thread the mesages from Stan (Oct 9), Gabriel(Oct 10) and Stan (Oct 10).
You are the second one in this thread to say that the lagest sample size would be defined by the category with a smaller proportion. I think that that’s wrong, as I explain one of those messages, because the binomial distribution has it’s larger standard deviation for p=0.5, and the needed sample size increase when the standard deviation increases. So the category closest to 50% will define the sample size.

0
#91014

Mikel
Member

The confusion, I think, is that at small proportions we are trying to be more precise (difference between 100 and 200 ppm for example). Gabriel is correct, that for a set level of precision, the proportions at around .5 will require the largest samples.

0
Viewing 17 posts - 1 through 17 (of 17 total)

The forum ‘General’ is closed to new topics and replies.