# Categorization Sample Size Calculation

Six Sigma – iSixSigma › Forums › Old Forums › General › Categorization Sample Size Calculation

This topic contains 16 replies, has 7 voices, and was last updated by Mikel 16 years, 1 month ago.

- AuthorPosts
- October 9, 2003 at 4:05 pm #33531
My company receives about 25,000 calls per day and wants to categorize the calls by CALL TYPE. (For example, 55% of the calls are web tech calls, 10% are complaints, 20% are account requests, and 15% are personal.) However, we can not categorize all 25,000 calls per day and would like to take a monthly sample to gain an estimated understanding (+/- 5% precision) of the “category type distribution”.

How do I calculate the sample size to ensure that the sample “category type distribution” is within 5% of the population “category type distribution” (w/ 95% CI)

I am not having a problem finding a sample size formula for discrete-attribute data provided there are only two types of occurances(good/bad). However, I have 11 types of calls and therefore more than two types of discrete-attribute occurances. I am having trouble finding an applicable formula.

Any direction to a formula or a table is very much appreciated.

Thanks. –Bill0October 9, 2003 at 10:03 pm #90840

GabrielParticipant@Gabriel**Include @Gabriel in your post and this person will**

be notified via email.I never thought of that, but I will make a try.

In a binomila distribution you have 1 parameter to estimate: p (proportion). And the problem is to find a sample size that lets you estimate p with a given error (width of the CI) and with a given confidence. However, in this case you have 11 parameters to estimate p1, p2, …, p11 (one proportion for each category). So I will replace your “ensure that the sample category type distribution is within 5% of the population category type distribution (w/ 95% CI)” with “ensure that the estimated proportion of each category is within 5% of the actual proportion of that category in the population with a 95% of confidence”

You can treat each parameter as a binomial distribution (it is either a web tech or not, it is either a complaint or not, and so on) and you will have 11 binomial ditributions (i.e. with two types of occurrences) for which you already know how to calculate the sample size. You can calculate the sample size for each catgory and then take the largest. That category will meet your requirement of precision and confidence and all the other categories will exceed it.

(I know that there is a “multinomial distribution” that adress this type of problems, but that’s all I know that about it)0October 10, 2003 at 12:03 am #90843This is correct and use the smallest percentage you expect to calculate the sample size – it will require the largest sample.

0October 10, 2003 at 2:22 am #90847

John H.Participant@John-H.**Include @John-H. in your post and this person will**

be notified via email.Bill

Stratified Random Sampling is often used in situations such as the one you described. i.e the population is divided into strata(catagories) which are as homogenous as possible and a random sample is drawn from each stratum. The number of observations taken at random from a given stratum should be proportional to the statistical weight and the standard deviation of the stratum and inversely proportional to the square root of the cost per observation from the stratum.

If you enter Stratified Random Sampling for the search in http://www.google.com you will find a considerable amount of information on this subject.

I hope this helps

John H.

0October 10, 2003 at 2:54 am #90849Thanks John…but I think my situation is a little different than the one you described. I do not know the strata “distribution” prior to the sample…therefore, I can not perform stratified random sampling. I’m actually sampling to determine the strata and its distribution/proportions.

Does this make sense?0October 10, 2003 at 1:25 pm #90866

GabrielParticipant@Gabriel**Include @Gabriel in your post and this person will**

be notified via email.Stan,

What do you mean? Tell me if the following is wrong:

The binomial distribution has a standard deviation of sqrt[p-(1-p)], which has the maximum for p=0.5 (50%). Since the needed sample size for a given confidence and a given CI width increases with the standard deviation, the proportion that will require the largest sample will be that which is closer to 50% to either side. In fact, if you calculate the needed sample size for 50%, any other proportion will give you a narrower CI for the same confidence (or better confidence for the same with of the CI), so using 50% to calculate the sample size is allways conservative and, further more, not very ineficient unless the actual proportion is close to 0% or 100% Why?: The standard of the binomial distribution deviation is 0.5 for 50% and 0.3 (not that smaller than 0.5) for both 10% and 90%, meaning that 10% or 90% would not require a sample size much smaller than 50%, but it is 0.01 (50 times smaller than 0.5) for 1% and 99%, meaning that using a sample size calculated for 50% would be very inefficient.0October 10, 2003 at 8:07 pm #90890Try this white paper and see if it helps. I browsed through it and thought it might apply. http://www.nawrs.org/ClevelandPDF/chakra2.PDF

0October 10, 2003 at 8:23 pm #90894You are right – my bad. The largest sample will be required for the expected proportion that is closest to .5.

0October 11, 2003 at 12:44 am #90907

John H.Participant@John-H.**Include @John-H. in your post and this person will**

be notified via email.Bill

Thanks for the clarification. It has been quite a few years since I encounted this type of problem and if I find a reasonable suggestion/ solution I will post it.

-John H.0October 11, 2003 at 8:03 am #90912

HemanthParticipant@Hemanth**Include @Hemanth in your post and this person will**

be notified via email.What is the characteristic you wish to measure?

My assumption (based on your post) is that you get 25000 call/day and you wish to do a category wise analysis. But your problem is you cannot categorise all 25000 calls (seems cumbersome) and hence wish to do a sample study.

I would say if this is the case then take atleast 25% (no specific reason) of your total calls per day and do this study over a period of time (a week or two.). Another guideline can be as suggested previously, collect data till you have atleast 5 calls in all categories..(remember, np = 5..??).

Hope this was helpful.

0October 12, 2003 at 11:36 pm #90927Thanks John…I appreciate the help.

0October 13, 2003 at 11:28 am #90940

GabrielParticipant@Gabriel**Include @Gabriel in your post and this person will**

be notified via email.Hemanth,

Remember the sampling plan design input given by Bill: I want +/-5% with 95% of confidence.

Using 25% could be too much or not enough.

“np=5 at least” is a gideline to use the normal approximation to the binomial. As as the control limits in the p chart are calculated at +/-3 sigmas as it is done for a normal distribution, np=5 is a good gideline to use control charts to detect out-of-control situations. But it will hrdly give you a sample size large enough to estimate p with in one single sample with the desired error and confidence. That’s why, in the p chart, you calculate pbar (the estimator of population’s p) only after about 20 samples had been taken (what would be a total sample size of 20*n).0October 14, 2003 at 12:29 pm #90988

HemanthParticipant@Hemanth**Include @Hemanth in your post and this person will**

be notified via email.Hi Gabriel

Thanks for refreshing me..you are right and I checked some calculations here..request you to read my next reply to Bill.

Hemanth0October 14, 2003 at 12:53 pm #90990

HemanthParticipant@Hemanth**Include @Hemanth in your post and this person will**

be notified via email.Hi Bill,

I am attempting to answer the query again, do let me know if you agree with it or not. Well, I would have to admit that previous reply may not have been very good. But never the less attempting again.

I am assuming the same scenario, only difference being you wish to have a confidence band of +/- 5% at 95% confidence limit around your proportion of calls for each category. for example: 55% +/- 5% of the calls are web tech calls with 95% confidence limit (right..??)

So this is what I tried, I ran 1 proportion test in Minitab and got 95% confidence limit for different sample size assuming I get 55% defectives. Shown below are some of the iterations..

Test and Confidence Interval for One Proportion

Test of p = 0.5 vs p not = 0.5

Exact

Sample X N Sample p 95.0 % CI P-Value

1 11 20 0.550000 (0.315278, 0.769422) 0.824

Test and Confidence Interval for One Proportion

Test of p = 0.5 vs p not = 0.5

Exact

Sample X N Sample p 95.0 % CI P-Value

1 28 50 0.560000 (0.412544, 0.700093) 0.480

Test and Confidence Interval for One Proportion

Test of p = 0.5 vs p not = 0.5

Exact

Sample X N Sample p 95.0 % CI P-Value

1 55 100 0.550000 (0.447280, 0.649680) 0.368

Test and Confidence Interval for One Proportion

Test of p = 0.5 vs p not = 0.5

Exact

Sample X N Sample p 95.0 % CI P-Value

1 110 200 0.550000 (0.478249, 0.620246) 0.179

Test and Confidence Interval for One Proportion

Test of p = 0.5 vs p not = 0.5

Exact

Sample X N Sample p 95.0 % CI P-Value

1 220 400 0.550000 (0.499779, 0.599475) 0.051

So if you have sample p value of 0.55 then you would need to collect data on 400 calls..

I dont know if it answers your query but an attempt..look forward to your reply..

Hemanth

0October 14, 2003 at 2:31 pm #91002Bill,

Maybe I am missing something here — first here are the assumptions that I am making about your post:

– You have one hunt-group and the percentages you guesstimate are produced from a sample of that single hunt group

– You need to sample each call not knowing what it is and as part of that action you put it in the sample category

If these assumptions are correct then you just need to work out the confidence level for the lowest population category (in your case complaints). So you work out the sample size for your required confidence level for complaints assuming they represent 10% of your group — If your guess is wrong and you find another category showing up with a lower percentage during your sampling, then switch and use that instead. You are then assured that all the others have a better confidence level since they all have a better sample size.

This means that you have way oversampled for the likely categories, but since you don’t know what the call is until you sample, you have no way out of that dilemma.

Alternatively…. You could put wrap codes on the phones and just run a complete report!! (but then you need to trust your agents categorisation)0October 14, 2003 at 3:08 pm #91005

GabrielParticipant@Gabriel**Include @Gabriel in your post and this person will**

be notified via email.Please read in this thread the mesages from Stan (Oct 9), Gabriel(Oct 10) and Stan (Oct 10).

You are the second one in this thread to say that the lagest sample size would be defined by the category with a smaller proportion. I think that that’s wrong, as I explain one of those messages, because the binomial distribution has it’s larger standard deviation for p=0.5, and the needed sample size increase when the standard deviation increases. So the category closest to 50% will define the sample size.0October 14, 2003 at 3:52 pm #91014The confusion, I think, is that at small proportions we are trying to be more precise (difference between 100 and 200 ppm for example). Gabriel is correct, that for a set level of precision, the proportions at around .5 will require the largest samples.

0 - AuthorPosts

The forum ‘General’ is closed to new topics and replies.