iSixSigma

Categorization Sample Size Calculation

Six Sigma – iSixSigma Forums Old Forums General Categorization Sample Size Calculation

This topic contains 16 replies, has 7 voices, and was last updated by  Mikel 15 years, 9 months ago.

Viewing 17 posts - 1 through 17 (of 17 total)
  • Author
    Posts
  • #33531

    lin
    Participant

    My company receives about 25,000 calls per day and wants to categorize the calls by CALL TYPE.  (For example, 55% of the calls are web tech calls, 10% are complaints, 20% are account requests, and 15% are personal.) However, we can not categorize all 25,000 calls per day and would like to take a monthly sample to gain an estimated understanding (+/- 5% precision) of the “category type distribution”. 
    How do I calculate the sample size to ensure that the sample “category type distribution” is within 5% of the population “category type distribution”  (w/ 95% CI)
     I am not having a problem finding a sample size formula for discrete-attribute data provided there are only two types of occurances(good/bad).   However, I have 11 types of calls and therefore more than two types of discrete-attribute occurances.  I am having trouble finding an applicable formula. 
    Any direction to a formula or a table is very much appreciated. 
    Thanks.  –Bill

    0
    #90840

    Gabriel
    Participant

    I never thought of that, but I will make a try.
    In a binomila distribution you have 1 parameter to estimate: p (proportion). And the problem is to find a sample size that lets you estimate p with a given error (width of the CI) and with a given confidence. However, in this case you have 11 parameters to estimate p1, p2, …, p11 (one proportion for each category). So I will replace your “ensure that the sample category type distribution is within 5% of the population category type distribution (w/ 95% CI)” with “ensure that the estimated proportion of each category is within 5% of the actual proportion of that category in the population with a 95% of confidence”
    You can treat each parameter as a binomial distribution (it is either a web tech or not, it is either a complaint or not, and so on) and you will have 11 binomial ditributions (i.e. with two types of occurrences) for which you already know how to calculate the sample size. You can calculate the sample size for each catgory and then take the largest. That category will meet your requirement of precision and confidence and all the other categories will exceed it.
    (I know that there is a “multinomial distribution” that adress this type of problems, but that’s all I know that about it)

    0
    #90843

    Mikel
    Member

    This is correct and use the smallest percentage you expect to calculate the sample size – it will require the largest sample.

    0
    #90847

    John H.
    Participant

    Bill
    Stratified Random Sampling is often used in situations such as the one you described. i.e the population is divided into strata(catagories) which are as homogenous as possible and a random sample is drawn from each stratum. The number of observations taken at random from a given stratum should be proportional to the statistical weight and the standard deviation of the stratum and inversely proportional to the square root of the cost per observation from the stratum.
     If you enter Stratified Random Sampling for the search in http://www.google.com you will find a considerable amount of information on this subject.
    I hope this helps
    John H.
     

    0
    #90849

    lin
    Participant

    Thanks John…but I think my situation is a little different than the one you described.  I do not know the strata “distribution” prior to the sample…therefore, I can not perform stratified random sampling.  I’m actually sampling to determine the strata and its distribution/proportions.
    Does this make sense?

    0
    #90866

    Gabriel
    Participant

    Stan,
     What do you mean? Tell me if the following is wrong:
    The binomial distribution has a standard deviation of sqrt[p-(1-p)], which has the maximum for p=0.5 (50%). Since the needed sample size for a given confidence and a given CI width increases with the standard deviation, the proportion that will require the largest sample will be that which is closer to 50% to either side. In fact, if you calculate the needed sample size for 50%, any other proportion will give you a narrower CI for the same confidence (or better confidence for the same with of the CI), so using 50% to calculate the sample size is allways conservative and, further more, not very ineficient unless the actual proportion is close to 0% or 100% Why?: The standard of the binomial distribution deviation is 0.5 for 50% and 0.3 (not that smaller than 0.5) for both 10% and 90%, meaning that 10% or 90% would not require a sample size much smaller than 50%, but it is 0.01 (50 times smaller than 0.5) for 1% and 99%, meaning that using a sample size calculated for 50% would be very inefficient.

    0
    #90890

    Ronald
    Participant

    Try this white paper and see if it helps.  I browsed through it and thought it might apply. http://www.nawrs.org/ClevelandPDF/chakra2.PDF

    0
    #90894

    Mikel
    Member

    You are right – my bad. The largest sample will be required for the expected proportion that is closest to .5.

    0
    #90907

    John H.
    Participant

    Bill
    Thanks for the clarification. It has been quite a few years since I encounted this type of problem and if I find a reasonable suggestion/ solution I will post it.
    -John H.  

    0
    #90912

    Hemanth
    Participant

    What is the characteristic you wish to measure?
    My assumption (based on your post) is that you get 25000 call/day and you wish to do a category wise analysis. But your problem is you cannot categorise all 25000 calls (seems cumbersome) and hence wish to do a sample study.
    I would say if this is the case then take atleast 25% (no specific reason) of your total calls per day and do this study over a period of time (a week or two.). Another guideline can be as suggested previously, collect data till you have atleast 5 calls in all categories..(remember, np = 5..??).
    Hope this was helpful.
     

    0
    #90927

    lin
    Participant

    Thanks John…I appreciate the help.

    0
    #90940

    Gabriel
    Participant

    Hemanth,
    Remember the sampling plan design input given by Bill: I want +/-5% with 95% of confidence.
    Using 25% could be too much or not enough.
    “np=5 at least” is a gideline to use the normal approximation to the binomial. As as the control limits in the p chart are calculated at +/-3 sigmas as it is done for a normal distribution, np=5 is a good gideline to use control charts to detect out-of-control situations. But it will hrdly give you a sample size large enough to estimate p with in one single sample with the desired error and confidence. That’s why, in the p chart, you calculate pbar (the estimator of population’s p) only after about 20 samples had been taken (what would be a total sample size of 20*n).

    0
    #90988

    Hemanth
    Participant

    Hi Gabriel
    Thanks for refreshing me..you are right and I checked some calculations here..request you to read my next reply to Bill.
    Hemanth

    0
    #90990

    Hemanth
    Participant

    Hi Bill,
    I am attempting to answer the query again, do let me know if you agree with it or not. Well, I would have to admit that previous reply may not have been very good. But never the less attempting again.
    I am assuming the same scenario, only difference being you wish to have a confidence band of +/- 5% at 95% confidence limit around your proportion of calls for each category. for example: 55% +/- 5% of the calls are web tech calls with 95% confidence limit (right..??)
    So this is what I tried, I ran 1 proportion test in Minitab and got 95% confidence limit for different sample size assuming I get 55% defectives. Shown below are some of the iterations..
    Test and Confidence Interval for One Proportion
    Test of p = 0.5 vs p not = 0.5
    Exact
    Sample X N Sample p 95.0 % CI P-Value
    1 11 20 0.550000 (0.315278, 0.769422) 0.824
     
    Test and Confidence Interval for One Proportion
    Test of p = 0.5 vs p not = 0.5
    Exact
    Sample X N Sample p 95.0 % CI P-Value
    1 28 50 0.560000 (0.412544, 0.700093) 0.480
     
    Test and Confidence Interval for One Proportion
    Test of p = 0.5 vs p not = 0.5
    Exact
    Sample X N Sample p 95.0 % CI P-Value
    1 55 100 0.550000 (0.447280, 0.649680) 0.368
     
    Test and Confidence Interval for One Proportion
    Test of p = 0.5 vs p not = 0.5
    Exact
    Sample X N Sample p 95.0 % CI P-Value
    1 110 200 0.550000 (0.478249, 0.620246) 0.179
     
    Test and Confidence Interval for One Proportion
    Test of p = 0.5 vs p not = 0.5
    Exact
    Sample X N Sample p 95.0 % CI P-Value
    1 220 400 0.550000 (0.499779, 0.599475) 0.051
    So if you have sample p value of 0.55 then you would need to collect data on 400 calls..
    I dont know if it answers your query but an attempt..look forward to your reply..
    Hemanth
     
     
     
     

    0
    #91002

    Cormac
    Participant

    Bill,
    Maybe I am missing something here — first here are the assumptions that I am making about your post:
    –  You have one hunt-group and the percentages you guesstimate are produced from a sample of that single hunt group
    –  You need to sample each call not knowing what it is and as part of that action you put it in the sample category
    If these assumptions are correct then you just need to work out the confidence level for the lowest population category (in your case complaints).  So you work out the sample size for your required confidence level for complaints assuming they represent 10% of your group — If your guess is wrong and you find another category showing up with a lower percentage during your sampling, then switch and use that instead.  You are then assured that all the others have a better confidence level since they all have a better sample size.
    This means that you have way oversampled for the likely categories, but since you don’t know what the call is until you sample, you have no way out of that dilemma.
    Alternatively…. You could put wrap codes on the phones and just run a complete report!! (but then you need to trust your agents categorisation)

    0
    #91005

    Gabriel
    Participant

    Please read in this thread the mesages from Stan (Oct 9), Gabriel(Oct 10) and Stan (Oct 10).
    You are the second one in this thread to say that the lagest sample size would be defined by the category with a smaller proportion. I think that that’s wrong, as I explain one of those messages, because the binomial distribution has it’s larger standard deviation for p=0.5, and the needed sample size increase when the standard deviation increases. So the category closest to 50% will define the sample size.

    0
    #91014

    Mikel
    Member

    The confusion, I think, is that at small proportions we are trying to be more precise (difference between 100 and 200 ppm for example). Gabriel is correct, that for a set level of precision, the proportions at around .5 will require the largest samples.

    0
Viewing 17 posts - 1 through 17 (of 17 total)

The forum ‘General’ is closed to new topics and replies.