iSixSigma

Is sampling necessary?

Six Sigma – iSixSigma Forums Old Forums General Is sampling necessary?

Viewing 45 posts - 1 through 45 (of 45 total)
  • Author
    Posts
  • #40615

    ritu
    Member

    Hi All,
    In a situation where you can get data for the process easily (system generated) is it still necessary to do sampling ? If yes, what are the disadvantages of analyzing the entire population?
    Ritu

    0
    #126455

    indresh
    Participant

    hmmm well of course there is no harm
    sampling is done to have the same findings coming out with approx the same accuracy as the entire data would have churned out but in less time and in the most cost effective manner
    when you take the entire data it might so happen that handling becomes difficult due to some constraint in the data anaylsis software etc that you are using
    rgds

    0
    #126457

    Mev
    Participant

    I guess if you can collect data easily then might as well do the analysis of everything without doing sampling. This will give you a correct analysis whereas sampling can tend to go wrong if not done properly.
    Cheers,
    Mev
     

    0
    #126458

    Helen
    Participant

    On a similar subject, I would like to know whether it is still necessary to prove there are differences in central tendency or variation with hypothesis tests if you have analysed all the data rather than a sample.  For example, if I have a month’s worth of transactions, any differences in variation are true for that month but do they need to be tested to assume that this will representative of a year?

    0
    #126463

    BTDT
    Participant

    Everyone:The population is all instances of a process, yesterday, today, tomorrow and forever until the end of time.You NEVER have the population. Whatever data you have gathered is a sample, even if it is a really big sample.When you have all the data for the transactions for last month, then you can say with no doubt what the average, standard deviation, etc. was for that month. This is not very useful for making any kind of prediction. The prediction is what you should be be interested in.I think what people are really asking is, “Is there an advantage to having a small sample size if the cost of collecting the data is very cheap?”Collect a sample as large and as representitive of your process as is possible. If data collection is expensive, then use parameters from your project goal statement to calculate the sample size using the ‘sample size’ tools appropriate for each statistical test.Cheers, BTDT

    0
    #126471

    thevillageidiot
    Member

    Time and cost is the downside to larger samples. If data is easily attained, more is always better.  You want to ensure 2 things when sampling:

    Representation – This is can be accomplished in 6S via rational subgrouping over a length of time, although any valid method is fine.
    Minimum Sample Size – This is determined by the type of inferential testing you plan on using (t test, ANOVA, FF, etc), sample variation, confidance level, how precise or accurate you want to be (aka the min difference), and what level of significance and power you want to have.
    Validate this advice with a 3rd party or text, as I am new at this myself.
    If you have MTB, it makes this quite simple.

    0
    #126472

    “Ken”
    Participant

    BTDT,
    Great response…  Agree with all, and would like to add a layer of paint–hopefully, not too thick!
    A representive sample should not be confused with a large sample.  A representative sample addresses the frame under which the sample is taken to insure the sample accurately reflects the population, and the size addresses the desired level of confidence to insure estimation precision.
    The reason I highlight the differences is due to the tendency of some experienced and inexperienced members of the forum who casually suggest the two are related somehow.  This is not the case. 
    A simple example should suffice–suppose you are measuring a mixture known to be homogeneous.  This could be a mixture of chemicals, pharmaceuticals, or waste water with impurities.  Additionally, it could also be a solid mixture of plastic resin.  Again, assuming thorough mixture–how many samples are required to measure the characteristics of the mixture, be it concentration, potency, or melt index?  Typically, the answer is one sample…  Why? 
    What would we observe if we evaluated repeated measures of a homogeneous mixture? 
    Ken

    0
    #126473

    Jim H
    Participant

    Not correct – the population is the items contained within the defined boundaries of interest.  The population of Michigan as of 9/6/2005 is a defined population that can be sampled or counted in total.  A population can be all the items loaded on a truck – you can weigh all the boxes contained in the population on the truck, or you can sample it.

    0
    #126475

    Mike Carnell
    Participant

    TVI,
    I don’t agree with the idea that more is always better particularly if you are involved in something like a t test or ANOVA. The sample size affects the sensitivity of the test. Large samples will cause the test to show a significant difference to small changes. That may or may not be what you want.
    You are correct Minitab will make this easy.
    Just my opinion.
    Good Luck

    0
    #126476

    thevillageidiot
    Member

    I should have known I was being careless when I unleashed the dreaded “always”….Thanks for the correction Mike.

    0
    #126477

    thevillageidiot
    Member

    With complete homogenity, one sample is representative of the population.

    0
    #126478

    Ritz
    Member

    Ritu,
    I do not believe it is necessary to sample from data where you are confident that you have appropriately characterized the population.  Since the researcher defines the population set, it is more important to understand the boundaries inherent in the population definition, and the subsequent risks to making inferences beyond the defined boundaries.  You must also consider how the boundaries you have defined can change over time. 
    While the theoretical population would include all past, present, and future observations, the practical researcher acknowledges these limitations and adjusts their risks as appropriate for the situation.  This is one reason why we state that we “fail to reject Ho” instead of accepting Ho, and why we bound the acceptance of Ha with a confidence interval.
    As Carnell mentions, be careful how you test for differences when using large samples.  Interpreting results must also be closely characterized in order to make the correct conclusions.
    Hope this helps.
    Ritz

    0
    #126480

    Mike Carnell
    Participant

    TVI,
    Sorry. I don’t mean to be picky on this I was just concerned someone would be rumming a t test with a couple hundred samples.
    Regards

    0
    #126482

    HF Chris
    Participant

    Mike,
    You have confuseded me. In my experience large samples tend to give more power if homogeneity and heterogeneity assumptions have been met. However, dependent on the sample spread the small changes tend to become masked in a larger sample. It then takes larger changes to indicate a difference. Of course, this still does not eliminate the fact that most people do not understand the difference between a representative sample and a population.
    Chris
     

    0
    #126484

    Mike Carnell
    Participant

    HFC,
    If you look at a sample size chart my ability to detect smaller differences increases with sample size. I don’t have my chart with me but if I use a sample size in a t test that is around 300 the test will become sensitive to a shift of less than 1 sigma. You can force significance in the test by increasing sample size.
    Regards

    0
    #126485

    thevillageidiot
    Member

    No, when you are right, ur right.  Thanks again for the correction…nothing worse than giving bad advice. 

    0
    #126486

    Mikel
    Member

    You can get 1 sigma with a sample of 17 (ROT used by Air Academy). With 300, I’m pretty sure you can get 1 RCH.
    ROT with large samples – it not statistical significance you are looking for anymore, it’s practical significance.

    0
    #126487

    Mikel
    Member

    There are many things worse than giving bad advice. Giving much more information than what is asked for is one.
    BTW, I like your name, but your responses don’t match. I like your contribution.

    0
    #126488

    HF Chris
    Participant

    Mike,
    Correct me if I’m wrong, but the increase of sample size increases the power of the difference. As sample size increases the smaller differences gain more power not the ability to detect the differences. The ability to detect diferences can vary no matter what the sample size.
    Chris
     

    0
    #126498

    ritu
    Member

    Thanks for the responses. But I think I got lost somewhere. My problem is not sample size.
    Let me brief the moment, Iam trying to analyze two month’s data for cycle time for a transaction process. I have complete data for these two months. There were no changes made in the process in these two months, but the performance has gone down in the second month. I am trying to check if this has happened jus by ‘chance’ .
    As per BTDT’s response, ‘When you have all the data for the transactions for last month, then you can say with no doubt what the average, standard deviation, etc. was for that month. This is not very useful for making any kind of prediction. The prediction is what you should be be interested in.’   Now, I have the average, SD etc for the data but can’t  I predict anything? Do I still need to sample from these two months. (Note: the data points for a day is above 100)
    Another doubt? For Helen’s question, one month’s data cannot represent a year’s population. Would it suffice if we take sample from the past 3 months’ data?
    Pls advise. Thanks again.
    Ritu

    0
    #126506

    HF Chris
    Participant

    Ritu,
    Do you understand the process you are observing? Have you placed the both sets of data in the same run chart? Is this common variation or special? You are stuck in descriptive statistics and not entered into inferential. Just looking at a change is just that…looking at a change. Is it operator variability, did your demand change, etc… The sample question/deviation/detour is important do you, because you have just understood that like flipping a coin….younever have all the responses.
    Chris
     

    0
    #126514

    Ken Feldman
    Participant

    but the performance has gone down in the second month. I am trying to check if this has happened jus by ‘chance’ .
    Almost sounds like you want to do a two sample t test to see if the decline from month one to month two is really a significant change or just “by chance”.  Compare the two averages/s.d. to see if there is any difference.  If you have the complete month’s worth of data this should be easy enough.
    You can only predict to the year if you have evidence that the process is stable.  And you can only predict within some confidence interval, certainly not a point estimate.

    0
    #126517

    ritu
    Member

    Yes Darth, I was intending to do a 2 sample T test. So I can go ahead with the entire months’ data and need not sample.
    Thanks for your response.
    Ritu

    0
    #126522

    mZ
    Participant

    I agree with BTDT.
    “Population” is such a concept so many people get confused with its definition but without realizing it. If half of the population on this board truely understand the meaning of it, many of the problems you see here will get resolved easily.
    mz

    0
    #126528

    thevillageidiot
    Member

    High praise indeed…Thanks for sharing your expertise and ascerbic wit in the forum….I think Joe is still in fetal position on his therapist’s couch….well played.

    0
    #126533

    BTDT
    Participant

    Ritu:Now I see your question. It is less about population versus samples and more about the effect of sample size on random versus systematic samples. Larger samples will not correct for a systematic bias.Put your self in the place of one of your stakeholders, let us call him/her, ‘Pat.’ Pat is widely described as a “not very nice word.” Pat always tries to put holes in any argument if it looks like it might not agree with the Pat-centric view of the universe. You can even do a brainstorming session with the defect defined as, “You know what your problem is with this data set, don’t you? (Pat never pauses after asking questions). You haven’t taken into account the …” and list all the possible factors.Every project SHOULD have someone like this on the team. A way to translate this statement is, “You may have failed to take into account the systematic effect of…” The idea is that your dataset is no longer ‘typical’ and anything you say about it will not be representative of the problem at hand – it will be biased. This is a serious problem and can not be corrected with an increase in sample size. It will derail your project at the Analyze report-out.During the Measure phase, I would like you to complete your list of possible factors (Xs) with your team and make sure that all factors are represented in the sample you have. Whether a sample has half as many data points has much less of an effect than neglecting all international orders. When you are in the final stages of your data collection plan, take it to Pat and ask, “Do you think there is anything we have left out?” You are not asking whether the data is easily collected or available, but ensuring pro-active buy-in for your data. Be wary of any hint that you are going on a witch hunt. Include all those ‘unusual or special orders’ and ‘exceptions,’ they are usually the reason for the project in the first place. Discuss this a length BEFORE you do any analysis.Now let’s go back to your data collection. There may be a Pat on your team who may feel that the portion of the data you are not collecting is going to bias the results in some way. If the half set of data you are planning to collect is only from the regular day-crew and leaves out the night shift, then she may have a valid point. You have a choice:1) collect all data2) ensure half of your data is from the day shift and the other half is from the night shift.It may be easier to just collect all of it. The confidence intervals on the parameters will be narrower by about 30%, but more importantly, you have prevented the effect of a possible bias and a lack of confidence in your analysis.You suspect that there may be a difference from one month to the next. When you do your run-chart of data (always the first step), you may see a trend or a sudden, monthly shift. The latter could be tested using any one of a number of two-sample tests. Robert Butler always advises to plot your raw data as a first step – sage advise. These tests will tell you whether the difference you see is ‘by chance’ or random. If it is NOT random, then dig into the process until you can find what may have caused this. This will also be a potential X that should be incorporated in your ongoing data collection plan.You are at Measure and your set of data must be able to show the baseline performance of the process. For example, it should be collected over a long enough period of time to satisfy the stakeholders that you are not reacting to an end-of-year effect.The data should be collected over enough different values of the potential Vital Xs such that it can be considered representative of the baseline business process. It should be rich enough in the variety of Xs that the data can be subgrouped for subsequent analysis.With respect to Helens’ question about data from 1 month versus 3 months, if you can satisfy me and everyone else on the team that there is no change from month to month, then one month is OK. If you think that ‘Pat’ is going to immediately say, “August was a weird month because a lot of people were on holiday,” then this will not do.
    Knowing what I know about seasonal effects, I would be more satisfied with a random sample of data from 3 months than all the data from 1 month. It also covers at one complete, quarterly business cycle.Another problem that comes up with sampling is how to sample infrequent events. Assume your company’s orders are roughly 2% international and 98% domestic. In order to see the different between the cycle time for the two groups, I would like a sample of about 50:50 international:domestic. In order to assess the average cycle time for an order, I would like to see a random selection of all orders and expect to see about 2:98 international:domestic.Before you go off and do a sample size calculation, go to your goal statement and see how much an improvement on the Y you are required to be able to detect. If you are improving the credit risk of consumer loans, a fraction of a percentage point can be significant. If you are improving the cycle time of those same consumer loans, a 90% decrease in cycle time should be expected.You may split your data collection plan into two phases.1) Collect a fairly small (50-100 continuous, 1000-1500 discrete) sample of only the Y data to do a stability check (run chart) and process capability calculation. You may even do this at the tail end of Define to show the magnitude of the problem. You may be able to combine this with a Gauge R&R study.2) A larger, planned data collection of Y and Xs for use in analysis during Measure. The number of data points in each potential subgroup should be large enough to be able to do statistical tests (ANOVA, etc.)Hope this helps, the reply became longer to address issues from others following your initial post.BTDT

    0
    #126570

    ritu
    Member

    Thanks a ton BTDT. It was a great response.
    I has done run chart to check if there were any special causes for a sudden drift in performance. Also as a operations subject matter expert, I donot see any major changes in the process that would impact the performance to this extent. There are about 2800 data points for each month and the problem I am working on is the dip in accuracy levels in the second month as compared to the first. The average accuracy levels usually is at 97% and this month we see that it has gone down to 92%. This level is also unacceptable by the clients,
    Can you suggest me any specific tool to understand what went wrong. I am not doing it as a GB or BB project but is a business analysis for the senior management. If I may say at this point that there is no possible special cause for the drift, how long do I need to see if it is a commom cause? The proces is about 18 months old and this is the first time our accuracy levels have gone so low.
    Thanks again for your help.
    Ritu

    0
    #126576

    jimmie65
    Participant

    Mike –
    I’m not sure I understand. Do you have an example of when a smaller sample size would be better for this reason?
    I may be too focused on the processes I’m involved in. But for example, I have a machine that is running a fairly stable process. I want to test the affects of a new material on cycle time, part weight & thickness, etc. I’ll have to limit my sample size due to expense, of course. But other than that, wouldn’t I want as large a sample size as possible for my t test or ANOVA?

    0
    #126579

    Ken Feldman
    Participant

    Ritu, I did a quick 2 proportion test and the difference is significant.  Certainly you are in a better position to assess why this occurred than anyone on the Forum.  Could it be that 1 in a thousand random events?  Has there been any change in pattern since you last collected the data?  Have you attempted to use a Multi-vari chart to see if you can identify a cause?  I assume you are tracking daily and can continue plotting on a daily basis.  Give it a couple of weeks to see if the old accuracy starts to return.  You can just tell your client “Sxxt Happens”.

    0
    #126580

    sharma
    Participant

    Dear BTDT…..
    let make it simple …..as layman can understand….talking like an consultant …ll only add flavor of confuse to forum…….u have written so much ….dt urs sentences appear as a lurking factor …..which suppress the essence of whole story
     
    now folks ……population and sample….lets talk
    population is —-is like god….u can’t c him…the more u find him the more u analyze is still to come …( ref)
    then….wt to do ….we …MAKE TEMPELS….as assume that god lies in the temple……also we believe that.. bigger the temples ……we are more nearer to god……
    here we can relate temples as sample …..
    so if we, predict…sample …we can predict the population ( temples—–> GOD )
    note — BIG temples——-sample size…….
    its generally happen…….when u wana go to temple u always validate ….that this temple is really true.
    in the same way u validate the sample …..by checking its stability , normality …then make other assumption
     
     
     

    0
    #126581

    sharma
    Participant

    i ll get back…right now i m going

    0
    #126583

    Mikel
    Member

    population is —-is like god….u can’t c him????
    You need to switch to better brand of tequila or mushrooms – right Darth?

    0
    #126585

    Ken Feldman
    Participant

    Whew!!!  I thought it was just me.  Boy, that explanation made Ken’s seem understandable :-).

    0
    #126589

    HF Chris
    Participant

    With a bigger temple you got more power with smaller differences.
     

    0
    #126592

    ritu
    Member

    Muti vari chart!!! – that strikes a thought….. Let me do that before I could tell our client ‘Sxxt Happens’. Will keep u posted if I get some inference from this.
    Thanks Darth :-)
    Ritu

    0
    #126601

    Ken Feldman
    Participant

    Now aren’t we sorry we taught Chris how to copy and paste pics as well?  Love the work you do.

    0
    #126603

    HF Chris
    Participant

    Darth,
    Thank you. Now that you like my temple, come in an see if you agree with this last post. Never got a answer: https://www.isixsigma.com/forum/showmessage.asp?messageID=78959

    0
    #126605

    Ken Feldman
    Participant

    Not sure what you are driving at here.  Power is linked to the probability of making a type II error.  In other words, making the correct decision when the Null should be rejected.  In a sense, sample size is affecting the confidence interval around whatever we are trying to make a statement about.  If we increase the sample size we reduce the confidence interval and thus make the analysis more sensitive or “powerful” if you want.  So, we can say that we have more “power” to see differences if they exist.  Not sure how what you are saying differs from the classic definition of power as being the ability to spot a difference should one exist.

    0
    #126606

    AB
    Participant

    Anu – How wonderfully creative…. BTDT’s explanation was so unclear. You illustrated such a complex concept in such lucid terms. Way to go… BTW, have you been nominated for the nobel prize before?… What the heck…I just remembered you can be nominated multiple times so yes, please nominate yourself and may God be with you (The entire population, not just a sample)
    Thanks
    AB

    0
    #126608

    HF Chris
    Participant

    Darth,
    This was a follow up to a statement that smaller differences can be detected with larger samples. I think someone was looking at a sample chart an misunderstood what it was really saying in reference to power differences.
    Chris

    0
    #126642

    anu anurag
    Participant

    yep i ll meet u …some day

    0
    #126696

    Mike Carnell
    Participant

    HFC,
    I am not sure what you think the power of the test is all about but as Darth points out it is about decision making. There is a point where you need to decide when you set up a hypothesis test how much of a difference does it take before there is a significant difference. That decision is normally made in terms of sigma i.e. how many do I shift before the test declare a difference. As my sample size increases the test will declare a difference with less of a shift i.e. with an alpha of 5% and a beta of 10% I can test for a 3 sigma shift with 4 samples but if I want to test for a 1 sigma shift I need 23.
    From the Minitab help menue:
    A prospective study is used before collecting data to consider design sensitivity. You want to be sure that you have enough power to detect differences (effects) that you have determined to be important. For example, you can increase the design sensitivity by increasing the sample size or by taking measures to decrease the error variance.
     
    This basically means that you just don’t go gather data, slam it into Minitab and decide what happened (which is what most do). You start with some data, you understand the process std dev, you figure out how far the shift need to go before it makes a difference, divide the difference by the std dev and now you understand in terms of sigma how far it needs to shift in terms of sigma.
    If we go back to the original post the only point I was making was that if you load something like a t test up with a large number of samples everything looks significant only because you have driven the sensitivity of the test up – not because ther is an actual difference.
     

    0
    #127096

    Mike Carnell
    Participant

    jimmie65,
    Sorry for not picking this up sooner. Last week was a little frantic.
    The short answer is no. Basically you will make a decision if there was a significant change based on the results of either the t test or the ANOVA. Given the right sample size the mean could shift as little as 0.1 sigma and the test would tell you that there was a significant difference. You need to understand how far the mean needs to move in order for it to have created a practical difference.
    If you email me at [email protected] I can send you something that may help.
    Regards

    0
    #127101

    Ben Royal
    Participant

    Sir,
    Would it be possible for your to post your response to jimmie65 as an article on this site? This concept is something that is usually overlooked in Green Belt and Black Belt training.
    Thank you

    0
    #127129

    Mike Carnell
    Participant

    Ben,
    Posting an article is quite as easy as it may seem. There is several months worth of queued up articles and the magazine is even tougher.
    Send me an email at [email protected] and I will send you what I sent Jimmie.
    Regards

    0
Viewing 45 posts - 1 through 45 (of 45 total)

The forum ‘General’ is closed to new topics and replies.