Central Limit Thereom

Six Sigma – iSixSigma Forums Old Forums General Central Limit Thereom

Viewing 9 posts - 1 through 9 (of 9 total)
  • Author
  • #53299


    Hello Everyone, I have a small question. I have a population (250) of which I have the
    Standard deviation, skewness, kurtoses and mean. In
    this population there is a certain ´mistake´ rate
    which is unknown to me. I have good reason to assume
    that this mistake rate has a similar distribution
    (mean deviation and kurtoses), which I will test
    with a modified jaque bera test. I will draw a
    random sample to obtain the mistake rate and
    construct a confidence interval.1.However, since I know the population standard
    deviation and I believe that this is the same for
    the mistakes. Do I still need to devide the standard
    deviation by the sqaure-root of n when construction
    the confidence interval or could I use or could I
    just use the population standard deviation?2.My second problem is the following. My population
    is wildly un-normal distributed. However it has
    features of the F-distribution. Would it be (very)
    inapproiate to use this for significance testing,
    and construction an (assymytrec) confidence
    interval? concerning the CET, in in assymtotic assumption
    could the sample be normal distributed when the
    population clearly is not. If someone could help me out with these question or
    refer me to the corresponding literature I would be
    very gratefull.Thanks,



    Do your own homework.You’ll be amazed what can be learned when using your own brain.



    Thanks Stan, I am happy to find out that my question
    is more obvious then I thought it to be ( I actually
    mean it). However after consulting;
    Sydsaeter& hammond 2002
    Wooldrige 2006
    Bowerman 2003
    kallenberg 1997
    Remi Baxter Stochastic calculus
    and many dodgy internet sources. I have not found a
    single example of the theorem with known population
    variables nor an explicit explanation for this
    specific case. My intuition tells me that it does
    make sense that the population variables should
    outrule the thereom.Furthermore, I have never seen (in social sciences
    and economics) a non linear confidence interval with
    the F statistics. I assume after your reaction, that the answer is
    rather obvious so, did I miss the point in the
    previous mentioned sources? or should I consult more
    exact sciences? If so could you give me any source?


    Robert Butler

        I’m not sure I understand the point of all of the questions you are asking.  I’d like to ask a few in turn which, hopefully, will help me understand what it is you are trying to accomplish.
      You said you have a population of 250. You have computed various statistics of that population and you know there is an unknown mistake rate and that you “have good reason to assume this mistake rate has a similar distribution which I will test with a modified jaque bera test.”
      I had never heard of the Jarque-Bera test but a quick check on Google indicated that it is a goodness of fit measure of departure from normality based on skewness and kurtosis. 
      Question 1: Why are you running a goodness of fit test?  As written your post gives the impression that a goodness of fit test is a measure of equivalence of two random distributions to one another – this isn’t the case. As mentioned above, the test you cited is a test for normality. You could, of course, run the test but later in your post you indicated the population is “wildly un-normal distributed” so it sounds like you already know what the results of the test will be.
      Question 2: You indicated you have 250 measures and presumably you have a way to identify a mistake so, rather than taking some sample of the data  why not just separate the 250 into good and mistake and see what you get?
     Your final comments suggest you are trying to determine some kind of confidence interval but it is unclear if you are interested in a confidence interval of a mean (or means) or if you are interested in a confidence interval on individuals from a population or perhaps something else entirely.
      If you could answer these questions and elaborate some more as to your aims concerning the confidence interval perhaps I or someone else may be able to offer additional thoughts.


    Venerable Bede

    Maybe I do not understand your question, but it seems to me that if you have the population, then forget about sampling and use the population. Then the mean and standard deviation you calculate are parameters, not statistics, so the estimation of confidence intervals does not apply, since you are not using the data to estimate or predict anything beyond the population that you have defined.
    And then I have to wonder, since you have the population, what sort of testing are you trying to accomplish?  What benfit do you derive from calculating a mean and standard deviation on such a skewed distribution?



    What is it that you are measuring with your “population” of 250? You said you have a mean, standard deviation, etc. What’s the metric? Cycle time?
    If you are going to assume that you have the population, I agree with the follow-on comments that suggest just pull the whole population and determine which ones are defective. From a statistical tool and statistical equation stand point, these things were primarily designed to infer characteristics of the population based on samples, so most statistical tools and equations become confusing when we assume we know the population. Could you step back and assume that you have a sample of 250 (since you will likely want to make recommendations about future process behavior) and reexamine your equations under that assumption. Unless I had all the data that was, all the data that currently is, and all the data that will be in the future, I personally hesitate calling something a population, not because of some definition, but rather analyzing population data either becomes really easy (like recommended by the other contributors) or incredibly confusing like you’ve found.
    If you reply to this post with your metric and what you are trying to show/determine, that would help me provide more information on how best to determine your solution.
    “Data is like garbage. You’d better know what you are going to do with it before you collect it.” – Mark Twain.



    Hello everyone,Thanks for the replies.I am happy to hear that my initial intuition is
    similar. However, I think that my question was
    slightly ambiguous. I will elaborate:I am currently rewriting an audit trail. I have a
    large set of companies that have applied for a
    subsidy. My population is a single deceleration of
    one company which needs to be checked for mistakes. This population is a set of ( around 250) bills (
    between 1 and 10.000 $). The population is wildly
    un-normal distributed. The distribution differs from
    heavily positively skewed, to uniform. In the bills, there a certain mistake rate (which
    defers per company/population). I would like to
    determine the mistake rate (in the form of an
    confidence interval)by random sampling and
    extrapolation. Since I know the population and we have good reason
    to believe (and tested, this) that the mistakes are
    distributed similar to the bills of the company.
    This has the advantage that I can use their
    variables (I think), such as the standard deviation
    and distribution. Since the standard deviation is the population
    deviation and not the sample deviation, I do not
    have s/n^0.5. And I can not determine the n, random
    sample size, by simply rewritten the 95% confidence
    interval to n=t.s^2/B^2
    B=maximum mistake of average. This method would have such an high n that it would
    often be higher than N (Population). So the method
    of using the population variables and distribution
    would give, according to me, a better result. (the
    only disadvantage is that I can no longer determine
    the sample size before testing, which would not be a
    major problem). my question: Can I simply take the population
    variables, or does the Central Limit Theory proof
    that it is better to take the sample variables of
    the mistakes itself (which do not significantly
    defer from the population variables) and use the
    normal distribution?Furthermore, my second problem is that I need to
    determine a maximum possible error (with 95%
    certainty) in the determined average mistake. Since
    the population is positively skewed distributed, I
    thought it would be appropriate to use this
    different distribution in the confidence interval
    (instead of 1,96 from the t distribution). Using the
    Skewed distribution significantly reduces the error
    on the left side (which is our aim), which is very
    convenient. However, is this inappropriate? I have so far, not
    yet encountered a similar method in academic
    literature?I hope you could help me out
    Thanks again.



    Hello everyone,I replied accidentally on my own message instead of
    yours, since I don t know if that is important for the notification system I wrote this message. My
    elaboration is in Central Limit Theorem Question. Thanks



    Let’s take a step back. The way I understand it: The Y (output) you are trying to control (or at least baseline) is the defect rate (mistakes/bill) for which you have no current data and are asking for a sampling plan. (It sounds like) the data you do have is data on the company and the amount of the bill.
    Focus on the Y (mistakes/bill) and assume you do not have the population: I would recommend simplifying things further by not calling the 250 bills a population, unless you are only interested in these 250 bills and have no interest in describing the behavior of the current system which includes past, present, and future behavior. If you are going to limit yourself to just these 250 bills, then your conclusions that you present need to be worded to that effect. I’m going to assume that we want to include the total population which is unknown.
    If Y is a function of X normalize or stratify: If you are confident that the mistakes/bill is influenced by the amount the bill is for, then you can either normalize your metric: defects/dollar requested or stratify your sampling plan to the distribution of dollar amounts. Normalizing your metric should give you more normal data and you can begin to see the benefits of the Central Limit Theorem.
    Consider Medians: Which ever metric you go with, if it is continuous data and non-normal then the median would be a more representative measure of central tendency than the mean. You’ll have to make some conservative estimates for the statistics you don’t know. You can’t use the standard deviation from the cost data if what you are trying to measure is defects/bill. There are various equations and look up tables out there that can guide you to an effective sampling plan. If absolutely stuck, start with a small amount of samples like 35 (why 35, no particular reason, this is just to get a feel of some of the statistics and determine how many additional samples we need, I would have said 30, but that’s the same number as the starting point with continuous data, and I didn’t want to get the replies that said I was wrong because my data was attribute, when I really just took a stab at the number to begin with). So, start with 35, run some statistics and see what would be the benefit of gathering more data (i.e., I once conducted a survey, we had limited time and budget, I collected 50 surveys, ran the confidence intervals and found that I would only get my confidence interval just 2% tighter if I collected another 50 surveys. So instead of doubling my work for 2%, I accepted the wide confidence intervals).
    Maximum defect rate (w/ 95% confidence): This is probably the easiest part once you get the data. This is where probability distributions come into play. Examine your data and find the best appropriate distribution (be sure to check the assumptions for the distribution, if you use MINITAB the 4 in 1 graphs help you do that). Then you can use that distribution with those statistics to calculate what would be the defect rate at 95%. If you use MINITAB, they have a pretty good explanation of what I’m talking about with examples. Go to Probability Distributions section and pull up one of those distributions and bring up the help screen.
    You are trying to find answers to some pretty simple questions, however, the challenge come in selecting the tool that is designed to give you that answer and avoiding assumptions that will take you down the wrong path. Some of the assumptions I’m concerned about (and note, I don’t know anything about your data, your project, or your objectives):
    ·        Assuming you have the population: The statistical tools make more sense if to me when I assume the population is always unknown.
    ·        Mixing statistics. If interested in defect rate, be careful not to use means and standard deviations from cost data.
    ·        Central Limit Theorem – Makes life easier, however, I always double check the work and the assumptions when one of my Black Belts comes back to me using it as a stand alone analysis. They often need something from the central limit theorem and charge in with the wrong original data (data they have instead of data they need) and quickly get stuck and confused, or worse perform incorrect analysis.
    Take a step back, take the analysis one piece at a time, and format things so they look and feel like they did while in training. Good luck.

Viewing 9 posts - 1 through 9 (of 9 total)

The forum ‘General’ is closed to new topics and replies.