Central Limit Thereom
Six Sigma – iSixSigma › Forums › Old Forums › General › Central Limit Thereom
 This topic has 8 replies, 5 voices, and was last updated 10 years, 9 months ago by Nik.

AuthorPosts

February 18, 2010 at 6:11 pm #53299
Hello Everyone, I have a small question. I have a population (250) of which I have the
Standard deviation, skewness, kurtoses and mean. In
this population there is a certain ´mistake´ rate
which is unknown to me. I have good reason to assume
that this mistake rate has a similar distribution
(mean deviation and kurtoses), which I will test
with a modified jaque bera test. I will draw a
random sample to obtain the mistake rate and
construct a confidence interval.1.However, since I know the population standard
deviation and I believe that this is the same for
the mistakes. Do I still need to devide the standard
deviation by the sqaureroot of n when construction
the confidence interval or could I use or could I
just use the population standard deviation?2.My second problem is the following. My population
is wildly unnormal distributed. However it has
features of the Fdistribution. Would it be (very)
inapproiate to use this for significance testing,
and construction an (assymytrec) confidence
interval? concerning the CET, in in assymtotic assumption
could the sample be normal distributed when the
population clearly is not. If someone could help me out with these question or
refer me to the corresponding literature I would be
very gratefull.Thanks,
WOuter0February 18, 2010 at 6:14 pm #189513Do your own homework.You’ll be amazed what can be learned when using your own brain.
0February 18, 2010 at 7:31 pm #189519Thanks Stan, I am happy to find out that my question
is more obvious then I thought it to be ( I actually
mean it). However after consulting;
Sydsaeter& hammond 2002
Wooldrige 2006
Bowerman 2003
kallenberg 1997
Remi Baxter Stochastic calculus
Bjork
and many dodgy internet sources. I have not found a
single example of the theorem with known population
variables nor an explicit explanation for this
specific case. My intuition tells me that it does
make sense that the population variables should
outrule the thereom.Furthermore, I have never seen (in social sciences
and economics) a non linear confidence interval with
the F statistics. I assume after your reaction, that the answer is
rather obvious so, did I miss the point in the
previous mentioned sources? or should I consult more
exact sciences? If so could you give me any source?0February 19, 2010 at 1:34 pm #189538
Robert ButlerParticipant@rbutler Include @rbutler in your post and this person will
be notified via email.I’m not sure I understand the point of all of the questions you are asking. I’d like to ask a few in turn which, hopefully, will help me understand what it is you are trying to accomplish.
You said you have a population of 250. You have computed various statistics of that population and you know there is an unknown mistake rate and that you “have good reason to assume this mistake rate has a similar distribution which I will test with a modified jaque bera test.”
I had never heard of the JarqueBera test but a quick check on Google indicated that it is a goodness of fit measure of departure from normality based on skewness and kurtosis.
Question 1: Why are you running a goodness of fit test? As written your post gives the impression that a goodness of fit test is a measure of equivalence of two random distributions to one another – this isn’t the case. As mentioned above, the test you cited is a test for normality. You could, of course, run the test but later in your post you indicated the population is “wildly unnormal distributed” so it sounds like you already know what the results of the test will be.
Question 2: You indicated you have 250 measures and presumably you have a way to identify a mistake so, rather than taking some sample of the data why not just separate the 250 into good and mistake and see what you get?
Your final comments suggest you are trying to determine some kind of confidence interval but it is unclear if you are interested in a confidence interval of a mean (or means) or if you are interested in a confidence interval on individuals from a population or perhaps something else entirely.
If you could answer these questions and elaborate some more as to your aims concerning the confidence interval perhaps I or someone else may be able to offer additional thoughts.0February 19, 2010 at 2:33 pm #189540
Venerable BedeMember@VenerableBede Include @VenerableBede in your post and this person will
be notified via email.Wouter:
Maybe I do not understand your question, but it seems to me that if you have the population, then forget about sampling and use the population. Then the mean and standard deviation you calculate are parameters, not statistics, so the estimation of confidence intervals does not apply, since you are not using the data to estimate or predict anything beyond the population that you have defined.
And then I have to wonder, since you have the population, what sort of testing are you trying to accomplish? What benfit do you derive from calculating a mean and standard deviation on such a skewed distribution?0February 19, 2010 at 3:15 pm #189542What is it that you are measuring with your population of 250? You said you have a mean, standard deviation, etc. Whats the metric? Cycle time?
If you are going to assume that you have the population, I agree with the followon comments that suggest just pull the whole population and determine which ones are defective. From a statistical tool and statistical equation stand point, these things were primarily designed to infer characteristics of the population based on samples, so most statistical tools and equations become confusing when we assume we know the population. Could you step back and assume that you have a sample of 250 (since you will likely want to make recommendations about future process behavior) and reexamine your equations under that assumption. Unless I had all the data that was, all the data that currently is, and all the data that will be in the future, I personally hesitate calling something a population, not because of some definition, but rather analyzing population data either becomes really easy (like recommended by the other contributors) or incredibly confusing like youve found.
If you reply to this post with your metric and what you are trying to show/determine, that would help me provide more information on how best to determine your solution.
“Data is like garbage. You’d better know what you are going to do with it before you collect it.” Mark Twain.0February 20, 2010 at 6:42 pm #189566Hello everyone,Thanks for the replies.I am happy to hear that my initial intuition is
similar. However, I think that my question was
slightly ambiguous. I will elaborate:I am currently rewriting an audit trail. I have a
large set of companies that have applied for a
subsidy. My population is a single deceleration of
one company which needs to be checked for mistakes. This population is a set of ( around 250) bills (
between 1 and 10.000 $). The population is wildly
unnormal distributed. The distribution differs from
heavily positively skewed, to uniform. In the bills, there a certain mistake rate (which
defers per company/population). I would like to
determine the mistake rate (in the form of an
confidence interval)by random sampling and
extrapolation. Since I know the population and we have good reason
to believe (and tested, this) that the mistakes are
distributed similar to the bills of the company.
This has the advantage that I can use their
variables (I think), such as the standard deviation
and distribution. Since the standard deviation is the population
deviation and not the sample deviation, I do not
have s/n^0.5. And I can not determine the n, random
sample size, by simply rewritten the 95% confidence
interval to n=t.s^2/B^2
B=maximum mistake of average. This method would have such an high n that it would
often be higher than N (Population). So the method
of using the population variables and distribution
would give, according to me, a better result. (the
only disadvantage is that I can no longer determine
the sample size before testing, which would not be a
major problem). my question: Can I simply take the population
variables, or does the Central Limit Theory proof
that it is better to take the sample variables of
the mistakes itself (which do not significantly
defer from the population variables) and use the
normal distribution?Furthermore, my second problem is that I need to
determine a maximum possible error (with 95%
certainty) in the determined average mistake. Since
the population is positively skewed distributed, I
thought it would be appropriate to use this
different distribution in the confidence interval
(instead of 1,96 from the t distribution). Using the
Skewed distribution significantly reduces the error
on the left side (which is our aim), which is very
convenient. However, is this inappropriate? I have so far, not
yet encountered a similar method in academic
literature?I hope you could help me out
Thanks again.0February 20, 2010 at 6:46 pm #189567Hello everyone,I replied accidentally on my own message instead of
yours, since I don t know if that is important for the notification system I wrote this message. My
elaboration is in Central Limit Theorem Question. Thanks0February 23, 2010 at 3:02 pm #189618Wouter,
Lets take a step back. The way I understand it: The Y (output) you are trying to control (or at least baseline) is the defect rate (mistakes/bill) for which you have no current data and are asking for a sampling plan. (It sounds like) the data you do have is data on the company and the amount of the bill.
Focus on the Y (mistakes/bill) and assume you do not have the population: I would recommend simplifying things further by not calling the 250 bills a population, unless you are only interested in these 250 bills and have no interest in describing the behavior of the current system which includes past, present, and future behavior. If you are going to limit yourself to just these 250 bills, then your conclusions that you present need to be worded to that effect. Im going to assume that we want to include the total population which is unknown.
If Y is a function of X normalize or stratify: If you are confident that the mistakes/bill is influenced by the amount the bill is for, then you can either normalize your metric: defects/dollar requested or stratify your sampling plan to the distribution of dollar amounts. Normalizing your metric should give you more normal data and you can begin to see the benefits of the Central Limit Theorem.
Consider Medians: Which ever metric you go with, if it is continuous data and nonnormal then the median would be a more representative measure of central tendency than the mean. Youll have to make some conservative estimates for the statistics you dont know. You cant use the standard deviation from the cost data if what you are trying to measure is defects/bill. There are various equations and look up tables out there that can guide you to an effective sampling plan. If absolutely stuck, start with a small amount of samples like 35 (why 35, no particular reason, this is just to get a feel of some of the statistics and determine how many additional samples we need, I would have said 30, but thats the same number as the starting point with continuous data, and I didnt want to get the replies that said I was wrong because my data was attribute, when I really just took a stab at the number to begin with). So, start with 35, run some statistics and see what would be the benefit of gathering more data (i.e., I once conducted a survey, we had limited time and budget, I collected 50 surveys, ran the confidence intervals and found that I would only get my confidence interval just 2% tighter if I collected another 50 surveys. So instead of doubling my work for 2%, I accepted the wide confidence intervals).
Maximum defect rate (w/ 95% confidence): This is probably the easiest part once you get the data. This is where probability distributions come into play. Examine your data and find the best appropriate distribution (be sure to check the assumptions for the distribution, if you use MINITAB the 4 in 1 graphs help you do that). Then you can use that distribution with those statistics to calculate what would be the defect rate at 95%. If you use MINITAB, they have a pretty good explanation of what Im talking about with examples. Go to Probability Distributions section and pull up one of those distributions and bring up the help screen.
You are trying to find answers to some pretty simple questions, however, the challenge come in selecting the tool that is designed to give you that answer and avoiding assumptions that will take you down the wrong path. Some of the assumptions Im concerned about (and note, I dont know anything about your data, your project, or your objectives):
· Assuming you have the population: The statistical tools make more sense if to me when I assume the population is always unknown.
· Mixing statistics. If interested in defect rate, be careful not to use means and standard deviations from cost data.
· Central Limit Theorem Makes life easier, however, I always double check the work and the assumptions when one of my Black Belts comes back to me using it as a stand alone analysis. They often need something from the central limit theorem and charge in with the wrong original data (data they have instead of data they need) and quickly get stuck and confused, or worse perform incorrect analysis.
Take a step back, take the analysis one piece at a time, and format things so they look and feel like they did while in training. Good luck.
0 
AuthorPosts
The forum ‘General’ is closed to new topics and replies.