Confidence Intervals for Non-Normal data

Six Sigma – iSixSigma Forums Old Forums General Confidence Intervals for Non-Normal data

Viewing 54 posts - 1 through 54 (of 54 total)
• Author
Posts
• #49282

Bob Hubbard
Participant

I am working with time data relating to the Time Required to Resolve Problems.  Most problems are solved in less than an hour, skewing the distribution to the left.  Some incidents require many hours to repair, which is why this data is not well modeled by a Normal distribution. (Even after accounting for Special Cause, I am left with Non-Normal data.)
How do I determine Confidence Intervals on this Non-Normal data without relying on the Central Limit Theorem?
Bob [email protected]

0
#168213

SixSigmaGuy
Participant

Do you mean the confidence interval around the mean?  I, too, frequently have the same problem and am really glad you asked this question.  When I plot a histogram of my data, it typically looks like a Chi-Square distribution with a low number of degrees of freedom.  But I don’t know how to calculate a confidence interval around the mean for a Chi-Square distribution.  Hopefully, someone here does and can solve this problem for us.

0
#168254

Scooter
Member

BH,
Have you tried the following:
As you do not seem to have Normal data have you tried using the median instead of the mean as they use in the housing industry?
Have you looked at converting your Non-Normal date using Box Cox transformation?
Scooter

0
#168256

Gastala
Participant

I’m no expert but have you considered non parametric tests – e.g. Mann-Whitney.  This basically compares the cumulative distributions with each other – regardless of their fit to any standard distribution.

0
#168257

Gastala
Participant

Disregard last post – wrong topic!

0
#168260

Dennis Craggs
Participant

What distribution describes the data? Have you tried Weibull or LogNormal?

0
#168263

Mikel
Member

A minor point, but this type of data is referred to as skewed right, not skewed left. The direction refers to the long tail.
If your data is continuous and relatively smooth, you can discover the correct distribution using Minitab allowing you to adequately describe the confidence interval.
If it is continuous and relatively smooth you can also transform it successfully if you just have a need to talk of it in terms of a normal distribution (NVA).

0
#168265

SixSigmaGuy
Participant

Wow!!  Live and Learn.  I’ve been doing this for many years and have always interpreted Skewed Left and Skewed Right based on the peak of the distribution, not the tail  When I first read your post, what is wrong with this person.   But then I looked it up in a few statistics books and, OMG, you are right.  I’ve even taught it wrong in classes; how humilating. :-(  A more precise definition (as I found in one of my stat books) is, if the mean and median are to the left of the mode, the distribution is Skewed Left (also called “Negatively Skewed”) and if the mean and median are to the right of the mode, the distribution is Skewed Right (also called “Postively Skewed”).
Thank you for pointing out the error in my ways. :-)

0
#168267

Mike Carnell
Participant

SSG,
Simple way to remember:
Make a fist with your thumb sticking out. Have your thumb point the direction of the tail. If your thumb is pointing right it is skewed right. I let you figure out what it is if your thumb points left.
Good luck

0
#168270

SixSigmaGuy
Participant

Not sure what Bob’s requirements are, but in our case, we are looking for something more rigorous than Mann-Whitney test (or it’s equivalent, the Wilcoxon Rank-Sum test); something that will increase our likihood of rejection.

0
#168271

SixSigmaGuy
Participant

:-)

0
#168272

SixSigmaGuy
Participant

Not sure what Bob’s distribution is, but from his description it sounds the same as mine.  Mine looks most like a Chi-Square distribution when the DF is low.

0
#168286

Bob H
Participant

I’m looking for a way to say, “we can say with 95% confidence, that
our 2007 numbers are a statistical improvement over our 2006
numbers”. I haven’t tried Mann-Whitney yet, because I’m looking for
something simple to execute and something I can explain to non-
statisticians.

0
#168287

Bob H
Participant

Thanks for the clarification on the direction of my skew. ;-)My data is continuous, but there are tons of data points below the
60 minute mark. (Probably because there is an informal Service
Level Agreement at the 60 minute mark.) Most of these problems are solved in an acceptable amount of
time, (the mean is around 120 minutes over +2,500 events.), but
at some point, everyone agrees that we should take, “as long as it
takes”, to fix the problem. My gut tells me that this becomes a
different process at this point, but I’m reticent to toss the data
points. If I make some broad assumptions, remove 20 or 30 data points,
my data fits Weibull and Log normal distributions.

0
#168288

Bob H
Participant

SixSigmaGuy is correct, my distribution looks roughly like a Chi
Square distribution with 9 or so degrees of freedom.

0
#168289

Bob H
Participant

I did do a Box Cox transformation, but it looked a little too
“scrubbed”. I like the idea of using the median, rather than the mean.
How do I calculate the CIs and is hypothesis testing the same with the
median as with the mean?

0
#168290

Bob H
Participant

Love the fist analogy. Will definitely use that in my next Green Belt
class, THANKS!

0
#168292

HF Chris
Participant

Why would you remove 20 to 30 points to make things fit better? Can you really say you have 95% confidence with that process? Suggestion is to run a goodness of fit test with multiple types of distributions. If no fit, re-examine your process. There are transformations that may work for a fit test, but what is that really going to do for you if your goal is to make the statistics look simple to your audience? Show your people the real picture and build a business plan to go fix it.
Chris

0
#168293

SixSigmaGuy
Participant

I’ve been doing some reading on the subject, and have an idea I’d like to propose as a solution to this problem.
Assuming that the data does fit a Chi-Square distribution and I want to build a 95% confidence interval around the mean of my sample.  Using both tails of the Chi-Square distribution, I can calculate the (1.0 – .95)/2 = 0.025 critical values separately for each tail; the same way I calculate the critical values when I’m trying to estimate the standard deviation of a normal distribution. I’ll call the left critical value a(L) and the right critical value a(R).
The properties of a Chi-Square distribution are (from http://stattrek.com/Lesson3/ChiSquare.aspx?Tutorial=AP):    m = mean = degrees of freedom    s = standard deviation = SQRT(2 * m)
Thus, using the same procedure as used for the z and t distributions, I can calculate the margin of error as two values:     E(L) = a(L) * s = a(L) * SQRT(2 * m)    E(R) = a(R) * s = a(R) * SQRT(2 * m)
Finally, the confidence interval would be [x-bar – E(L)] to [x-bar + E(R)], where x-bar is calculated the same way as for a normal distribution.
Does that sound like a valid approach?  Does my math make sense?

0
#168306

SiggySig
Member

I’m confused as to why you would consider a chi-square when working with continuous cycle time data? I’ve only ever used chi-square to handle proportion data.I have done several cycle time analyses before, and have found that Mood’s Median test is the hypothesis test to use. No need to transform, etc.

0
#168309

Dennis Craggs
Participant

Actually the type of distribution is very important. Since the data is skewed it suggests that Weibull, LogNormal, or Exponential will provide a reasonable fit to the data. Once a distribution is fit to the data, then confidence limits on the parameters of the distribution can be established. Also, the population % that will fall into intervals can be predicted. The distribution may allow the data (and specifications) to be normalized. This will allow the calculation of indices like Ppk or Cpk.
Minitab can be used to establish the type of distribution. Weibull and LogNormal were referenced since they are usually provide reasonable fits to the data.

0
#168327

SixSigmaGuy
Participant

If I understand Bob’s project correctly, the goal of the project is to lower the mean; thus, he should be performing hypothesis tests around the mean, not the median.  At least that’s the case with my project and it sounds similar to Bob’s.
The reason for considering the chi-square distribution, in my case, is because the histogram for my data resembles a chi-square distribution.  Also,

My distribution is not symmetric, but skewed right.
There are no negative values in my data; thus the values are bounded by zero.
These are both characteristics of the Chi-Square distribution.
It’s true that my distribution might not have the same shape as the Chi-Square distribution, but, if it doesn’t, I need to determine what distribution it does match.  Then, I’ll have the same problem of how to calculate a confidence interval for that distribution.

0
#168332

Snow
Participant

Is your objective simply to determine a reliable estimate of central tendency and corresponding CI for a given data set that doesnt approximate normal?

0
#168333

SixSigmaGuy
Participant

Yes, that’s a good way to put it, although the central tendency metric needs to be the mean, x-bar.

0
#168334

Snow
Participant

Forgive my ignorance, but what is so important about using the mean?

0
#168335

SixSigmaGuy
Participant

It’s the metric we are trying to improve; that’s the only reason.  We have a metric named MTTC, Mean Time To Close, that we use for tracking support calls.  Currently our MTTC is too high and we want to reduce it.  Thus, we are doing a Six Sigma project to accomplish this.  I could use other metrics, such as Median or Variance, but the results wouldn’t mean much to our sponsors because it doesn’t tell them that the mean has been reduced.

0
#168338

HF Chris
Participant

Reminds me of the joke about the two statisticians who went hunting. The first hunter shot 2 inches to the left of the deer.  The second hunter shot 2 inches to the right of the deer and both of them were happy.  Why you ask, because by averaging the distance of both shots they got the deer.
Point is, if your managers only understand changes in the mean they don’t understand their true process problem.
HF Chris

0
#168339

Brandon
Participant

HF, that’s a lot like what we used to call the difference between a stats guy and an engineer…
If a naked lady stood against a wall and you could only move towards her 50% of the distance with each step..the stats guy say “Forget it…you’ll never get there!” The engineer said “I’ll get close enough…move over!”

0
#168340

Taylor
Participant

brandon, someone once said close only counts in horseshoes and handgrinades, but I think you added to the list

0
#168341

SixSigmaGuy
Participant

Whoever said my managers only understand changes in the mean?  My managers understand statistics very well.  The problem is the metric we are trying to improve is the mean.  If we were trying to improve the median or the variability, we’d be after a different metric; but, in this case, we are trying to improve the mean.
No one yet has tried to answer my question.  All I’ve heard are alternative approaches and/or workarounds.  I’m not looking for alternative approaches or workarounds; I’m looking for a way to calculate the CI around the mean of a non-normal distribution. Doesn’t anyone know how to do this?  I would have expected it to be a very common problem, especially for Six Sigma practitioners who are trying to reduce cycle times.  Hasn’t anyone had to deal with this problem before?

0
#168342

SiggySig
Member

0
#168343

SixSigmaGuy
Participant

Minitab won’t help if I don’t know what method to use.  As far as I can tell, Minitab doesn’t deal with my issue.  I would love it if someone could point me to something in Minitab that will help me with my problem.

0
#168345

SiggySig
Member

Just use Graphical Analysis – it will take your data and give you confidence intervals for mean, median and std dev.

0
#168347

Snow
Participant

Forgive me if this sounds pedantic, but I am simply trying to explain what I think is the proper response to your dilemma.
Any stable process will approximate a known distribution. Based on earlier comments, is it possible that your data set (s) is/are not stable, hence the difficulty in determining distribution type (ie ‘throwing out data points, etc)?  Have you checked?  You realize if your data set indicates instability then a CI for any descriptor (ie mean) is not valid?
And distribution type is relevent in that it allows the investigator to choose the descriptor that best represents their data set (in this case, central tendency).  Project Ys that focus on time are generally bounded at zero and thus, skewed to the right…this is what you would expect.  Hence, the median is a better descriptor for this data set and thus, a non-parametic test is an option here.
If your client is using a time-based performance metric from a stable process that plots as expected (pos skw), then using the mean and not the median is simply not representative of the process.  Verify this statement through your statistical SME and bring this to the attention of your client and have them explain or defend this practice…you might even increase your credability with them.
Now the good news…..Practically, if what you require is a test of means (ie ANOVA, paired T, one sample T/Z, two sample T/Z, etc) these tests are largely robust to a departure from normality and using them would be appropriate for what you have described (Bob H). There is a difference between detecting non-normality (you have) and being sensitive to it (tests of means are not).
In addition, you can invoke normality in a data set (but ask yourself why) through through subgrouping (ie exploiting the CLT) or transformation once the distribution type and the resulting optimal lamda is determined .
Hope it helps.  Trust but verify!

0
#168349

HF Chris
Participant

Sorry if this posts twice….my first one seems to have got lost in space.
SixSigmaGuy,
In an effort not to aggravate Stevo with “in my last company”, here it goes. I was in your position to reduce cycle time. I did deal with managers who reduced product production schedule by 18 days to meet new part orders. I did have managers whose metric goal was to reduce the mean despite the spread and variation. In other words the current “late” process was in controlled “chaos”. The bosses only concern was remove 18 days on average with confidence.
Yes, you can run an ANOVA with conservative measures that take into account your “Variation”. When you run an ANOVA you need to make sure you understand the assumptions (found in a basic stats book). Yes you can see possible improvement in the mean….But, I hope your not missing my point. There are rules on how to handle outlier points (removing them completely is not the way). Also what happens to the products that don’t “fit” your model? Do they also get removed from service? If your average metric gets better will the customer be any happier if their product is not delivered at the prescribed time?
Run a goodness of fit test and look for the best transformation of your original data. Not knowing your data, a basic stats book has plots to guide your decision. If comparing two data stats use the same transformation (apples compared to apples). Keep it mind, that your products will continue to run with variation which I assume have an impact on another part of your schedule.
HF Chris

0
#168350

Severino
Participant

Let me start by saying that I am no statistician, so you may call me an idiot by the time you’re done reading this.  My ego can accept that.  While I find the question intriguing and am also intersted in the answer, is it possible that you are too caught up in the minutia of the calculations instead of focusing on the improvement?
If your goal is to reduce your mean time to close and to show that you have made a significant, robust, sustainable change to your process why not just begin by generating a control chart?  Their whole purpose is to detect shifts in the mean and due to Chebyshev’s inequality they are relatively insensitive to the actual underlying distribution when three sigma limits are chosen.
Although it sounds strange, you might be able to use the Winsorized mean to calculate your control limits so that they are sufficiently narrow and then use those limits to monitor your actual mean.  If you can get your mean to fall below the LCL then you know you’ve made a significant improvement in your process.
Keep in mind also that as my understanding goes, using the mean as a measurement of central tendency for a distribution that is skewed can be dangerous.  For example, the way to gain the most significant shift in the mean is to manage that data which shows up in the long tail, but by doing that you would be improving your process for only a very small proportion of the population rather than making a significant improvement for a majority of the population.  Therefore, in my opinion it would be very very advisable to disqualify the outliers.
Note that I said disqualify rather than ignore.  For example, if this is a call center you can develop criteria along with management for not including the data from a call in the metric.  If this is done beforehand and you actually develop a new baseline with such a system in place you may actually find your data more closely approximated by a distribution whose properties are well understood.  You could appease the management team by developing a seperate metric for those datapoints which were disqualified so that they may be improved at a later date.
Although none of this is as sexy as being able to calculate a confidence interval and perform a hypothesis test to show that you have made a significant shift in the mean, it does allow you to focus more of your efforts on the improvement activity rather than getting caught up in the details of calculation.  Anyway, it’s probably a stupid suggestion…

0
#168356

DaveS
Participant

SixSigmaGuy,
Get a copy of “Statistical Intervals: A Guide for Practitioners (Wiley Series in Probability and Statistics),Hahn and Meeker. Thye most likely cover this. Don’t have my copy here, but I recall they do discuss this.
Another way is to bootstrap it. Resampling techniques like the bootstrap and jackknife are distribution independent.

0
#168363

Deanb
Participant

Great example of a statistical wrong thinking . I’d like to add that the key to getting the deer is not statistical, but management of relevant Xs including people factors.People wise, the hunter must manage his emotions (by avoiding buck fever) and cultivate a passion for both continuous improvement and for bagging the deer.Technically, the hunter needs to maintain consistent technique in aiming and shooting, estimate windage, range, and deer movement, know what his rifle and ammo can and cannot do, protect himself from the environment so he can stay in the field long enough to get opportunities, minimize scent and noise, read sign and apply this intelligence, skillfully choose a worthy target in the first place, and be capable of taking fast a corrective action 2nd shot if needed. Sounds a lot like SS.

0
#168370

Bob Hubbard
Participant

Dude, thanks for the insight. Its was most helpful.

I have been digging deeper since my original posting, and heres what Ive found. The data points far out in the tail seem to fall into three categories. 1. The problem repair team was not aware that something was broken, so could not begin fixing it. 2. The problem repair team and the client agreed to limp along until the problem could be fixed. 3. Special cause variation is to blame. Armed with this information, Im confident I can address some of the data points most distant from the mean. At that point I will be more confident that this process is stable.

We considered using the median, but in this case, the key performance metric is the average time to fix these problems, so wed really like to have a way to attack this from that standpoint.

I plowing through the sub-grouping now, and I really, really appreciate your suggested solutions!

Bob H

0
#168377

Brandon
Participant

Or, as Ron White says, just hit the deer with your pickup at 55 mph. A lot easier…..and you don’t even have to set your beer down.

0
#168382

BC
Participant

An Electrical Engineering joke:
If your thumb is pointing in the direction of the current, and the fingers wrap around in the direction of the magnetic field, then you know it’s your right hand.

0
#168385

SixSigmaGuy
Participant

But those metrics will be calculated assuming the data is normally distributed; my data is far from normally distributed.  Check out this paper on non-normal distributions here on ISixSigma: https://www.isixsigma.com/library/content/c020121a.asp.

0
#168387

SixSigmaGuy
Participant

I don’t want to do a transformation.  The whole point of this thread is to find out how to calculate CIs without having to do any sort of transformation.

0
#168388

SixSigmaGuy
Participant

Thanks!  I’ll check out that book.
Bootstrapping was one of the first things I tried.  Still failed to reject.  Nonetheless, the purpose of my question in this thread is to find out if there is a way to calculate confidence intervals for non-normal data directly from the data; similar to how it’s done for Normal data.  Hopefully the book will answer the question for me.

0
#168389

melvin
Participant

Chebychev and Empirical Rules
I have used Chebychev’s rule to estimate conservatively ci for data that is not normal.  See below copy paste info from http://www.stat.tamu.edu/stat30x/notes/node33.html
Hope this helps a little.  bob
Knowing the mean and standard deviation of a sample or a population gives us a good idea of where most of the data values are because of the following two rules:

‘s Rule The proportion of observations within k standard deviations of the mean, where , is at least , i.e., at least 75%, 89%, and 94% of the data are within 2, 3, and 4 standard deviations of the mean, respectively.
Empirical Rule If data follow a bell-shaped curve, then approximately 68%, 95%, and 99.7% of the data are within 1, 2, and 3 standard deviations of the mean, respectively.
EXAMPLE: A pharmaceutical company manufactures vitamin pills which contain an average of 507 grams of vitamin C with a standard deviation of 3 grams. Using Chebychev’s rule, we know that at least

or 75% of the vitamin pills are within k=2 standard deviations of the mean. That is, at least 75% of the vitamin pills will have between 501 and 513 grams of vitamin C, i.e.,

EXAMPLE: If the distribution of vitamin C amounts in the previous example is bell shaped, then we can get even more precise results by using the empirical rule. Under these conditions, approximately 68% of the vitamin pills have a vitamin C content in the interval [507-3,507+3]=[504,510], 95% are in the interval [507-2(3),507+2(3)]=[501,513], and 99.7% are in the interval [507-3(3),507+3(3)]=[498,516].
NOTE: Chebychev’s rule gives only a minimum proportion of observations which lie within k standard deviations of the mean.

0
#168390

Snow
Participant

You bet!  Best of luck!

0
#168391

SixSigmaGuy
Participant

Thanks for sharing!  I gave it a try, but unfortunately, it widens the CI over what we get with a t-distribution.  So, the liklihood of rejection is reduced.
I assume that for k=1, the result comes out to 0%, so it’s a pretty safe bet that the % is greater than 0%. :-)
This will certainly be handy in other situations I deal with, though.  Thanks!

0
#168392

SixSigmaGuy
Participant

I do, very much, appreciate your help. But my goal here is to find a way to calculate a confidence interval on non-normal data, regardless of whether it’s the right approach or not.  I recall doing this in a statistics course I took in graduate school, so I know it can be done; I just don’t remember how and I can’t find anything that tells me how.
The data is stable according to general inspection and any statistical analysis I’ve done to verify that the process is in control.
That said, your post has some very good information that I will find helpful on future projects.
Thanks!

0
#168393

Snow
Participant

Roger that.  Good luck!

0
#168394

SixSigmaGuy
Participant

Glad you’re back Bob H.  Sounds like our problems are very similar.  If you come up with a solution that’s not posted here, would you mind sharing it?  I got your email address from your first post; I’ll send you an email and we can discuss offline.

0
#168395

HF Chris
Participant

SixSigmaGuy,
Here is a walk through example using chi square. I don’t use this often because of my own personal bias to: 1.group data per their homogeneity if possible and compare like samples; 2. transform data if close; 3. Understand what in the system is causing this variance, because although all data may fit in their upper and lower limits, the process for business is unacceptable. However, in response to your request look at this link for non-normal data and the calculations.
/resourcesquality/wqachapter10.pdf

0
#168419

melvin
Participant

Actually, k must be greater than 1 for this to apply.  bob

0
#168430

melvin
Participant

I also seem to remember that in order to use the t-distribution, one of the assumptions for small sample size is that the parent population is normal so be careful there.  bob

0
#168463

SixSigmaGuy
Participant

Yes, that’s right.  My statistics reference states that if the sample size is <= 30, then the population distribution has to be normal.  Of course, as the sample size increases, so do the degrees of freedom; and as they increase, the distribution widens.

0
Viewing 54 posts - 1 through 54 (of 54 total)

The forum ‘General’ is closed to new topics and replies.