Home › Forums › General Forums › Methodology › Need Help Choosing Statistical Tool for Data
This topic contains 14 replies, has 5 voices, and was last updated by Mike Carnell 7 months, 1 week ago.
Hi Everyone,
I’m currently a Greenbelt candidate and working on my project. I’ve identified my primary metric, which is made up of discrete proportions that can take on 9 values (0, 0.125, 0.250 , 0.375, …., 1.000). Most of my potential X’s are also discrete. I have about 10000 observations of my primary metric, and I’m not sure how to represent the data statistically. If I drew a histogram, it would have a long left tail with the mode at 1.000. Although I know that a histogram shouldn’t be used on discrete data. After my project is through, I’m expecting an even longer left tail with even more observations at 1.000.
My questions are:
1) How do I represent/display the data statistically in the measure phase?
2) What statistical methods would you suggest in testing whether the improvements to be put in place make a difference statistically?
Dot plots and box plots can be useful.
I would work on getting better precision for your measurement. Why are you only able to get 9 values for a primary metric? Don’t confuse a primary metric with something you measure….why not use ppm or % defective. Be aware there are lots of potential questions.
@teg153 – are you sure that you only have those values available? These look more like a measurement was taken and then attributed to the closest 1/8 bucket. I’d look deeper to the actual measurement, not the bucket it was attributed to.
What is your objective? That will help to guide how to evaluate the data.
Thanks Chris and MBBinWI for your replies!…here’s some background to give an idea why I only have 9 values. The goal of the project is to make better use of our medical operatories. So if we look at one operatory, we have the amount of time a provider was scheduled in that operatory out of an 8 hour day. Because their schedules are not continuous, more like 1 hour increments, I only have so many values. For example, an operatory that has a provider scheduled in it 4 hours out of an 8 hour day would have a usage of 0.500. One out of 8 hours is 0.125. I have several thousand of these observations available at this granularity in our database. Thanks in advance for your suggestions…
I don’t know the context but I highly suggest you redefine the problem. Use of an asset (medical operatories) isn’t a problem…it’s a symptom of something else.
Your metric ought to be % of hours used IF you insist on proceeding with this project.
Also, with my tongue in cheek, I can easily increase the metric you’ve suggested by having the medical operatories used inefficiently…heck, slow down the effectiveness in half and you’ll increase your % of hours used. This is another reason, I suggest redefining the project to a defect or problem in the process.
@teg153 @MBBinWI gave you good advice. I liked @cseider answer as well but then I always prefer to see my data in a dot plot at some point because it seems easier to see very odd values.
You need to understand the problem you have been given to solve and that should always begin with the data that caused someone to think it was a problem. You also need to understand the measurement that created the data (that would be why MSA is in the Measure Phase).
Everything, at least that I am aware of, at some point is discrete, depending on the resolution of the measurement system. Don’t let rules get in your way. If you want to see your discrete data in a histogram then do it. Nobody goes to statistical jail. I just hope you don’t work for the “tool overkill person” – it might be a capital offense for them.
You seem to be approaching this a little backwards. You will probably struggle a little less with what the proper tool is by asking two questions before you start worrying about tools. First “What do I want to know?” and second “How do I want to see it?” Don’t just do it in your head. Write it down and then start sorting through tools to figure out what fits and what does not fit. It will help you to learn all of the tools.
If I am dealing with something I know nothing about (which is frequently) I can print up a bunch of graphs. Don’t even do logic just graph everything any way you can. Don’t read them but print them all. Spread them out on the floor. Stand up so all you can really see is the patteren. Pick up the one that look like they are telling you something and spend some time figuring out if they really mean anything. Don’t throw the others away until you have figured out for sure if they are useless Don’t want to use any more paper than we have to.
You are putting to much pressure on yourself to know to much. At the beginning of a project you aren’t supposed to know anything. If you don’t know anything I am willing to bet you will have a better solution that the person who knows all about it unless of course this is something involving something quantum physics, predicting prime numbers, or something like that.
Good luck.
@teg153 You have committed one of the “sins” of data collection. You took “time” which is a continuous variable and converted it into a much less robust discrete value. I am sure somewhere along the line you captured the actual time spent in the OR. If not, you have lost a good deal of valuable information. I am still not sure what the underlying problem is. It seems that possibly you might be concerned with the forecast of how long a procedure should take rather than what percentage do the surgeons hit the target time window. Do you have a “goal” of what percent of the time do you wish to meet the forecast? Since the mode is 1.00 it appears that they don’t miss by too much. If you had a goal of say 95%, what percent of your procedures are worse than that? As others have indicated, be real clear as to your problem statement and what you want to learn from the data.
@Darth At the risk of turning this into a debate that was not the intent of the original question but you stated time is a continuous variable. Some where along the line someone decided to put this data in large buckets (I do agree there is probably some system in the background that has the actual time). At what point does time become continuous? Hours, minutes, seconds? In electronic a microsecond used to be fast so does that make seconds dicrete. We frequently see nanoseconds (defined by the amount of time between the time you graduate college and the time the alumni association asks for a donation) so now are microseconds discrete?
Unfortunately everything is really discrete at some level but is typically defined as that in terms of measurement capability and probably more so by our ability to calibrate the measurement system.
Just my opinion.
@mike-carnell I define the difference between discrete and continuous as the potential ability to subdivide the measurement and whether it is counted or measured by some device. Time can be divided quite a bit assuming, as you say, the measurement system has the capability to do it and that subdivision is usually done with some sort of measurement device. Counting the cases on a pallet or the people in a room doesn’t really have the potential ability to be subdivided into smaller units that make sense nor is there a “measurement” device other than the eyeball and the finger to point and count them. Then there is always the “gray” area where large counts can take on the characteristics and behavior of continuous data so you end up with “pseudo” continuous data. In the end, don’t get too crazy but recognize that your assumptions will have an impact on the tools you choose for analysis and thus the results that you have to interpret and use to answer your research questions. BTW, what happened to “Just my Humble Opinion” as your signoff? :-).
It’s always good to have these detailed discussions but I’ll say that I’ve seen others try to get as exact as yourself in front of new green/black belts and it’s caused much confusion.
My mentoring has encouraged them to treat much data as continuous since the end goal is often getting more pallets of material out the door.
I get really saddened with my newer belts who say “But I was told the % is discrete since it’s made of counts in the numerator and denominator”. I say, let’s do a distribution of your %’s and you’ll see it’s normal looking curve and we can easily apply mean statistics to show we’ve increased the cases/hr output. They look relieved and I say- “Hey, see you’ve accomplished your goal of 30% increase with statistical confidence.”
As one of my mentors has emphasized…it’s my two cents.
@cseider While I agree that dogmatic dedication and anal retention is not always a desirable attribute when analyzing data, it is my opinion that given a choice, do it the right way. When all else fails, common sense should be utilized rather than a strict adherence to rules. Taking a shot at the data from different angles also helps. For example, if one is trying to determine the difference between three means the correct approach might be to use ANOVA. But what if the data fails the two major assumptions of normality and equal variance? Then we might be tempted to do a Moods Median test. What I tell folks to do is try both. If you get the same directional answer then what the heck, sort of ignore the assumptions. They are pretty robust anyway. But, what if they give you different results? I would be inclined to go with the test that fulfilled the assumptions. I hope you aren’t saying you would “round off” and go with the ANOVA because it is cooler although it might not be correct. For the same effort, you can use the Median and be confident under interrogation that the correct tool was used and that you can stand behind your analysis with greater certainty than saying you blew off the analysis and went with something easier.
Oh, I’d always say do a Moods Median test with non-normal data since 1. it’s statistically more correct and 2. it’s so similar in looks.
But when people talk discrete vs continuous, then the tool selection becomes much more confusing.
@cseider Then let me make it easier for you to understand. If I measure the number of ounces of tequila I have consumed, then it it likely to be continuous since I am “measuring” it and, depending on the measurement device, the quantity can be subdivided to the degree the instrument is capable of measuring. If I am interested in knowing how many shots I had, it would be “countable” and thus discrete. Unless you are the kind of guy that doesn’t finish the shot. I suggest that we both attempt to validate this methodology right now. Have a good weekend Chris and nice chatting with you. I sense that Mike has already started chatting with Jack and his friend Coke.
LOL….who uses a shot glass? Have a good weekend…. :)
@Darth @cseider I don’t drink that nasty crap. If it is striaght rum we do Havana Club (when we can get it) and if you mix it with coke it doesn’t matter much. Now if you want a boat drink you mix Captain Morgan Private Stock and Squirt. That is nice to just sit on the back porch, watch the sun go down and nurse it for a while. About 7pm Consuelo and I are slippin on down to Gruene Hall to catch Chris Isaaks and maybe dance a bit.
Have a great weekend.
© Copyright iSixSigma 2000-2014. User Agreement. Any reproduction or other use of content without the express written consent of iSixSigma is prohibited. More »