working with Non normal data
Six Sigma – iSixSigma › Forums › Old Forums › General › working with Non normal data
- This topic has 14 replies, 12 voices, and was last updated 14 years, 11 months ago by
annon.
-
AuthorPosts
-
August 20, 2007 at 4:35 pm #47883
Hi,
I am working on a project to reduce the # of trouble tickets in a specific category. I have a sample of trouble tickets (TTs) per day for a 30 day period. When I plotted my histogram the data turned out to be non normal. I transformed the data using a box cox tranformation and the data still turned out to be non normal, p value<0.05. I think this is being caused by an outlier, but I am not sure. I'm not really sure how to analyze the data now. Can anyone offer any suggestions?
Also, what type of control chart would I use?
Thanks0August 20, 2007 at 7:02 pm #160180P chart for defectives, U chart for defects.
Should the data be non-normal? If so, there are a couple of options:
1) try to fit a curve using minitab
2) use non-parametrics to do all of your testing
If there is reason to believe that the data should be normally distributed, then you need to understand why your sample isn’t so.0August 23, 2007 at 5:39 am #160313
hitesh chopraParticipant@hitesh-chopraInclude @hitesh-chopra in your post and this person will
be notified via email.hi,
very basic question i have is how you are able to plot Histogram for Attribute data & hence check Normality.
Atrribute data does not follow normal distribution curve, they will either follow Poission or Binomial curve.
0August 23, 2007 at 8:40 am #160316As your data is ‘Number of tcikets’, it is a Discrete data and definitely it will turn out to be Non Normal. Treat it like discrete data and use tools available for the same. You dont need to check Normality in this case.
0August 23, 2007 at 8:56 am #160317
PrasoonParticipant@PrasoonInclude @Prasoon in your post and this person will
be notified via email.How can we say that No. of tickets is a discrete data ? No. of tickets will be in integer …..
I think its a variable data and should not be treated as a discrete.
Try to remove the outliers from the data and then check for normality. try to find out the special causes for these outliers.
I hope it will work in your case.
Thanks,
Prasoon0August 23, 2007 at 10:28 am #160319
Six Sigma guyMember@Six-Sigma-guyInclude @Six-Sigma-guy in your post and this person will
be notified via email.Why are u checking for normality? I assume trouble tickets to be like defective tickets?
0August 23, 2007 at 11:05 am #160321
Arturo Ruiz FalcóParticipant@Arturo-Ruiz-FalcóInclude @Arturo-Ruiz-Falcó in your post and this person will
be notified via email.Lou:
First of all, what kind of data are in the tickets? I assume you are analyzing continuous data (e.g. length values).
Second, if you are picking the trouble tickets you are not taking a random sample of the process. It is likely that you are picking the tails of the parent population. Even in the case that the parent population is normally distributed, you can not expect that such sample shows normality.
Third, why do you need to transform this data to be normally distributed? I will be justified if you want to apply an statistical test or chart which requires normallity. Is this the case? I assume that your real objective is to discover why parts are out of spec and therefore trouble tickets are rised. Then I would suggest to plot scatter Final quality variable vs process parameters (e.g CTQ vs CTP) and then look for a pattern of the trouble tickets.
I hope it helps. Good luck,
Arturo0August 23, 2007 at 1:01 pm #160324Hi Prasoon
What is being measured over here is the number of tickets raised and that is a Count of the tickets. Count by any means is Discrete. To expalin further – Can there be any day when 10.5 tickets were raised, it will be either 10 or 11, hence it is a Discrete data.
Anand0August 23, 2007 at 2:30 pm #160336
MatthieuParticipant@MatthieuInclude @Matthieu in your post and this person will
be notified via email.Lou,First, do you trust your data ?Have you done an MSA on the data you deal with ?Have you taken into account the “Saturday” and “Sunday” ?Do you have benchmark to compare your baseline against ?If you suspect an outlier being the cause of the non-normal distribution, find it, understand it and isolate (remove) it from your dataset and re-run your chart.This will help you in obtaining a normal distribution.This will not give you any indication of why these tickets were created or how to prevent them.
Prasoon, Anand,The “Discrete Data” definition below can be found at:https://www.isixsigma.com/ dictionary/ Discrete_Data-226.htm
Discrete DataDiscrete data is information that can be categorized into a classification. Discrete data is based on counts. Only a finite number of values is possible, and the values cannot be subdivided meaningfully. For example, the number of parts damaged in shipment.Attribute data (aka Discrete data) is data that can’t be broken down into a smaller unit and add additional meaning. It is typically things counted in whole numbers. There is no such thing as ‘half a defect.’ Population data is attribute because you are generally counting people and putting them into various catagories (i.e. you are counting their ‘attributes’). I know, you were about to ask about the ‘2.4 kids’ statistic when they talk about average house holds. But that actually illustrates my point. Who ever heard of .4 of a person. It doesn’t really add addition ‘meaning’ to the description.See Continuous Data for alternative data type.Observations made by categorizing subjects so that there is a distinct interval between any two possible values. “Good or Bad” and “Tall or Short”Matthieu0August 23, 2007 at 7:12 pm #160368
Six Sigma BBIT LindaMember@Six-Sigma-BBIT-LindaInclude @Six-Sigma-BBIT-Linda in your post and this person will
be notified via email.Hi Lou:
I am also working on a project to reduce ticket count. My mentor has taught me to do the following:
1. Collect data
2. Run graphical summary in mini tab to determine if data is normal
3. If data is normal create a control chart (in this case you would use P chart since your data is discrete)
4. If your data is not normal you will need to investigate as the previous messages have stated – if you have an outlier that is causing this and you determine the reason you could eliminate and rerun. When you run the graphical summary in minitab you will be able to identify if there are outliers, the out put from the graphical summary will give you a histogram, and a boxplot you can identify the outliers on the box plot. Use the paintbrush to highlight the outliers and minitab will provide you with the row/location of the outlier on your spreadsheet.
5. Once your data is normal run the P chart, and if the process is stable you can run process capability using Binomial distribution
Hope this helps—0August 23, 2007 at 8:15 pm #160371
Robert ButlerParticipant@rbutlerInclude @rbutler in your post and this person will
be notified via email.Let’s back up a bit here.
You said, “I am working on a project to reduce the # of trouble tickets in a specific category. I have a sample of trouble tickets (TTs) per day for a 30 day period.”
What we need are more details.
1. Give us a definition of a trouble ticket – better yet give us an example. Is this some kind of pass/fail or is it some kind of check list or?
2. Does this specific category have only a single thing that will result in the issuing of a trouble ticket or does it have multiple things that could result in a trouble ticket?
3. If there are multiple things within this specific category that could generate a trouble ticket – where’s your bean count by thing (the pareto chart)? What does it look like?
4. Since you are working to reduce the number of trouble tickets one would assume you would want to look at frequency of trouble tickets against things like time of day, day of the week, shift changes over time, production lines across time, raw material changes over time, etc. In short – what does this data look like when plotted against things that might show trending or clustering which in turn might suggest possible cause and effect?
Histograms are interesting but, based on what you have posted so far, I don’t see why you would even care about them or any of their characteristics at this point.
0August 23, 2007 at 8:56 pm #160372Linda,
Thanks for your response. When I ran the graphical summary, the data were not normal. I tried transforming the data (my thought process being that I need normal data in order to do statistical analysis), but the data were still not normal. As folks have already posted, since I am working with attribute data, I should not have expected my data to be normal.
I guess my question then is, if I have attribute data (that does not fit a normal distribution) how can I analyze the data? I wouldn’t be able to construct a control chart to monitor the process, correct?
Thanks again for your response. I’d be curious to know how your project turns out.0August 23, 2007 at 9:11 pm #160374Robert,
Thanks for your feedback. Being new to SS I guess I was hell bent on having normal data to work with, not realizing that there will be times that I will have non normal data. To answer some of your questions:
1. Give us a definition of a trouble ticket – better yet give us an example. Is this some kind of pass/fail or is it some kind of check list or?
In this case the, category of trouble tickets are for MS Outlook. Some examples: I can’t access my email, I am missing emails, etc…
2. Does this specific category have only a single thing that will result in the issuing of a trouble ticket or does it have multiple things that could result in a trouble ticket?
There could be a number of things wrong with Outlook that would cause a customer to open a TT.
3. If there are multiple things within this specific category that could generate a trouble ticket – where’s your bean count by thing (the pareto chart)? What does it look like?
When I created the pareto chart, it clearly highlighted the major problem areas causing TTs to be opened. In the case there were about 5 areas that made up the 80%. We’re beginning to focus on identifying probable causes for each of those areas.
4. Since you are working to reduce the number of trouble tickets one would assume you would want to look at frequency of trouble tickets against things like time of day, day of the week, shift changes over time, production lines across time, raw material changes over time, etc. In short – what does this data look like when plotted against things that might show trending or clustering which in turn might suggest possible cause and effect?
That’s a great question. When I collected my data I was simply concentrating on how many TTs had been opened in a 30 day period. I had a break down of TTs per day for that time frame, but didn’t take into consideration factors such as time of day.
I really appreciate your comments. I am beginning to realize I was probably making this much more difficult than I should have.0August 24, 2007 at 6:30 pm #160411
Robert ButlerParticipant@rbutlerInclude @rbutler in your post and this person will
be notified via email.Based on your reply I’d say at this stage of the effort there is no need to even think about distributions. I’d recommend going back to the main areas identified by the pareto and start tearing that data apart. Some possible areas of investigation could be:
1. Is there a connection between a type and frequency of a particular trouble ticket and things like – level of worker training, level of worker experience, location of worker on a network, location of a worker on a server, etc.?
2. Is there any connection between type of ticket and worker occupation? There is a good chance that some occupations will use various aspects of Outlook more than others.
3. How uniform are the basic programs? (i.e. are some sections still running Windows ME whereas other sections are running Windows XP?). If this kind of split exists is there any connection between this split and the frequency of occurrence of a particular trouble ticket?
..and so on.0August 26, 2007 at 5:34 pm #160434
A good data collection plan and drilldown will take you from the corporate goal or objective that the project supports all the way through to the process, the process output, a specific characterisitic of the output, its operational definition, unit and metric, data type, spec, standard, and DO per unit.
From here, you know precisely what questions need to be answered and thus, what data should be gathered and what tools (generally) you will be using to analyze the data set.
Good luck. And do the work up front in Define and Measure. Makes things a lot easier.
0 -
AuthorPosts
The forum ‘General’ is closed to new topics and replies.