Distribution of the Population and Histograms

Six Sigma – iSixSigma Forums General Forums Tools & Templates Distribution of the Population and Histograms

This topic contains 13 replies, has 4 voices, and was last updated by  Chris Seider 6 months, 2 weeks ago.

Viewing 14 posts - 1 through 14 (of 14 total)
  • Author
  • #56006



    I work in a paint manufacturing plant and I’m in the process of trying to reduce the number of colour additions to a batch of paint to reach the colour standard. I have chosen a specific line and worked out(via Minitab) the N (187), Mean (4.037) & the stDev 2.017. The target average is 2.70 colour additions. Spreadsheet attached. I am trying to move towards Green Belt level but I don’t know where to go from here. Can you please help me? How can I show the bell curve, confidence level and the type of distribution I have and whatever else I should be doing to show I’m doing the right thing?


    Robert Butler

    Your target average may be 2.7 but, as you noted the average of your sample is 4.03 and your median is 4. The histogram of all of the data indicates quite a skew towards the high side in terms of counts of color additives per batch. In your spread sheet you have numbers highlighted in yellow along with summary statistics – I’m assuming this is to indicate some kind of lot identifier (so many batches/lot?)

    In any event an overall histogram illustrates the fact that your process is a long way from any kind of average of 2.7 color additions. A check of boxplots by lot number reveals the same thing. In addition to having to move the mean you will also need to get feedback from management concerning the amount of variation they are willing to tolerate around a target mean of 2.7. Without that piece of information you could have a situation where your grand average is 2.7 but the variability is such that your changes to the process don’t matter.

    Since you have a physical lower bound of 0 color additives and you have a target very close to 0 the histogram of your process will be non-normal and the issue for you is that of determining the causes for the need for large numbers of additives per batch. As noted, for the entire data set the median is 4 and a check of the lower and upper quartiles indicate a value of 3 and 5 respectively. In other words 75% of your product requires more than 3 additives and 25% requires more than 5.

    If you think your data is representative then I would say the next task is to understand the need for so many additives to so much of your product. From the standpoint of the 80/20 rule you might want to consider first looking at those cases where the color count is greater than 5 (the upper 25%).

    If you can answer that question and find a solution such that the maximum additive count becomes 5 then, based on your data, you would have a situation where your lower quartile is 3, your upper quartile is 4, your median would be 3 (a whole number color count additive) and your mean would be 3.3 with a standard deviation of 1.05. Eliminating the group above 5 additives wouldn’t get you to 2.7 but it would be a large improvement over your current situation.

    I’ve attached two graphs – a histogram and a lot (?) level boxplot to help you think about what it is that you data is telling you.


    Chris Seider

    Not being a smart alec….promise. But consider following basic tools of the DMAIC process such as a process map and gather data on X’s also.



    Thank you Robert, that is very helpful. I’m required to show a bell curve with 95% confidence level. Can you help here?
    The reason I am not so familiar with the topic is due to the alacrity of the training by the master black belt training. It was very fast and the time to assimilate and comprehend was rather limiting. Hence Why I seek support. I do appreciate any help you can give me through out this project.



    Sorry, I meant to say Black Belt trainer not training.


    Robert Butler

    That may be the requirement but the basic physics of your situation will guarantee it won’t happen. As I mentioned in my first post you have an actual physical lower bound of 0 additives and you are trying to work very close to that lower bound which means you are going to have an asymmetric distribution. Just look at the histogram of your process as it is currently running – it is already skewed. If you manage to eliminate the upper 25% of that distribution the data will still be skewed just with a shorter tail.

    What you can do is this. If you can identify and eliminate the occurrence of the upper 25% of the current distribution (or drastically reduce the odds of anything exceeding 5 additives) then you could take the data for your new process, build a normal probability plot of the data, pick off the values corresponding to the 2.5 and 97.5 percentiles and you will have an equivalent two sigma spread around your process mean.

    By way of showing improvement in the process you would compute the equivalent 95% bounds of your process as it is now running and then show the difference after you have improved the process.

    If you want to do this you will need justification for the approach. Try borrowing the book Measuring Process Capability by Bothe through inter-library loan and read Chapter 8 Measuring Capability for Non-Normal Variable Data. It covers the details of this method for determining equivalent spread.



    Why do you want to know the distribution for your process? Control charts work with any distribution, without normalization. They can also be used for short runs such as yours. I suggest buying “Short Run SPC” – Dr Wheeler.


    Robert Butler

    I don’t think the OP was concerned about process control nor about control chart issues. My take on his post was that he was given the task of reducing the average number of additives per paint lot. To that end I focused on what his current set of data was telling him about his process and I suggested a possible way of looking at his process that, if successful, would result in reducing the odds of having to add more than 5 additives to a lot of material.

    I realize that someone in his organization said they wanted a normal distribution but I didn’t get the impression this had anything to do with control charts either.


    Chris Seider

    @rbutler Agreed. Solve the problem.

    He didn’t get “my input” about getting data on X’s, etc which might get a solution sooner.



    Hi Chris

    I did gather my X’s Data that impact on the primary Metric i.e., colour moves. Here is my list: Tinter Strength, Tinter Additions, Dispersion time, Machine speed, Raw materials,
    Skill levels, Scales/weight, Specification, Spindle type & Blade type & size.

    I will set up a DoE from 3 of the factors above probably using a variable speed drive for machine speed, tinter addition rate (the rate at which the tinter is added to the product), time to disperse.

    I am open to suggesting and guidance if you think my choices is weak.


    Robert Butler

    Not so fast! :-) If your most recent post is an accurate summary of all you have done then you still have some work to do before thinking about a design.

    If you are going to try to run the analysis as I mentioned above then you should first split your data into the group with more than 5 additives and the group with 5 or less and run either simple bean counts or means tests of all of the factors you have listed Tinter Strength, Tinter Additions, Dispersion time, Machine speed, Raw materials, Skill levels, Scales/weight, Specification, Spindle type & Blade type & size.

    I’m not sure just how you would characterize some of the variables such as skill levels, specification and scales/weight but if you have some way of classifying these things then you should look at a summary of the differences in the counts/statistics of all of these properties split on the basis of the two groups.

    1. To see if there are any significant differences – for the continuous measures you could use a two sample t-test. At a guess these would be tinter strength and additions, dispersion time and machine speed. Depending on how scales/weight, specifications, and skill level are characterized (either continuous or ordinal) they might be included in this list as well.

    For those variables that are most likely nominal – Raw materials (suppliers?),Spindle type & Blade type & size you would want to run a bean count to see if there are obvious splits in the counts of these things between the two groups.

    2. Given the above information you will be in a much better position to decide which of these variables might be worth checking at the level of a DOE.



    Hi Robert,

    Thanks again for keeping me on the right track.

    I have a couple of questions I’d like to ask, although they might sound silly, here goes..
    1. How did you calculate that 75% of your product requires more than 3 additives and 25% requires more than 5?
    2. What do you mean by run either simple bean counts or means tests? How do you do this?
    3. What do you mean by “splits” in the counts of these things between the two groups?



    Robert Butler

    1. If you have some kind of software that allows you to identify the quartiles of your data you will find that the lower quartile (25%) is 3, the median (the 50% point is 4, and the upper quartile (75%) is 5. Therefore 25% of your data has a color count >5.

    If you don’t have access to any statistical software you can approximate this process by sorting the entries for color count in the Excel sheet in ascending order. The actual count of real entries is 187.

    25% of 187 is 46.75 – rounded up to 47, 50% of 187 is 93.5 rounded up to 94, and 75% of 187 is 140.25 rounded down to 140.

    If you count down the sorted entries to the 47th entry the color count value is 3, down to the 94th the color count value is 4 and down to the 140th entry and the color count value is 5. Therefore 25% of the color count values are greater than 5.

    The only problem with doing the calculation this way is that you are running the analysis on the sample so you will not get perfect agreement with a statistics package running the same analysis.

    In particular you will see that, for this sample, for each of the cut points there are still values of 3 after the 47th entry, values of 4 after the 94th and values of 5 after the 140th. For this particular sample 16% of the entries are greater than 5 and 27% of the entries have a value of 5 or more.

    2. What I mean by simple bean counts or means is that you take all of the variables you listed as possibly being related to the total count of color additives and you class their values according to whether or not they are associated with color counts less than or equal to 5 or greater than 5.

    Example: Bean count using the actual color count values from the sample

    Color count less than or equal to 5 = 156, color count greater than 5 = 31

    Variable:Blade Types 1,2,3,4

    Blade Type CC<=5 : Type 1: 106, Type 2: 35, Type 3: 10, Type 4: 5 Total count 156
    Blade Type CC>5: Type 1: 1, Type 2: 4, Type 3: 7, Type 4: 19 Total count 31

    Therefore color counts <= 5 are most likely associated with blade types 1 and 2 (141 out of 156) whereas color counts >5 are most likely associated with Type 4 with Type 3 running a distant second (19 out of 31 or, combining type 3 and 4, 26 out of 31).
    This would suggest blade type might be associated in some way with larger or smaller numbers of color addition. In this case you would want to look at the various physical aspects of the blades to see if there might be some connection between some of the physical properties and number of colors added.

    Example: Means using the total count of 187

    Variable: Dispersion time

    Dispersion times for CC<=5 : Mean dispersion time is 60 seconds with a standard deviation of +-2 (N = 156)
    Dispersion times for CC>5: Mean dispersion time is 15 seconds with a standard deviation of +-3 (N = 31)

    Run a two sample t-test on the means

    t = (60 – 15)/sqrt[{(2*2)/156} + {(3*3)/31}] = 45/.5621 = 80 therefore there is a significant difference between the mean dispersion times for color counts <=5 and color counts >5. This would suggest there is something about dispersion time that impacts the number of colors added to a given batch.

    You can run these calculations manually but it will be very labor intensive and time consuming. Given the number of comparisons you will need to make you will really need to have access to some kind of statistical software and you will need to have an understanding of what that software is doing. For a basic level of understanding of the statistics I would recommend getting a copy of The Cartoon Guide to Statistics by Gonick and Smith. You should note that the section on the t-test makes reference to the “need” to have normally distributed data – this is an unneeded constraint. The t-test does quite well with non-normal data. If you need a reference check pages 52-53 of The Design and Analysis of Industrial Experiments 2nd Edition by Owen Davies.

    3. The splits are just the act of grouping the various measures according to whether they correspond to a color count less than or equal to 5 or color counts greater than 5.


    Chris Seider

    yes, consider the tons of graphical and statistical tools to get a glimpse of impacting factors. As @rbutler followed up with to yourself.

Viewing 14 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic.