It’s data Jim, but not as we know it

During my black belt training I was introduced to the true scope of the normal distribution. Here was a distribution found extensively throughout nature and industry. It is a principle building block for the six-sigma methodology from which a number of our essential statistical tools are based including all of the t-tests. We discovered that if our customer specifications were more than six standard deviations from the mean we would be rocking.


It was with great expectation that I got back to the office and started looking at our data. Following the basic approach of getting familiar with the data through histogram, time-series and normality test. I was surprised to find during my first project that none of the data was normally distributed and this has continued through my projects. I can count on the fingers of one hand the number of normal distributions I have found.


So what could be wrong? Admittedly I work in a heavily transactional environment looking at business processes and data drawn mainly from database driven business applications. Could I be looking at too small a sample? Not likely as I extract millions of records at a time for analysis. Could it be special causes? Possibly but I have tried stripping out the outliers and stratifying across many dimensions and still the distribution plots match none of the usual suspects. My “Individual Distribution Identification” does sometimes get a hit on 3-Parameter Weibull but that does not help much. Could I be looking at the wrong type of data? Possibly but a lot of it has been time-based and financial.

Handpicked Content:   Teams Match Wits at iSixSigma's Project Bowl I


Either way I have found I mainly use the tools associated with the binomial distribution and the non-parametric tests and have missed out on the rich-set of tools built around the normal distribution. As my understanding progresses I hope this will change (e.g. start looking at data transformations).


So with some positive spin I could say I have discovered the Business Data Probability Distribution but it wouldn’t wash with this audience. It may be better to paraphrase our hero of fact-based decision making and say “Its data Jim, but not as we know it”.



Comments 3

  1. Meikah

    It’s true! Managing data can really be tricky. But I always remember Pareto’s 80-20 Rule: 80% of consequences stem from 20% of the causes. So for example, if we apply this to sales figures/data, we can say that 20% of clients are responsible for 80% of sales volume. Therefore there is danger in handling data. For all we know we are studying all of 100% when we only need to look at 20% of it to know how it’s affecting the entire data we have. Peter Drucker once said that there’s nothing more useless than doing a task so efficiently yet so wrongly. At any rate, I wish you all the best in your projects.

  2. Ladi Olaoye

    Like you, I’m in the transactional world and often find non-normally distributed continuous data. A possible explanation (having not done any extensive research on it) based on the few projects I’ve seen and have been involved with is that the processes were not specifically designed to consistently produce the output we are measuring (unlike in manufacturing) and so fails to produce a central tendency and normal dispersion about a central tendency. Also a large sample may well contain numerous subsets which are within themselves normally distributed but whose identity (the subsets’ that is) is not known because it is not by any design nor recognisable through any previous experience.

    If the Sixsigma practitioner manages to improve a process and introduces controls that can make the output (continuous) consistent within specification limits, one may well find post normality in the new data where there wasn’t one at the start of the project meaning using different set of tools to measure the improvements. mmmnn!

    I like to think a lot of data about nature is normally distributed by design. To avoid going down the evolutionism vs. creationism path, perhaps better to say: there is a significant enough degree of "order" in nature however way it came to be that way, and that we quite often seem to know how to group it to reveal it’s normality.

  3. Robin Barnwell

    Sorry for the delay in responding.

    Meikah, thanks for the feedback. Most of our processes are business to consumer (B2C) and as such we have million of transactions per year to different people rather than having 80% of our business with a few customers.

    Ladi, thanks for the advice. The data does show a strong central tendency. I have been considering the tools I could use to better stratify the data to get relevant groups and am looking at using transformations, chi-analysis and data mining (entropy values) to find and investigate specific groups. I will report back if I get any success.

Leave a Reply