During my black belt training I was introduced to the true scope of the normal distribution. Here was a distribution found extensively throughout nature and industry. It is a principle building block for the six-sigma methodology from which a number of our essential statistical tools are based including all of the t-tests. We discovered that if our customer specifications were more than six standard deviations from the mean we would be rocking.

 

It was with great expectation that I got back to the office and started looking at our data. Following the basic approach of getting familiar with the data through histogram, time-series and normality test. I was surprised to find during my first project that none of the data was normally distributed and this has continued through my projects. I can count on the fingers of one hand the number of normal distributions I have found.

 

So what could be wrong? Admittedly I work in a heavily transactional environment looking at business processes and data drawn mainly from database driven business applications. Could I be looking at too small a sample? Not likely as I extract millions of records at a time for analysis. Could it be special causes? Possibly but I have tried stripping out the outliers and stratifying across many dimensions and still the distribution plots match none of the usual suspects. My “Individual Distribution Identification” does sometimes get a hit on 3-Parameter Weibull but that does not help much. Could I be looking at the wrong type of data? Possibly but a lot of it has been time-based and financial.

 

Either way I have found I mainly use the tools associated with the binomial distribution and the non-parametric tests and have missed out on the rich-set of tools built around the normal distribution. As my understanding progresses I hope this will change (e.g. start looking at data transformations).

 

So with some positive spin I could say I have discovered the Business Data Probability Distribution but it wouldn’t wash with this audience. It may be better to paraphrase our hero of fact-based decision making and say “Its data Jim, but not as we know it”.

 

 

About the Author