# Use Historical Data for DOE?

Six Sigma – iSixSigma Forums General Forums Methodology Use Historical Data for DOE?

Viewing 9 posts - 1 through 9 (of 9 total)
• Author
Posts
• #55796

Aaron Chambers
Participant

Can historical data be used in design of experiments? If so, how does that work since you can’t deliberately manipulate the data? I’m using Minitab 17.

0
#201761

Robert Butler
Participant

Yes, you can as long as you are a firm believer in the “garbage in gospel out” approach to scientific investigation.

The issue is this – the point of a design is that you choose a specific set of experiments and you run them in as random an order as possible. This gives you two things
1. The matrix of the design guarantees the variables of interest will be independent of one another.
2. By randomizing the run order you are also guaranteeing that the effects of any unknown, uncontrolled, unknown and uncontrolled variable will only manifest itself (themselves) as a contribution to the error term of the model.

If you take happenstance/historical data and try to come up with a list of experiments that approximate, to some degree, the experiments in the design space (in my experience this is next to impossible if the variable list is 5 or more) what you don’t know and cannot know is the confounding effect of various unknown/uncontrolled variables associated with each of the historical data runs. This, in turn, means you won’t know if the effect associated with a particular variable is actually associated with that variable and not some other variable or other combinations of variables – all of which are unknown and uncontrolled.

What you can do with historical data is the following: Take the block of data of interest, scale all of the X variables to a -1 to 1 range and analyze the matrix of the X variables using the approach of eigenvalues and condition indices. It is my understanding that Minitab does not have this capability however it is also my understanding that there are R programs which will do this.

Once you have the ability to run an eigenvalue analysis you should assign a count variable (a dummy Y variable for convenience purposes only) to the list of the experiments you want to include in the analysis and then run a regression of all of the X variables of interest against the count variable (so, if you have 150 data points the count variable will range from 1-150). After the first run examine the condition indices, find the largest condition index and, if it is >10, then delete the variable with the largest eigenvalue associated with that condition index. Run the analysis again without the deleted variable and repeat the above until the largest condition index is <= 10.

Take the remaining variables and use those variables to construct a multivariable model. Run backward elimination regression using the reduced set of the X variables and regress the Y against these variables.

What you will have when you are finished is a model expressing Y as a function of a group of X’s for a given block of data. You will be guaranteed that, for the block of data you are using, the terms in the final model are independent of one another but you will not know and cannot know the confounding structure of the X’s in the model with respect to the unknown/uncontrolled variables that were impacting the process at the time you were taking measurements of the X variables of interest.

This approach, while lengthy, will at least allow you to make some use of the historical data and it will suggest the possibilities of significant correlates with the Y response of interest. I’ve used this method to examine historical data in industry and it is my method of choice when dealing with the analysis of retrospective data in the health industry.

Please note that this approach DOES NOT get around the issue of the influence of unknown/uncontrolled variables and their possible confounding with the X variables of interest. All it does is allow one to explore historical data with an eye to something more than univariate analysis and suggest a list of possible X variables that one would want to include in an actual experimental design.

The book Regression Diagnostics by Belsley, Kuh, and Welsch has a good description of the condition index approach to determining the independence of multiple X variables.

0
#201762

Chris Seider
Participant

@rbutler I always find this question of interest. You give them great guidance to ATTEMPT with the caveats you mentioned.

I’m not sure why folks don’t analyze the process data they have with the X’s and Y’s with the statistical tools and THEN do a controlled experiment to minimize outside forces that may impact confidence in understanding interactions, strength of impact by the main factors, etc.

I wonder where this comes from–is someone or some organization pushing such an idea to do DOE’s with past data? I also find it tough to think they could find perfect data for the X’s for a factorial design to begin with.

0
#201767

Robert Butler
Participant

@cseider I wonder about this focus myself. There have been situations where I have asked this question and the answers I’ve received suggest that the only knowledge the questioner has with respect to DOE and historical data is that they know the design is comprised of a bunch of experiments and they believe they have enough historical data to populate the design matrix. What is missing is any understanding of the whys and wherefore’s of DOE construction and analysis and a complete failure with respect to understanding the concept of unknown/uncontrolled variables and their impact on the experiments in the historical data set.

If there is someone/some organization pushing this idea I haven’t heard of them/it.

0
#201769

Sergey
Participant

Good discussion!
DOE for historical data sounds like a non-sense. DOE itself is about proactive controlling and setting factors to get certain output and then to analyze where there is a relationship. For historical data we just take what we have. Seems naive to organize it right order and assume all the measurements were obtained under controlled conditions.

As an idea to run regression with potentially dependent variables, use variance inflation factor as indicator of multicollinearity. Using it in Minitab can help to consider improving your data set.

0
#202814

Mason
Guest

Late to the party here but I’m going to add my 2¢.

Money and politics are “good” reasons for this. When you aren’t the manufacturer and don’t have control over the process, running any designed experiments gets prohibitively expensive. If you’re trying to compare different manufacturers – frequently competitors and some you’ve been doing business with for some time – there will definitely be some pushback when they realize what you’re doing. If one were instead able to acquire off-the-shelf samples for testing, it would make for some cheaper preliminary tests while rustling no jimmies.

0
#202815

Robert Butler
Participant

Running a DOE to compare suppliers who happen to be competitors is done all the time. I’m not sure I understand what you mean by “don’t have any control over the process” If you don’t control anything then what exactly are you proposing to investigate? As for telling your suppliers you are running a DOE with their product – why would you bother to do this?

0
#202817

Chris Seider
Participant

@rbutler good question.

0
#202835

No! If all you have is historical data use a regression technique. A DOE is a specific and ordered set of experiments closely controlled.

0
Viewing 9 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic.