Difference Between Regression and DOE?

Six Sigma – iSixSigma Forums General Forums Tools & Templates Difference Between Regression and DOE?

Viewing 10 posts - 1 through 10 (of 10 total)
  • Author
  • #55035

    Amit Kumar Ojha

    Hi All,

    It would be great if someone can throw some light on the difference between Regression and Design of Experiment(DOE) w.r.t their application along with some real time experiences where these have been used in Six Sigma projects.


    Robert Butler

    That is a very tall order (request). The difference is between data gathering (DOE) and the analysis of the data (regression). Any design is just a recipe for data collection. The design stipulates the MINIMUM number of experiments one would need to run so that when the time comes to analyze the data the researcher can rest assured that the variables that were studied acted independently of one another.

    Thus, any description you might have of independent variable behavior (the regression equation) with respect to correlation with the outcome is guaranteed to be a description between each independent variable and the outcome such that the description of the relationship between any given independent variable and the outcome is also independent of all of the other variables which were included in the design.

    As for examples, I hardly know where to begin. I have posted a number of examples of this kind of work on this forum in the past but as far as I know you can no longer access posts that go back more than a couple of years. The good news is that from time-to-time I save an exchange that I think might be useful to me elsewhere. Below is an example from an exchange back in 2004.

    To pick one at random we had a situation where we had to come up with a new surface coating. The coating had to withstand friction induced heating which would result in overall material degradation and ultimately material failure. The old coating was very good but it was based on organic solvents and there were a number of economic and environmental factors which indicated a non-solvent based coating was desirable. After a lot of discussion with the customer we identified about a dozen different factors we wanted to check. Some had to do with our coating formulation and some with the customer’s process of application. We built an initial screen of 20 experiments. Because of restrictions and because there were a couple of factors whose behavior was suspected to be curvilinear and one or two interactions of interest, I used D-optimal methods to build the design. It was obvious from the resultant models that the optimum lay outside of our initial region but these models indicated we could build an acceptable, although sub-optimal, coating material and they pointed the way to further improvements. They also identified changes in process variables (quantity and direction) needed to guarantee product uniformity. We built the coating material according to model predictions and we also used the models for control of the line. The formulation worked as predicted. We were able to vary the line parameters in the direction and magnitude indicated by the models and see changes in product performance in agreement with the predictions.

    Since there was an immediate need for the product we went ahead and used the models for guidance with respect to current manufacture and we used the hints from the model, with respect to further improvements, to build a second, smaller design with levels and settings outside the original region but in the directions indicated by the first.

    We took one of the customer lines off of regular production for two days and ran the new design. The results of that run, when combined with the results of the first design permitted the construction of an improved process and an improved formulation model which we used to run the process. When their marketing discovered a niche market with different coating performance requirements we used the existing models to identify the changes necessary to the formulation and the line to meet these requirements. The actual performance of the new coating was within the error bars around our predictions and proved to be acceptable to the niche market.


    Norbert Feher

    My answer is much easier I think:

    Please think about iteractions. In case of DOE You can fully explore them but in case of regression analysis not really.

    I used regression analysis in case of trend series analysis of customer forecast data and DOE in case of a moulding process to see the iteraction between time, pressure and temperature.

    I think Thomas Pyzdek’s Six Sigma Handbook gives a very good explanation of the topic.

    I hope I could help You…




    Shelby Jarvis


    Robert Butler shared some great wisdom. As you read it, reflect on key concepts in his statements. DOE is proactive where as regression is reactive. The proactive approach allows you to specifically target the VOC for designing products and processes to deliver what the customer needs. Regression is a great tool, and should always be considered.

    Below are a few strengths and weaknesses. These +/- are relative to DOE/regression and not relative to other statistical tools. In the right situation, I am a fan of regression as a means to understand the process.

    + fastest way to test entire space of the project due to actually selecting the test.
    + can focus on delivering to customer facing requirements
    – due to taking the process to the edge of capability, take care to consider safety
    +/- can be the cheapest option in that you are running a minimum number of test, but is likely expensive due to the nature of processing at the edge of capability

    + will typically yield product at the current known rate, so you can sell the product
    – time it takes to build a measurement plan (assuming you determine to do a formal measurement plan) is as long or longer than a DOE
    – you are passively collecting data and may not see every combination
    – the passive nature indicates you are looking back as opposed to the DOE specifically looking at settings and conditions

    I suggest evaluating each situation prior to building a DOE. I typically consider safety, impact, VOC, etc.

    At one point in my career, I worked in a automotive ceramic factory. The kiln firing cycle was more than 1 day. On one hand, the cost of running a DOE and scrapping a few cycles of product seemed high, but having a failure rate in the double digits posed a higher long term cost as well as a greater risk to the customer.

    As a team, we selected specific product attributes which were failing, then mapped these attributes to specific process conditions perceived to by driving these characteristics. As you may expect, we learned that factors which we were not predicting as being important were actually controlling the quality. Regression may have never helped us reach this conclusion as the controlling factor was not managed not tracked in a typical production cycle.

    Had a separate situation in which our product was failing in our customers process. All parties involved agreed that our product was apparently meeting the specifications but yet was simply not functionally working for our customer. The two companies worked together to identify factors in each plant.

    The DOE allowed us to rapidly learn the effect/non-effect of our product characteristics vs. our customers process parameters. Without this joint effort, it would have taken months to determine the root cause.


    Robert Butler

    In regards to one of the other posts to this thread: There are a few things to consider as far as interaction terms in a regression model are concerned. If the data is designed data then you know exactly which linear, curvilinear, and interaction terms you can investigate because you designed a sampling plan (the DOE) which defines them before you begin to analyze the DOE data using regression methods. If you have non-designed data – process data, patient sample data, etc. before you can begin the analysis you first have to find out the terms that can be supported with the data set you have (which terms are independent enough of one another in the block of data you are using for analysis).

    Thus the first thing you have to do is put the data through a co-linearity wringer to check for term independence. The usual procedure is to look at VIF’s and eigenvalues and condition indices for all of the terms you would LIKE to include in a model and then run one-at-a-time elimination to drop terms which have VIF’s and/or condition indices > 10 (or if you want to be very conservative > 5).

    Once you have the refined list of variables that are “independent enough” from one another you can proceed with your model building efforts. The key point is that there is no reason the refined list cannot include curvilinear and interaction terms which met the acceptance criteria for independence.

    If you are interested in additional information concerning this approach I would recommend borrowing a copy of Regression Diagnostics by Belsley, Kuh, and Welsch from the library and read the section concerning these measures of co-linearity.




    Very nice… I was trying to think of a clear way to say the same thing; regarding the ability to explore interactions with regression. But, I’ve not had a lot of experience with it so, I kept quiet. :-) Cheers.


    Amit Kumar Ojha

    Thanks Robert, Norbert, Shelby and JB.

    I read you valuable comments. Although I have never used DOE but after carefully reading the inputs, I came to the following conclusion. (Please correct me if my understanding is wrong)

    1. DOE is an active method wherein the factors are manipulated to see their impact on dependent varaible however, regression is passive wherein the available data is analysed for determining the variable which have significant impact on Y.
    2. Robert what I understood from your post is that you used DOE to find the optimum value for your inputs. (One small doubt here may sound a bit silly also- Can we not use some mathematical techniques such as Linear Programming for optimization as these are relatively easier to use given the constrains and factors)

    All the bset!!!


    Robert Butler

    Well, sort of.

    1. By itself the DOE isn’t going to tell you much of anything other than the fact that the responses from experiment-to-experiment vary. As I noted it is a recipe for data gathering. It is active in the sense that you are constructing and running specified combinations of variables but you need to use regression methods to analyze the data and then use the resulting equations for the various responses of interest to identify a family of independent variable settings that will provide you with various and sundry trade-off optimums for all of the responses at the same time.

    You could try (and I have seen this done more than once) to just eyeball the DOE results and see if you can find a combination that will provide some kind of an optimum. The problems with this approach are manifold. A couple of the biggies are:
    a. You are assuming the optimum exists within the design space.
    b. You are assuming you have near superhuman powers to simultaneously identify the optimum settings for all of the response variables of interest and the associated tradeoffs just by looking at the numbers.

    I can tell you that in each instance where I have been presented with an “eyeball analysis” I was able to show, after properly analyzing the data using regression methods, that the eyeballs were blind.

    2. You can use various programming methods to rummage through happenstance/production data and, just by virtue of the quantity of data you are bound to find something. The problems with this approach are many.
    1. You have no idea of data confounding with respect to unknown/uncontrolled variables. Thus you have no guarantees that the effect (or lack of effect) you are seeing is actually correlated with the variable of interest.
    2. You have no guarantees that variables critical to a process will have been given enough latitude to permit their effects on the responses of interest to be identified. Unless you have data which includes process upsets it is highly likely that the production data will exhibit any significant trending with the response (this, by the way, is a good thing – it’s why you run control charts and other such aids to check process performance).
    3. You are assuming (generally) an interest in the optimization of a single response.
    4. You are assuming the data is really representative of the process.
    In my experience, the best you can do is to run an analysis on the available data and use the findings to guide your thinking with respect to the construction of a design.

    There is much more to the DOE experience besides the identification of optimum settings for process inputs. While optimization is important an equally important facet of design is that of being able to define, with a high degree of certainty, the effect OR LACK THEREOF, of any of the variables included in the design. Another aspect of a design is the ability to assess the impact of variable settings on the associated variation of the responses. I have never seen an assessment of happenstance data that had much success with respect to these two points.


    Amit Kumar Ojha

    Thanks Robert for your valuable insights.
    This discussion is very useful for people who want to know the basics of DOE…



    I would say DOE builds a regression equation.

    All historical data can be analysed using Regression technique. failure analysis etc.

    New designs and changes goes well with DOE.That ends with a regression equation.

Viewing 10 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic.