# DOE or Logistic Regression

Six Sigma – iSixSigma › Forums › General Forums › Methodology › DOE or Logistic Regression

- This topic has 2 replies, 3 voices, and was last updated 1 year, 4 months ago by Chris Seider.

- AuthorPosts
- March 13, 2019 at 10:48 pm #237266

easoccerParticipant@easoccer**Include @easoccer in your post and this person will**

be notified via email.I need help determining whether to use DOE or Logistic Regression.

Here is a bit of background.

I am doing a project on seeing which different factors (see below) or combination of different factors has the most significant impact on our dependent variable which is our scheduling rate.

A. $

B. Name

C. Social proof (others have said the same thing but here’s what they found…)

D. Regular scriptY= mentioned during call

N= was not mentioned during callMy first question would be I’m trying to determine if I need to use DOE bc I am trying to find what single or combination of the 4 factors is the most impactful on producing the best results (scheduling rate). But because design of experiment is primarily based on continuous values and not binary, I feel like I should be using logistic regression.

Do I use logistic regression?

If yes, then what is the sample size needed for each permutation so that the logistic regression test is statistically significant?

0March 14, 2019 at 8:18 am #237270

Robert ButlerParticipant@rbutler**Include @rbutler in your post and this person will**

be notified via email.DOE is a method for gathering data. Regression (any kind of regression) is a way to analyze that data. As a result it is not a matter of using DOE or regression.

Experimental designs often have ordinal or yes/no variables as part of the X matrix. The main issue with a design is that you need to be able to control the variable levels in the experimental combinations that make up the design. In your case you said you have the following:

A. $

B. Name

C. Social proof (others have said the same thing but here’s what they found…)

D. Regular scriptBefore anyone could offer anything with respect to the possibility of including these variables in a design you will need to explain what they are. You will also need to explain the scheduling rate measure – is this a patient scheduled yes/no or is it something else.

However, in the interim, if we assume all of these variables are just yes/no responses then you could code them as -1/1 (no/yes). The problem is that you cannot build patients. Rather you have to take patients as they come into your

hospital(?). What this means is that you have no way of randomizing patients to the combinations of these factors and this, in turn, means you have no way of knowing what variables, external to the study, are confounded with the 4 variables of interest which means you really won’t know if the significant variables in the final model are really those variables or a mix of unknown lurking variables that are confounded with your 4 chosen model terms.What you can do is the following: Set up your 8 point design and see if you have enough patients to populate each of the combinations. Given that this is the case, you can go ahead and build a model and, if the response is scheduled yes/no, the method of choice for data analysis would be logistic regression. The block of data you use will guarantee, for that block of material, your 4 variables of interest are sufficiently independent of one another. What this approach WILL NOT AND CANNOT GUARANTEE, is that the 4 variables of interest actually reflect the effect of just those 4 variables and are not an expression of a correlation with a variety of lurking and unknown variables.

Another thing to remember is the coefficients for the variables in the final reduced logistic model are odds ratios which are not the same as coefficients in a regular regression model and cannot be viewed in the same way.

If you don’t have enough patients to populate the combinations of an 8 point design you will have to take the block of data you do have and run regression diagnostics on the data to see just how many of the 4 variables of interest exhibit enough independence from one another to permit their inclusion in a multivariable model. All of the caveats listed above will still apply.

What all of this will buy you is an assessment of the importance of the 4 variables to scheduling with the caveat that you cannot assume the described relationships mirror the relation between the chosen X variables and the response to the exclusion of all other possible X variables that could also impact the response. While not as definitive as a real DOE this approach will allow you to do more than build a series of univariable models for each X and the measured response.

This entire process is not trivial and if, as I suspect, you are working in a hospital environment I would strongly recommend you talk this over with a biostatistician who has an understanding of experimental design and the diagnostic methods of Variance Inflation Factors and eigenvalue/condition indices.

0March 15, 2019 at 12:29 am #237280

Chris SeiderParticipant@cseider**Include @cseider in your post and this person will**

be notified via email.@rbutler , the first paragraph was perfect.

0 - AuthorPosts

You must be logged in to reply to this topic.