Home › Forums › General Forums › Tools & Templates › Which Output(s) Y, or Their Interactions Are Affected by a Single Factor?
This topic contains 9 replies, has 3 voices, and was last updated by Robert Butler 2 months ago.
Reverse Engineering Case
Objective:
to find out, which output(s) Y and/or their interactions is/are significantly affected by a single factor X
Data available (see attached excel file):
X (attribute data): OK or Not OK
Y (variable data): 16 measureable outputs
Method that has been tried:
To find out the significance of single output by itself: 2 sample T hypothesis test
To find out the significance of output interactions: ???
Theoretically I can manually calculate and test the interactions one by one. However, this would be very tedious process, considering there are 65,536 combinations that need to be calculated and tested one by one.
Is there any better idea to find out if any of output interactions is significantly effected by a single factor?
Attached is an example of data available.
An interaction is usually expressed as the product of two factors. In this case you would generate a series of 16 regression equations of the form
Y = b0 + b1*X
Given this series of linear equations you would probably find that not all of the Y’s are significantly correlated with the single X. Thus you would have a subset of equations of the form
Yi = b0i + b1i*X
The interaction of any two Y’s would be expressed as
Yi*Yj = (b0i + b1i*X)*(b0j +b1j*X)
This would reduce to b0i*b0j +b0i*b1j*X +b1i*b0j*X +b1i+b1j*X*X
If X is coded -1, 1 then this expression reduces to
Yi*Yj = b0i*b0j +b1i*b1j +X*(b0i*b1j +b1i*b0j)
So, first order of business is identify the significant correlations between each of the 16 Y’s and the single X. Take the coefficients associated with the significant correlations only, set up a do-loop using the above equation, and crank out the combinations.
@dando – not sure exactly what you are trying to evaluate here. Typically, interactions are caused by the inputs, not a result of the outputs.
In trying to understand your query, I took your data and did some sample graphical analysis. I took what you identified as the X value (input) and some of the Y (outputs) and graphed them. I used two Y’s in a scatterplot and used the X value as a group variable. Two resultant graphs are attached. You will see that the Y’s a and b with the X input seem to change based on whether X is high or low. This would indicate that X high causes a different response in outputs a and b than it does when low.
In the second graph, I used a and d. Here the response is nearly parallel, with the average response being lower but with a similar slope for X being low.
However, your data has significantly more High values than Low, so this might be a matter of data overload of the high values.
Not sure if this helps or not.
Like MBBinWI I was wondering about the idea of Y’s interacting as well. I couldn’t get your data set to open on my computer so it is good to see that MBBinWI was able to take a look.
I do know of cases where you will have an intermediate Y which is dependent on an X which along with that X can drive a change in a second Y but situations like these are not a matter of interaction.
One additional item, if we assume you are really looking at Y’s interacting then the only way I can see that you can get 65 thousand plus interactions is if you assume all way interactions from 2-16.
Physically, I don’t see why there would be any interest in anything past a 2 way interaction. In that case 16 items taken 2 at a time would be a maximum of 120 and that figure would drop with every inestance where the correlation between X and the Y of interest is not significant. Thus, if only say 5 of the correlations were significant you would have a maximum of 10 two way interactions to consider.
First, thank you for Robert and MBBinWI for your quick respond.
I’m sorry for not elaborating the case very clear.
Let me try it:
We have a case where we need to recall our millions of product due to safety issue.
Due to unknown real root cause, front end can’t be determined.
What we know is, a part inside an welded assembly measured low.
Measuring this part inside a welded assembly is not possible technically, unless high-resolution CT-scan is being used (but financially can’t be justified).
The only hope right now is using the available function tester data to identify those bad parts. Even though the function testers were not designed to identify this bad part in first place, we’re hoping that the combination of the outputs can indicate a signal of a bad part in this welded assembly.
Based on our findings currently, we tried to mark each function data, which we know for sure it has this bad part inside the assembly. Those are marked “low” in the spreadsheet.
The ones that marked “high” are the ones that produced the same batch but found to be OK.
I hope this helps clarify the case.
Again, I really appreciated any ideas and suggestions that can help us accelerate the recall process.
Ok, things just got a lot simpler. What you actually have are 16 X’s and one Y. The issue becomes that of a backward elimination logistic regression where you take the 16 output measures and run an analysis to identify those Y’s that are significantly correlated with the odds of a defect.
There are some thing you will need to do. You will need to check the co-linearity of the Y’s to make sure they can be included in a multi-variable model. You will need to drop measures with no variation (like output_l in you sample data set) and you will need to look over what you have to make sure the odds ratios and their corresponding Wald confidence limits are reasonable.
Hi Robert,
Please let me know if I’m following you correctly or not:
1. Delete the “no signal output” such as “output_L”
2. Run binary logistic regression with “low / high” as response and “output a – q” as continuous predictors.
3. Select all possible terms (max. in minitab 17: 3 way interaction)
4. Select “backward elimination” as stepwise elimination method. (alpha to remove: 0.1)
This is re result that I got from minitab:
* ERROR * The model could not be fit. Maximum likelihood estimates of parameters may not
exist due to quasi-complete separation of data points. Please refer to help or
StatGuide for more information about quasi-complete separation.
Anything that I didn’t do correctly?
You overdid it. All you want to do is set up the regression equation in the form of
Defect = fn(y1, y2, y3,….y16)
In other words a simple linear model. Before you do this you will need to check the matrix of your Y’s to make sure they are independent enough of one another. I doubt you will be
able to test for condition indices – most packages do not have this capability – but you should be able to check the co-linearity of the Y’s using Variance Inflation Factors (VIF). The usual rule is that you will need to drop any factor with a VIF >10.
In order to use the VIF you will need to scale all of your Y’s to a range of -1 to 1 because the VIF is sensitive to magnitude differences.
To do this you will need to find the minimum and maximum values for each Y and compute the following:
A = (maxY + minY)/2, B = (maxY – minY)/2
then
Yscales = (Y-A)/B
Even after you check the VIF’s you may still get the warning of quasi-separation. If you get that warning you will have to manually insert one Y at a time into the regression equation (no particular order here – just start with Output a and work down) and run the analysis to see if the term can remain. After the first analysis, add a second factor and run the analysis on both factors. Keep adding factors and testing them in this manner. When you get to a factor that results in the error warning, drop it and keep on adding terms , dropping any others which generates the warning.
Once you have the sub-group of acceptable variables – put them into the regression equation and have your package run backward elimination to identify those terms which remain statistically significant (P<.05 is the usual choice for cutoff).
You will need to look over the final model terms to see if the odds ratio and the corresponding Wald confidence intervals are “acceptable” by this I mean you don’t want term withe huge C.I. spans nor do you want terms whose odds ratio are either miniscule (<.001 for example) or huge (>999.99).
The end result will be an equation that expresses the log odds of a defect as a function of a subset of your Y’s.
I ran a quick and dirty check (no scaling, no checking for co-linearity) on your sample data set and the backward elimination resulted in terms that had odds ratios >1 which would mean for each unit increase in that particular Y the odds of a defect would increase and others with odds ratios <1 which meant for each unit increase in the associated Y the odds of a defect would decrease.
The way you check the adequacy of the model is to run sensitivity and specificity checks on the output.
If you have never worked with logistic models you will need to do some reading. I would recommend getting Regression Methods in Biostatistics – Vittinghoff, Glidden, Shiboski and McCulloch via inter-library loan and read the chapter on logistic regression. It is the single best short summary of the issue that I know of.
Thank you so much Robert!
Your thorough explanation helps me a lot.
This is the first time I use Logistic regression. Thanks for recommending me the book. For sure I’ll get it.
After some readings and trials, below is the best one that I can get (with my limited knowledge in logistic regression):
Binary Logistic Regression:
Part versus Output a, Output b, Output c, Output d, Output e, …
Method
Link function Logit
Rows used 41772
Backward Elimination of Terms
α to remove = 0.1
Response Information
Variable Value Count
Part suspect 229 (Event)
good 41543
Total 41772
Deviance Table
Source DF Adj Dev Adj Mean Chi-Square P-Value
Regression 9 171.73 19.0808 171.73 0.000
Output a 1 3.27 3.2682 3.27 0.071
Output b 1 12.88 12.8847 12.88 0.000
Output c 1 20.05 20.0459 20.05 0.000
Output d 1 6.04 6.0446 6.04 0.014
Output e 1 2.89 2.8869 2.89 0.089
Output f 1 5.91 5.9066 5.91 0.015
Output g 1 33.50 33.4975 33.50 0.000
Output n 1 9.38 9.3800 9.38 0.002
Output q 1 7.90 7.9024 7.90 0.005
Error 41762 2669.48 0.0639
Total 41771 2841.21
Model Summary
Deviance Deviance
R-Sq R-Sq(adj) AIC
6.04% 5.73% 2689.48
Coefficients
Term Coef SE Coef VIF
Constant -6.903 0.312
Output a 0.878 0.499 14.75
Output b -1.028 0.286 1.41
Output c -2.529 0.586 2.97
Output d -1.489 0.605 3.51
Output e 0.652 0.385 27.76
Output f -0.715 0.295 1.41
Output g 2.239 0.367 1.39
Output n 0.357 0.119 1.74
Output q -0.463 0.154 18.68
Odds Ratios for Continuous Predictors
Odds Ratio 95% CI
Output a 2.4068 (0.9047, 6.4030)
Output b 0.3577 (0.2042, 0.6263)
Output c 0.0798 (0.0253, 0.2517)
Output d 0.2255 (0.0688, 0.7389)
Output e 1.9202 (0.9031, 4.0827)
Output f 0.4894 (0.2747, 0.8718)
Output g 9.3854 (4.5707, 19.2718)
Output n 1.4285 (1.1321, 1.8025)
Output q 0.6296 (0.4656, 0.8514)
Regression Equation
P(suspect) = exp(Y’)/(1 + exp(Y’))
Y’ = -6.903 + 0.878 Output a – 1.028 Output b – 2.529 Output c – 1.489 Output d
+ 0.652 Output e – 0.715 Output f + 2.239 Output g + 0.357 Output n – 0.463 Output q
Goodness-of-Fit Tests
Test DF Chi-Square P-Value
Deviance 41762 2669.48 1.000
Pearson 41762 40717.71 1.000
Hosmer-Lemeshow 8 5.32 0.723
The goodness of fit test is the best one. However this is the model before some factors being reduced due to high VIF.
Is this model valid, considering it consists three factors with VIF>10?
The R-Sq(adj) seems to be very poor to me. However, this is still the better one compared with other models.
Does the model needs to be improved prior being used? What would you recommend?
After determine the best model, do I have to do some tests to residuals, fits, delta chi-sq, delta deviance?
If you have terms that violate the VIF criteria then they should not me included in the model. If you include them then there is no way to be sure that the terms in the model are actually measuring something that is independent from the other terms.
I would recommend re-running the model and also setting alpha to .05. In my experience .1 is too generous. Once you have the model you will want to run sensitivity and specificity tests on the output – this is the way you test to see if the model has any value with respect to discrimination between an acceptable and an unacceptable part.
© Copyright iSixSigma 2000-2017. User Agreement. Any reproduction or other use of content without the express written consent of iSixSigma is prohibited. More »