Binary Logistic Regression gives wrong analysis
Six Sigma – iSixSigma › Forums › Old Forums › General › Binary Logistic Regression gives wrong analysis
 This topic has 14 replies, 6 voices, and was last updated 16 years, 5 months ago by BTDT.

AuthorPosts

June 27, 2005 at 9:53 am #39837
I’m doing an analysis to find out which factor cause the outliers in process output. Data is attached. I’m transforming output data into 0(data within control limits) and 1(outlier data). Then I do Binary Logistic Regression among transformed Y and all factors(Xs). MINITAB gives wrong report. In attached examples, factor X1 has significant impact on Y. X1’s Pvalue is near 1, which means low impact on Y. Can anyone explain this? Or I have selected wrong analysis method. https://www.isixsigma.com/library/downloads/ASKiSixSigma.xls
0June 27, 2005 at 9:59 am #122206I give raw data as below.
X1 X2 X3 Y Outlier_Y122 63 37 68 087 62 74 83 0750 102 98 1000 1123 77 132 90 084 85 138 78 0104 99 94 126 0109 110 75 70 098 110 109 84 0120 89 104 76 0129 126 138 136 083 92 114 102 0128 134 161 101 086 78 108 67 0980 89 48 1500 191 108 109 66 095 103 129 99 069 109 81 85 088 112 63 120 048 105 111 52 075 127 100 117 0109 78 85 134 0120 105 90 125 0153 102 96 119 081 82 63 85 060 91 123 70 081 101 122 145 098 95 80 137 0145 99 91 128 086 105 95 124 0106 63 101 54 00June 27, 2005 at 10:01 am #122207As per my knowledge Binary logistic is to identify the relation between variable input Vs attribute output.i am not sure how you are using this to identify the outliers.
hops that simple box plot and matrix plot is adequate.
Hope that doctors of this forum will help us.0June 27, 2005 at 10:10 am #122208Dear Anbu,
Just give some background of my post.
The purpose of my BLR analsis is like:
1. We already know the output outliers thru SPC.
2. We are monitoring some other input factors also.
3. We are doing BLR analysis to find out which factors’ change are more significant cause output outliers.0June 27, 2005 at 10:27 am #122209How you can identify outliers with binary logistic regression I think for that purpose you have SPC
lets see what the experts has to say ??
0June 27, 2005 at 12:48 pm #122217
Robert ButlerParticipant@rbutler Include @rbutler in your post and this person will
be notified via email.I’d say you are entering something wrong in Minitab. A quick check of your data – just simple plots – shows the relation between X1 and Y to be very strong courtesy of the two extreme values of Y (which also correspond to the two extreme values of X1). A very quick check of regressing X1 X2 and X3 against Y – without bothering with normalizing data, checking for collinearity etc. gives p for X1 to be <.0001. If you drop the two extreme X points you still get a p of .07 for X1 and your p value for X2 goes to .03.
The graphs really tell the whole story – there is a very strong connection between you extreme X1 values and you extreme Y values.0June 27, 2005 at 1:27 pm #122221sir ,
what Kiloli is trying to do is which of X’s causing outliers in the process output as a binary response he has given 0 and 1 .. can we do it I have a big doubt here ?
Need your clarificaiton sir !!0June 27, 2005 at 1:29 pm #122222sir ,
what Kiloli is trying to do is which of X’s causing outliers in the process output as a binary response he has given 0 and 1 .. can we do it I have a big doubt here ?
Need your clarificaiton sir !!0June 27, 2005 at 2:05 pm #122226
Robert ButlerParticipant@rbutler Include @rbutler in your post and this person will
be notified via email.I could be missing something here but as I understand it Kiloli is trying to find a relationship between some X’s and his/her Y and in particular the extreme values of the Y response. If all he/she had was extreme response = yes/no then he/she would be stuck with attempting a binary regression. In fact, the Y responses look pretty continuous – (I didn’t do anything except the most cursory of inspections of the data) and, as I mentioned, a simple plot suggests the cause of the extreme Y values is the extreme X1 value. The simplest check would be to rerun either one of the X1,X2,X3 combinations corresponding the one of the extreme Y responses and then repeat the experiment with X1 reduced to a “typical” value (the median = 98 would be a good choice) and see what happens to Y.
In short, I don’t see any need to convert the Y responses to binary in order to investigate the relationships between Y and X1, X2, and X3.0June 27, 2005 at 3:25 pm #122237Dear Robert,
Are you mean that Minitab has some defect?
The reason I’m doing this is:
In one sentence, can we predict Y outlier from X value? Give us one X value, we can predict Y outlier will be “true” or “false”.
My other considerations, maybe wrong:
1. SPC can identify the process output outliers.
2. With BLR, we are trying to identify which input cause the process output outliers.
3. It’s doubtfull that linear regresson can satisfy above goal. i.e. Can we find some factor has strong corelation with Y, but these factor does not haf significant influence the process outliers?
4. I believe some Xs are like “chemic activators”. And Y are very sensitive to X if X is above some threshold. That’s why I’m trying to us BLR.0June 27, 2005 at 4:40 pm #122239
Robert ButlerParticipant@rbutler Include @rbutler in your post and this person will
be notified via email.No, I don’t think Minitab has a defect. I think you are using it incorrectly. If you plot your data (Y vs. X1, X2, and X3) you will see that X1 is highly correlated with Y. Most importantly, the extreme values of X1 are very correlated with Y. If you plot Y against the other two (X2 amd X3) you will see no such relation.
If you develop a regression equation and deliberately force all three X’s into the equation you will get the following:
Y = 74.72 +1.49*x1 +.61*x2 .34*x3
The p values associated with the three are
X1 <.0001
X2 .2131
X3 .3062
The model has an r2 of .98 and the root mean square error is 43.
X1 is highly correlated with Y and you can use the correlation equation to predict responses of Y to changes in X1.
The problem you have is that the two extreme values of X1 are so extreme that, for all intents and purposes you only have three values of X1 – 98 (the median of the X1’s) and 750 and 980.
An examination of the residuals indicates if you insist on including the two extreme X values you have a lurking linear variable. If you drop the two extreme X1 values and just force in X1,X2 and X3 you get
Y = 14.9 +.36*X1 +.63*X2 .147*X3
where the p values are
X1 .07
X2 .03
X3 .46
The residual plot looks good and the r2 = .26 with a root mean square error of 25.
So, based on what you have provided it would appear X1 may have a major impact on your Y. For preliminary work you could use either one of the equations to get some sense of how X1 impacts Y since either one will “predict” Y given X1.
So, to answer your questions
1: In one sentence, can we predict Y outlier from X value? – yes – either one of the equations gives you this capability, however, both equations have problems which need to be addressed before you can decide if either of them are really describing your experimental space.
Give us one X value, we can predict Y outlier will be “true” or “false”. – given the equation you can predict what Y will be and from that numerical value you can decide whether Y is “true” or “false”
The big concern is the unbalanced nature of your X matrix. Your extreme values of X1 are a “tail wagging a dog”. The Y response to these extreme settings does suggest X1 may be critical but you will need to do more work before you can make that claim.
1. I’m sure SPC can identify process outliers but it doesn’t have much bearing on what you are trying to do which is develop a predictive equation that will identify levels of the X’s which will result in unacceptable Y responses.
2. I think you are wasting your time with BLR – it really isn’t applicable in this case.
3. Linear regression – see above – will meet your goal once you determined its final form.
4. The preliminary linear regressions developed above both indicate a sensitivity of Y to X1. Further work should help you quantify this relationship – BLR is not the way to go here.0June 27, 2005 at 4:46 pm #122240KiloLi,
If I read your question correctly that you want to determine which of the X’s are causing outliers to be present, I think you are over complicating the situation by introducing Binary Logistic Regression. I would create Box Plots or Main Effect Charts which have as the xaxis the outlier designation (0 and 1) and the yaxis the inputs. This quick analysis confirms what Robert had said (X1 appears to be the key driver). Follow the graphical analysis up with linear regression (like Robert recommends) and confirmation trials.
Binary logistic regression is not the right tool for this analysis. Remember that this regression technique is based on the Binomial distribution which is ratios, not just 1’s and 0’s. An example of how this would be applicable in you data set is if at some combination of x’s you gather 10 samples. Of those 10 samples, you got 2 outliers or 20% outliers. At another combination of x’s you got 8 outliers out of 10 samples.
Hope this helps.
Kirk0June 27, 2005 at 5:09 pm #122241Ignore my last paragraph. BLR could be used in this type of situation, however, I believe you may be over complicating things still.
Kirk0June 27, 2005 at 8:26 pm #122250KiloLi:You have revealed a quirk in the way the standard error is estimated as a consequence of an assumption in binary logistic regression (BLR). More about this later, but in the bigger picture, let us look at the problem itself.There are three different ways you could have approached this problem.Along with run charts and such, I see that the distributions of X1 and Y values are very nearly normal with the exception of data points #3 and #14. X2 and X3 are normally distributed. This has no real influence on the subsequent analyses, but it is interesting to see that the abberant data points in the distributions of X1 and Y are from the same data points. A scatter plot of X1 versus Y verifies this.I have left out a bunch of stuff involving X2 and X3, but these are not that relevant to the rest of the discussion.What I will do depends on the assumptions I want to make about X1 and Y.ContX – ContY (linear regression)Use regression to find that the X1 value has the largest influence on the size of Y. Even though the two largest values of Y are caused by the two largest values of X1, the residual analysis shows there is a normal variation in the error regardless of the size of the prediction. There looks to be no reason to transform the data in any way. The pvalue shows this is not random: the magnitude of X is useful in predicting the magnitude of Y.DiscreteX – DiscreteY (Fisher’s exact test)Flag the largest values of X1 greater than about 200 and call them “Outlier_X”. Do a crosstabulation table of Outlier_Y and Outlier_X and select the option to do a Fisher’s exact test on the 2×2 table. The pvalue shows this is not random: flagging large values of X is useful in identifying large values of Y.DiscreteY – ContX (binary logistic regression)For all data points except #3 and #14, the magnitude of X1 makes no difference in whether a Y point is an outlier or not. In fact ALL other data points are NOT outliers. In other words, for the majority of your data points, the fact that the X1 value is 80, 85, or 90 makes NO difference to the proportion of outliers in that bin(0%). The assumption for BLR is that the as you are slowly increasing the value of X1, the proportion of outliers starts off at zero and traces a lazy “Z” shape as the proportion increases to 25%, then to 50%, beyond 75% and finally reaches 100%. Your graph of X1 value versus proportion of outliers is a step function, with infinite slope. The response function is the percentage of points expected to have the high(or low) value in a group. In these cases where the proportion within a subgroup changes from 0% to 100%, the formula for estimating the standard error of the coefficients gives a VERY large value, and hence a large pvalue. If you wish to see the effect of this, flag the data point #26 (Y=145) as an outlier and rerun the binary logistic regression. The standard error is a more reasonable value, and the pvalue of 0.002 agrees with the previous analyses.When going through the methodology, I would choose the analysis that uses the richest type of information (cont X, cont Y) and gives the most satisfactory results – ordinary linear regression.Kirk’s and Robert’s posts are pointing you in the same direction.Cheers, BTDT
0June 27, 2005 at 8:28 pm #122251I mean lazy “S” shape.BTDT
0 
AuthorPosts
The forum ‘General’ is closed to new topics and replies.