Issue in Regression Analysis
Six Sigma – iSixSigma › Forums › Old Forums › General › Issue in Regression Analysis
 This topic has 14 replies, 8 voices, and was last updated 16 years, 4 months ago by McD.

AuthorPosts

July 21, 2006 at 7:14 am #44085
Hi,
I am doing a project and used Multiple regression. When i used individual factor and used simple linear regression, those factors was showing significant impact on output (pvalue was less than 0.05). But when i am using multiple regression with all the X’s, some X’s are not showing significant impact on Y. Is there any any issue?
I hought it might be multicollinearity issue in X’s. But after checking VIF, it is not that high.
Any suggestion/help on this would be appreciated.
Thanks0July 21, 2006 at 12:28 pm #140788
Robert ButlerParticipant@rbutler Include @rbutler in your post and this person will
be notified via email.Even when collinearity isn’t an issue things change when you go from simple oneatatime regression to multiple regression. There are a number of issues all of which can best be summarized by recognizing the following:
If I regress a Y against a single X (let’s call it X1) then X1 is going to account for a certain percent of the total observed variation in Y. If I next add another X (let’s call this one X2) and regress Y against it while simultaneously allowing X1 to remain then the amount of variation in Y that X2 will be permitted to explain will be some amount of that portion of variation not accounted for by X1.
With the oneXatatime method each X is being allowed to test itself against all of the observed variation in Y, not just the leftover portion. If your X’s are nice and orthogonal the order of entry of the X’s into the regression probably won’t matter but if they are not orthogonal but also not sufficiently collinear to cause problems then order of entry/order of removal is important and can result in precisely the situation you have observed.
Regardless of the condition of the X matrix you should always develop multiple regression models using both stepwise backward elimination and stepwise forward selection with replacement. If these two methods result in different models you will need to do an exhaustive comparison of their respective regression diagnostics (residual analysis, lof, etc.) and you will also need to give some serious thought to the physical meaning of the equations and their respective terms before deciding which model best describes your system.0July 24, 2006 at 7:07 am #140884Thanks Robert.
So should i remove those variables which are not showing significant impact on Y in multiple regression?
But if i add these variables, I am getting higher Rsquare value, though it is not significantly different. Also the equation is relatively better (means less variance from actual), when i am adding those variables.
0July 24, 2006 at 9:36 am #140887he Rsq. value would increase every time you add a X to your regression, even if your X does not have a significant impact on Y. Remove the X’s which are not significant and then run regression again.
0July 24, 2006 at 12:27 pm #140894
Robert ButlerParticipant@rbutler Include @rbutler in your post and this person will
be notified via email.As was noted – add a term, regardless of its significance, and the R2 drops as does the residual variance. Since it sounds like you don’t know the structure of the X matrix you should “remove” the variables in the way I suggested in my first post – run stepwise regression – both backward elimination and forward selection with replacement – and see if the two techniques give the same model.
0July 24, 2006 at 2:51 pm #140904If you have Minitab, run the analysis in “Best Subset” first. Then run the analysis as a simple multiple regression. In any case, rather than looking at R squared you should be looking at adjusted R square. If adjusted R squared is much lower than R squared, you have an indication that one or some of the variables are inflating your R squared. Remove the x’s unless you have an interaction effect. In that case the interaction effect and the two variables involved in that interaction should be included in your data set.
0July 24, 2006 at 4:55 pm #140910I would suggest that you use DOE instead because multiple or simple linear regression only focuses on the interaction of X and Y while DOE focuses on the interaction of different Xs combination with Y then provide you an optimal setting later on. Interaction is very critical and not only the main effects similar to what multiple regression does.
0July 24, 2006 at 5:50 pm #140912You can run inteteractions in multiple regression just as easily. The way to do this is to create a new variable that multiplies the two x’s. Use the “Calculate” Function in Minitab and create a new variable by multplying x1*x2. Then include the two variables (x1 and x2) and the interaction (new x1*x2) into the stepwise regression function. If this additional term is significant then you still need to include the other two terms in your prediction equation even if they are not significant. Again, it is easiest to use the Best Subset function to see what combinations of variables and interactions give you the highest adjustred R squared. If you have observational data, you cannot use the DOE function easily. But you can still run interaction effect using the method described above.
0July 24, 2006 at 6:16 pm #140913
Robert ButlerParticipant@rbutler Include @rbutler in your post and this person will
be notified via email.I wouldn’t recommend using best subset multiple regression for anything for the simple reason that the method is just a very long and potentially incorrect way of doing what backward elimination and forward selection with replacement stepwise regression do without all of the additional assumptions about your X matrix.
We had a discussion about the issues surrounding this some time back. The link below is to the initial post which framed the discussion. The entire discussion should provide you with a better understanding of the issues surrounding these regression techniques.
https://www.isixsigma.com/forum/showmessage.asp?messageID=51625
As for the issue concerning the inclusion or exclusion of main effects when their respective interaction is significant we had quite a discussion about that as well. Both sides marshalled papers in defense of include or exclude and, to the best of my knowledge, there was nothing presented that gave a clear advantage to either side.
0July 24, 2006 at 8:07 pm #140919The sequence and type of regression approach to take is contingent upon two factors:
1. Hypothesis Testing or
2. Exploratory analysis
Best subsets are only appropriate in the context of data exploration and need to be followed up with a multiple regression algorithm. Backward and forward procedures do not always give identical results because they use different criteria for excluding/including variables. So you need to understand what each of the two procedures does. Finally, technically the whole analysis needs to be supported by a solid residual analysis. So it’s not as easy as never using best subsets especially when backward elimination and forward selection may not give identical results.
In the context of six sigma projects I recommend a highly pragmatic approach. Run a best subset, follow it up by stepwise procedure, see if the results make sense, see if you can use the results and solve the problem that you are targeted with, take action.
When writing a dissertation minute differences between procedures become important. However, in the context of science regression analysis has been relegated to the more simplistic tools anyway and has been replaced by more complex tools such as structural equation modeling.0July 24, 2006 at 8:50 pm #140920
On the other Hans…Participant@OntheotherHans... Include @OntheotherHans... in your post and this person will
be notified via email.When writing a dissertation minute differences between procedures become important. < Hans, do you speak from experience or do you just speak?
0July 24, 2006 at 9:47 pm #140924I went through the process twice and have been advising many students on their dissertations. I guess that counts for some experience.
My key point is that there are grey zones in statistics and that the usage of a procedure needs to be driven and justified by the what the user wants to do with results. Thus, if your goal is to explore the data, best subset is adequate in many situations. Of course you are paying a price if you only look for maximum R squared and the equation is used for prediction. In this case, the “best subset” may not be optimal because your goal in prediction is to reduce your error which is driven by the MSE (mean squared error). High R squared (even very high R squared or adjusted R squared) don’t necessarily translate into a regression equation that translates into a usable prediction equation. Robert Butler pointed that out very succinctly in the other email trail
Where we disagree is on the conclusions we draw from the same sources. Some of the different approaches used to regression or statistical tools usage in general cannot be justified based on mathematical considerations only. They are driven by issues of research design and problem field and existing knowledge in a problem field. In this respect I wholeheartedly follow R.A. Fisher’s critique of Neyman’s and Wald’s decisiontheoretic turn in statistics. Applied statisticians do not necessarily agree with mathematical statisticians. And some of the controversies around different preferences for approaches to regression analysis stem from this divide in opinion about the role and nature of statistics in general.0July 24, 2006 at 11:12 pm #140927Two dissertations? Crap, you trumped me there Hans.
I only went through it once, but it was really, really rigorous. And Ive only advised a few students on their dissertations . but I did a postdoc under an egomaniacal textbook writing thought leader in his field before I escaped to industry. How about that???? Was your postdoc really, really hard subservient slave labor and did it contribute to a great deal of successful grant writing and many publications for a frizzyhaired crazed manipulative psychotic mathematician sort of like the Doc on House but with a lousy attitude and a desire to inflict pain?
Id poke at your statistics, logic, rationale and perspectives but, so far, I think youre a good read not to say that I and many others on the forum wont go for your jugular at the slightest indication of a fatal flaw (or even a mental lapse) in your postings.0July 25, 2006 at 12:02 am #140928Trumped,
I am really flattered that you would poke at my “statistics, logic, rationale and perspectives but so far (…) it’s been a good read”. That’s what matters the most to me.
As for your believing my two dissertations and the pain to get through them you either believe me or you believe that I am full of “crap”. I don’t take it personally if you take a critical attitude to such statements. After all this is a cold medium of communication in a black box environment.I was asked an honest question and answered it honestly in return. “Der Rest ist Schweigen” …
In any case, thanks for your compliment and I am glad to see that there are critical minds on this side who vivisect every argument meticulously.0July 25, 2006 at 5:35 pm #140955“you will also need to give some serious thought to the physical meaning of the equations”
I would argue that this is the thing to do first. At it’s best, regression is no more than an educated guess, supported by some observation. If you have a first principles reason to suspect some relationship, accomodate that first.
For some reason, people operating any process always believe that their process is somehow special. But the reality is that all processes must follow the laws of physics. Take into account the things you know first. There really isn’t much to be gained by rediscovering ancient laws. Once you have removed the known effects, now surprises in the data aren’t masked by potentially large influences of the basic science of the problem.
Now that you have a model that describes how you are different from what you should expect, once again follow Robert’s advice and understand what the model is telling you. If something is very counterintuitive, it probably indicates that there is something going on you don’t understand. Approach it with an open mind, but don’t be willing to easily be led off into the weeds. Understand what it means, and dig into the mechanism.
Remember, regression can give us insight into the problem, and can point in the direction of causes. But it can never prove a cause by itself. It can only show relationships. If a relationship is suspect, suspect it!
–McD
0 
AuthorPosts
The forum ‘General’ is closed to new topics and replies.