Regression Confusion
Six Sigma – iSixSigma › Forums › Old Forums › General › Regression Confusion
 This topic has 11 replies, 6 voices, and was last updated 15 years, 1 month ago by Robert Butler.

AuthorPosts

June 7, 2007 at 6:32 am #47196
Hi
I’m having some trouble with conducting a multipule regression in Mini tab. The results are coming back as follows:
Regression Analysis: Duration versus Shift Code, Crew Code, Reason Code
The regression equation is
Duration = 0.693 – 0.213 Shift Code – 0.0628 Crew Code + 0.149 Reason Code
Predictor Coef SE Coef T P VIF
Constant 0.6931 0.1735 3.99 0.000
Shift Code 0.21324 0.08554 2.49 0.013 1.0
Crew Code 0.06276 0.03729 1.68 0.094 1.1
Reason Code 0.14911 0.03870 3.85 0.000 1.0
S = 0.636708 RSq = 9.0% RSq(adj) = 7.8%
Analysis of Variance
Source DF SS MS F P
Regression 3 9.0070 3.0023 7.41 0.000
Residual Error 224 90.8091 0.4054
Total 227 99.8161
The problem that I have with this is the Rsq value of 9.0% denoting the level of relationship between the Y (duration) and x’s is not very high. I understand that the data was not collected in a controlled environment, but it was expected that there would be a higher relationship then the regression suggests, am I doing something wrong? Are there factors such as sample size and sample distribution that I should be taking into consideration?
0June 7, 2007 at 1:24 pm #157128It simply means that your factors do not explain much of the variation in “Y”. For kicks, generate a new column in your data set just using a random numbers generator. Call this Y2 and do the regression again, but on Y2. See if you can beat 9%!
I don’t think you did anything wrong perse. (Other than picking the wrong X factors to study). It’s like taking a sample of women’s heights and doing a regression of height versus purse color, perfume type, hair color, and hair length.
0June 7, 2007 at 1:26 pm #157129
Robert ButlerParticipant@rbutler Include @rbutler in your post and this person will
be notified via email.You have regressed shift code, crew code, and reason code against a variable called duration.
What are these codes?
If they are just arbitrary numerical assignments then you have done your regression incorrectly.
If they are not arbitrary but actually have numeric significance what is the significance and what is the relationship between these codes?
If they are highly confounded then you will need to revisit you approach.
0June 7, 2007 at 1:38 pm #157131
residualsParticipant@residuals Include @residuals in your post and this person will
be notified via email.i had hoped that robert would start at the beginning of the diagnostics: the residual analysis and work it from there.
0June 7, 2007 at 1:39 pm #157130It doesn’t look like there is any multicollinearity to worry about since the VIFs are so low. (The X’s do not depend on each other).
Robert, are you suggesting that an interaction term be added to the model?
0June 7, 2007 at 1:46 pm #157132Mary:From reading your post and Robert Butler’s response, I think we are looking at a situation where your “Y” is continuous (Duration) and the “Xs” are discrete, fixed values (Shift_Code, Crew_Code, Reason_Code).If this is true, then I would advise you to:1) Run chart on “Duration” – is the process stable?2) Test of Homogeneity for “Duration” for each “X” factor. Does any particular factor cause a large variation in “Y” (Duration)?3) Test of Means for “Duration” for each “X” factor in (3) above. ANOVA or ttest depending on the number of possible values for each factor. Does any particular value of “X” change the average Duration?4) Make sure you have a good cross section of individual “X” factors by doing a Cross Tabulation of all “X” factors. For example, check that Shift_Code and Crew_code are independent.Cheers, BTDT
0June 7, 2007 at 1:53 pm #157133
Robert ButlerParticipant@rbutler Include @rbutler in your post and this person will
be notified via email.If the codes are actually interval variables then, as was noted, the next thing to do would be a residual analysis. (Actually, the first thing that should have been done, regardless of the codes, was to plot the response against the variables of interest.)
However, before doing anything we need to know the story about the codes. If the codes are, as I suspect, just arbitrary numbers (and if there are more than two codes per category) then they need to be recast in a proper form for regression.
After recasting we would want to look at the regression diagnostics of the recast codes to make sure there isn’t a problem with collinearity.
0June 7, 2007 at 2:03 pm #157135Mary:Yes, Butler reminds us once again about the value of looking at the raw data.In addition to the run chart, in Minitab go to “Graph”, “dot plot” “One Y – with groups”.When you are conducting the “test for equal variances,” “ttest,” and “ANOVA,” always choose the option of graphing the results.Cheers, BTDTCheers, BTDT
0June 7, 2007 at 3:51 pm #157143My assumption about the codes is that they are predefined alpha or alphanumeric codes for the designated categorical factors. (shift code would be something like A, B, C; reason code is probably a alpha code like MP for missing part, etc.)
These are also known as “Indicator Variables”. If I am correct in my assumptions above, I would like to know how many levels of each factor there were in the study. For example, are there 100 distinct reason codes, 3 shift codes, etc.?0June 11, 2007 at 10:49 am #157245
The ForceMember@TheForce Include @TheForce in your post and this person will
be notified via email.For multiple regression, you need to look more in the Rsqadj and since your value is very small, might as well want to revisit your approach and sampling plan. Take a look into your residuals and lack of fit as well.
0June 11, 2007 at 8:56 pm #157273Thank you everyone, your replies were extremely helpful. I’m going to go back and have another look at the coding of the data. By the way the coding is broken down into 4 crews ABCD(1234), Dayshift and Nightshift (1,2) and reason code (1234), hence the going back and recoding. Once again thanks for your help.
Cheers Mary0June 12, 2007 at 12:57 am #157283
Robert ButlerParticipant@rbutler Include @rbutler in your post and this person will
be notified via email.So does this mean you have 4 crews on during the daytime and 4 at night or does this mean you have 4 crews, two on during the day and two on during the night? Also do the crews change times i.e. sometimes crew 1 and 2 have the day shift and crews 3 and 4 have the night shift or is it a case of the same crews at the same time every day. If it is the latter then crews and shift will be confounded and you won’t be able to use both of these terms in the regression.
Below is an example of the way the coding should be done:
Reason Code (RC) 1,2,3,4
If RC = 1 then do r1 = 1, r2 = 0 r3 = 0
if RC = 2 then do r1 = 0 , r2 = 1, r3 = 0
if RC = 3 then do r1 = 0, r2 = 0, r3 = 1
if RC = 4 then do r1 = 0, r2 = 0 , r3 = 0
you will run your regression against the dummy variables r1, r2, and r3. Draper and Smith – Applied Regression Analysis 2nd Edition, has a very good discussion of the use of dummy variables starting on pp.241 of that text.0 
AuthorPosts
The forum ‘General’ is closed to new topics and replies.