# Regression Confusion

Six Sigma – iSixSigma Forums Old Forums General Regression Confusion

Viewing 12 posts - 1 through 12 (of 12 total)
• Author
Posts
• #47196

McNabb
Participant

Hi
I’m having some trouble with conducting a multipule regression in Mini tab.  The results are coming back as follows:
Regression Analysis: Duration versus Shift Code, Crew Code, Reason Code
The regression equation is
Duration = 0.693 – 0.213 Shift Code – 0.0628 Crew Code + 0.149 Reason Code

Predictor        Coef SE        Coef T                  P        VIF
Constant        0.6931          0.1735     3.99     0.000
Shift Code     -0.21324     0.08554     -2.49    0.013   1.0
Crew Code    -0.06276      0.03729    -1.68    0.094   1.1
Reason Code 0.14911     0.03870     3.85     0.000    1.0

S = 0.636708     R-Sq = 9.0%       R-Sq(adj) = 7.8%

Analysis of Variance
Source DF SS MS F P
Regression 3 9.0070 3.0023 7.41 0.000
Residual Error 224 90.8091 0.4054
Total 227 99.8161
The problem that I have with this is the Rsq value of 9.0% denoting the level of relationship between the Y (duration) and x’s is not very high.  I understand that the data was not collected in a controlled environment, but it was expected that there would be a higher relationship then the regression suggests, am I doing something wrong? Are there factors such as sample size and sample distribution that I should be taking into consideration?

0
#157128

Craig
Participant

It simply means that your factors do not explain much of the variation in “Y”. For kicks, generate a new column in your data set just using a random numbers generator. Call this Y2 and do the regression again, but on Y2. See if you can beat 9%!
I don’t think you did anything wrong per-se. (Other than picking the wrong X factors to study). It’s like taking a sample of women’s heights and doing a regression of height versus purse color, perfume type, hair color, and hair length.

0
#157129

Robert Butler
Participant

You have regressed shift code, crew code, and reason code against a variable called duration.
What are these codes?
If they are just arbitrary numerical assignments then you have done your regression incorrectly.
If they are not arbitrary but actually have numeric significance what is the significance and what is the relationship between these codes?
If they are highly confounded then you will need to revisit you approach.

0
#157131

residuals
Participant

i had hoped that robert would start at the beginning of the diagnostics: the residual analysis and work it from there.

0
#157130

Craig
Participant

It doesn’t look like there is any multicollinearity to worry about since the VIFs are so low. (The X’s do not depend on each other).
Robert, are you suggesting that an interaction term be added to the model?

0
#157132

BTDT
Participant

Mary:From reading your post and Robert Butler’s response, I think we are looking at a situation where your “Y” is continuous (Duration) and the “Xs” are discrete, fixed values (Shift_Code, Crew_Code, Reason_Code).If this is true, then I would advise you to:1) Run chart on “Duration” – is the process stable?2) Test of Homogeneity for “Duration” for each “X” factor. Does any particular factor cause a large variation in “Y” (Duration)?3) Test of Means for “Duration” for each “X” factor in (3) above. ANOVA or t-test depending on the number of possible values for each factor. Does any particular value of “X” change the average Duration?4) Make sure you have a good cross section of individual “X” factors by doing a Cross Tabulation of all “X” factors. For example, check that Shift_Code and Crew_code are independent.Cheers, BTDT

0
#157133

Robert Butler
Participant

If the codes are actually interval variables then, as was noted, the next thing to do would be a residual analysis.  (Actually, the first thing that should have been done, regardless of the codes, was to plot the response against the variables of interest.)
However, before doing anything we need to know the story about the codes.  If the codes are, as I suspect, just arbitrary numbers (and if there are more than two codes per category) then they need to be recast in a proper form for regression.
After recasting we would want to look at the regression diagnostics of the recast codes to make sure there isn’t a problem with collinearity.

0
#157135

BTDT
Participant

Mary:Yes, Butler reminds us once again about the value of looking at the raw data.In addition to the run chart, in Minitab go to “Graph”, “dot plot” “One Y – with groups”.When you are conducting the “test for equal variances,” “t-test,” and “ANOVA,” always choose the option of graphing the results.Cheers, BTDTCheers, BTDT

0
#157143

Craig
Participant

My assumption about the codes is that they are pre-defined alpha or alphanumeric codes for the designated categorical factors. (shift code would be something like A, B, C; reason code is probably a alpha code like MP for missing part, etc.)
These are also known as “Indicator Variables”. If I am correct in my assumptions above, I would like to know how many levels of each factor there were in the study. For example, are there 100 distinct reason codes, 3 shift codes, etc.?

0
#157245

The Force
Member

For multiple regression, you need to look more in the Rsqadj and since your value is very small, might as well want to revisit your approach and sampling plan. Take a look into your residuals and lack of fit as well.

0
#157273

McNabb
Participant

Thank you everyone, your replies were extremely helpful. I’m going to go back and have another look at the coding of the data.  By the way the coding is broken down into 4 crews ABCD(1234), Dayshift  and Nightshift (1,2) and reason code (1234), hence the going back and recoding.  Once again thanks for your help.
Cheers Mary

0
#157283

Robert Butler
Participant

So does this mean you have 4 crews on during the daytime and 4 at night or does this mean you have 4 crews, two on during the day and two on during the night?  Also do the crews change times i.e. sometimes crew 1 and 2 have the day shift and crews 3 and 4 have the night shift or is it a case of the same crews at the same time every day.  If it is the latter then crews and shift will be confounded and you won’t be able to use both of these terms in the regression.
Below is an example of the way the coding should be done:
Reason Code (RC) 1,2,3,4
If RC = 1 then do r1 = 1, r2 = 0 r3 = 0
if RC = 2 then do r1 = 0 , r2 = 1, r3 = 0
if RC = 3 then do r1 = 0, r2 = 0, r3 = 1
if RC = 4 then do r1 = 0, r2 = 0 , r3 = 0
you will run your regression against the dummy variables r1, r2, and r3.  Draper and Smith – Applied Regression Analysis 2nd Edition, has a very good discussion of the use of dummy variables starting on pp.241 of that text.

0
Viewing 12 posts - 1 through 12 (of 12 total)

The forum ‘General’ is closed to new topics and replies.