Multiple regression
Six Sigma – iSixSigma › Forums › Old Forums › General › Multiple regression
 This topic has 21 replies, 11 voices, and was last updated 14 years, 5 months ago by Bob Kane.

AuthorPosts

November 19, 2007 at 4:10 pm #48712
sixsigmahackMember@sixsigmahack Include @sixsigmahack in your post and this person will
be notified via email.Have a question about multiple regression. I have a model to calculate cycle time based on 10 months of data (i.e 10 data points). The regression equation shows:
Cycle Time (in days) = 75 – 0.0342 Applications completed – 0.00409 Inventory (1 month prior)
The Rsquare value is 70% and both Xs are signficant (applications completed, inventory). My question is this,
1. Intuitively i would have assumed that as the inventory increases the cycle time would also increase. However the negative coefficient for inventory (.00409)suggests to me that if we kept on increasing inventory, the cycle time would go down – this is puzzling.
2. Can I trust a regression model with just 10 data points? i dont think, i have enough data points to satisfy assumptions of normality. Though i do not see any heteroscedesticity (sp.) errors. Should i wait till more data becomes available?0November 19, 2007 at 5:02 pm #165068
Robert ButlerParticipant@rbutler Include @rbutler in your post and this person will
be notified via email.Don’t sweat the normality – it only matters for residuals and with only 10 data points it would be very easy for a 10 point draw from a perfectly normal distribution to fail the test.
To answer your question – first – plot the data – specifically plot applications vs. Inventory and then plot cycle time against both. What do you see? If applications looks like it is correlated with inventory then run a regression to see if it is significant. If it is then applications and inventory aren’t independent and you are building a model using a data set that cannot support the inclusion of both terms.
If they don’t look/test significantly correlated look at the plots of cycle time against the two – what do you see? Any points look like they are a tail wagging a dog? Any trends that look like they should be something other than linear? If you have such a point – rerun the regression without it and see if that changes things.
If you want – with only 10 points – post the matrix and I’d be willing to look at it for you.0November 19, 2007 at 6:46 pm #165083
sixsigmahackMember@sixsigmahack Include @sixsigmahack in your post and this person will
be notified via email.Hi Robert when i run the regression model for decisions (x) versus the inventory Y, i get an Rsquare value of 26%, however the p value is .129 which suggests that the decisions is not significant in the determination of inventory levels. Does this make sense? Intuitively i think inventory might be determined more significantly by incoming volumes (demand). This assumption is also verified by the regression model as a high Rsquare relationship exists b/w inventory and incoming volumes.
When i plot cycle time versus inventory i see only a 3% Rsquare value, however when i plot cycle time versus decisions i see a 65% Rsquare value. The two of them together give me a stronger model but as suggested in my first post, it seems a bit counter intutive. Here’s the model:
Inventory (1 month prior) Incoming volumes (2 m prior) Decisions Cycle Time (Actual)2571 1619 999 61.852598 1467 928 61.522572 1510 966 58.802477 1339 1142 55.952279 1426 1184 57.531978 1282 1178 53.831969 1613 1219 53.032173 1455 956 62.931944 1740 995 64.071753 1025 1141 59.610November 19, 2007 at 7:26 pm #165090SixSigmaShack,It appears you have 4 columns of data. I am assuming the cycle time is the last column and the first three are inventory and incoming volumes. What is the third?
0November 19, 2007 at 7:30 pm #165091
sixsigmahackMember@sixsigmahack Include @sixsigmahack in your post and this person will
be notified via email.the third is no. of application decided in that particular month – a metric for the productivity of this business unit. Intuitively, greater the number of applications decided the lesser should be the cycle time – this assumption holds true when i run the model.
0November 19, 2007 at 7:35 pm #165094
Chris SeiderParticipant@cseider Include @cseider in your post and this person will
be notified via email.It is scary…..I have seen some material implying you can’t do a regression unless the data was normally distributed!
0November 19, 2007 at 7:37 pm #165096
sixsigmahackMember@sixsigmahack Include @sixsigmahack in your post and this person will
be notified via email.c.Seider, without many data points, i think one value may significantly influence the results , with more points it is easier to remove this outlier – however i think one should still be able to predict the regression model with lesser data points.
0November 19, 2007 at 7:52 pm #165100A correlation having an r square less than 1 is already suspicious, especially when it is 70 %. May I offer a suggestion, do a calculation to see the minimum number of samples needed to get a 95 % degree of confidence (alpha = 0.05) and this will tell wherther 10 samples is enough to obtain a robust correlation. From a personal perspective, I don’t believe you do.
0November 19, 2007 at 7:56 pm #165101Rob,I am not sure what you are referring to. The R square will always range between 0 and 1. Pete
0November 19, 2007 at 7:59 pm #165102
Chris SeiderParticipant@cseider Include @cseider in your post and this person will
be notified via email.Keep in mind the following practical, real life example….
What if your inventory measurement system had precision or accuracy issues? Just that noise of either the input or output or both would cause the correlation coefficient to be dropping less than one.
0November 19, 2007 at 8:04 pm #165103
sixsigmahackMember@sixsigmahack Include @sixsigmahack in your post and this person will
be notified via email.C.Seider, the inventory calculations are pulled of a system pretty straightforward, i can see that inventory may be correlated to productivity, however my regression model suggests its not. Incoming volumes should however be completely independent – when i run the model with incoming volumes, decisions and cycle time ; incoming volumes becomes insignificant – p value greater than .05 – which means my best bet is just to run a single variable regression model with cycle time as my Y metric and decisions as my X metric. I am hoping to get something that ties volumes (inventory or incoming volumes) to productivity (no.of decisions) and cycle time because i think there must be a relationship (hence my qustion that whether I have enough data points) and b. it is much harder to explain to the business that incoming volumes is not a significant reason why cycle times would be high.
0November 19, 2007 at 8:26 pm #165108If the r square is less than 0.80 or between 0 – 0.80 your correlation could have an error greater than 20 %. In few words, the chances for your calculated values to match the real ones may be 1 in 5. However on natural systems r squares are found to be between 50 – 90 %. This is why it was suggested that you a simple student t evaluation and see if the sample size is large enough to give the degree of confidence that you want or seeking after.
0November 19, 2007 at 9:05 pm #165119
ClarificationParticipant@Clarification Include @Clarification in your post and this person will
be notified via email.Just to clarify: regression lines are only attenuated by measurement error in the independent variable. By contrast, correlations are attenuated by measurement error in both the dependent and independent variable. In essence, measurement error will reduce the slop and increase the observed intercept of a regression line. Also, if the extent of the measurement error in the independent variable is known it can be corrected for using a “correction for attenuation” formula.
0November 19, 2007 at 9:39 pm #165139
Robert ButlerParticipant@rbutler Include @rbutler in your post and this person will
be notified via email.I don’t see where you are getting both decisions and inventory as significant. If the cut is P < .05 then only decisions is significant. This is true even if the cut is .15.
Decision (p = .0021), Inventory (p = .154)
What you have here is an very nice illustration of why you should normalize your X’s (center and scale so that they range from 1 to 1) instead of running your regression against the raw variables.
If you center and scale you will find that inventory and decisions are orthogonal enough to allow their inclusion in the model. The assessment of the raw values says otherwise.
The normalized model is cycle time = 58.8 – 4.1 *Ndecisions where
Ndecisions = (decisions – M1)/M2
M1 = (max decision + min decision)/2
M2 = (max decision – min decision)/2
Your R2 is .68 and your RMSE = 2.2 which, roughly speaking means the 95% CI around your cycle time predictions are plus minus 4.4 and your grand mean cycle time is 58.9.
An examination of the residuals indicates nothing is amiss. So it would appear you have a simple linear model relating cycle time and applications completed which explains 68% of the variability observed in your process (by the way, the X matrix will permit an investigation of the interaction as well – it wasn’t significant either).
….and now to some of the other issues –
Normality does not apply to X’s or Y’s it only applies to residuals – see pp.23 Applied Regression Analysis 2nd Edition Draper and Smith for details.
R2 of 68% just says that, all other things being equal, you have a single term model which explains 68% of the observed variation. Thus the correlations suggests you would be very unwise to ignore applications completed when trying to impact your cycle time.
As for the utility of the regression – it depends. If you confirm it with additional runs you may find it is more than adequate for your needs. You can find claims in the literature to the effect that a regression must have an R2 of (fill in the blank) before it is useful these guidelines, without any additional information – are of little value.
R2, by itself, isn’t very useful. It is an easily manipulated statistic. You should never judge the utility of a regression equation on the basis of it alone (or on the basis of any other single statistic for that matter). The only way to use it is in conjunction with other statistics and with a residual analysis.0November 20, 2007 at 8:50 pm #165188
Chris SeiderParticipant@cseider Include @cseider in your post and this person will
be notified via email.If r is equal to the square root of Rsquared, how can measurement error impact regression any differently than correlation? It cannot….
I do not know how one can have a “correction for attenuation” formula for impreciseness of your measurements (both X and Y). At best you could have one for a constant bias.0November 21, 2007 at 1:45 am #165196Correlation doesn’t guarantee causation, so I wouldn’t believe that increasing inventory will reduce cycle time. I used forward stepwise regression and obtained the model below. Note the low VIF values which means you don’t have correlation among the x variables (multicollinearity). I don’t know why I get different coefficients than you do!!??
In any case, it seems like your number of decisions is really a second Y, and you should be looking for other X variables to explain the cycle time and the number of decisions. At least this is what I gather from a cursory look (After one beer mind you!! :) )
The regression equation is
ct = 105 – 0.00409 inv1 – 0.0342 no apps
Predictor Coef SE Coef T P VIF
Constant 104.63 11.72 8.93 0.000
inv1 0.004094 0.002563 1.60 0.154 1.4
no apps 0.034161 0.007179 4.76 0.002 1.4
0November 21, 2007 at 2:00 pm #165212
Robert ButlerParticipant@rbutler Include @rbutler in your post and this person will
be notified via email.After HACL’s post last night I went back to your data set for another check. I went ahead and reran the analysis without bothering to normalize the X’s and I ran the data set through a couple of different programs.
1. Without normalization I get the same thing as HACL. What both of our regressions are telling you is, for the data provided, inventory isn’t significant.
2. There are a number of reasons why this may be the case. If you plot the data cycle time x inventory and cycle time x decisions and then do a 3d scatterplot of cycle time against the two you can see the following:
a. Cycle time increases as inventory increases, cycle time decreases as decisions increase. Regressions of cycle time against the two separately give equations with signs in the direction you expect, however, the correlation between cycle time and inventory is not significant.
b. For your data cycle time vs. the two shows that you have a large amount of data where the inventory is high, decisions are low, and cycle times are high and you have some where decisions are high, inventory is low and cycle time is low. What you don’t have is much data where inventory is low and decisions are low or when inventory is high and decisions are high.
c. If you look at the scatter plots of cycle times vs. inventory and decisions separately it is easy to see that decisions vs. cycle time is a much tighter plot (indeed it is so clustered that it might as well be two large fuzzy data points) than cycle time vs. inventory.
What this suggests is that, for the data provided, when controlling for decisions, the effect of inventory is in doubt. One possible interpretation of the results is that your process is relatively insensitive to changes in inventory as long as those changes are within the bounds of your data. A way to check this would be to “fill in the blanks” namely get data with the decision/inventory combinations not represented and/or get data outside the inventory/decision limits of your current data set.0December 6, 2007 at 7:00 am #165772
Andrew KennettParticipant@AndrewKennett Include @AndrewKennett in your post and this person will
be notified via email.This has been a very good discussion and I thought I’d throw in a couple comments.
1. Correlations drawn from historical or observation data such as this are necessarily constrained by a lack of enough range in the data (you are trying to keep your plant running afterall) so you can’t with confidence make predictions outside the range of your independent variables used in the analysis — Robert Butler succinctly observed what you don’t have is much data where inventory is low and decisions are low or when inventory is high and decisions are high. To overcome this problem you can try and run a Designed Experiment (DOE) with larger ranges of your independent variables, of course that isnt always possible in the real world
2. While it is true that the sign on a simple linear regression with one independent variable (i.e. y=a+bx, so the sign of b in this case plus) tells you the direction y will move when x changes (in this case y grows when x grows) this is not true with more than one independent variable. So when you have y=a+bx1cx2 does NOT mean that that y necessarily falls when x2 grows because x1 in the real world is never constant while x2 is being changed. You could calculate y for a set of frequently occurring x1, x2 pairs and use that as a ready guide for the expected y and then use this to explain your observations and see what values of x1,x2 give you desirable values of y.0December 6, 2007 at 1:42 pm #165779sixsigmahack: I have frequently seen inventory factors behave counterintuitively in productivity analysis, so this is not abnormal. It also can represent an opportunity for improvement, as inventory is not one factor but is usually a mix of many factors that warrant investigation. In one case regression analysis showed that batches using company inventory (bought cheaper and readily on hand for use by the producers) correlated with LOWER craft productivity than costlier short order material that had to be unloaded, piled and sorted by the producer craft(not the inventory handlers). Imagine, old fashion job shop purchasing and craft sorting was more process efficient than a high tech inventory system using low cost handlers.Upon deeper investigation we found that the internal inventory system tended to be missing documentation or be short something, which consumed entire crews of craft time to find missing items. We also found that the craft could often solve their problems more efficiently working with trusted external suppliers than with their internal customers in the inventory system. Naturally this finding lead to many targeted process improvements, and changed the way we thought about our costs. Good luck.
0December 6, 2007 at 4:03 pm #165790R squared to low to have captured the significat factors in this situation. Perhaps include more factors.
I use 80% as the lowest level of R square to accept the model0December 6, 2007 at 5:04 pm #165794I agree with HACL that Decisions might represent another “Y”, but I also have found low Rsquareds when paired with .05 P values or lower can represent potentially significant economic relationships warranting deeper investigation. You are not merely analyzing physicals, but are indirectly investigating economics too. Why not be more direct about it!When doing productivity or any kind of economic analysis I have found it useful to create parallel regressions (multiple views) using some representative dollar factor as a “Y,” such as “billable shipments” or better yet, “gross profits billable,” where the latter is a measure of economic value added, or profit generated by the process. Your cost accountant can probably give you this data to correspond with your other period data.By looking at dollars/profits you night find even more counterintuitive realities, such as slower cycle times may correlate positively with profit under certain circumstances, or certain decision inflows negatively relate to profit. Knowing and acting on these distinctions can drive profits north in a hurry. Every operation has a “sweet spot” economically. Part of this project’s natural objective may be to find this spot!
0December 10, 2007 at 2:58 pm #165943
Bob KaneParticipant@BobKane Include @BobKane in your post and this person will
be notified via email.An alternative analysis or one that can shed some light on your regression is the use of Little’s Law which states that:
Cycle Time= WIP/Exit Rate
Where WIP takes into account your inventory and Exit Rate is the number of completed applications. A transfromation of your data using the inverse of applications might provide a better model.
It would be nice to have more than 10 data points.0 
AuthorPosts
The forum ‘General’ is closed to new topics and replies.