# R sq

Six Sigma – iSixSigma › Forums › Old Forums › General › R sq

- This topic has 18 replies, 5 voices, and was last updated 12 years, 8 months ago by Craig.

- AuthorPosts
- June 14, 2007 at 11:20 pm #47274
What are considered the good, bad, and the ugly values for r sq values in regression? I know, I know….it depends, but what are the rules of thumb, based on the industry?

Everything I read says `high` or `low`…..and that is?

Thanks!!0June 14, 2007 at 11:37 pm #157467

Practical applicationParticipant@Practical-application**Include @Practical-application in your post and this person will**

be notified via email.R squared is an indicator of explained variance and evaluated in the social sciences by using Cohen’s tables. R Squared < .10 is low, between .11 and .50 mid-sized and greater than .50 large.

In engineering applications the more important indicator is the prediction interval and if the prediction interval is useful for the purposes of the prediction. Even an squared of .98 can result in a prediction interval that may not be useful, particularly if your a targeting a specific as opposed to an average prediction.0June 14, 2007 at 11:46 pm #157470thanks…now that begs the question what is a prediction interval…is that something i will find in the sigma skill set or am i off to google it?

0June 15, 2007 at 3:00 am #157474

Practical applicationParticipant@Practical-application**Include @Practical-application in your post and this person will**

be notified via email.Minitab gives you the averarage prediction interval. Under Regressions click on options and you will find “prediction intervals for new observations”. Put in the value(s) for which you want the prediction, then click on “prediction limits”. You can see if the prediction limits are too far in the context of the decision that you want to make. This is much more informative from a decision-making point of view than the very general R-squared statistic. The prediction interval is also more meaningful for the usage of the equations in decision-making processes than the more general confidence interval.

The formula is documented in Draper and Smith or Kutner, Nachtsheim, Neter and Li. The Minitab handbook covers the function in section 2-9. Make sure that the prediction for the new observation is within the range of the values used to develop the regression equation.0June 15, 2007 at 3:25 am #157476A prediction interval relates to the confidence of predicting an individual observation. This differs from the confidence interval in regression, which applies to predicting the average response at a certain condition. The prediction interval will always be wider.

You are not going to find an absolute answer for the R-sqaure question. There is much more to it than that. You can easily over-fit the model and have a great R-square, but many terms will have poor p-values. The best bet is to look at all aspects of model adequacy.

R-squared, Mallows Cp, VIF, etc. Once you have arrived at the best model from your data, it is what it is!0June 15, 2007 at 11:55 am #157485

Practical applicationParticipant@Practical-application**Include @Practical-application in your post and this person will**

be notified via email.hacl. there is a difference between a point estimate in prediction and an average prediction. Minitab only provides the average prediction. The formulas are related.

0June 15, 2007 at 1:51 pm #157493Practical application,

Thanks for the reply, but I was saying the same thing. A prediction interval applies to individual observations. (With 95% certainty, where will a single observation fall at a given condition?). Confidence intervals apply to the mean response.

Actually Minitab provides both:

95.0% CI 95.0% PI

(3.997, 5.187) (2.833, 6.351)

The stat guide in Minitab shows the above example. Note how the 95% PI is wider than the 95% CI

0June 15, 2007 at 2:03 pm #157494

Practical applicationParticipant**Include @Practical-application in your post and this person will**

be notified via email.My point is that there are two types of PI formula and one CI formula. It makes a difference if you predict the average Y at x or a specific Y at x. The formula for the latter has a wider confidence interval and in some applications that interval may be too large for prediction purposes. Minitab calculates the interval for the average Y at x. The other PI formula is documented in Draper etc.

0June 15, 2007 at 2:14 pm #157496I think we are on the same page for the most part.

Predicting an average Y at x is a CI

Predicting a specific Y at x is a PI (Wider than a CI)

Similar to the central limit theorem, standard error of the mean is smaller than the standard deviation of the distribution for individuals. Sort of the same concept as CI vs PI.

I didn’t know there were 2 formulas for prediction interval. Does this hold true for simple linear regression as well as multiple regression?

Also, minitab calculates both. (PI and CI)

0June 15, 2007 at 3:48 pm #157500To get back to the original question, it depends on what type of process you are performing the regression on. With the type of product we manufacture at our plant, anything over 70% R-Sq adj is good. But if it was a life or death situation it would be significantly higher. Too many people try to outwit each other on here and lose the whole point of the original question.

0June 15, 2007 at 4:02 pm #157502

Robert ButlerParticipant@rbutler**Include @rbutler in your post and this person will**

be notified via email.To add tp what ADB said – I hope that all of your concern with R2 is after you have actually done a regression analysis – this means plot the data and look at trends, data clustering, etc. Plot the residuals against anything that makes sense (predicted, X’s, time, etc) and examine these plots for adequacy of fit, absence of significant trends and patterns, etc. If you have not done these things before thinking about R2 then all of the “rules of thumb” are worthless. You cannot evaluate a regression on the basis of a single statistic such as R2.

The post below has more details and an example:

https://www.isixsigma.com/forum/showmessage.asp?messageID=436830June 15, 2007 at 4:18 pm #157505

Practical applicationParticipant**Include @Practical-application in your post and this person will**

be notified via email.I don’t want to be picky, but they are three interrelated formulas. Look them up in Kutner, around page 100 or so. Good luck.

0June 15, 2007 at 4:20 pm #157507

Practical applicationParticipant**Include @Practical-application in your post and this person will**

be notified via email.adb,

great comment. but it shows that you don’t quite understand what the various formulas stand for and how they are related to each other. this is not a question of nitpicking in a life and death type of situation, as you stated yourself.0June 15, 2007 at 4:41 pm #157509Practical Application,

If the formulas are short, it might help if you showed them. If I can grab a copy of the text you referenced, I will do so. (Doubtful though since I have too many as it is!!) Such a nerd!!

I think the poster should just realize that Rsquared depends on many things, and it represents the amount of variation explained by the model. Building a model is a science and involves many diagnostic checks such as mallows cp, VIF, r sq, adjusted r sq, etc. Once the best possible model is fitted to the data, then the experimenter can look at the CI and PI and see the predicability of the model.

If I build a model for gas mileage based on engine size, car weight, octane level, driving habits, speed as an example. I get a model with an r-square of .70. My average gas mileage is 22mpg for the current combination (I am going 55 PMH, I have 91 octane in my tank, my car weighs 2 tons, I am in the “grandma” driving mode, and I have a V8 engine.

I have 1 gallon left in my tank [and I did my RR study aready :-) ]

Since I care about my next observation, and not the average at these conditions, I look at my prediction interval while weaving down the highway. My lower PI is 16 MPG and my upper PI is 28 MPG. The signs on the highway say “Gas at exit A in 17 miles, and Gas at exit B in 24 miles. Which exit do you take?

0June 15, 2007 at 4:50 pm #157512

Practical applicationParticipant**Include @Practical-application in your post and this person will**

be notified via email.Robert,

thanks for clarifying this and for emphasizing how important it is to not only understand what the formulas tell and don’t tell you, but also, how to utilize additional information (the residuals) to troubleshoot the information.0June 15, 2007 at 5:06 pm #157515

Practical applicationParticipant**Include @Practical-application in your post and this person will**

be notified via email.hacl,

great post. i’ll post the formulas later tonight … unless robert does me the favor :-).0June 15, 2007 at 6:56 pm #157529I am getting a little better with analogies in my old age!

Thanks for the postive feedback.

HACL0June 15, 2007 at 10:20 pm #157543

Practical applicationParticipant**Include @Practical-application in your post and this person will**

be notified via email.hacl,

as you correctly pointed out the formula for a prediction interval has tighter bands than that of a confidence interval. The prediction intervals take into account two main components:

1. mean squared error, i.e. the sum of squared errors adjusted for by degrees of freedom (this is why you want to do the residual analysis to understand if sources of unexplained variations should be included);

2. the distance of the predicted value from the mean value. The prediction is less precise the more the predicted value is away from the mean (this is one of the reasons why statisticians caution to use R squared as a measure of the utility of the regression equation; it does not depend on the mean squared error only).

There are two scenarios:

Prediction interval for Yh (new) when paramters are unknown or prediction of mean of m new observations for given xh. In the first case, you would predict ,for example, used manhours for the next production run as opposed to, for example, predicting used manhours for let’s say three production runs. The two formulas differ as follows:

MSE [1 + 1/n + (xh xbar)2/S(xi xbar)2] (prediction interval when parameters are unknown)

MSE [1/m + 1/n + (xh xbar)2/S(xi xbar)2] (prediction of mean of m, in this case m = 3, new observations for given xh)

In certain financial calculations this may make a difference in assessing, for example, the risk associated with committing resources, to an external client. (…. and yes, statistics can be hair splitting … you just have to know when to split hair, how and why).

Kutner et al. pp. 57 – 63, or Drapner and Smith, pp. 60 – 63); Kutner uses m, Drapner uses q instead. I hope this helps.

0June 15, 2007 at 11:25 pm #157547Very informative and you saved me the cost of a textbook!

Thanks!0 - AuthorPosts

The forum ‘General’ is closed to new topics and replies.