Comments on: Linear Regression: Making Sense of a Six Sigma Tool Six Sigma Quality Resources for Achieving Six Sigma Results Thu, 21 Mar 2019 01:03:28 +0000 hourly 1 By: Jim Mon, 12 Jun 2017 18:28:58 +0000 For regression analysis, always use R-Sq (adj) instead of R-Sq to determine the goodness of fit, assuming the model has already passed the p-value test.

Three reasons that R-Sq (adj) should be used instead:
1. takes into consideration the number of data points used in the regression model;
2. takes into consideration the number of terms in the regression equation;
3. is more conservative than R-Sq.

However, it is a good practice to compare R-Sq and R-Sq (adj) to be sure they are close in value, as a quick cross-check.

By: Chris Seider Tue, 05 May 2015 15:30:03 +0000 I always tell my folks to look at the p-value and R-sq. If the p-value is less than their typical 5%, the R-sq tells them what percent of the variation is explained but I always tell them to look at the RESIDUALS before getting excited. Make sure there’s no pattern or big outliers and a few other items.

Residual analysis is critical to successfully applying the tool.

By: Kicab Tue, 05 May 2015 11:59:20 +0000 “if one squares the difference before they are added, two things are achieved:

It cancels the effect of having both positive and negative values
It magnifies (penalizes) the larger errors.”

You haven’t explained why it is necessary to penalize larger errors to find a line that “fits” or explains the data? You could penalize larger errors in many ways: multiply only large errors by a factor, use any even power, e.g., fourth and not just second power (square), use only errors that are greater than a certain percentages. In addition, by “penalizing” larger errors, those values will be substantially more influential in determining the line. The result is that a single point can make the least squares line be substantially misleading for most points.

The better explanation for squaring is that it provides additional and tractable statistical benefits: hypothesis testing and confidence intervals using unbiased estimates of the variances. However, this does not mean that the least squares line is the most useful one.

If your purpose is to find a line that contains either the averages or proportion of individual values within a predefined limit, other ways of determining that line are better. First, determine why you want to fit a line to data and then determine what method(s) will be better. You might choose the absolute deviation approach—or other methods found in the literature.

“Therefore a much more important indicator of the validity of the model is – as always – the p-value.” The p-value is probably the least important indicator of model validity.

Again, depending on the purpose, other criteria that evaluate the extent the purpose is met are better indicators. No one (I hope) has as a purpose to fitting a line to data that the p value be significant. Rather most often the purpose is to predict. Hence, the accuracy and frequency (probability) of correct predictions are better indicators for the prediction purpose.

Check Minitab for definition of influential points. You will see that one type is a point far from a fitted line in a vertical direction (Y). This influence is exaggerated using least squares. The other type is a point far from the others in a horizontal (X) direction. This will increase R-square and lead to mistakenly significant p-values.

By: Rafael Espinosa Thu, 18 Nov 2010 23:13:51 +0000 Sorry, it should say:

1.5 divided by 6 does not equal 0.75, it equals 0.25.

By: Rafael Espinosa Thu, 18 Nov 2010 23:10:52 +0000 For the R2 equation, shouldn’t you subtract the results of that equation from 1? The way you have it, it doesn’t work out…1.5 divided by 6 0.75, it equals 0.25.

By: silerioj Fri, 23 Apr 2010 07:24:46 +0000 The formula above for R-sq confuses me. I thought R-sq is the ratio of SSR over SST, which simply translates to the percent of the total variation due to the linear relationship between the dependent and independent variables.