Linear regresion

Six Sigma – iSixSigma Forums Old Forums General Linear regresion

Viewing 5 posts - 1 through 5 (of 5 total)
  • Author
  • #31469

    Joaquin Camacho

    I have the following situation:
        I have been trying to do a simple liner regresion with about 4000 ponts of data, what I get is a big cloud of points a regresion line and a very low R sq. To minimize this effect I creted sub-groups along the range and when I plot them,I got a nice line with a high R sq. The question is: Is it valid to sub-group raw data for both X & Y and run the regresion with with the averages of  variable size, sub-group data?
    Any Thoughts?


    Robert Butler

      You can regress anything against anything so from a strictly mechanical standpoint you can run a regression against averages.  The more important question is what is the physical justification for doing this?  A reading of your post gives the impression that your primary concern is the R2 of the final regression model and not the utility of the model itself.  If this is indeed your concern then it would be a mistake to build a regression model on mean responses.
      A regression is nothing more than a geometric fit of a line through a cloud of data points.  Least squares regression generates a line that minimizes the sums of squares of all of the vertical differences between the data points and regression line.  In short, a least squares regression line is nothing more than a map of the location of the mean values of Y’s associated with the various values of the X’s.  If you take averages of your Y values at each X and then run the regression you will automatically see a jump in R2 and all of the other statistics associated with the fit of the regression line for the simple reason that you will have hidden most of the the vertical differences between the individual data points and the regression line.  In other words, you will have hidden your measurement uncertainty from your regression package.
     To see just how dangerous this can be try this extreme approach: Split the X values into two groups – X(high) and X(low).  Assign the arbitrary values of -1 for X(low) and 1 for X(high) Take the grand average of the Y’s associated with these X’s.  You now have exactly two data points (Xlow, Y1Avg), Xhigh, Y2Avg).  A linear regression against these two points will give you an R2 of 1 with zero error.
      If you actually have a situation where you will be making decisions on average values of your response (Y) and where your uncertainty focuses on the uncertainty of average responses and not individual Y measurements  then it would be appropriate to develop a regression model using individual X’s and average Y’s. To do this you will want to make sure that you have about the same number of Y response at each X in question.  You will also want to check the variability of the Y’s at each X . Note that there is no averaging of the X values.


    Chip Hewette

    First, think through the 4,000 data points.  That’s a lot of info!  Chances are very high that you have many contributing causes to the measured response.  Assigning all 4,000 responses to one proposed factor is inappropriate in my view.
    Your thought to simplify the data structure is mathematically suspect, as related by the earlier response to your question.  However, subdividing the data is indeed where I would go…except I would subdivide by additional elements of interest.  What other things are affecting the response?  Day of the week?  Outside temperature?  Shift of the factory?  Operator?
    I suggest a team approach to identifying likely causes and a historical data exploration along those lines first.  Based on data type, your analyses could be multiple linear regression, ANOVA, chi-square, or logistic regression.
    If there is any interest in performing the simple linear regression on the 4,000 data points one can simply take a random sample of that dataset and look at the scatter plot.  Take 10% of this dataset randomly and see what you see.  Chances are high that the results will be no different, but easier to display to your audience.



          Chip & Robert I think your comments hit the nail in the head.



    No.  You are fooling yourself using your method. The truth is in the original data. The low R-square means you haven’t yet identified what is creating variation in the response variable (the Y).
    The regression line using the averaged data may be somewhat close to that using the full data, but you certainly can’t ignore the fact that the values you are using are means, and not individual measurements.

Viewing 5 posts - 1 through 5 (of 5 total)

The forum ‘General’ is closed to new topics and replies.