Transformation in Regression

Six Sigma – iSixSigma Forums Old Forums General Transformation in Regression

Viewing 12 posts - 1 through 12 (of 12 total)
  • Author
  • #50954


    I am studying some info on regression modeling and the need to validate underlying assumptions to ensure the precision and accuracy of the unknown parameters.  But I dont get this part…..
    If I was fitting a regression model and the residual analysis indicated the need for a possible transformation, how do I know which variable to transform (I understand one can transform the Y and/or the Xs)?  In addition, once the variable for transformation is determined, do I then simply save the transformed data in a seperate column and plug it into the analysis in place of the original data and rerun and recheck?  


    Robert Butler

      As was stated in another recent thread – it depends.  The two basic residual shapes that suggest a need for a transformation are the shapes.   In the case of <  the first choice is to take the log of Y. After logging Y you need to re-run the regression on the log values and if the log solves the problem then the final regression model will be on the logged values.
      The first choice for transforms when the residual pattern resembles > is some form of an inverse Y i.e.  1/Y.  Usually, you will look at the residuals and identify the fitted Y value associated with the tip of the > shape.  If it is say 100 then the transform would be 1/(100-Yi)**2  where Yi are the individual Y values. 
      If these simple approaches don’t work then you have your work cut out for you and it will usually entail a lot of plotting of residuals against X’s and against time order to see what kinds of patterns emerge.  You may find that the final version requires a transform of both the Y and some of the X’s.  To the best of my knowledge there aren’t any simple rules if you have to move to this level of investigation.
      There is one other pattern that does show up from time to time.  When the residuals are plotted against the fitted Y’s you can get a plot that looks like   a series of lines looking like rock strata. This occurs when the Y responses is constrained to a finite number of distinct categories and in the course of taking measurements these values are repeated in the data.  The parallel lines will have a slope of -1.



    That was very helpful, thank you!  So a shape on the residual vs. fitted value plot indicates a potential need to transform the Y using log Y or 1/Y (or other power transformation based on the optimal lamda – I assume?).
    In addition, I have a colleague who keeps putting the different predictors and / or response through various arbitrary transformations during a step-wise regression exercise, looking for a higher Rsq.  Is this an acceptable approach — to simply start manipulating terms arbitrarily?  This doesnt sound like the soundest strategy…



    Newbie:First a little review – the assumption of linear regression
    is that the results are adequate when the RESIDUALS are normally distributed.
    Deviations from normality are only seen in the residual plots AFTER attempting
    to refine a regression model.There may be a non-linear relationship between X and Y. This
    can be addressed by the careful construction of a non-linear model. For
    example, Y=m*log(X)+b. In this case neither X or Y should also be transformed
    before conducting the regression. An alternate method would be to take the
    logarithm of X to construct another variable, i.e. L = log(X), then do the
    regression using the equation Y=m*L+b.Some response variables are non-normally distributed and
    entirely new methods are used to construct and refine a model. Binary logistic
    regression is one such method where the response is either 0/1 or pass/fail.
    Some people make the mistake of converting the data to a series of percentages
    and conducting regression on those data.There are a number of well known, non-normal, distributions
    that arise under different circumstances and I would expect the response (Y) to
    show such a non-normal distribution. In these cases, in order to use linear
    regression, one or more of the variables should be mathematically transformed.
    Some examples I have seen involve the number of defects per unit area, airborne
    contaminant concentration, non-centricity of drilled holes, lengths of cut
    metal, fill height of filled containers, deviations from planarity, most
    financially based data, computer memory usage, and disk storage usage.I like to use the Box-Cox transformation and look at the
    graph of the effect of different lambda values. I look to see if the optimal
    values are close to those involving specific transformations.






    Square root




    Reciprocal square root



     I would caution anyone who would start transforming data to
    get a better mathematical fit for a regression model without having a reason to
    do so. Conduct a few experiments on the system to get a feeling for the
    factors influencing the response. When using historical, happenstance data
    there are usually many other reasons for non-linear or non-normal behaviour.Cheers, Alastair



    Thanks for the feedback and I would ask for your patience as I summarize my understanding of your and Robert B counsel:
    So the stepwise or best subset techniques are used to determine the desired subgroup of predictors for the future model (via Mallows Cp, R^2, S, etc) which are then taken forward for regression, which in turn is determined by the data type of the terms and one’s understanding of the process.  Once the model has been fitted, it is assessed for statistical signifiance (ANOVA and P-values for individual predictors), use-ability (R^2, R^2 adj, S, magnitude and direction of coefficients, etc), multicolinearity (VIF), and finally, validity of underlying assumptions (normality, independence, and equal variance or identical distribution of residuals). 
    Non-random patterns in the residual plots will indicate when and where the model should be adjusted to improve accuracy and predictability in the coefficients to include:  a need to transform the Y (and/or X) or a need to change the predictors (ie add another single or higher order term)….Yes?
    And finally:

    I read regression does not require the Y to be normal, but for the residuals to approximate normality….does this mean you can have a Y data set that is non-normal and then approximate normality in the residuals?
    Do curvlinear effects appear in the residual analysis and if so, what should be plotted to reveal its presence?
    Thank you, thank you, thank you!


    Robert Butler

      Blindly transforming data to push the value of R2 around is a waste of time. Transformations are used to impact the distribution of the residuals and as BTDT noted
      “the assumption of linear regression is that the results are adequate when the RESIDUALS are normally distributed. Deviations from normality are only seen in the residual plots AFTER attempting to refine a regression model.”
      …and the reason you want this is so that you can bring all of the standard tests of signficance to bear when you examine the results of your attempt to regress Y on one or more X’s.
      If your only object is to push R2 as high as possible a far simpler solution is to just run a gross polynomial overfit.  This approach makes as much sense as blind transformations and it is a lot simpler. In addition, if there are no repeat points, you can build an equation with an R2 = 1. 


    Robert Butler

      This is the approach I’d recommend:
      1. Don’t jump in and start generating regression statistics. First – plot the data – Y against all of the X’s of choice so that you have some idea of what the data looks like and thus what you have to work with.
      2. If it is possible, check your X matrix for collienarity – at a minimum run VIF’s and plot the X’s against one another – I prefer 3d to 2d in this case since you can see the relationship between 4 X’s. (one X on the X,Y,Z axis and then color code the points with the values from the 4th X). Run the stepwise regression with the subset of X’s that are sufficiently independent of one another.
     3. Run both stepwise and backward elimination on the data to check for consistency with respect to convergence to a type of model. 
     4. Look at the models – and look at the progression of the statistics in the development of the models (simultaneous changes in R2, Sy.x, and Mallows Cp).  You may find an earlier model and not the final model is the better fit to the data.
    5. Examine the model(s) – residual plots, lof, etc.
    6. If there are issues with the residual plots or other aspects of the regression analysis make the appropriate changes (i.e. transforms, runs with and without apparent influential points, etc.) and try again.
      To your final points
       Yes, normality applies to the residuals not to X or Y.
       If there is a curvilinear/linear effect that is missing (unaccounted for by the X’s you used) it will most likely show up in the plot of the residuals against the fitted Y’s in the form of a curve/straight line pattern in the scatterplot.



    Newbie:I looks like you have done a fair amount of reading. Keep it up, fitting regression and other models to data is an iterative and involved process. As you keep selecting and removing factors, changing the equation terms and assessing progress, keep looking at the residual plots for any non-random patterns in the residual error in the model. The ideal case is one where the errors in the model are smoothly distributed over the prediction interval in time and response.Curvilinear effects will be seen in the fits vs. residuals plot. For example, all the residuals are high for large and small values of Y, but low for intermediate values. In other words the plot will look like a boomerang.Minitab has an option to generate all residual plots each time you run the regression. All of them should look random.As Robert says, you can always increase your R^2 by adding more factors. Don’t make a model more complex without a decent reason to do so. The object is to fit the simplest model you can to adequately fit your data.Cheers, Alastair



    BTDT and Robert,
    Thank you so much!  That was tremendously helpful.  I will take all your advice moving forward and again, thank you for taking the time to answer a bunch of questions!



    Hey Alastair,
    I was re-reading your response and realized I didnt fully understand the last point….keeping a model as simple as possible (ie minimal number of terms and straightforward mathmatics I am assuming) while still having it prove useful….sooooo, where is that line between maxing out the R value and keeping the number of terms to a reasonable level?  Is this where the Mallows CP comes in?  Thanks again!


    Robert Butler

      If you are running stepwise regression there will be times, especially with non-designed data, when the machine will grind on adding terms and providing you with a model that is an overfit of the data. 
      One way to guard against this is to watch R2, Sy.x, and Mallows Cp and see how they behave as a unit.  The “symptoms” of an overfit can be many and varied but the two I’ve seen most often are as follows:
    1. For each term up to the Nth term Sy.x sees a substantial (yes, it is a judgment call) decrease as each term is added, R2 increases, and the Cp will either decline with each addition, or decline but at a more gradual pace.  At the Nth+1 term R2 will again increase but the reduction in Sy.x will be much less than was seen in previous steps.  Mallows Cp will continue to drop.  With each addition of another significant term past the Nth you will see the same thing – the key being the changes in Sy.x are much smaller than before until the stepwise process finally grinds to a halt.
    2. Same as above but now it is the Cp which seems to be running things.  After the Nth term both Sy.x and R2 see very small decreases and increases however Cp continues to see big changes with each new term past the Nth term.
      When it comes to regression the issue is the amount of variation in the data that is explained by the model.  Consequently, I watch the above statistics in the order of Sy.x, R2, Cp.
      When either 1 or 2 happen I will go back to the model where the change in reduction of Sy.x shifted from major to minor and I will take that model and the final model and run a regression analysis on both.  I will compare their respective residual plots – looking for all of the usual suspects – trends, influential data points, etc., lof – in general and with respect to subsets of the data that might be of interest to me, and predictive ability across the range of the data. 
      If my regression analysis turns up things like data points that appear to be highly influential, independent variables whose missing data changing the structure of the data, etc. I will take appropriate actions and re-run everything.
      At the end of the effort, if I still have situations like 1 or 2 and if I can’t find any major difference between the two models I will opt for the simpler version.



    You were reading my mind – How do you choose between simplicity and model effectiveness?  But you explained it very well and explaining how you balance those various measures and still keep an eye on simplicity was very helpful.  Thanks guys!!

Viewing 12 posts - 1 through 12 (of 12 total)

The forum ‘General’ is closed to new topics and replies.