What is a standardized residual? How do I calculate it? How do I use it and interpret it? What are its benefits? The answers to these questions and more can be found below.
Overview: What is a standardized residual?
Before you can understand what a standardized residual is, you must first understand what a residual is.
A common use of a residual, also known as a raw residual, is in the context of regression analysis. A residual is the vertical distance from a data point to the prediction line. It is an indication of the error of the prediction. The prediction line is essentially the average of the data points and the residuals are how far each point is away from that average.
In other words, it is the mathematical difference between the observed value of y given the paired x value, and the predicted value of y given that same x value. Large residuals indicate large errors, whereas small residuals indicate that your prediction line is a good fit to your data.
Image source: Stack Overflow.
Notice that some residuals will be negative and some will be positive. A graphical plot of your residuals is an important diagnostic of how well your data fits your predictive model. Your regression may have a high r-square value, or high correlation, but the residual plot may suggest something else.
Now let’s define a standardized residual. A standardized residual is the raw residual divided by an estimate of the standard deviation of the residuals. It’s a measure of the strength of the difference between observed and expected values. Here’s how you calculate the standard deviation of the residuals for a simple linear equation.
The standardized residual is then the ratio of the individual raw residual divided by the standard deviation.
The standardized residual can help you detect outliers. Standardizing residuals is useful because raw residuals might hinder your ability to identify outliers if they have non-constant variance. If residuals of observed values farther from your prediction line have greater variance than residuals with observed values closer to your prediction line, then outliers are harder to detect. By adjusting for this nonconstant variance, all standardized residuals will have the same standard deviation.
As mentioned earlier, your plot of residuals, whether they be raw or standardized, are important indicators of the validity of your regression model. Here is an example of what the typical standardized residual plots look like.
First, look to see if the normal probability plot looks normal. Then, see if the histogram looks normal as well. The other two plots should look random (Versus Fits and Versus Order). If they are, you can have faith that your regression is valid.
Most statistical software will also identify residuals that are 2 or 3 standard deviations away from the prediction line. These are referred to as outliers and represent an unexpected value.
Image source: Pinzhi.org.
Here’s another set of standardized residual plots for some data that seems to have a little problem. What do you see?
The first thing you should see here is that the normal probability plot doesn’t appear to be normally distributed (a straight line), which is a requirement. The histogram confirms that since it does not appear symmetrical around a center value. And finally…bunnies in the clouds? Those graphs should appear as a random pattern.
Your conclusion should be that there is a problem with the data, and although you might have calculated a strong correlation, the predictive model is not valid.
Finally, let’s consider studentized residuals. Here is a great explanation from a Penn State University online course, which says this about studentized residuals:
When trying to identify outliers, one problem that can arise is when there is a potential outlier that influences the regression model to such an extent that the estimated regression function is “pulled” towards the potential outlier, so that it isn’t flagged as an outlier using the standardized residual criterion. To address this issue, studentized residuals offer an alternative criterion for identifying outliers. The basic idea is to delete the observations one at a time, each time refitting the regression model on the remaining n-1 observations. Then, we compare the observed response values to their fitted values based on the models with the ith observation deleted. This produces deleted residuals. Standardizing the deleted residuals produces studentized residuals.
3 benefits of standardized residuals
Now that we’ve explored a number of ways to use residuals, let’s look at the benefits of standardized residuals.
1. Offers a generic measure
One of the problems with using raw residuals was the potential problem of differing variances. By dividing the raw residual by a generic standard deviation, you now have a simple, standardized way to assess your residuals and your regression model.
2. Identifies potential outliers
A statistical analysis of the standardized residuals will help identify potential outliers, which often have an undesired impact on the error and quality of your prediction.
3. Confirms the validity of your prediction model
Even though you might show a strong correlation in your regression model, it is the evaluation of the standardized residual plots that will tell you whether your correlation and model are valid.
Why are standardized residuals important to understand?
An evaluation of any type of residual is important to understand since it’s a measure of the strength of the relationship between your observed values and predicted values.
Measure of error
Residuals are the difference between observed values and predicted values. This error is important to understand so you can evaluate whether your predicted values will provide you with the information you need to assess the usefulness of your predictions.
Least squares regression
The least squares regression method is generally used for developing your prediction line and thus your prediction model. The calculation involves squaring the raw residuals and then moving the line around until the location of the line minimizes the sum of the squared residuals.
Interpretation of the residual plots
“It ain’t over til the residual plots says it’s over” is a good way to describe the importance of understanding the residual plots. There are three graphical plots that are used for the standardized residuals. They are:
- Normality plot (residuals should be normal and have a mean of zero and a standard deviation of 1)
- Plot of the standardized residuals versus predicted values (random)
- Plot of the standardized residual versus the order of the data (random)
An industry example of a standardized residual
A pharmaceutical company was investigating the relationship between time and the continued efficacy of one of their liquid oncology products. Over the course of time, they collected samples and tested the effectiveness of the product to determine the practical shelf life for the product.
Below is the final statistical output and the standardized residual plots. It appears that the prediction model is valid. The next step was for the company to figure out how to extend the shelf life to reduce the amount of product that has to be disposed of because of declining efficacy.
3 best practices when thinking about standardized residuals
The standardized residuals are just a mathematical calculation. Properly collecting data and doing your regression is the part you should be concerned with.
1. Paired data
To do a regression analysis requires that there be a paired set of data. That is, for each x value, there needs to be a corresponding y value.
2. Use statistical software
Doing the calculations of the standard deviation, least squares regression, standardized residuals, and all of the residual plots by hand is not recommended. Fun for some, but not recommended for accuracy.
An unusual point will result in a high residual and can pull your prediction line away from the true center of the data. This could have a negative impact on your prediction.
Most statistical software will highlight any standardized residual that is 2 or 3 standard deviations away from the prediction line. You will want to investigate and decide how to address that data point. A statistical calculation called Cook’s Distance will help you identify statistically significant outliers.
Frequently Asked Questions (FAQ) about standardized residuals
1. What is the difference between a raw residual and a standardized residual?
A raw residual is the mathematical difference between an observed data point and a calculated predicted value for that point. A standardized residual takes that raw residual and divides it by the standard deviation of the total set of residuals.
2. Why is a standardized residual more than 3 standard deviations from the prediction line considered to be an outlier?
Recall that the shape of the standardized residuals should be normally distributed. The probability of a point being 3 standard deviations away or more in a normal distribution is only about 0.3%. Therefore, if you see a standardized residual that far away, you will want to check it out.
3. How do you interpret the graphical plots of standardized residuals?
In the case of simple linear regression, you know that the least squares regression technique strives to draw a line in the center of the data to minimize the squares of the residuals. That’s why the normality plot should show the residuals to be normally distributed with a mean of zero and a standard deviation of 1.
Furthermore, the residuals should be randomly distributed around the prediction line. That’s why the plot of residuals versus fits should appear to be random with no discernable pattern.
Finally, the residuals should be consistent over time and not vary as a result of the order of the data. The residual plot of residuals versus order should look random.
To sum up standardized residuals
A raw residual is the difference between an observed value and a predicted value in a regression or other relevant statistical tool. A standardized residual is the raw residuals divided by an overall standard deviation of the raw residuals. This provides a consistent measure of the error of your prediction.
Before deciding on the value of your prediction model, confirm the validity of your model with the use of residual plots. There are some assumptions about the nature of residuals that must be met. They are: residuals need to be normally distributed with a mean of zero and a standard deviation of 1. They also need to be random with respect to your predicted values and with the order of your data.