iSixSigma

What Should I Do When Error Residuals Are Not Normally Distributed in GLM?

Six Sigma – iSixSigma Forums General Forums Methodology What Should I Do When Error Residuals Are Not Normally Distributed in GLM?

This topic contains 7 replies, has 1 voice, and was last updated by  Robert Butler 1 year, 4 months ago.

Viewing 8 posts - 1 through 8 (of 8 total)
  • Author
    Posts
  • #55966

    yair

    I’m trying to analyze some experimental data about animal behaviour and would need some help or advice regarding which non-parametric test should I use.

    The variables I have are:
    – Response variable: a continuous one (both positive and negative values)
    – Explicatory variable: a factor with 6 levels
    – Random effect variable: as the same animal performing some behavioural task was measured more than once.

    As I have a random effect variable, I chose a GLM model. Then, when checking the normality and homoscedasticity assumptions, Shapiro-Wilks test showed there was no normality and QQplots revealed there weren´t patterns nor outliers in my data. So the question would be: which non-parametric test would be optimal in this case, knowing that I would like to perform certain a posteriori comparisons (and not all-against-all comparisons): red vs grey; red vs black; red vs light blue; black vs grey.

    My database has lots of zeros responses in some conditions, I´ve read that for t-students tests lacking of normality due to lots of zeros it´s OK to turn a blind eye on lack of normality (Srivastava, 1958; Sullivan & D’agostino, 1992) … is there something similar with GLM?

    0
    #202395

    Robert Butler
    Participant

    Your post suggests you have run a statistical test and then, for whatever reason, a QQplot. What you need to do is run a residual analysis. The heart and soul of a residual analysis is a plot of the residuals against the predicted and a plot of the residuals on a normal probability plot. You should also look at a histogram of the residuals. If the histogram gives the impression of approximate normality, if the plot of the residuals on the normal probability paper passes the “fat pencil test” and if the plot of the residuals against the predicted values resembles a shotgun blast and does not have any obvious linear, curvilinear, or > or < shape then it is reasonable to assume that your residuals are “normal enough” and that the results of term significance are acceptable.

    If the above approach is unfamiliar to you I would recommend getting a good book on regression through inter-library loan and read the chapter on residual analysis.

    There are a couple of other aspects of your post that need some discussion. You state that you have “Random effect variable: as the same animal performing some behavioural task was measured more than once.”

    That is not a random effect variable that is a repeated measure and if, as your post would suggest, you are running an analysis on repeated measures then basic GLM is not the correct approach because the program will take the variability associated with the repeated measures and use that as an estimate of the ordinary variation in the data. The end result will be variable significance where none exists. With repeated measures you will need a program that allows you to identify the sources of the repeated measures so that the machine will not incorrectly use the within animal variability.

    You also stated you have an “Explicatory variable: a factor with 6 levels”. Are these levels ordinal or nominal. If they are ordinal – no problem but if they are nominal then you will need to build dummy variables and run the analysis on the dummy variables and not on the 6 levels themselves.

    0
    #202396

    Yair

    Dear Robert, firstly, thanks for your reply.

    I am testing normality assumption using a QQplot using the residuals, not the dependent variable. I´m sorry if that mislead you. I can´t upload images here, so I post the link to another forum where I´ve uploaded them as well as my question; maybe with them, the case becomes clearer.

    https://stats.stackexchange.com/questions/336931/what-should-i-do-when-error-residuals-are-not-normally-distributed-in-glm?noredirect=1#comment636865_336931

    About the random effect variable, this is not a repeated measures as there is no correlation variable such as time or space, I´m not measuring the same individual at t1, t2, t3…, tn nor x1, x2, x3,…,xn (being “t” and “x” time and space). It´s a random effect variable as I use one individual more than once, and the levels of the one factor I have are randomized, hence, not repeated measures.

    Regarding the nominal or ordinal, the explicatory variable is nominal and I specify it´s a factor when running the analysis (I´´m using R).

    0
    #202397

    Robert Butler
    Participant

    If you are using one individual more than once then the smallest unit of independence is that individual. Therefore the measures on that individual are repeated. Repeated measures do not have to be across time or in any time order they only have to be measures within the same unit. In the event that you are not measuring across time a good guess with respect to the structure of the repeated measures would be that of compound symmetry.

    To give you an example of non-time ordered repeated measures, let’s say you have an experimental design and you will run each of the experiments of that design on the same animal and you will repeat this process across a series of animals. The experiments guarantee that the factors within the experiments are independent of one another but the measures across the entire design within a given animal have to be treated as repeated measures. In those instances when I’ve done this we make sure we randomize the sequence of experiments within each animal but when we analyze the data we have to make sure the machine knows the measures are repeated and not independent.

    I don’t know R so I can’t comment concerning the ability of R to discriminate between an ordered variable and one without order. I would assume there has to be some kind of command in R that lets it know the variable is nominal and perhaps that command constructs dummy variables in the background and runs the analysis on those variables without anything more needed on your end. If you are not sure then I’d recommend checking to make sure R is handling your nominal variable correctly.

    I understand you are using the QQplot to check residuals and that is precisely why I’m questioning the choice. My understanding of QQplots is that they are used to check to see if data from two different sources can be treated as having a common distribution. This isn’t the point of a residual analysis.

    As for testing normality of the dependent variable – there is no need for that just as there is no need to test the normality of the independent variables – contrary to what you can find out on the web and in blog sites from here to eternity, the issue of normality only applies to the residuals. It should also be noted that the issue of residual normality is not one of perfection – it is one of approximation which is the reason that the key aspects of residual analysis are graphical – specifically the graphs I mentioned in my initial post.

    0
    #202398

    Robert Butler
    Participant

    Sorry typo – the sentence “The experiments guarantee that the factors within the experiments are independent of one another but the measures across the entire design within a given animal have to be treated as repeated measures.”

    Should read “The design guarantees that the factors within the experimental design are independent of one another but the measures across the entire design within a given animal have to be treated as repeated measures.”

    0
    #202400

    Robert Butler
    Participant

    For whatever reason I couldn’t get the link to your post on the other site to work last night however it seems to be working this morning so I took a look. The histogram of the residuals gives the impression of approximate normality and if I had nothing else to go on I’d continue with the analysis. However, since you do have other means at your disposal I would want to see the results of the graphs I mentioned in my first post.

    In addition to this, you will still need to consider the issue of repeated measures and how to correctly handle them.

    0
    #202401

    Yair

    Dear Robert, thanks again. It helps a lot to have this exchange of ideas. I will assess each point separately.

    * I understand the experiment design could remind of a repeated measures one, but at least there has to be a covariance correlation matrix. In this case, you proposed compound symmetry, and I would use that for a randomized block design. Nevertheless, in my case each animal was tested in 6 different conditions of vision and those conditions were selected randomly before the experiment began. Maybe I´m wrong and I´m open to suggestions but I can´t see the repeated measures here. (I would use repeated measures if I tested the same animal, or the same group of animals, maintaining the order of conditions of vision and not randomizing them).

    * Random effect variable: linked to the previous bullet, I included it in the analysis because I have to state somehow the lack of independence of every data point. Also, I may perform some analysis with the intraclass correlation coefficient.

    * Regarding the analysis of assumptions itself, before looking at p-values and drawing conclusions, as you can see from the other website, I´ve already done what you suggested (at least partially): both the histogram of the residuals and the residuals vs. predicted plot, and they look reasonably good.

    But the QQplot and the Shapiro-Wilk´s test does not, hence my question. When you suggested to do “a plot of the residuals on a normal probability plot”, did you mean a residuals vs. predicted plot? if not, does it have a specific name?

    0
    #202402

    Robert Butler
    Participant

    1. You said, “in my case each animal was tested in 6 different conditions of vision and those conditions were selected randomly before the experiment began.” In other words you are using the same animal and testing six different vision conditions on that same animal. This is the same thing as I described when I mentioned running an entire experimental design on a single animal and the issue remains exactly the same – the smallest unit of independence is the animal which means the measurements within the animal are repeated – not independent.

    Perhaps checking a reference might help. Your situation with vision conditions substituted for pills and whatever your measured outcome is substituted for fecal fat is that described in the repeated measures problem on pages 254-259 in Regression Methods in Biostatistics – Vittinghoff, Glidden, Shiboski, and McCulloch.

    There really isn’t much I can add to your comments about the QQplot – it’s the wrong plot and you need to plot your residuals on a normal probability plot. I did some checking and can’t find the normal probability plot called by any other name. The normal probability plot is used to evaluate the normality of the distribution of a variable, that is, whether and to what extent the distribution of the variable follows the normal distribution. The selected variable will be plotted in a scatterplot against the values expected from the normal distribution.

    As for the failure of the Shapiro-Wilk’s test – that’s not surprising nor would it be surprising if the data failed one of the other tests such as the Anderson-Darling. These tests are extremely sensitive to any deviation from perfect normality. Indeed, it is quite possible to take data generated using a random number generator with an underlying normal distribution and have that data fail one or more of these tests.

    This extreme sensitivity is the reason that the focus of residual analysis is the visual assessment of the graphical representations of the residuals. If you fail to run residual analysis the way it should be run and instead rely on summary statistics such as Shapiro-Wilk’s you will waste a great deal of time worrying about nothing of any consequence.

    Again, as a possible reference to help you better visualize the situation you should look at pages 33-44 of Fitting Equations to Data – Daniel and Wood 2nd edition. This appendix (3A) provides normal probability plots of random normal deviates of various sample sizes. It clearly illustrates just how non-normal perfectly normal data can look.

    I would recommend that you check out these books. Hopefully, you are in the position of being able to borrow both of the referenced books through inter-library loan.

    0
Viewing 8 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic.