Viewing 5 posts - 1 through 5 (of 5 total)
  • Author
  • #43306


    Before checking for the normality I understand we have to identify the outliers if any using the Box plot and eliminate the same. My question is:
    If the Box plot is showing one outlier in the first instance and after eliminating the same from the data, should I check for the Outliers again using the Box plot? If there are outliers again should this also be eliminated? If so should this be repeated till the time the box plot shows no outliers?


    Robert Butler

      You don’t check for normality, or lack thereof, by using a box plot and you don’t just arbitrarily discard points labeled as outliers on a box plot or on any other kind of plot. A box plot is a visual way to summarize the critical measures of a distribution so that they can be easily viewed and understood in relation to one another.
      An outlier in a boxplot is a data point that meets certain criteria with respect to its distance from the median relative to the IQT (or distance from the mean relative to the SE). If you discard those point labeled as outliers and replot you will probably find another group of data points that have taken their place.  In theory you could go on discarding in this fashion until you had nothing left at all.
      If you wish to visually check for normality and visually identify points that may or may not be suspect, use a normal probability plot.  If you want to test the impact of a particular data point on your normality assumptions run Anderson-Darling or Shapiro-Wilk.  The Anderson-Darling test will be more sensitive to issues in the tails of the distribution.  Only after formal testing and actual investigation of the data point should you attempt to address the issue of discarding data.
      I never discard a data point unless I have actual physical proof that it was in error.  The fact that it was odd or seems to be far away from my distribution is interesting but by itself it is not grounds for elimination.  I do pay attention to the effect such data is having on my analysis and, if necessary, I run the analysis with and without the data in question.



    I do agree with Robert, only remove an outlier when you are absolutely sure why this actually is an outlier and that the reason for being an outlier has nothing to do with the issue itself. I have experienced that outliers could tell me more about a solution  than the normal data does. When you have removed an outlier, you should rerun the normal probability analysis to see if “new” outliers are shown. A.D. helps you with this. In a later stage you might want/have to repeat this based on outliers during a residual analysis



    There are many ‘sources’ of outliers. Eliminating extreme values from a normal distribution is not a good idea because as Robert mentions one data point might well be replaced with another, ad ininitum.
    However, before checking for normality in my opinion it’s worth checking for any defective data. By defective I’m implying software default values due to ‘overflow’ or zeroes, such as no work days, electronic shorts or no contact, or anything botched!!!
    Failing to take such precautions within the context of your study may well lead you up the garden path, for example by ommiting to delete shorts and then not meeting the requirement of homogeneity in a t-test or Anova study.
    Accordingly, I always check my data using boxplots, stem and leaf plots – to check zeroes and other non random numbers.
    By the way, consider modelling some data using Minitab because this, in my opinion, is by far the best way to learn more about statistics, statistical assumptions, and might well allow you to ‘validate’ your approach ‘a priori’ by typng in some ‘what if data.’ In the past, I’ve found this most helpful.



    Using graphical techniques are great to determine if a possible outlier exists for investigation, however there are other methods to determine if a data point is a true outlier or not.  Grubbs test utilizes the difference between a point and the mean and established a confidence interval which like all hypotheses test will return a answer with a specific level of confidence.   The null on this is at there are no outliers in the data versus the alternative that at least one point is a true outlier (assuming the data is normally distributed).
    The statistic is: 
    and the confidence interval is generated with the following: 
    [(N-1)/SQRT(N)]*SQRT(t(alpha/(2N),N-2)**2/(N – 2 +
    t(alpha/(2N),N-2)**2)” src=””>
    Where N is the sample size, t is the t distribution with the appropriate risk and degrees of freedom.
    hope this helps as well as all the other (great) advice.

Viewing 5 posts - 1 through 5 (of 5 total)

The forum ‘General’ is closed to new topics and replies.