iSixSigma

Correlation

Six Sigma – iSixSigma Forums Old Forums General Correlation

Viewing 24 posts - 1 through 24 (of 24 total)
  • Author
    Posts
  • #50321

    Sachin Dhavle
    Member

    To know the correlation between two variables, we calculate Correlation Coefficient. The value shows that there is week correlation between two variables. Then we find that there are outliers in dataset. After removing outliers we recalculate Correlation Coefficient, the recalculated value of Correlation Coefficient was less than the previous value. That means after removing outlier the correlation get reduced.
    Is there is any problem in data or is there any statistical problem?

    0
    #172936

    Kannan
    Participant

    Hi there. This is a example of negative correlation. as u keep reducing the extreme values the coefficient. Also this could be because of the non linearity in the data sets of x and y. I would not be surprised if the coeffecient reaches near 0 and then starts increasing again after u remove many data points.

    0
    #172937

    Mikel
    Member

    Say what?
    Negative correlation refers to the slope of the line.
    I suspect what he is calling an outlier isn’t.
    How did you decide it was an outlier?

    0
    #172938

    Kannan
    Participant

    I Guess he means the extreme values. Stan, then what is the reason for the coeffecient to decrease assuming that he is removing the extreme values?
     

    0
    #172941

    Sachin Dhavle
    Member

    Outliers mean extreme point. These are the points which are outside of 3 Sigma limit in control Chart.

    0
    #172946

    Clive
    Participant

    What happens if you just have the outliers?
    Is it that there is a relationship put only at the extreme values?
    Without seeing the data and knowing how may data points you have it sounds as though you don’t have any correlation or if you do it’s an odd shape.

    0
    #172948

    Mikel
    Member

    What is an outlier in correlation?His definition of beyond 3 sigma is wrong.

    0
    #172949

    Chris Seider
    Participant

    I know there are those who advocate the removal of points indicating a larger than +/- 3 standardized residuals but that is using the regression tool which I suspect my friend, colleague, and mentor Stan is trying to get you to realize.
    However, I don’t subscribe to this approach of throwing out residuals just to get a better regression.  Residuals shouldn’t be thrown out unless the special cause is known.  I especially would not throw out residuals just to get a statistically significant regression because then I would think the slope and intercept are suspect if the residuals were just continually thrown out until a “good” regression appear.
    Have you considered the first thing you should be doing and creating scatterplots of your input vs your output to understand the residuals?
    Good luck….

    0
    #172950

    Taylor
    Participant

    Good Lord, Help me keep from screaming…….
    Sachin
    Correlation does not predict Relationships; it only shows the strength of the relationships between two variables. The stronger the relationship, the greater likelihood that change in one variable will affect change in another variable. So to answer your question there is strong(er)  relationship in your outlier variables to the rest of your data points, and just because they are outside 3 sigma or extreme points does not exclude the correlation of them.
    A negative r, implies that as one variable (X) increases, the other variable (X) decreases. A Positive r implies that as one variable (x) Increases the other Variable (x) also Increases.
    If your r value is close to zero, then you have a Non-Linear Relationship. Your data points on a scatterplot will appear to be in a U shape or vice versa.
    When interpreting your results, keep in mind that if you remove the “outliers” you must have cause assigned and know that these data points are no good or do not represent normal operations of your process. If this is the case then the results of your data are pointing that one Variable (X) has little or no affect on the other Variable (X)
    Hope this helps

    0
    #172955

    DaveS
    Participant

    Chad,
    Your statement,”If your r value is close to zero, then you have a Non-Linear Relationship. Your data points on a scatterplot will appear to be in a U shape or vice versa.” is absolutely incorrect. Better scream some more. Otherwise good summary.

    0
    #172956

    Taylor
    Participant

    Daves, Point taken. Circle triangle conversation. As I should have stated in my last post, a strong relationship other than linear can exist, yet r can be close to zero. If you understood Non Linear Relationship then you would have known the point I was trying to make.

    0
    #172959

    DaveS
    Participant

    Chad, Do a little more study. You are still off track. Hint, zero correlation is inherent in the definition of independent variables. A zero correlation coefficient mean that no linear relationship exists. It does not imply that ANY other relationship exists whatsoever. It may be that a non linear relationship does exist and happens to have a zero coefficient as you suggest or the variables may be independent. Your post about the data necessarily looking u-shaped is absurd.

    0
    #172996

    Taylor
    Participant

    Daves
     I agree withYou, However, making the assumption variables are normally dirstributed is misleading as a Strong Non Linear Relationship can exist with the yeilding r value close to zero. The original poster made this assumption and attempted to remove extreme data points.
    As for the resulting plot, it can be u shaped, curved, whatever, anything other that linear.

    0
    #173002

    Mikel
    Member

    Chad,
    With all due respect, I think you missed the original posters point. They think they removed outliers and the correlation got worse. The only reason that would happen is they were not really outliers.
    The real question is why did they think they were outliers?
     

    0
    #173003

    DaveS
    Participant

    Stan,
    With all due respect, your statement:
    “They think they removed outliers and the correlation got worse. The only reason that would happen is they were not really outliers.” ,
    is not true.
    They may have been outliers, and also be high influence points.
    Model two columns of random integers. Get the corelation coefficient. Now model one more set of points that would be an outlier to that set and much larger than the previous. The new correlation coefficient would be better.
    I think the real point is that without the data we are all just guessing.
    Pearson’s corrrelation coefficient (OP does not say that is what he is doing, but in common usage that what most mean) is notoriously subject to influence from outliers. It also is not robust to non normal data. Data normality is an assumption for this statistic. Spearmans and others are better for non normal.
    A regression analysis with proper residual analysis could settle the issue.
    I do agree that calling something outside control chart limits an outlier is not the best approach .

    0
    #173008

    e
    Participant

    If a pearson correlation analysis assumes normality of the input and output, then why is a regession analysis not require this assumption.  If r equals the square root of R squared, then one would think the analysis are similar.

    0
    #173012

    Stevo
    Member

    Sachin,
     
    You have created a perfect storm for yourself.  Post the data and let these guy/girls do your analysis. (They will, the egos around here are hilarious).
     
    I take more of a PGA approach.
     
    Practically – Does your process suggest that there is a correlation?  (If clear – Stop – if not proceed)
    Graphically – Do a scatter plot and interpret those results?  (If clear – Stop – if not proceed)
    Analytically – Pick the right tool for the data and interpret those results.  (If clear – Stop – if not – make up something because no-one will be able (or want to) to disprove your findings.
     
    Stevo

    0
    #173013

    Vallee
    Participant

    E,A regression formula is looking for a point by point match. As one increases or decreases so should the other axis. Don’t assume a cause and effect relationship. You are looking at two samples not the population so while it may be nice to have you will not always have normality. Regression plots with a lot of inconsistency can follow an equation plot (line, us-shape etc) but will have a lot of dispersion about the equation. The question once you determine a possible equation is how predictable and accurate it will be when used. If the sample taken did not match the population you take from next, the equation will not work. The closest match to these is a T-Test because you are looking for a change and match before and after. A test however must meet the assumptions you asked about earlier.HF Chris Vallee

    0
    #173018

    Brandon
    Participant

    Oh Stevo, there you go again…making something make sense when keeping it confusing brings in mucho dinero!!

    0
    #173047

    Sachin Dhavle
    Member

    Let me explain in detail what we are doing. We have response variable Y and independent variable X1, X2, X3……….X10.We assume that these X’s affects Y. But we want to prove the relation between X’s and Y statistically using correlation coefficient and Scatter Plot. And those X variables shows statistical significant correlation with Y, these variables should be include in multiple linear regression equation (prediction model). Using this regression equation (prediction model) we can predict future value of Y. Before calculating prediction model we will check whether these X’s and Y are statistically under control or not, using control chart (process control)
         1) Suppose Y is our Number of Defects per unit we want to minimize these defects. We can control process using control charts. Those points are out side of 3-Sigma limits are special cause (or outliers), removing these outlier we try to find out relationship between Y and X’s and then regression model (prediction model). This is definition of our outliers. Is this correct?
     
         2) Now we have ten data set, in each of this data set there are only 6 to 15 data points. All the data set shows most of the X does not have correlation with Y.
            In two data set, X3 shows positive correlation with Y and another data set negative correlation between X3 and Y.
          Why this happens?
         3) Is the data points are less in number? OR
         4) Is there no correlation between X3 and Y?
         5) We are not checking assumption of normality. Is the assumption of normality required for correlation?
         6) Some time correlation coefficient shows R² = 0.52, but scatter does not show that points are scatter around diagonal line. Then what will be our interpretation?
     
         7) Can we calculate regression equation between Y and X’s without knowing relationship between Y and X’s? Then removing X step by step using P-value for each individual X.
         

    0
    #173048

    e
    Participant

    I am asking a simple question that has not been answered.  I have read in a few sources online differing requirements for normality of the input or output for a correlation analysis.  I have read either one or both of the variables in the Pearson correlation must be normally distributed. 
    Is it both or one or none are required to be normally distributed.
    I am suspicious about the requirement for normality since there is no requirement for normality of the input or output for a simple linear regression–only the residuals must be normally distributed for a valid regression.

    0
    #173052

    Sachin Dhavle
    Member

    Normality assumption of data is not required for Regression.But one of the assumption for regression equation is residuals should follows normal distribution
     

    0
    #173053

    Mikel
    Member

    Please share your data, I am sure someone will help make sense of it.

    0
    #173056

    Vallee
    Participant

    Forgot about the regression test for now. Normality does not affect your regression formula in itself, but it is useless if the data points show up off target with a large spread as the norm. I can create a regression formula on what my weatherman does and says to determine if it will rain… but I would never plan a picnic with the output because I don’t know when he will say the same thing again. Perform an ANOVA and figure out what you really have first. Some people will rebel at this but you have no true understanding of the between or within interactions for main effects or their interactions. Get control of your system to have stable data with less variation and then go down the regression road. You are only digging a hole that will create a monster that does you nothing in the end.

    0
Viewing 24 posts - 1 through 24 (of 24 total)

The forum ‘General’ is closed to new topics and replies.