Correlation
Six Sigma – iSixSigma › Forums › Old Forums › General › Correlation
 This topic has 23 replies, 11 voices, and was last updated 13 years, 11 months ago by Vallee.

AuthorPosts

June 18, 2008 at 7:12 am #50321
Sachin DhavleMember@SachinDhavle Include @SachinDhavle in your post and this person will
be notified via email.To know the correlation between two variables, we calculate Correlation Coefficient. The value shows that there is week correlation between two variables. Then we find that there are outliers in dataset. After removing outliers we recalculate Correlation Coefficient, the recalculated value of Correlation Coefficient was less than the previous value. That means after removing outlier the correlation get reduced.
Is there is any problem in data or is there any statistical problem?0June 18, 2008 at 11:19 am #172936Hi there. This is a example of negative correlation. as u keep reducing the extreme values the coefficient. Also this could be because of the non linearity in the data sets of x and y. I would not be surprised if the coeffecient reaches near 0 and then starts increasing again after u remove many data points.
0June 18, 2008 at 11:47 am #172937Say what?
Negative correlation refers to the slope of the line.
I suspect what he is calling an outlier isn’t.
How did you decide it was an outlier?0June 18, 2008 at 11:58 am #172938I Guess he means the extreme values. Stan, then what is the reason for the coeffecient to decrease assuming that he is removing the extreme values?
0June 18, 2008 at 12:18 pm #172941
Sachin DhavleMember@SachinDhavle Include @SachinDhavle in your post and this person will
be notified via email.Outliers mean extreme point. These are the points which are outside of 3 Sigma limit in control Chart.
0June 18, 2008 at 2:16 pm #172946What happens if you just have the outliers?
Is it that there is a relationship put only at the extreme values?
Without seeing the data and knowing how may data points you have it sounds as though you don’t have any correlation or if you do it’s an odd shape.0June 18, 2008 at 2:52 pm #172948What is an outlier in correlation?His definition of beyond 3 sigma is wrong.
0June 18, 2008 at 3:58 pm #172949
Chris SeiderParticipant@cseider Include @cseider in your post and this person will
be notified via email.I know there are those who advocate the removal of points indicating a larger than +/ 3 standardized residuals but that is using the regression tool which I suspect my friend, colleague, and mentor Stan is trying to get you to realize.
However, I don’t subscribe to this approach of throwing out residuals just to get a better regression. Residuals shouldn’t be thrown out unless the special cause is known. I especially would not throw out residuals just to get a statistically significant regression because then I would think the slope and intercept are suspect if the residuals were just continually thrown out until a “good” regression appear.
Have you considered the first thing you should be doing and creating scatterplots of your input vs your output to understand the residuals?
Good luck….0June 18, 2008 at 4:08 pm #172950
TaylorParticipant@ChadVader Include @ChadVader in your post and this person will
be notified via email.Good Lord, Help me keep from screaming…….
Sachin
Correlation does not predict Relationships; it only shows the strength of the relationships between two variables. The stronger the relationship, the greater likelihood that change in one variable will affect change in another variable. So to answer your question there is strong(er) relationship in your outlier variables to the rest of your data points, and just because they are outside 3 sigma or extreme points does not exclude the correlation of them.
A negative r, implies that as one variable (X) increases, the other variable (X) decreases. A Positive r implies that as one variable (x) Increases the other Variable (x) also Increases.
If your r value is close to zero, then you have a NonLinear Relationship. Your data points on a scatterplot will appear to be in a U shape or vice versa.
When interpreting your results, keep in mind that if you remove the “outliers” you must have cause assigned and know that these data points are no good or do not represent normal operations of your process. If this is the case then the results of your data are pointing that one Variable (X) has little or no affect on the other Variable (X)
Hope this helps0June 18, 2008 at 9:00 pm #172955Chad,
Your statement,”If your r value is close to zero, then you have a NonLinear Relationship. Your data points on a scatterplot will appear to be in a U shape or vice versa.” is absolutely incorrect. Better scream some more. Otherwise good summary.0June 18, 2008 at 9:12 pm #172956
TaylorParticipant@ChadVader Include @ChadVader in your post and this person will
be notified via email.Daves, Point taken. Circle triangle conversation. As I should have stated in my last post, a strong relationship other than linear can exist, yet r can be close to zero. If you understood Non Linear Relationship then you would have known the point I was trying to make.
0June 19, 2008 at 12:00 am #172959Chad, Do a little more study. You are still off track. Hint, zero correlation is inherent in the definition of independent variables. A zero correlation coefficient mean that no linear relationship exists. It does not imply that ANY other relationship exists whatsoever. It may be that a non linear relationship does exist and happens to have a zero coefficient as you suggest or the variables may be independent. Your post about the data necessarily looking ushaped is absurd.
0June 19, 2008 at 2:51 pm #172996
TaylorParticipant@ChadVader Include @ChadVader in your post and this person will
be notified via email.Daves
I agree withYou, However, making the assumption variables are normally dirstributed is misleading as a Strong Non Linear Relationship can exist with the yeilding r value close to zero. The original poster made this assumption and attempted to remove extreme data points.
As for the resulting plot, it can be u shaped, curved, whatever, anything other that linear.0June 19, 2008 at 3:47 pm #173002Chad,
With all due respect, I think you missed the original posters point. They think they removed outliers and the correlation got worse. The only reason that would happen is they were not really outliers.
The real question is why did they think they were outliers?
0June 19, 2008 at 4:23 pm #173003Stan,
With all due respect, your statement:
“They think they removed outliers and the correlation got worse. The only reason that would happen is they were not really outliers.” ,
is not true.
They may have been outliers, and also be high influence points.
Model two columns of random integers. Get the corelation coefficient. Now model one more set of points that would be an outlier to that set and much larger than the previous. The new correlation coefficient would be better.
I think the real point is that without the data we are all just guessing.
Pearson’s corrrelation coefficient (OP does not say that is what he is doing, but in common usage that what most mean) is notoriously subject to influence from outliers. It also is not robust to non normal data. Data normality is an assumption for this statistic. Spearmans and others are better for non normal.
A regression analysis with proper residual analysis could settle the issue.
I do agree that calling something outside control chart limits an outlier is not the best approach .0June 19, 2008 at 4:36 pm #173008If a pearson correlation analysis assumes normality of the input and output, then why is a regession analysis not require this assumption. If r equals the square root of R squared, then one would think the analysis are similar.
0June 19, 2008 at 5:23 pm #173012Sachin,
You have created a perfect storm for yourself. Post the data and let these guy/girls do your analysis. (They will, the egos around here are hilarious).
I take more of a PGA approach.
Practically Does your process suggest that there is a correlation? (If clear Stop if not proceed)
Graphically Do a scatter plot and interpret those results? (If clear Stop if not proceed)
Analytically Pick the right tool for the data and interpret those results. (If clear Stop if not make up something because noone will be able (or want to) to disprove your findings.
Stevo0June 19, 2008 at 5:28 pm #173013
ValleeParticipant@HFChrisVallee Include @HFChrisVallee in your post and this person will
be notified via email.E,A regression formula is looking for a point by point match. As one increases or decreases so should the other axis. Don’t assume a cause and effect relationship. You are looking at two samples not the population so while it may be nice to have you will not always have normality. Regression plots with a lot of inconsistency can follow an equation plot (line, usshape etc) but will have a lot of dispersion about the equation. The question once you determine a possible equation is how predictable and accurate it will be when used. If the sample taken did not match the population you take from next, the equation will not work. The closest match to these is a TTest because you are looking for a change and match before and after. A test however must meet the assumptions you asked about earlier.HF Chris Vallee
0June 19, 2008 at 5:53 pm #173018
BrandonParticipant@Brandon Include @Brandon in your post and this person will
be notified via email.Oh Stevo, there you go again…making something make sense when keeping it confusing brings in mucho dinero!!
0June 20, 2008 at 7:11 am #173047
Sachin DhavleMember@SachinDhavle Include @SachinDhavle in your post and this person will
be notified via email.Let me explain in detail what we are doing. We have response variable Y and independent variable X1, X2, X3 .X10.We assume that these Xs affects Y. But we want to prove the relation between Xs and Y statistically using correlation coefficient and Scatter Plot. And those X variables shows statistical significant correlation with Y, these variables should be include in multiple linear regression equation (prediction model). Using this regression equation (prediction model) we can predict future value of Y. Before calculating prediction model we will check whether these Xs and Y are statistically under control or not, using control chart (process control)
1) Suppose Y is our Number of Defects per unit we want to minimize these defects. We can control process using control charts. Those points are out side of 3Sigma limits are special cause (or outliers), removing these outlier we try to find out relationship between Y and Xs and then regression model (prediction model). This is definition of our outliers. Is this correct?
2) Now we have ten data set, in each of this data set there are only 6 to 15 data points. All the data set shows most of the X does not have correlation with Y.
In two data set, X3 shows positive correlation with Y and another data set negative correlation between X3 and Y.
Why this happens?
3) Is the data points are less in number? OR
4) Is there no correlation between X3 and Y?
5) We are not checking assumption of normality. Is the assumption of normality required for correlation?
6) Some time correlation coefficient shows R² = 0.52, but scatter does not show that points are scatter around diagonal line. Then what will be our interpretation?
7) Can we calculate regression equation between Y and Xs without knowing relationship between Y and Xs? Then removing X step by step using Pvalue for each individual X.
0June 20, 2008 at 7:14 am #173048I am asking a simple question that has not been answered. I have read in a few sources online differing requirements for normality of the input or output for a correlation analysis. I have read either one or both of the variables in the Pearson correlation must be normally distributed.
Is it both or one or none are required to be normally distributed.
I am suspicious about the requirement for normality since there is no requirement for normality of the input or output for a simple linear regression–only the residuals must be normally distributed for a valid regression.0June 20, 2008 at 10:24 am #173052
Sachin DhavleMember@SachinDhavle Include @SachinDhavle in your post and this person will
be notified via email.Normality assumption of data is not required for Regression.But one of the assumption for regression equation is residuals should follows normal distribution
0June 20, 2008 at 12:05 pm #173053Please share your data, I am sure someone will help make sense of it.
0June 20, 2008 at 12:52 pm #173056
ValleeParticipant@HFChrisVallee Include @HFChrisVallee in your post and this person will
be notified via email.Forgot about the regression test for now. Normality does not affect your regression formula in itself, but it is useless if the data points show up off target with a large spread as the norm. I can create a regression formula on what my weatherman does and says to determine if it will rain… but I would never plan a picnic with the output because I don’t know when he will say the same thing again. Perform an ANOVA and figure out what you really have first. Some people will rebel at this but you have no true understanding of the between or within interactions for main effects or their interactions. Get control of your system to have stable data with less variation and then go down the regression road. You are only digging a hole that will create a monster that does you nothing in the end.
0 
AuthorPosts
The forum ‘General’ is closed to new topics and replies.