iSixSigma

Difference between Correlation and Regression

Six Sigma – iSixSigma Forums Old Forums General Difference between Correlation and Regression

  • This topic has 27 replies, 20 voices, and was last updated 12 years ago by S.
Viewing 28 posts - 1 through 28 (of 28 total)
  • Author
    Posts
  • #44976

    Kiran Varri
    Participant

    Hi,
    I have recently completed Six Sigma Green belt certification. I would like to know with examples if any, what is the best way to explain the difference between Correlation and Regression. I am sorry if i sound dumb, but i am still a learner. Your kind responses will be appreciated.

    0
    #145414

    Quainoo
    Member

    Kiran,
    I do not consider myself a specialist but below is my attempt to answer your question:
    Correlation:
    If X and Y are correlated, there will be a relationship between the increase (or decrease) of the Y and the increase (or decrease) of the X.
    Minitab provide two pieces of information which are:

    The strength of the relationship (between 0 and 1)
    A ‘p value’ which tells you if the two variables are ‘statistically’ correlated (if p is less than 0,05 there is correlation).
    Regression:
    Regression analysis will calculate a mathematical relationship between the X (or X’s) and the Y:
    By doing so, it will:

    Tell you if the X is statistically significant that is, if the factor should be taken into account in the mathematical equation (if p is less than 0,05 the term is significant)
    Give you the value associated with the X to estimate the Y (example: Y=3X)
    Give you the ‘accuracy’ of the mathematical relationship by providing an ‘Rq value’ (if Rq is 0,95 your regression equation explains 95 % of the ‘real’ data.
    As far as I can see, if there is a strong correlation between two variables, it is very likely that the regression term will be considered significant because both use the same sort of ‘variance analysis’ to perform the calculations.
    Please take note that regression provide other information but I tried to keep my answer simple.
    PS: concerning correlation, I think you have to be cautious about the possible difference between ‘correlation’ and ‘causation’ . A correlation might exist just ‘by chance’ even though no real link exist between the X and the Y.
    Vincent

    0
    #145431

    Brit
    Participant

    A good point about correlation and causation.  For example, yo umight take data concerinng washing your car and it raining the next day. You might have a good correlation coefficient, but one really doesn’t cause the other to happen.
    One way I like to check causation is asking the question backwards.
    First hypothesis:  Every time I wash my car, it rains.
    Second confirmation hypothesis:  Every time it rains, I wash my car…
    Obviously, washing your car has no effect on the weather.

    0
    #145529

    Kiran Varri
    Participant

    Thanks Vincent and Bret…..the response was very helpful and easy to understand….Kiran

    0
    #146133

    Guido S
    Participant

    Hi Kiran,
    the R-Sq from Regression is r*r from Correlation! For linear relations R-Sq and r have same sense.
    For quadratic and higher relations there is a difference. Correlation is only for linear relations!
    Best regards Guido

    0
    #146140

    Jonathon Andell
    Participant

    Maybe it’s just semantics here, but regression is just an analytical tool. Correlation is one of several bits of information that can come from regression.Causation may the issue here. I heard that somebody was able to correlate birth rates to the number of stork nests, which spawned the belief that storks bring babies. (I don’t know whether that one is true, but it makes a good point.)The point is, two factors can be correlated without one causing the other. Sometimes it’s because both correlated factors are caused by yet another factor. Occasionally it’s due to an as-yet unexplained factor.I hope this helps a bit.

    0
    #167823

    amir saeed khan
    Participant

    hello, how are you?i am studet of BS bioinformatics  in pakistan .i have alot of problems in statistics terms like correlation and regression ,but now i hav’nt ,you are great.tanc alot.

    0
    #167824

    Kim Niles
    Participant

    Kiran:
    Good question, I had a statistics teacher answer this question once by telling me that correlation analysis is always through visual interpretation and regression analysis is always mathematical.   I’ve stuck with that definition even though I’m not very comfortable with it.  I hope to read more posts on this string. 
    KN  – http://www.KimNiles.com

    0
    #167827

    Mikel
    Member

    Yes
    Correct
    What is difficult in  that?it  is  a  common  sense.

    0
    #168330

    Viti
    Participant

    Hi all,Stan, could you please explain a bit more?
    thanksJane

    0
    #172139

    Jude Ogunade
    Participant

    First, put it in mind that both terms refer to relationship. Secondly, know that this relationship is between two variables. Thirdly know that whilst correlation is concerned about only the variations of this relationship (another name for correlations could be covariations), regression seeks to finds the best fit of the variations between these two variables, i.e. finding where the two variation come close to being the same thing.

    0
    #173737

    waqar
    Member

    thanks would u like to tell me ur mailing address plz

    0
    #173853

    Jude Ogunade
    Participant
    #174197

    Alessio Toraldo
    Participant

    Sorry to say that, but many of the above repies are, strictly speaking, inaccurate.
    Statistics does not know anything about causal relationships. Neither for correlation (C), nor for Regression (R)
    The real difference is the statistical model assumed by the person applying the analysis:
    C: a bivariate gaussian is assumed
    R: does not assume a bivariate gaussianTo say the same thing in another way:The underlyin model is:Y = aX + b +Ea and b are constants; E is random noise, normally distributed, with mean = 0.
    An here comes the difference: correlation assumes that X is a normal distribution; regression does NOT assume so.
    Perhaps less intuitive, but correct ;-)
    Cheers
    AT

    0
    #174215

    Robert Butler
    Participant

      From pp. 33 Applied Regression Analysis – Draper and Smith – First Edition we have the folloiwng section:
    1.6 The Correlation between X and Y
       If X and Y were both random variables following some (unknown) bivariate distribution then we could define the correlation coefficient between X and Y as
    rhoXY = covariance(X,Y)/sqrt[V(X)*V(Y)]
    On pp.34-35 of the same book we have the section
    Correlation and Regression which states:
      (If we have a simple linear model Y = bo +b1*X +e) …b1 is a scaled version of the correlation coefficient.  (The two) are closely related but provide different interpretations.  The correlation coefficient measures association between X and Y while b1 measures the size of the change in Y, which can be predicted when a unit change is made in X.
      This is the difference between the two. Neither correlation nor regression assumes a bivariate gaussian with respect to either X or Y.

    0
    #174521

    Alessio Toraldo
    Participant

    There is confusion between “correlation “and “regression” as statistical inference techniques, and “r” and “b” (coefficients) as simple mathematical indices.Of course the two indices, b and r, can always be computed from any sample, irrespective of the shape of the underlying distribution: they are pure mathematical objects and nothing prevents us from computing them, much in the same way as nobody prevents us from, say, adding up 15 metres and 30 secs, getting 45 …?The issue here is NOT with the COMPUTABILITY of b and r – they are, of course, always computable. The issues is their interpretation – i.e. the use of them in statistical inference.
    Now – to drive a valid inference on the slope coefficient (b) in linear regression, you have to satisfy four basic assumptions: (i) that the function relating X to the expected value of Y ((E/Y)|X) is linear; (ii) that residuals from the regression line are normally distributed, with 0 mean, for each X value: (iii) that the variance of residuals is equal for all values of X (homoscedasticity); (iv) that residuals on different Xs are independent of each other.
    These conditions need to be satisfied in order to have a valid statistical test on the regression angular coefficient b.Instead, correct statistical interpretation for Pearson’s r requires the distribution to be a bivariate gaussian (a different model from the minimal one required by regression, which – see above – only requires the residuals to be normally distributed).It is very misleading to say that neither index assumes normality because it attracts attention over an irrelevant fact – the mere “computability” of the two indices in every possible situation, and takes attention away from what is really important for practical use, i.e. that r does not have a clear meaning if the distribution is not a bivariate gaussian, and that b does not have a clear meaning either, if the residuals are not normally distributed and independent.

    0
    #174523

    Michael Mead
    Participant

    That was a great answer.

    0
    #174569

    Robert Butler
    Participant

    The original question posted back in 2006 was the following: “what is the best way to explain the difference between Correlation and Regression?”
     
    The answer to that question is:
     
    The correlation coefficient measures association between X and Y while b1 measures the size of the change in Y, which can be predicted when a unit change is made in X.
    As stated the question was one concerning the explanation of a concept. The answer that was provided did just that – it explained and (hopefully) it clarified. I suppose one could view this kind of an answer as misleading and irrelevant with a focus “mere computability” and thus of no practical value but I disagree.  
    When someone asks me to explain regression my choice of explanation focuses on explaining what the machine is doing when it is told to regress Y on X.  My explanation will indeed focus on the mere computability –that is a clear explanation of what least squares regression does – because, in my experience, that is precisely what the individual posing the question wishes to know. 
     After having provided a clear explanation of the basic concept I can use the understanding resulting from the comprehension of my explanation to go in any direction I or the individual asking the question may choose.
      If the focus shifts to hypothesis testing and the difference between tests concerning r and b1 one can then begin by discussing the issues of normality that have been provided. 
    As noted for most tests concerning r both X and Y have to be normal. However, if you just leave it at that your statement is going to be a major call for inaction since, in my experience, people are going to spend an inordinate amount of time worrying about normality instead of getting on with the work.  A better answer would be the following:
    “(For most tests concerning r, both X and Y have to be normal.) Often a bivariate population is far from normal. In some cases a transformation of the variables X and Y brings their joint distribution close to the bivariate normal, making it possible to estimate r in the new scale…(In spite of the non-normality) we may still want to examine whether two variable are independent or whether they vary in the same or in opposite directions.  For a test of the null hypothesis that there is no correlation, r may be used provided that one of the variables is normal. When neither variable seems normal, the best-known procedure is that in which X and Y are both converted to rankings.  The rank correlation coefficient, due to Spearman, is the ordinary correlation coefficient r between ranked values of X and Y.”
    Statistical Methods 7th Edition Snedecor and Cochran pp.191-192
      What I don’t have, and I would be very interested in reading, is a clear discussion of the issue of the word “close” in particular, just how robust are tests concerning the correlation coefficient to non-normal behavior in X and Y.  The text cited above says only “methods of expressing the amount of correlation in nonnormal data by means of a parameter like r have not proceeded very far.” 
    Given that this book has a copyright of 1980 it’s probably safe to assume that things have proceeded farther and it would be interesting to know just how far that is.

    0
    #174957

    Alessio Toraldo
    Participant

    I agree entirely on the final comment about how “serious” are departures from normality. Normality is more the exception than the rule, so tests should be used provided that violations are not huge and the test is robust (incidentally, I recall a paper which directly showed that even largely skewed distributions of X and Y have relatively little impact on r distribution, especially if Rho=0 under null hypothesis).
    I suggested neither that the normality caveats should paralyze a researcher, nor that r and b can be computed with no regard as to their underlying distribution.
    I just mentioned what the assumptions are, because this important point had not been addressed yet, and is indeed more relevant than other, typically misleading issues, like that of causality.
    To reply to that statement by saying that “neither r nor b assume normality” is very detrimental to comprehension by other readers – it is like saying “X and Y are not eigenvalues” to spectators who do not have a background in mathematics.

    0
    #174958

    Alessio Toraldo
    Participant

    PS Incidentally, the discussion DID shift to hypothesis testing. I did the shift :-)

    0
    #174959

    Alessio Toraldo
    Participant

    Ok – I posted a reply, but only the PS came out.
    Well, briefly, I wrote that I agree entirely on the robustness issue – no one should be paralyzed by the suspect of non normality. I know of a paper addressing the issue of robustness of r, but it is even older (1977) than that cited by Robert – see reference below. They say it the test is very robust with skewed and leptokurtic distributions. Much more recent work (Chin-Diew Lai, of Massey University, New Zealand) claimed that when the original distributions are lognormal – very clearly skewed, r is a biased estimator of parameter Rho (the true correlation of the population from which the sample was extracted); the bias can be huge and nullified only with millions (!) of data points. Clearly, the positions are different.Effect of the violation of assumptions upon significance levels of the Pearson r.
    By Havlicek, Larry L.; Peterson, Nancy L.
    Psychological Bulletin. 1977 Mar Vol 84(2) 373-377Chin-Diew Lai, Department of Statistics, Massey University, New Zealand
    John C W Rayner, School of Mathematics and Applied Statistics,
    University of Wollongong, , Australia
    T P Hutchinson, School of Behavioural Sciences, Macquarie University, Australia
    Most statistics students know that the sample correlation coefficient R is used to estimate
    the population correlation coefficient . If the pair (X, Y) has a bivariate normal
    distribution, this would not cause any trouble. However, if the marginals are nonnormal,
    particularly if they have high skewness and kurtosis, the estimated value from a sample
    may be quite different from the population correlation coefficient . Our simulation
    analysis indicates that for the bivariate lognormal, the bias in estimating can be very
    large and it can be substantially reduced only after a large number (3-4 million) of
    observations. This example could serve as an exercise for the statistics students to realise
    some of the pitfalls in using the sample correlation coefficient to estimate 

    0
    #178164

    RABIYA
    Participant

    hi
    i m from pakistan and doing masters i have an assignment about this topic and this all information help me alot thankyou u all bye

    0
    #178277

    zeeshan ahmad
    Member

    Correlation: It tells us about the relation between two variables
    It is two way relation. It is symmetric.
    example: smoking is related to cancer
    Regression: It tells us about the cause and effect
    Its one way relation. Its asymmetric
    example: smoking causes cancer
    here you cannot say cancer causes smoking. So it is one way relation.

    0
    #178279

    Ryan
    Member

    This explanation is inaccurate. Regression does not tell us “cause and effect.” Such a conclusion would be based on the design, not the analysis. -Ryan

    0
    #183601

    anand singh
    Participant

    thank u

    0
    #183604

    Ron
    Member

    Linear regression investigates and models the linear relationship between a response (Y) and predictor(s) (X). Both the response and predictors are continuous variables.
    A Pearson correlation coefficient measures the extent to which two continuous variables are linearly related.
    One models the relationship the other quantifies the relationship.
     
     

    0
    #185655

    Alessio Toraldo
    Participant

    I fully agree with Ryan. Zeeshan ahmad: read the previous posts before repeating (wrong) solutions. A discussion with as many as 20 posts might well have provided better insights as to the solution.
    So, I’d better repeat what’s the real difference between regression and correlation. Regression pays attention to the change in the Y as a function of a one-step change in X. The question it poses and investigates is in scalar units, e.g., one might wonder by how many centimeters (Y) do children grow in one year of age (X). Correlations instead, does not care about units of measurement. It is a pure number (no units) telling you how closely two variables X and Y match – to what extent they carry the same information. For instance, the age of children and their education (number of years at school) essentially measure the same thing. Indeed (almost) all children of first year are 6-year old; (almost) all children of second year are 7-year-old and so on. Correlation here is close to +1, meaning that the degree of redundancy of the two variables is extreme. By contrast, age and height in a set of adults are completely uncorrelated – to know how tall a given adult is does not tell you anything at all about his age. The correlation here is 0.Another, seemingly more technical but crucial point: regression assumes that errors in Y are normally distributed, and nothing else; correlation much more strictly assumes that both X and Y are normally distributed. If these assumptions are not – at least vaguely – met, then hypothesis testing on regression (the slope, usually referred to as Beta) and on correlation (the Rho coefficient) cannot be properly carried out.
    Do not listen to people telling you that normality assumptions are irrelevant for regression or correlation. These people refer to the mere computability of indices. They do not understand that the mere computability of something is useful only to the experts – if one has to explain to a learner what is the essence of regression and correlation, s/he should stress what is important to know for practice, not for theoretical, abstract (didactically empty) reasoning.

    0
    #185934

    S
    Participant

    i dont know

    0
Viewing 28 posts - 1 through 28 (of 28 total)

The forum ‘General’ is closed to new topics and replies.