Binary Correlation

Six Sigma – iSixSigma Forums General Forums Methodology Binary Correlation

This topic contains 7 replies, has 2 voices, and was last updated by  Kristen Hill 4 months, 2 weeks ago.

Viewing 8 posts - 1 through 8 (of 8 total)
  • Author
  • #56053


    I have data around attendance at classes and I need to find out if there is a relationship between attendance at the most recent class and previous classes. I have coded 1 to mean attended and 0 did not attend.

    Name, Class 1, Class 2, Class 3
    A. 1, 1, 1
    B. 0, 1, 0

    I have tried correlating class 3 with classes 1 and 2 and also adding classes 1 and 2 together to see the number of previous classes attended and correlating this with class 3.

    All the correlations are low and I’m wondering if correlation is the wrong method as I’m working with mainly binary data. What do you think?


    Robert Butler

    Based on your description of the problem I think your best bet would be to run a logistic regression with attendance at class 3 being the Y variable and attendance at classes 1 and 2 being the two X variables.

    The output would be in the form of odds ratios – that is you would have an odds ratio for attendance at class 1 correlating with attendance at class 3 and similarly for class 2. One problem you are going to have with this approach is the question of independence of attendance at class 1 relative to attendance at class 2. If the two are not independent enough you will get huge odds ratios with very large Wald confidence limits and the analysis won’t be worth much.

    One thing you need to remember with binary data – to see significance you need a lot more data than you would need for an analysis of continuous data so even if the measures exhibit sufficient independence you may not have enough data to detect a significant difference.



    Thank you for replying so quickly. I have data for around 100 students which should help.

    I will try a logistic regression. Do you think doing a correlation would be meaningless?


    Robert Butler

    I’m not really sure what you mean by correlation. Since you are on-line – post the data and I’ll take a look at it – just 3 columns, one for each class yes/no (1/0) and an indication as to which class is which.



    Sorry I only just saw this message.

    The data is how you described it. Columns for each class containing 0s and 1s and I’m wondering if correlating class 3 with class 2 would give any meaningful information about whether there’s a relationship between attendance at points 2 and 3. (Standard Pearson/Spearman)


    Robert Butler

    Not really. One of the key assumptions of the Pearson Correlation Coefficient is normality (or approximate normality) of the variables – binary data is not normal.

    If it was just a case of looking at class attendance in 2 vs class attendance in 3 you could run a 2×2 chi-square analysis which would give you a measure of association. However, given your problem description I still think logistic regression would be the better bet. It would tell you the odds of attending class 3 given attendance (or lack thereof) in class 2 and it would also tell you whether or not the odds ratio was significant.



    OK that makes sense, thank you


    Kristen Hill

    You can run a chi-square very quickly to find a relationship, but a Logistic regression is correct as well. The chi-square can give you a very quick and easy view and can be done in Excel.

Viewing 8 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic.