# Binary Correlation

Six Sigma – iSixSigma › Forums › General Forums › Methodology › Binary Correlation

This topic contains 7 replies, has 2 voices, and was last updated by Kristen Hill 4 months, 2 weeks ago.

- AuthorPosts
- July 29, 2018 at 9:05 am #56053

AnnieI have data around attendance at classes and I need to find out if there is a relationship between attendance at the most recent class and previous classes. I have coded 1 to mean attended and 0 did not attend.

Name, Class 1, Class 2, Class 3

A. 1, 1, 1

B. 0, 1, 0I have tried correlating class 3 with classes 1 and 2 and also adding classes 1 and 2 together to see the number of previous classes attended and correlating this with class 3.

All the correlations are low and I’m wondering if correlation is the wrong method as I’m working with mainly binary data. What do you think?

July 29, 2018 at 9:32 am #202862Based on your description of the problem I think your best bet would be to run a logistic regression with attendance at class 3 being the Y variable and attendance at classes 1 and 2 being the two X variables.

The output would be in the form of odds ratios – that is you would have an odds ratio for attendance at class 1 correlating with attendance at class 3 and similarly for class 2. One problem you are going to have with this approach is the question of independence of attendance at class 1 relative to attendance at class 2. If the two are not independent enough you will get huge odds ratios with very large Wald confidence limits and the analysis won’t be worth much.

One thing you need to remember with binary data – to see significance you need a lot more data than you would need for an analysis of continuous data so even if the measures exhibit sufficient independence you may not have enough data to detect a significant difference.

July 29, 2018 at 9:37 am #202863

AnnieThank you for replying so quickly. I have data for around 100 students which should help.

I will try a logistic regression. Do you think doing a correlation would be meaningless?

July 29, 2018 at 9:56 am #202864I’m not really sure what you mean by correlation. Since you are on-line – post the data and I’ll take a look at it – just 3 columns, one for each class yes/no (1/0) and an indication as to which class is which.

July 29, 2018 at 2:29 pm #202865

AnnieSorry I only just saw this message.

The data is how you described it. Columns for each class containing 0s and 1s and I’m wondering if correlating class 3 with class 2 would give any meaningful information about whether there’s a relationship between attendance at points 2 and 3. (Standard Pearson/Spearman)

July 29, 2018 at 3:44 pm #202866Not really. One of the key assumptions of the Pearson Correlation Coefficient is normality (or approximate normality) of the variables – binary data is not normal.

If it was just a case of looking at class attendance in 2 vs class attendance in 3 you could run a 2×2 chi-square analysis which would give you a measure of association. However, given your problem description I still think logistic regression would be the better bet. It would tell you the odds of attending class 3 given attendance (or lack thereof) in class 2 and it would also tell you whether or not the odds ratio was significant.

July 30, 2018 at 1:07 pm #202875

AnnieOK that makes sense, thank you

July 31, 2018 at 6:12 am #202877You can run a chi-square very quickly to find a relationship, but a Logistic regression is correct as well. The chi-square can give you a very quick and easy view and can be done in Excel.

- AuthorPosts

You must be logged in to reply to this topic.