I have the standard deviations and means of 4 samples of unequal but known sizes. How can I calculate the standard deviation of all 4 together?Data:pop A n=14, mean=28.9, sd=7.5
pop B n=12, mean=31.3, sd=6.6
pop C n=4 , mean=26.0, sd=8.0
pop D n=8 , mean=34.0, sd=3.9
You can’t add std. deviations, you add variances. Read this thread.
Thanks Darth,I looked through this thread but no-one seems to address the case of unequal sample sizes. I tried weighting the variance by sample size then adding them up and square rooting the result. It seems to give a good approximation of grand SD in most cases, but sometimes dramatically wrong, as I have found from an Excel sheet.So if anyone can shed any light would be much appreciated..Z
Do a little research on the pooled standard deviation. That is essentially an average s.d. but considers sample size.
Ok Thanks. Seems like it only provides an estimate though. so I geuss there is no way to calculate the exact SD…
Zebedee:There are two reasons why your numbers are sometimes close and sometime very different.(1) When Excel calculates the sample standard deviation for each subgroup, the denominator in the calculation has a (n-1) term. The calculation of the ‘average’ or pooled standard deviation will do a weighted average of the individual standard deviations, but multiples each by (n-1) for each subgroup to place the variance back on common ground. The result is a number that reflects the ‘average’ standard deviation within each subgroup.(2) When you calculate a standard deviation for the entire dataset (n=38) you will get a standard deviation for the entire amount of variation. This will include the varation within each group, but will be higher than the result in (1) above because the means of the individual groups are not the same. This is the key point in performing an ANOVA to detect this difference in means of the subgroups.I suggest you set up your 38 datapoints in one Minitab sheet with the first column identifying the subgroup (A, B, C, and D) with the second column containing the raw data. Then go to Stat, ANOVA, OneWay, choose C2(raw data) as response and C1(subgroup) as factor. You should see the calculation of the standard deviation within each group, the pooled standard deviation (think of this as the weighted average standard deviation of the subgroups), and the total standard deviation.It will take you a while to duplicate the numbers from the ANOVA using Excel, but in the end you should see how and why this works.Failing that, post your raw data and we can all have a bash at it.Cheers, BTDT
Thanks for this wonderfully comprehensive message which I will have to re-read to get my head around.Unfortunately I don’t have the raw data.. trying to work out population sd from sds of subgroups. But it seems like this formula for pooled sd is what I have been looking for. Is that right BTDT?Best
Z:You are getting close to the nuts and bolts, keep reading. I will have a look at it more today and get back to you.Cheers, BTDT
Z:I can’t post the table of calculations (too big). Send me an email at 6SigmaGuru(at)gmail(dot)com and I’ll send the spreadsheet showing the rollup for your data.Cheers, BTDT
Your question is ambiguous and has two possible interpretations.
1. You have taken samples and you would like to view their respective standard deviations as independent estimates of the true population standard deviation when the process is “doing the best it can” (stable).
In this instance, to get a better estimate of that variance you pool the variances. For unequal sample sizes the pooled variance is
s^2(pooled) = [(n1-1)*s1*s1 +(n2-1)*s2*s2 +(n3-1)*s3*s3+…]/(n1+n2+n3+….-k)
where k = the number of sample standard deviations ( in your case 4).
The pooled estimate for the standard deviation in this case would be 6.6
2. In #1 above you are ignoring the variance associated with the shifts in the means of the population samples and you are assuming the variance of each independent sample (within sample variance) is somehow representative of the overall population.
If you change your point of view and wish to think of your samples as being grab samples, none of which can be viewed as representing the overall population then you will need to have some way of including the between population variance.
As far as I know, if all you have are the numbers in your original post, there isn’t any way to get an estimate of the population standard deviation with the degree of precision of the estimate when you pooled within sample variances.
An estimate of this kind of population standard deviation could be developed by taking the lowest and highest means and their associated standard deviations, identify, respectively, their lowest and highest 95% or 99% values, and take these values and treat them as an estimate of the range. A rough estimate of the standard deviation of a population is the range/6.
In your case this would amount to the following:
lowest “value” = 26 – 3.18*8 = .56
highes “value = 34 +2.36*3.9 = 43.2
(43.2 – .56)/6 = 7.1
In this case there isn’t a great deal of difference between the two because there aren’t any significant differences between the population means you have listed. However, if there had been a significant shift in one of the population means- say instead of a mean of 34 it was a mean of 44 with the same standard deviation of 3.9 and the same count of 8 then the range estimate for the overall standard deviation would be 8.7 while the pooled estimate would remain at 6.6
Thanks so much for this reply.I’ve just realised where some ambiguity might be arising and I should be more precise about what I meant in the question. When I said I wanted to calculate the overall SD I meant only in the actual dataset that the subgroup sds represent. Not in the wider set of possible measurements.Thanks again
Robert Butler:I sent you an email asking a regression question few days back. I presume you havent checked your email for a while or might be really busy. IF you can give me an answer by the end of this week, it would be a big help for me.. Like always, thanks a million for the helpDeep
Here is a table representing the Excel sheet to calculate what you need.
The formulae are to the right of each cell, except for cell C7.The workflow was to use the sd’s of each group to calculate SS(within),
then use the weighted average to get the SS(between), and add the two to
get SS(total). You only have to divide each SS by the DF to get the
variances and root those to get the standard deviations.The overall standard deviation of the data (if you had it) would have
been 6.77. The best number for the ‘average’ standard deviation within
each group, the pooled standard deviation is 6.59.Cheers, BTDT
Subgroup N Mean SD SS(within) SS(between)
A 14 28.90 7.50 731.25 =D2^2*(B2-1) 32.61 =(C2-C$7)^2*B2
B 12 31.30 6.60 479.16 =D3^2*(B2-1) 9.16 =(C3-C$7)^2*B3
C 4 26.00 8.00 192.00 =D4^2*(B2-1) 78.37 =(C4-C$7)^2*B4
D 8 34.00 3.90 106.47 =D5^2*(B2-1) 102.17 =(C5-C$7)^2*B5
Totals 38 30.43 1508.88 =SUM(E2:E5) 222.31 =SUM(G2:G5)
Source DF SS SD
Subgroups 3 222.31 =G7
Error 34 1508.88 =E7 6.59 =SQRT(C16/B16-1)
Total 37 1731.19 =SUM(C15:C16) 6.77 =SQRT(C17/B17-1)
The forum ‘General’ is closed to new topics and replies.