Home › Forums › General Forums › Tools & Templates › Addition and Subtraction of Distributions
This topic contains 8 replies, has 4 voices, and was last updated by Robert Butler 1 week, 2 days ago.
I collected three individual testing data sets for variables X, Y, and Z, and generate three individual distributions based on these three individual testing data sets, f1(X), f2(Y), and f3(Z). The question is how to calculate f(Z-X-Y) distribution based on these three individual distributions f1(X), f2(Y), and f3(Z)? Note, I can not make normal distribution assumptions for any one of these f1(X), f2(Y), and f3(Z) because they all have skewed distributions.
Thanks all for your suggestions in advance.
Zeng
Do you think the distributions are independent? If so, create a third column adding the 3 to get an idea of your “thesis”.
If they aren’t independent (not uncommon), then gather data of each distribution simultaneously and then add those 3 results if x+y+z is important to your effort.
My understanding of your post is that you have some kind of response and you went out and first ran a series of experiments where you varied X while holding Y and Z at some fixed level and then did the same thing for Y (holding X and Z constant) and Z holding (X and Y constant). If this is the case then you can build a simple main effects linear regression model that will predict the response as a function of X,Y, and Z.
As for the issue of distribution normality – this is not an issue. It is quite true that there are numerous web sites, peer reviewed articles, and even some textbooks that will insist Y and/or X need to be normally distributed before one can employ linear regression – all of them are wrong.
The issue of normality applies ONLY to the residuals of the regression and this requirement really should be stated as approximately normal. The reason for the focus on distribution of the residuals is because it is the residuals that are used by the t and the F tests to assess term significance.
The best discussion/proof of the above that I’m aware of are sections 1.2-1.4 (pages 9-33) of Applied Regression Analysis, 2nd Edition by Draper and Smith. The most recent version of this book is the 3rd edition. I don’t have a copy of that text but I’m sure the sections I cited in the 2nd edition are contained in the 3rd edition. If you are interested I’d recommend getting a copy of this book through inter-library loan and reading/copying the sections I cited.
Hi,
It is not important whether your data is normal or not. Important is whether you have got a data from a process which generates normal distribution? Because the subset of data which you have got may have special causes at the time of data collection. This will make the distribution skewed.
As pointed out, this data can be considered coming from normal distribution & regression can be applied.
It will be better if you can explain more about what is your Xs & Y??
@Suraj.singh, I’m having difficulty understanding the point of your post. My interpretation of your statement is that you seem to think that normally distributed response data is somehow a guarantee that the measured process does is not being impacted by special causes. If this is what you are saying this is incorrect. There is no connection between the presence or absence of special causes and the distribution of the measured response. This is true regardless of the output distribution.
My understanding of your second statement is that you believe that the distribution of the X’s and/or the Y’s must be normal before employing linear regression. If this interpretation is correct you are again in error. A careful reading of the mechanics and basic assumptions of linear regression will show that the issue of distribution applies only to the residuals – the distribution of the X’s and the Y’s have no bearing on the whether or not the use of linear regression is appropriate.
I’ve seen others make that statement about regressions needing normal data….where/when did that horrible impression creep into people’s minds?
I have a theory…
@cseider it probably comes from the same people who have provided us with other nifty alt-facts of statistics such as:
1. You should try to have sample sizes of 30 because the central limit theorem states that samples of this size will be normally distributed.
2. In order to use the t-test the data must be normal.
2a. In order to use the t-test the sample sizes must be equal.
2b. In order to use the t-test the variances of the samples must be equal.
2c. (…and the top whopper) In order to use the t-test the sample size must be at least X (where X seems to float between 6 and 30).
3. If the data is normally distributed your process is under control and no special causes are present.
4. The data must be normal before one can use a control chart.
5. Data that is not normally distributed guarantees the presence of special causes.
6. Linear regression only applies to those situations where the relationship between the X and the Y is a straight line.
…and so on and so forth.
…and just in case someone new to the world of six sigma and/or statistics reads this post and fails to note the opening comment concerning alt-facts -every point listed above is false and that is false as in wrong, wrong, wrong.
@rbutler Nicely done. ;)
We do know that traditionally t-tests are considered for normal distributions. I know I’ve read your comments and other research saying they are robust enough with non-normal data but IF one has to “pick” the test to get the probability wanted–it might not be a root cause being tested ;) I don’t mind advising folks to use the other tests if they find their data is non-normal (and can’t find out of control reasons, for example, for the non-normality. It keeps things clean and helps with flow diagrams (don’t blast me too much for being conservative on this) :)
A lot of your other points are nuances that can’t be learned in a class. Remember those days when one feels like they’re drinking out of a fire hose during the Analysis week of typical six sigma training. Those lessons you properly jester about MIGHT be learned during good mentoring from an experienced mentor.
#3 and #4 are examples where an inexperienced mentor would wonder why you listed them. :) I do chuckle at the whole list.
Stay cool (literally) and not metaphorically, LOL.
@cseider Contrary to how some of my posts about the t-test sound I too am not adverse to using other tests to check for population differences. The fact is that the Wilcoxon-Mann-Whitney test can be used anywhere the t-test is used and it can handle crazy non-normal data and find a significant difference where a t-test with the same data would not detect a difference.
In those cases where I’m teaching people about the t-test or where I’m helping someone use their handy-dandy stat package I tell them that since, in most cases, we are talking a couple of mouse clicks, they should run both the t-test and the Wilcoxon-Mann-Whitney on the same set of data and see how they compare. I also tell them to run histograms of the two populations and keep the test and histogram results in a summary file.
When they finally do get a case where the two disagree they will have the fact of the two tests and the histograms of the two populations. When they take the historgram of the populations where the two tests disagreed and look at the histograms where they didn’t, they will have an eye-calibration chart which will give them an excellent understand of just how wild things have to get before the t-test fails to detect a significant difference. They will also have developed an understanding of and a level of confidence in the two tests that they cannot get by any other means.
© Copyright iSixSigma 2000-2017. User Agreement. Any reproduction or other use of content without the express written consent of iSixSigma is prohibited. More »