# Please advise: Statistical Significance

Six Sigma – iSixSigma › Forums › Old Forums › General › Please advise: Statistical Significance

This topic contains 9 replies, has 9 voices, and was last updated by Evaluation of gain scores 12 years, 8 months ago.

- AuthorPosts
- October 8, 2006 at 3:57 am #44825

ForrestParticipant@Forrest**Include @Forrest in your post and this person will**

be notified via email.

Ive been reading this forum on the topic of statistical significance and p-values, but need additional help understanding and articulating.

SOME CONTEXT:

I recently finished a project to increase staff productivity. The result was a practical increase of 25%, plus a paired T-Test p-value of 0.000.

Financial benefits were estimated based on increased exams performed. My sponsor insists that the financial benefits must be validated by “statistical significance.” Although exams performed increased, a paired T-test resulted in a p-value of 0.327.

MY QUESTIONS:

In terms of projects and results, when is appropriate to test for statistical significance?

Maybe better said, when is it unnecessary?

How do I articulate why it is or isnt necessary to conduct the paired T-tests to validate results.

Your advice is appreciated.0October 8, 2006 at 7:20 am #144393Hi ForrestI do encounter same problem as you are facing. This is quite a common problem in many Six Sigma project. You end up with this situation mainly because you apply the wrong metric and set the wrong target. When setting the target, you may try to ask yourself a question

What is the minimum performance improvement is required for a significant impact to the financial benefit? Theoretically, the answer should be your new target. However, sometime it is not practical to set an achievable target. Therefore, you shouldnt restrict yourself to the term statistical significance. If you feel satisfied with the currently improved situation, and you are able to maintain the sustainability, it is all right to get the project closed. Revisit the project again after 6 month and set another new target for breakthrough improvement. Good Luck0October 16, 2006 at 1:13 pm #144872I may not fully understand the meaning of your data, but I would tend to agree with your sponsor. If it appreas that there is a practical difference, but the p-value does not support this assumption, then you cannot reject the null hypothesis. It is possible that the change you see is just normal process variation. You could be sampling from a different part of the population, giving you a sense of improvement where none really exists. My recommendation is to put the data into a process behavior chart and see what is really going on.

0October 16, 2006 at 2:32 pm #144885

Dr. Mikel HarryMember@Mikel-Harry**Include @Mikel-Harry in your post and this person will**

be notified via email.You will use paired t-test on a continuous data with the same sample.

Now, for the testing of significance, you want to prove that something had changed or “After” is better than “Before’ and these are reflected in the Alternate hypothesis which would expect a p-value result of p<0.05.0October 16, 2006 at 2:41 pm #144888A quick primer on statistical significance:

Unfortunately “Statistical Significance” is not a good term “Statistical Detectability” would be better. The fact that something is statistically “significant” does not mean that it has practical significance.

The meaning of “p” is often misunderstood. p is the probability of exceedence. It is the probability that you will get a result which is at least as “impressive” as the result you actually got, given that (or “conditioned on”) the fact that the null hypothesis is true. There are several pitfalls in this:The null hypothesis is not just the hypothesis of no difference, it also includes the conformity to various conditions. These are usually that the data are normally, independently, and identically distributed (NIID). Most tests (e.g t-tests ANOVA) are fairly robust to the normality assumption (although the F-test is more sensitive) However, the critical one is identically distributed. Any special cause variation will invalidate the idea that there is any distribution at all, and will prevent any valid conclusions being drawn.

The logic is often reversed: The probability that you will get or exceed a result given that the null hypothesis is true, IS NOT the same as the probability that the null hypothesis is true given the result. (To confuse the two is a logical fallacy known as “transposition of conditioning”) The two ARE related through Bayes Theorem, but this highlights the subjectivity of statistical inference, which is a controversial subject.

The fact that your data may not be NIID (except possibly in a designed experiment) is the reason why a control chart is a better method, as special cause variation other than between your two groups will invalidate your conclusions.0October 16, 2006 at 3:03 pm #144890

Ovidiu ContrasParticipant@Ovidiu-Contras**Include @Ovidiu-Contras in your post and this person will**

be notified via email.Forrest,

I hope I understand your question, so:

IF:

1. A 25% increase in productivity was achieved and proved by the 2 sample T-test for before and after intervals (assuming the productivity is measured in some form of continuous data)

2. The 2 sample T-test for the financial results showed for the same before and after intervals that they are not statistically different

THEN:

First thing that I’ll question would be the relationship between productivity increase and financial results

As per your questions: You’ll always have to prove the difference between before and after situations. If the improvement is obvious and just by plotting the distributions you can see the differece (they don’t overlap), there’s not a real need to perform the test.

Hope this helps0October 16, 2006 at 3:07 pm #144891Ovidiu Contras brings up a great point. If there are other factors that could change the result, then your metric isn’t appropriate. Since you have detected what you think is a practical difference, this may not be the case, but it is comething to think about.

Is your data normal?0October 16, 2006 at 3:19 pm #144893

Training EvaluationsMember@Training-Evaluations**Include @Training-Evaluations in your post and this person will**

be notified via email.Forrest,

There are two models for evaluating the impact of training on human performance: the Kirkpatrick model and the Glass model. Both models use a logic that ties performance improvement to training interventions via 4 (Kirkpatrick) or 5 (Glass) logically related steps:

Step 1: Satisfaction with training; This is assumed to be a prerequisit for effective information processing for skills and knowledge acquisition

Step 2: End-of-class evaluations: This ensures that learning occurred. The most frequently used method is a paired t-test on pre-training scores and post-training scores using the same or an equivalent educational test.

Step 3: Transfer of knowledge/skill to the workplace: This is done through interviews, observations, etc.

Step 4: Measurement of performance: Increase vs. decrease in terms of what the critical outcome measurement is

Step 5 (Glass only): Translation into dollar figure.

In regards to the significance of the tests: If you have small sample sizes, you will have little power. So what you will have to do is to identify what difference in scores in practically relevant and then collect data until you have the power to test the hypothesis. Also, n improvement in test score itself does not prove that the intervention was successful. Glass (Return on Investment in Training or Training Evaluation Handbook) has a detailed discussion about the issues that you face. Good luck0October 16, 2006 at 7:06 pm #144909

Jonathon AndellParticipant@Jonathon-Andell**Include @Jonathon-Andell in your post and this person will**

be notified via email.You might consider displaying the scores in a control chart format. With paired t-tests, there are several options:1. Alternate evach person’s before-after scores, and observe the “zig-zag” pattern.2. Display all “before” scores and all “after” scores in the same sequence, and observe whather the pattern repeats at a different mean.3. Display only the difference (either absolute scores or percetnages) between before and after.If one of those displays shows an unmistakeable change, the hypothesis test becomes a formaility. If the change is not blatantly obvious, the hypothesis test tells us “how sure” we are that the difference is “real.”Basically, the P-Value can be interpreted crudely to represent the probability that the difference was just due to a random draw of data. Therefore, 1-P is more or less the probability that the change was “real.”Whether the magnitude of the change meets the business needs is a non-statistical question, which you have to answer separately.PS – don’t forget to test whether the amount of variation changed, too!

0October 16, 2006 at 8:34 pm #144925

Evaluation of gain scoresParticipant@Evaluation-of-gain-scores**Include @Evaluation-of-gain-scores in your post and this person will**

be notified via email.I have attached a typical scenario that you may face when evaluating a “gain score”, i.e. training or performance score before and after intervention. This is an idealized case to help you run your own diagnostics:

Raw data:Before 1

After 1

Tenure

no tenure

high tenure

Gain score30

37.5

0

1

0

7.520

25

0

1

0

535

43.75

0

1

0

8.7599

99

1

0

1

090

90

1

0

1

099

99

1

0

1

0

Essence of the table: 1. The paired t-test is not significant. However, low tenured employees show an increase of 25%, while high tenured employees show a 0 percentage increase. You often find that training interventioons have a higher impact on newer than experienced employees. There might be other moderating variables such as location, level of training etc. The point that I want to get across is that overall there may not be a significant difference, but once you dig deeper you’ll find why on the aggregate you cannot find that difference:

Problem 1: Low sample size: Just run a paired t-test on the six data and then double the size by copying and pasting the first two columns. The p-value of the t-test is very much affected by sample size even though the gain score may be identical (Statisticians say that the null hypothesis is always wrong :-).

Problem 2: Analysis of gains scores: The gain score = before – after: Simply calculate the gain score and review anything that you find is noteworthy. The previous response in regard to control chart or even run charts may help you identify certain patterns in your data. They can help you pinpoint the “lurking” variable.

Problem 3: Analysis of the correlation between before and after scores. Before you run your paired t-test, regress the before on the after variable (after = response; before = factor) and review the residuals. The residuals may tell you that you are dealing with multiple groups, which may lead you to incorporate the group membership into your analysis. If you show a low correlation between the scores, you may have had some problems with the administration of the test. (In this case you will see that the low tenured have higher variability in their residual than the high tenured employees).

Problem 4: Make sure that you have enough variability in the criterion measure. Often, supervisor ratings show little variability and lack of reliability. As a result, you cannot identify differences that truly exist. MSA on the criterion variable (= y variable) is critical in your project as you are dealing with people and performance measurements.

Now go to work and dig into you data!0 - AuthorPosts

The forum ‘General’ is closed to new topics and replies.