Hypothesis Testing Concept of Pvalue

Six Sigma – iSixSigma Forums Old Forums General Hypothesis Testing Concept of Pvalue

Viewing 14 posts - 1 through 14 (of 14 total)
  • Author
  • #37883

    bach huss

    Hi all,
    Just a question on the P-value concept used to reject/accept Null Hypothesis.
    What I understand is that the P-value shows the propbability that the H null is true, why are we using 0.05 as a guideline not a bigger number because it seems that 0.05 is too small.
    Thanks in advance,



    Your p-value represents the degree of evidence against your null hypothesis, so a lower p-value such as .001 provides more evidence against the null hypothesis than does a higher p-value such as 0.10, which gives you little evidence against the null hypothesis.  The p-value of 0.05 gives a generally accepted cut-off point for comfortably (in most cases) rejecting the null hypothesis.  P-values represent one of the few validated cases where bigger is truly not better. 


    Chris Butterworth

    Hi Bach,
    Say you have batteries with known capacities (5 hours mean with 0.2 hrs std dev) and a new battery design is discharged and has a capacity of 5.6 hours (you don’t usually get a lot of these). It is possible that this new design is no different from the old and that you selected this one battery at random and it just happened to be on the very high end of the distribution of the old battery. This is the null hypothesis (no difference) and the p-value is the probability of testing one battery, from the old population, and getting a 5.6 amp-hour discharge.  If that p-value is small, you would tend to be uncomfortable accepting it as a random selection from the same population and so you would conclude that the new battery design is indeed different (better). That is, you would reject the null hypothesis. 
    The 0.05 value you mention is the alpha risk. This is the risk of rejecting the null hypothesis when it is true and you set this value before the test is run. In any test, there are two correct decisions (accept the null when it is true and reject the null when it is false) and there are two incorrect decisions (reject the null when it is true and accept the null when it is false). You want these to be small because they are incorrect decisions.
    Hope this helps


    bach huss

    Thank u for yr replies. I think I understand now. Just to check my understanding. The pvalue, is not the probability that Ho is true but rather the risk (alpha) of rejecting the null hypothesis when it is actually true.



    I’m sure the veterans on this site have seen this before but for the rest … an interesting little game that you can do in a workshop that shows the comfort level associated with an alpha risk of 0.05.
    Ahead of time take 2 decks of cards and sort them so that you combine all the black cards into one “deck”.  In the workshop produce the deck and ask someone to pull a card and show it to the group.  Keep repeating this and each time tell the group the probability of the next card also being black.  Keep in mind the class don’t know you have rigged the deck and are expecting that each draw there is a 50/50 chance of producing a red card (this is their null hypothesis).
    After the third black card in a row you might have a couple of raised eye brows, the fourth some mutterings might be heard, but by the fifth black card in a row most of the class will probably be making accusations that something isn’t right.  Surprise surprise but you have dropped below a 5% probability of drawing consecutive black cards.
    This is the amount of chance/risk most people are comfortable with accepting before they reject their initial hypothesis that there was a 50/50 chance of drawing a red card.



     You are right. p-value is the probability of making a Type 1 error ( alpha risk) (that is rejecting the null hypothesis when it is true).
    Naturally you would like to have it as small as possible. Most often this is chosen as .05 but you can change that if you want.



    To say it in terms of probability, the p value is the probability that you get a result that far from Ho if Ho was true.
    For example, lets say that you have a coin you suspect is not fair and delivers more heads than tails. In that case Ho would be heads=50% and H1 heads>50%. So you decide to make an experiment: You will toss the coin 10 times and see if the ratio of heads is “significantly” greater than 50%. But what would “significantly” be?
    Let’s assume Ho is true. In that case, these would be the chances to get these number of heads or more: If the ratio of heads was 50%…

    The probability to get 0 or more heads would be


    The probability to get 1 or more heads would be


    The probability to get 2 or more heads would be


    The probability to get 3 or more heads would be


    The probability to get 4 or more heads would be


    The probability to get 5 or more heads would be


    The probability to get 6 or more heads would be


    The probability to get 7 or more heads would be


    The probability to get 8 or more heads would be


    The probability to get 9 or more heads would be


    The probability to get 10 or more heads would be

    These are th p values. At some point one sais “the result of the experiment would be very unlikely if the null hypothesis was true, so I will not sustain the assumption that it is true and will look somewhere else (other than chance) for the casue of this result”



    The Ha would be that:
    [ Heads (significantly) > or (significantly) < Tails ] (Not uni-directional)
    The Ho would be that:
    [Heads = Tails ]  (Not: [ Heads = 50% ] …imagine if it was possible to land on the edge 33% of the time… Then the Ho would be: [Heads = Tails = Edge ]
    Hypothesis testing looks for differences in samples, and the p value gives you probability that they are not different…



    You are wrong about Ha. The alternate can be either > (greater than…)”, < (smaller than…), or "” (not equal to…). This is the difference between the one tailed and the two tailed tests.
    About Ho, I assumed that P(landing on the edge)=0, so “Heads=tails” = “Heads=50%”. But I take your point that a coin could actually land on its edge.



    Agreed, Gabriel,
    I was merely trying to illustrate the fact that the Ha is that the samples are NOT =, not necessarily that Heads > Tails.  If Heads <> Tails.
    As far as the edge possibility, if you are testing all scenerios, (i.e. P(Heads) = P(Tails) = P(Edge)), a low p value result unless P(Each)~33%.
    I hope this helps you make “Heads or Tails” out of my attempt to clarify!    ;-)



    We still don’t agree about Ha.
    If you set Ha: Heads > Tails (one tailed test), then a too low number of heads in the sample will not give you a small p value.
    If you set Ha: Heads not= Tails (two tailed test), then both a too high and a too low number of heads in the sample will give you a small p value. But you will need more heads than in the one tailed test to get the same low p value on the “too many heads” side, beacuse an alpha= 0.05 test leaves all 5% at one tail when it is a one tailed test, but leaves 2.5% on each tail when it is a two tailed test.



    Ah, yes!
    I forgot about the “less than, not equal, greater than” test option…  Thank you for refreshing my memory!


    Marc Richardson

    The decision to act or not act based on a p-value of .05 is a convention. Basically it comes down to this: are you willing to make a million dollar process change on a probability where p-value=.10? I might make a $1000 dollar change on that basis, but not a $1,000,000 dollar change.
    Marc Richardson
    Quality Assurance Manager



    Bach Huss,
    This is a longer response, but I hope after reading it, you will be able to make your own decisions as to how to treat a “significant result” in a hypothesis test.
    1. The origins of the 5% p-value
    The p-value of 5% was developed by Fisher and popularized in his Statistical Methods for Research Workers (1925). He chose the p-value of 5% because he assumed that it is reasonable to assume that you can reject the null hypothesis if the “probability” of it being “true” is only 1/20 = 5%. There is no other reason than the acceptance of Fisher’s criterion by his students and the scientific community in general (during the 1940s and 1950s).
    His famous initial remark is as follows: “If therefore we know the variance of a population, we can calculate the variance of the mean of a random sample of any size and so test where or not if differs significantly from any fixed value. If the difference is any times greater than the standard error, it is certainly significant, and it is a convenient convention to take twice the standard error as the limit of significance; this is roughly equivalent to the corresponding limit p = 0.05 already used for the chi-square distribution (Fisher 1925, paragraph 2).
    He reasserted this position in his “The Design of Experiments” from 1935, 13: “It is usual and convenient for experimenters to take 5 per cent  as a standard level of significance, in the sense that they are prepared to ignore all results which fail to reach this standard, and by this means, to eliminate from further discussion the greater part of the fluctuations which chance causes have introduced into their experimental results”.  But he also writes: “It is open to the experimenter to be more or less exacting in respect of the smallness of the probability he would require before he would be willing to admit that his observations have demonstrated a positive result” (same page).
    2. Alpha risk, truth and rejection of a hypothesis: scientific inference vs. acceptance procedure
    The term alpha risk was introduced by Neyman and Pearson in their famous 1933 paper. It introduced into the logic of scientific inferences of experimental designs the argument that scientific inferences can be logically, and statistically dealt with like acceptance procedures in business. Acceptance sampling was developed by Rome and Dodges (1928) at the Bell laboratories. Romer and Dodges introduced the terms consumer risk vs. producer risk which Neyman and Pearson then translated into alpha risk and beta risk. Thus, Neyman and Pearson modeled the practice of scientific inference making in experimental designs based on the logic of acceptance sampling model of business decision making. And here is where the confusion about “truth” of the “rejection of a hypothesis” based on a predetermined “risk of acceptance” comes from. There are two schools of thought:
    Neyman and Pearson’s position is that statistical inference making is a rational decision that is governed by the mathematical models of decision making with the ultimate goal to “accept” or “reject” a hypothesis. Two errors can be made (alpha and beta). But ultimately, they make the assumption that the acceptance of rejection of a hypothesis is an either/or decision just like a business accepts or rejects a “lot” from its suppliers.
    Fisher starts from the premise that “Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis” (Design of experiments, p. 16).  He further goes on as follows (p. 25 – 26): “… in learning by experience, or by planned chains of experimentation, conclusions are always provisional embodying the evidence so far accrued. Convenient as it is to note that a hypothesis is contradicted at some familiar level of significance such as 5% or 2% or 1% we don not in Inductive Inference, ever need to lose sight of the exact strength which the evidence has in fact reached, or to ignore the fact that with further trial it might come to be stronger or weaker. The situation is quite different in the field of Acceptance Procedures, in which irreversible action may have to be taken, and in which, whichever decision is arrived at, it is quite immaterial whether it is arrived at on strong evidence or weak (…) In the field of pure research no assessment of the cost of wrong conclusions can conceivably be more than a pretence. (…) Such differences between the logical situations should be borne in mind whenever we see tests of significance spoken as “Rules of Action. A good deal of confusion has been caused by the attempt to formalize the exposition of tests of significance in a logical framework different from which that for which they were in fact first developed.”.
    3. Confidence and confidence intervals
    The discussion, of course, also extends to the “confidence intervals”. For Neyman and Pearson they are defined purely mathematically and statistically using the relevant formula. Fisher’s position is not surprisingly somewhat different. In his 1925 book (p. 126 he writes: “The confidence to be placed in a result depends not only on the magnitude of the mean value obtained, but equally on the agreement between parallel experiments”. In a sense, Neyman and Pearson are situated in the tradition of the “critical experiment”, while Fisher is situated in the tradition of accumulated knowledge.
    I hope this will help you make your own intelligent decisions as to how to treat a “significant p-value” in any of your projects.

Viewing 14 posts - 1 through 14 (of 14 total)

The forum ‘General’ is closed to new topics and replies.