Medicine has made remarkable progress within the lifetime of the oldest members of our society. Evidence from trials has come to replace expert opinion as the arbiter of treatment effectiveness. Following the work of Fisher (1925) and Neyman and Pearson (1933), null hypothesis significance testing (NHST) and the p value have become the cornerstones of clinical research. The attractions of a rule-based ‘algorithm’ approach are that it is easy to implement, permits binary decisions to be made and makes it simple for investigators, editors, readers and funding bodies to count or discount the work. But does that make it reliable? Even in the early days, this approach was controversial. Worse still, as these ideas became broadly adopted, fundamental misinterpretations were embedded in the literature and practice of biomedical research.

Fisher originally proposed using the exact p value for a single trial as an indication of the credibility of the null hypothesis when considered together with all the available evidence.

For several decades, there has been a failure by many authors to realise that all these probabilities (p, α, β, type I error, type II error) are conditional.

For example, the standard definition of the p value is ‘the probability of having observed our data (or something more extreme)

This error is an example of the fallacy of the transposed conditional or the prosecutor’s fallacy, so called because it may be used by the prosecution to exaggerate the weight of evidence against a defendant in a criminal trial. Similarly, this common misinterpretation of the p value exaggerates the weight of evidence against the null hypothesis. What we actually need is the false discovery rate (FDR), which is the proportion of reported discoveries that are false positives. Some authors prefer the term false positive risk to emphasise that this is the risk that, having observed a significant p value from a single experiment, it is a false positive.

Another example may help to clarify the problem of the transposed conditional. Consider the difference between the probability of having spots

In the last decade, attempts have been made in social science, psychology, medicine and pharmacology to replicate important experiments with very disappointing results. This has led to increasing concern about the validity of much scientific work. Attention has inevitably turned to the statistical techniques and our interpretation of them.

In biomedical research, we often do not know the effect size as it is frequently small, sampling is difficult and variance is often large and poorly known. Crucially, we only do the experiment once and have only a one-point estimate of the p value. Additionally, our theories are only weakly predictive and do not generate precise numerical quantities that can be checked in quantitative experiments as is possible in the physical sciences. NHST does not perform well under these circumstances. Added to this are the frequent misunderstandings about the meaning of the p value. To reiterate, the p value is strictly the probability of obtaining data as extreme or more so,

To investigate how common this misunderstanding is in medical textbooks, we audited the definition of the term p value in books held in the medical library of the University of Exeter Medical School at the Royal Devon and Exeter Foundation Trust (

Fraction of correct definitions of the p value found in medical textbooks by library subject classification

Subject area | Fraction of correct and unambiguous definitions for the p value |

Evidence-based medicine | 3/15 |

Exam revision | 3/17 |

Research | 8/13 |

Statistics | 4/4 |

Total overall subjects | 18/49 |

The most common error was to claim that the p value is the probability that the data was generated by chance alone. This definition has been frequently and vigorously refuted in the statistics literature but is very persistent in medical textbooks and journals.

Beyond misinterpretations of p values, there are also widespread problems with multiple testing, sometimes inadvertant, which grossly inflates the proportion of false positive results. This is known as ‘p-hacking’ or ‘data dredging’ and allows researchers to selectively report spurious results as significant.

Given the scale of this problem, what should be done? There are two main areas to address. First of all, we need to teach the correct statistical interpretation of NHST because of the huge volume of trials already published. This has already been attempted without success for at least the last 40 years. Second, we need to move to statistical models that are better suited to current research problems and address some of the shortcomings of NHST. Both of these issues will need the entire research community to change. Research funding agencies, universities and journals must recognise that they have played a key role in promoting a culture where the p value has had primacy over reason. Researchers must resist redefining statistical quantities to suit their own arguments, because it is mathematically wrong to do so. Possible statistical approaches include the use of effect size estimation accompanied by 95% CIs, a reduction in the p value threshold for significance,

We urge authors and editors to demote the prominence of p values in journal articles, have the actual null hypothesis formally stated at the beginning of the article and promote the use of the more intuitive (but harder to calculate) Bayesian statistics.

The p value in null hypothesis significance testing is conditioned on the null hypothesis being true.

This means that a p value of 0.05

In fact, the chance of us mistakenly rejecting the null hypothesis and concluding we have a successful treatment is more in the region of 30%–60%.

Scientific journals and textbooks need to be explicit on how p values are used and defined.

Use of the more intuitive Bayesian statistics should become more widespread.

This article has been corrected since it was published Online First. This paper is now Open Access.

RP and RB conceived the idea of writing the article. RP was the primary author and conducted the audit; RB and LM contributed to drafting and revising the article, in particular helping to clarify the text and make it suitable for a wider medical audience.

None declared.

Not required.

Not commissioned; externally peer reviewed.