Article Text

Download PDFPDF
Problem with p values: why p values do not tell you if your treatment is likely to work
  1. Robert Price1,
  2. Rob Bethune2,3,
  3. Lisa Massey2
  1. 1Anaesthetics, Royal Devon and Exeter NHS Foundation Trust, Exeter, UK
  2. 2Surgery, Royal Devon and Exeter NHS Foundation Trust, Exeter, UK
  3. 3South West Academic Health Science Network, Exeter, UK
  1. Correspondence to Dr Robert Price, Anaesthetics, Exeter, Devon, UK; robert.price1{at}nhs.net

Statistics from Altmetric.com

Introduction

Medicine has made remarkable progress within the lifetime of the oldest members of our society. Evidence from trials has come to replace expert opinion as the arbiter of treatment effectiveness. Following the work of Fisher (1925) and Neyman and Pearson (1933), null hypothesis significance testing (NHST) and the p value have become the cornerstones of clinical research. The attractions of a rule-based ‘algorithm’ approach are that it is easy to implement, permits binary decisions to be made and makes it simple for investigators, editors, readers and funding bodies to count or discount the work. But does that make it reliable? Even in the early days, this approach was controversial. Worse still, as these ideas became broadly adopted, fundamental misinterpretations were embedded in the literature and practice of biomedical research.1

Fisher originally proposed using the exact p value for a single trial as an indication of the credibility of the null hypothesis when considered together with all the available evidence.2 It is worth noting that the null hypothesis does not necessarily mean no difference between groups, this is the nil null hypothesis.3 Rather it is the hypothesis we aim to nullify by experiment, which can provide more powerful evidence if it includes a quantitative prediction of the expected difference. In the following decade, Neyman and Pearson developed the concepts of the alternative hypothesis, α, power, β, type I error and type II error, as a formal decision-making procedure to automate industrial production quality control. This method requires multiple samples and repeated analyses to control the long-term error rates, only using the p value as a binary decision-making threshold. Despite the inherent conflict in these two approaches, they have been fused into NHST for single trials, where a p value threshold is used to accept or reject the null …

View Full Text

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.