Objective To show that overpowered trials claim statistical significance detouring clinical relevance and warrant the need of superiority margin to avoid such misinterpretation.
Design Selective review of articles published in the New England Journal of Medicine between 1 January 2015 and 31 December 2018 and meta-analysis following Preferred Reporting Items for Systematic Reviews and Meta-Analyses checklist.
Eligibility criteria for selecting studies and methods Published superiority trials evaluating cardiovascular diseases and diabetes mellitus with positive efficacy outcome were eligible. Fixed effects meta-analysis was performed using RevMan V.5.3 to calculate overall effect estimate, pooled HR and it was compared with mean clinically significant difference.
Results Thirteen eligible trials with 164 721 participants provided the quantitative data for this review. Largely, the primary efficacy endpoint in these trials was the composite of cardiovascular death, non-fatal myocardial infarction, unstable angina requiring rehospitalisation, coronary revascularisation and fatal or non-fatal stroke. The pooled HR was 0.86 (95% CI 0.84 to 0.89, I2=45%) which was lower than the mean clinically significant difference of 0.196 (19.6%, range: 0.09375–0.35) of these studies. There was a wide 95% CI in these studies from 0.56 to 0.99. The upper margin of CI in most of the studies was close to the line of no difference. Absolute risk reduction was small (1.19% to 2.3%) translating to a high median number needed to treat of 63 (range: 43 to 84) over a follow-up duration of 2.95 years.
Conclusions The results of this meta-analysis indicate that overpowered trials give statistically significant results undermining clinical relevance. To avoid such misuse of current statistical tools, there is a need to derive superiority margin. We hope to generate debate on considering clinically significant difference, used to calculate sample size, as superiority margin.
- clinical pharmacology
- clinical trials
- overpowered trials
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
Superiority trials test if one drug is better than the other. This information is mostly used by busy clinicians who want quick information and are in search of new treatment for the unmet needs of their patients.1 The design of such trials facilitates determination of the smallest difference in the primary endpoint between the study arms. Traditional null hypothesis statistical testing is used to decide the statistical significance of two-sided tests using p value and CI. Progress towards such informative trial design and analysis has been slow and this method of testing carries a high chance of misinterpretation and misuse.1–3 The smallest difference to be detected is usually made a priori and used to calculate sample size. Currently, there is no consensus or scientific formulae to choose the smallest difference that is clinically meaningful and important for public health decisions.4 Over the years smallest difference is taken for granted for a realistic difference that can be identified with the available sample size,5 which raises two issues: smaller than an ideal sample size increases type II error while larger than required sample size increases the power of the study (‘overpowered trials’) and decreases the magnitude of the smallest difference required to produce statistical significance.
Criticisms have been raised by us3 6 7 and other authors2 4 8 regarding the conduct and interpretation of superiority trials. We have raised our concerns on the results of trials such as EMPA-REG3 and CANTOS7 the HRs of which are 0.86 and 0.85, respectively. Its 95% CIs are 0.74 to 0.99 and 0.74 to 0.98 and the clinically meaningful differences used to calculate its sample size are 0.77 and 0.80 (20% risk difference translates to HR of 0.80), respectively. Since the upper limits of 95% CI of these studies is close to line of no difference, there is a small improvement over the control arms. Had the authors considered clinically meaningful difference while analysing their study results, both the studies would give statistically non-significant results. Such interpretation of results has the potential of misuse of data as the drugs may not be sufficiently beneficial to obtain regulatory approval. Few methodologies have been put forth to evaluate the statistical significance such as lowering p value to 0.005,9 Cohen’s rule of thumb10 and number needed to treat (NNT),11 but none of them work as well as non-inferiority (NI) margin for NI trials in justifying both statistical and clinical significance.
In this meta-analysis, we consolidate our suggestion about superiority margin for superiority trials akin to NI margin for NI trials, with some evidence. We have analysed superiority trials with significant outcomes and have assessed the clinical relevance of their results. The analysis made in this study may impact on the considerations to be taken while conducting and also while interpreting the results of superiority trials.
This is a selective review and meta-analysis of all the superiority trials published in the New England Journal of Medicine (NEJM) between 1 January 2015 and 31 December, 2018.
We followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses consensus statement for the design, implementation and reporting of this study.12 All trials published in NEJM between 2015 and 2018 were hand-searched for the type of study design. Articles were selected based on our inclusion criteria.
Study selection, inclusion and exclusion criteria
We included all blinded or open-label superiority trials with pharmacological intervention conducted in adults that had positive primary outcome with a continuous or categorical variable. We set a 4 year search limit of studies published in NEJM because we thought this was sufficient to show the trend of conducting and reporting of superiority trials practiced worldwide by eminent researchers. We chose to include studies conducted in populations with cardiovascular diseases and diabetes mellitus because these disease conditions encompass large sample size trials, owing to high prevalence and chronicity, and also because several drugs in this section receive Food and Drug Administration (FDA) approval every year claiming superiority over the existing drugs. Studies conducted in paediatric population, or in populations with venous thromboembolism, bleeding, cardiomyopathy and shock, were excluded. We also excluded trials involving any non-pharmacological intervention such as (but not limited to) lifestyle modification, percutaneous coronary intervention, coronary artery bypass grafting and pacemaker implantation. Studies which expressed their primary endpoint in mean difference,13–23 OR24 and risk ratio25 26 and also studies which had non-significant primary efficacy outcome were excluded. This eligibility criteria largely yielded an uniform primary outcome, that is, the composite of cardiovascular death, non-fatal myocardial infarction (MI), unstable angina requiring rehospitalisation, coronary revascularisation, fatal or non-fatal stroke,24–39 except for one study40 which had safety endpoint as its primary outcome. So although it met our eligibility criteria, we excluded this study to maintain uniformity and generalisability of our results.
Individual journal volumes and issues were hand-searched for superiority trials. All the selected articles and their supplementary materials were searched for clinically meaningful difference or effect size used to calculate the sample size.
NG did the search and independently screened the titles and abstracts yielded by the search against the inclusion criteria. Full reports for all titles and abstracts that appeared to meet the inclusion criteria or where there was any uncertainty were obtained. NG and SM then screened the full-text reports and decided whether these met the inclusion criteria. We resolved disagreements through discussion with our third author NS. Reasons for excluding trials were recorded. The review authors extracted the data on study characteristics such as (1) trial design, population and sample size; (2) intervention details; (3) duration of treatment and median follow-up; (4) outcome measures; and (5) study results for the primary efficacy outcome.
Data collection process
Data abstracted included endpoints for decision-making, that is, HR in the primary efficacy endpoint along with their CI, p value and NI margin or clinically significant difference, if present.
Risk of bias within and across studies
Two authors (NG and SM) independently assessed the risk of bias of each included trial. If risk of bias was not clear from the main manuscript, we obtained records from the articles’ supplementary materials. Cochrane’s risk of bias tool was used to assess risk of bias among included studies. We assessed overall risk based on the description of randomisation methods, allocation concealment, blinding of participants, investigators and statistician and attrition and reporting biases. As per tool input, biases were recorded as low risk, high risk or unclear (either lack of information or uncertainty over the potential for bias). Any disagreements in the risk of bias were discussed and resolved by consensus of all the authors.
Results for all primary efficacy outcomes are expressed as HR with 95% CIs, calculated from end of treatment values. For data expressed in person years, we calculated incident rate for the total duration of the study. Overall effect estimate of the meta-analysis was calculated using RevMan V.5.3 with fixed effects model of heterogeneity and Observed-Expected (O–E) and variance statistic as per Cochrane Handbook for Systematic Reviews of Interventions41 and Tierney et al.42 Two studies34 35 had two coprimary efficacy outcomes. In the fixed effects model, HR of only the first coprimary outcome is used, as HR and its CI for second coprimary outcome were almost the same. For the same reason, we used only the single-dose analysis values from the study by Bonaca et al,39 for the final assessment. Sensitivity analysis including these values did not differ significantly from the reported result. Further, descriptive statistics were used to describe sample size, duration of the studies and overall clinically significant difference. Absolute risk reduction (ARR) and NNT were also calculated from the end of treatment values.
During the search limit period, a total of 840 original articles were published in NEJM. The selection process of the articles is outlined in figure 1. Studies reporting results of subgroup analysis or combined analysis were excluded as duplicates. Filtering the articles based on disease-specific criteria yielded 171 potentially relevant articles. Further, 113 studies were excluded as they were either not a superiority trial or involved any non-pharmacological intervention. Full text of the remaining studies were retrieved and assessed. In total, 16 superiority trials were eligible for the final analysis. Out of these, three trials26–28 had to be excluded as they reported results as OR or risk ratio (RR). Since OR and RR evade time-dependent factor, in contrast to HR, it was considered inappropriate to club trials reporting OR/RR with HR. However, sensitivity analysis including these three trials was conducted, it yielded similar pooled estimate but with increased heterogeneity (result not shown).
Overall, 164 721 participants were included in the 13 trials that provided the quantitative data for this review (for individual study details, see table 1). Largely, the primary efficacy endpoint in these trials was composite of cardiovascular death, non-fatal MI, unstable angina requiring rehospitalisation, coronary revascularisation and fatal or non-fatal stroke. Two studies had two coprimary outcomes: composite of death from cardiovascular causes, non-fatal MI or non-fatal stroke (first coprimary outcome) and the composite of these events plus resuscitated cardiac arrest, heart failure or revascularisation (second coprimary outcome). HRs for each coprimary outcomes were similar and hence for pooled analysis, HR and CI of first coprimary outcome were used. Sensitivity analysis including both the primary outcomes did not vary the final effect estimate considerably.
Out of 13, only 9 (69.2%) studies mentioned the clinically significant difference used to calculate their sample size. The range of these clinically significant differences was between 9.375 and 35 with an average value of 19.6. Studies which used annual rate of primary event or NI margin to calculate their sample size were excluded from calculating average of differences. Further, the primary hypothesis of four studies was to test NI with the use of NI margin and conclude superiority when the upper boundary of 95% CI was <1.0.30 32 33 36 Overall, lower boundary of 95% CI of the included studies ranged from 0.56 to 0.89, while upper boundary of 95% CI ranged from 0.86 to 0.99. Incidence of the primary event was described as HR in all the 13 studies.
Synthesis of results
The overall effect estimate (pooled HR) was 0.86 with 95% CI of 0.84 to 0.89, p<0.00001. The test for heterogeneity, I2 in our meta-analysis was 45%, which indicates that there was moderate heterogeneity or inconsistency in the studies included, which conflicts with the fixed effect assumption. According to fixed effects model assumption, each study included in the meta-analysis is trying to estimate the same quantity of intervention effect41 (figure 2).
On an average, the sample size in each included study of our meta-analysis was 12 671 (range 3297–27 395). These participants experienced an ARR of 1.19% to 2.3% in the primary outcome which translated to a median NNT of 63 (range: 43–84) over a follow-up duration of 2.95 years. This follow-up duration was taken from the calculated median NNT (of 63) that corresponded to studies by Schwartz et al 25 and the Zinman et al,36 who had a median follow-up duration of 2.8 years and 3.1 years, respectively (table 1), yielding a median duration of 2.95 years (midpoint of above studies) for the said NNT.
Risk of bias
Full details of the risk of bias are outlined in figure 3. All the included studies described their randomisation methods and were considered as low risk. Information about allocation concealment was unclear for two trials, rest of them used either interactive voice response or web response or generated unique number to conceal allocation. All the included trials were double-blinded except for two, in which one was single-blinded30 and the other one was open label.37 Only in one trial39 even biostatisticians were blinded until analyses. Seven studies clearly mentioned that statisticians were not blinded, while it was unclear in rest of the studies. We did not look for incomplete outcome data and selective reporting as our study objective was to look for primary outcome of these studies.
The main objective of superiority trial is to determine the smallest difference in the primary outcome between the study arms.43 Therefore, they are designed in such a way that they have enough power to detect those differences. However, in this meta-analysis, we debate on assigning meaning to the smallest difference making it clinically significant and more specific. Currently, the clinically significant difference is used to calculate the sample size but it is not considered at the time of analysis of findings.
In this study, the effect estimate of the included studies is 0.86 with 95% CI 0.84 to 0.89. It means that there is ~14% decrease in risk of having the primary outcome, if the patient consumes experimental drug. This 14% may not be clinically significant as the estimated mean of clinically meaningful difference of these studies is 19.6%. Nine (69.2%) out of 13 included studies did not meet their own respective marks for clinical significance, despite obtaining statistical significance (clinically meaningful difference was not mentioned for other four studies). Either their pooled HR was beyond their own clinically meaningful difference25 29 31 34 35 38 39 or their 95% CI27 37 included this value, which means had the authors considered clinically meaningful difference during their analysis, the null hypothesis would be accepted and superiority of experimental drug would be rejected. This leads us to think whether a statistically significant result is clinically relevant. In one of the included trials,34 the investigators intended to evaluate 35% of lower risk with experimental drug.
Further, the effect estimate translates to a median NNT of 63 which means 63 patients need to be treated for a median duration of 2.95 years to prevent one extra outcome from happening. It is debatable whether such high NNT for the said time period is acceptable for a new, rather expensive medication with potential and unknown adverse effect profile, even more so because trial-based NNT often underestimates NNT in clinical practice.44
There are also critics highlighting that with the existing design of superiority trials, even the borderline effect will become statistically significant if we test with large sample size.45 In our study, the average sample size was ~12 600. In 11 of 13 included studies, the upper bound 95% CI crossed 0.90 and in two of 11 such studies, the upper bound 95% CI was 0.99,36 38 which is very close to line of no difference. We believe that the large sample size has given enough power to rule out null hypothesis in these studies. Such ‘overpowered or oversized trials’ have been criticised earlier by many researchers.2 46 47
It is a known fact that CIs offer predictive information about treatment effects that can be expected posttrial use. Narrow confidence intervals suggest little variability within the population and assure that even a wider interval, which we are bound to get once the drug is marketed, may provide similar difference in treatment effect in general population.48 Given the wide CIs and their close approximation to line of no difference in most of the included studies, there is an uncertainty regarding the effectiveness of such drugs after being marketed.
Nevertheless, as stated before, while analysing the results of the study, it is important to assess the relevance of difference between the means of two drugs. In NI trials, inclusion of NI margin addresses the issue of clinical relevance.49 We have earlier proposed the need for superiority margin for superiority trials similar to NI trials.3 6 With this study, we emphasise fixing the clinically meaningful difference for that particular outcome (suppose say ‘10%’ for ‘composite of death from coronary heart disease, non-fatal MI, fatal or non-fatal ischaemic stroke or unstable angina requiring hospitalisation’) as a ‘superiority margin’. It makes sure that only when 95% CI is beyond this margin, the experimental drug will be deemed superior. The proposed margin can be contemplated based on research question and entire active control effect, similar to NI margin.
Whether there should be a guideline to the difference to be detected in a trial is debatable. Some authors argue that each research question is unique and deserves special consideration. They believe that guidelines encourage the same approach being used and enough attention may not be paid to individual differences towards all study designs.5 However, absence of guidelines may invite carelessness and engage investigators in playing with sample size, CI and p-value.
This study has a few limitations: First, we included articles published in NEJM only, hence, we cannot generalise the study conclusion for trials published in other journals. However, we believe that NEJM is sufficiently representative of leading medical journals and these results should be quite robust. Second, since a single journal was selected for review, only one author did the initial screen followed by all three authors. Third, we did not register our study protocol in PROSPERO which is typically recommended for systematic reviews and meta-analyses. Fourth, we did not include effect size of four studies to calculate average clinical meaningful difference as the effect size in these studies was not mentioned. They have used either annual rate of events or NI margin to calculate sample size. Owing to similar primary efficacy outcome in all the included studies, we presume that the authors would have considered similar effect sizes ranging from 9.375% to 35% to calculate sample size.
This study also raises some important questions such as, ‘Is there a need of superiority margin for superiority trials? Which is the best parameter that can be used to derive superiority margin? and What are the pros and cons of choosing clinically significant difference as superiority margin? Or should it be modified?’
In the absence of a superiority margin, the current practice leads to a situation where overpowered superiority trials provide statistically significant results with effect size so small that they may be of little clinical relevance. This meta-analysis shows that this practice is widespread in cardiovascular diseases and diabetic trials published in NEJM. We hope to generate debate on such practice.
Overpowered cardiovascular trials provide statistical significance but may have little clinical relevance.
Superiority margin must be taken into account while interpreting results of such trials in order to avoid scientific misinterpretation.
Clinically significant difference used to calculate sample size may be used as the superiority margin.
Current research questions
Is there a need of superiority margin for superiority trials?
Which is the best parameter that can be used to derive superiority margin?
What are the pros and cons of choosing clinically significant difference as superiority margin? Or should it be modified?
What is already known on the subject
Superiority trials use traditional null hypothesis statistical testing to decide the statistical significance between one or more experimental arms. Such analysis has the potential to misinterpretations and scientific misuse of superiority trials.
Contributors NG and SM hand-searched the articles and independently reviewed the articles. Disagreements were discussed with NS. All three authors contributed in data analysis and manuscript preparation.
Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests None declared.
Patient consent for publication Not required.
Provenance and peer review Not commissioned; externally peer reviewed.
Data availability statement There are no data in this work.