Statistics from Altmetric.com
In an accompanying article Janine Janosky sets out the case for the use of single subject designs.1 I was asked by my colleague Dr John Mayberry, the editor of the journal, to referee this paper, but felt it would be more appropriate to respond to it, largely to stimulate debate on this issue. I would suggest that the proper applicability of single subject designs is much narrower than this article would imply. I would furthermore warn readers of the dangers of a view that if left to grow unchecked could result in an important undermining of the dominance of the multi-patient randomised clinical trial that is now, with very strong justification, accepted as the cornerstone of evidence based clinical practice—with serious consequences for the choice of appropriate management for future patients. The two key issues are equivocation regarding the ambit of the single subject design, and the robustness of the inference to be drawn from data such as figure 1 in Janosky’s paper.
It is well accepted that clinical expertise is needed to apply the findings of large clinical trials to the individual patient. The doctor’s initial training, ongoing CPD, and clinical experience facilitate the recognition of patients who are not “average” and for whom current evidence based guidelines, which are optimised for patient populations, may not be optimal. How to decide on management for a specific patient may be problematic. When the issue relates to maintenance treatment, the single patient design certainly has a role. For example, the patient may have two coexisting conditions for which the therapeutic requirements conflict. Another context is polypharmacy—perhaps the patient is currently taking four drugs, and the clinician suspects that one could be withdrawn without diminution of therapeutic effect.
In some parts of the article, including the six listed “possible research questions”, Dr Janosky clearly implies that the research issue only applies to a specific patient. In other places, a broader scope is implied by phrases such as “unique study populations”, “choosing the patient to participate”, and “typical in terms of the practice demographics (and) for disease presentation and progression”. Dr Janosky concedes that there is an issue of limited generalisability. I would argue that a study of this kind cannot provide any reassurance that we can extrapolate the findings to other patients. One could say, to other similar patients, but what does similar mean in this context? Demographic, physiological, and diagnostic similarity are of little relevance here, the only similarity that matters relates to propensity to respond to the treatment in question, and this can neither be observed nor ensured. Conversely, the conventional large clinical trial relates to patients drawn from a population defined by well defined eligibility criteria, and random allocation ensures groups are comparable within limits of chance variation in respect of all possible variables, including counterfactual treatment response. This is what justifies applying the conclusions of the trial to patients at large who fulfil the eligibility criteria used in the trial.
The other key issue relates to drawing an “obvious” conclusion from a limited dataset. This is shaky on two counts, relating to clinical lability and statistical methodology. Dr Janosky refers to the patient “in need of lower fasting blood glucose values”—but there is such a thing as regression towards the mean (strictly, a misnomer, regression towards the mode would be a more apt description). The inference that the “switch” in figure 1 is real is strongly dependent on a presupposition that patients don’t just “switch” spontaneously in this way. Perhaps this is reasonable in diabetes—it would not be for remitting/relapsing conditions such as inflammatory bowel disease or multiple sclerosis, and certainly not for thyroid disease or bipolar disorder. What Dr Janosky terms the “primary A-B single subject design”, as used here, is particularly vulnerable to criticism—while it is the simplest within subject design, it is the least adequately controlled, and effective blinding is unlikely to be achieved. The methodological issues arising in the familiar multi-subject crossover design are well known. Borrowing terminology commonly used in that context, the observed treatment difference could equally be interpreted as a period effect, or could be distorted by carry-over. In the example given, the treatment difference could be considerably confounded by seasonal differences.
Furthermore, with regard to statistical methodology, what is the implied cut off between a “real” difference and one that could be attributable to chance? For the data as shown, an unpaired two sample t test would give a highly significant p value, around 0.0001 here, but we do not have sufficient evidence to decide whether an assumption of Gaussian distributional form is reasonable, without considerable extrapolation from data on others. The non-parametric Mann-Whitney test is robust, and yields a two sided exact p value of 1 in 35 or 0.029. This is below 0.05, but much less extreme, normally a p value of this magnitude would not be regarded as strongly convincing. A decision rule approach is more relevant. This might relate to a pre-agreed clinically importantly large difference—although this would share the non-robustness problem. Alternatively, one could abandon conventional hypothesis testing with a low α level and opt for a “pick the winner” approach with implied equal α and β rates. This corresponds more closely to the less formalised “trial and error” course of events that commonly occurs in clinical practice.
Dr Janosky’s stance contrasts quite sharply with that taken by Guyatt et al.2 This highly informative review of the appropriate use of single subject trials was restricted to double blind, randomised, multiple crossover designs aiming to optimise management of a specific patient. Decisions about efficacy were based on a combination of a signed standardised difference measure of effect size D and a single tailed p value. Although my reservation above concerning unquestioned use of parametric methods still applies. Furthermore, an effect size criterion expressed in absolute terms (for example, fasting blood glucose units) would be much more directly interpretable for clinical importance than a relative measure such as D.
Clinicians are forever “trying” patients with different treatments. Use of a single subject design, with additional rigour ensured by multiple periods, randomisation of treatments to periods and blinding, and perhaps some statistical analysis, is certainly one stage more formal and rigorous. But we should not imagine it is anything more than that: we can only validly draw conclusions about that one patient, in their present state, it would be very risky to extrapolate to ostensibly “similar” cases.