Have Randomized Controlled Trials of Neuroprotective Drugs Been Underpowered?
An Illustration of Three Statistical Principles
Background and Purpose—The results of phase III trials of neuroprotective drugs for acute ischemic stroke have been disappointing. We examine the question of whether these trials may have been underpowered.
Methods—Computer simulations were based on the binomial distribution.
Results—We illustrate that even small overestimates of the efficacy of an intervention can lead to a serious reduction in statistical power, that the use of data from phase II studies tends to lead to such overestimation, and that a minimum clinically important difference derived with cost-effectiveness modeling techniques is considerably smaller than might be suggested by intuition.
Conclusions—We recommend placing more emphasis on minimum clinically important differences when planning stroke trials, with these differences being derived from an assessment of the public health impact obtained in conjunction with the use of epidemiological and cost-effectiveness models. Even small benefits, when averaged over a sufficiently large number of cases, will, in total, accrue to a large positive impact on the public health.
Although the use of neuroprotective drugs for acute ischemic stroke treatment has been supported by animal studies, phase I trials, and phase II trials, the results of phase III studies have been consistently disappointing. More than one commentary on this topic has already been given,1 2 3 and a recent academic industry roundtable proposed various recommendations regarding drug development and the optimal design of clinical trials. Many of the roundtable’s recommendations focused on development of improved animal models of ischemic stroke. Other recommendations pertained to issues such as the choice of patient population, the outcome measure, and the window of opportunity.4
Although discussed elsewhere,1 one topic that was not considered by the roundtable was the statistical one of sample size. Here, we address the question of whether the previous phase III trials of neuroprotective drugs been too small to have the statistical power to detect effects that are nevertheless clinically meaningful. Also, we considered the related question of “why.” Consideration of this latter question, not previously discussed in the literature, may help elucidate how sample size calculations can be improved in the future.
For simplicity of illustration, we assume that the primary analysis of the phase III randomized controlled trial in question will involve first creating, for each patient, a dichotomous outcome of “good” versus “poor.” For example, a “good” outcome might be defined as survival to the end of the trial with a Barthel Index of either 95 to 100 or within 5 units of the prestroke baseline, with a “poor” outcome defined as death or survival with significant stroke-related deficits. Denote the true (but unknown to the investigator) proportions of intervention and control patients expected to have good outcomes as Pi and Pc, respectively. Similarly denote the observed values of these variables as pi and pc. For concreteness, we assume that Pi=0.40 (ie, that 40% of the patients in the usual care group are expected to have good outcomes). In practice, Pi will depend on both the population under study (eg, whether patients with mild stroke will be included) and the cutoff used to define which outcomes are considered “good.” So long as good outcomes are moderately likely (eg, Pi falling within the range of 0.20 to 0.80), our conclusions are robust to changes in the value of Pi.
For concreteness, we also assume that the primary analysis is a χ2 test (ie, which directly compares pi with pc). In practice, greater statistical power can be gained with various embellishments to this basic analytic strategy: for example, by considering an outcome variable such as the Barthel Index to be continuous rather than dichotomous, by combining multiple correlated outcomes into a single statistical test, and by including covariates such as study site and stroke severity. Nevertheless, the basic statistical principles illustrated here remain valid even in the presence of more sophisticated analytical approaches.
Our analysis plan uses simulations to illustrate 3 possible reasons why a randomized trial of an acute stroke treatment might be underpowered (1) sensitivity of the results of the sample size and statistical power calculations to small changes in the inputs to these analyses, (2) overestimation of the true (but unknown) difference in outcome rates between intervention and control, and (3) overestimation of the minimum clinically important difference in outcome rates between intervention and control. Details of the simulation methodology are given in Appendix.
From time to time, we use as examples the results of trials of neuroprotective agents that were published in Stroke during 1996 to 2000. Although not exhaustive, these trials are intended to be illustrative of current practice (and, indeed, the state of the science) in this field.
Sensitivity of Power to Small Changes in Outcome Rates
Power is the probability that a trial will have a positive result, conditional on having correctly specified the true (but unobservable) improvement in outcomes associated with the intervention (ie, in essence, having correctly specified Pc and Pi). We first illustrate that the typical trial of an acute stroke treatment tends to be on the “steep part of the power curve”; in other words, we illustrate that if assumptions about the efficacy of treatment err even slightly on the side of optimism, then statistical power can be greatly reduced.
Table 1⇓ illustrates the relationship between sample size and power. Its rows correspond to the assumptions used by the investigator in determination of the sample size. (For simplicity, we assume that the investigator has correctly specified that Pc=0.40. In practice, misspecification of Pc will be an additional source of error.) For example, if the investigator assumes that Pi=0.50 and uses a traditional 2-sided hypothesis test with a type I error rate of 5% and a power of 80%, then the study will require a sample size of 388 patients per group (Table 1⇓, row 10, column 1).5 For this row in the various columns, the statistical power, with the assumption of a sample size of 388 per group, is presented as a function of the true proportion of good outcomes in the intervention group. For example, if the investigator’s assumption that Pi=0.50 is correct, then the power is the desired 80% (line 5, column 11, underlined). However, if the investigator’s assumption about the outcome rate in the intervention group is optimistic and Pi is actually 0.45, then a trial with 388 patients per group will have a power of only 29% (line 10, column 6). Even a difference of 2% in outcome rates can have an effect on power; for example, if the assumed outcome rate in the intervention group is Pi=0.50 but the actual rate is Pi=0.48, then the power of the study to detect a statistically significant impact of the intervention drops from 80% to 61%.
Another way to use Table 1⇑ is to simply note the sample size per group (column 2), to compare this against the sample sizes of various trials, and then to move to the appropriate row in the table to assess the statistical power. For example, sample sizes in the intervention group for various phase III trials are 680 for clomethiazole,6 565 for tirilazad (planned as),7 464 for piracetam,8 368 for lubeluzole,9 186 for nalmefenen,10 and 152 for ebselen.11 These correspond approximately to lines 7 to 11 of Table 1⇑ (ie, with per-group sample sizes ranging from 173 to 787). For sample sizes of this magnitude, if the actual improvement in good outcomes associated with the intervention is 5% (ie, Pi=0.45), statistical power ranges from only 16% (n=173 per group) to 52% (n=787 per group). Similarly, if the actual improvement in good outcomes associated with the intervention is 2% (ie, Pi=0.42), statistical power ranges from only 7% to 13%.
Overestimation of the True Intervention Effect
The extent to which the results of Table 1⇑ are worrisome depends in large part on how small Pi is allowed to be. How, then, should the outcome rate in the intervention group be specified? Three typical and often interrelated approaches are data based, theory based, and intuition based. We begin by considering the data-based approach.
Suppose that the efficacy of the intervention (ie, Pi, again with the assumption that Pc is precisely known) is to be estimated from previous studies. Typically, these are 1 or more phase II dose-selection trials, but in some circumstances, previously conducted phase III trials might be available as well. For these purposes, “multiple estimates of Pi” could consist of either more than 1 previous trial and/or subgroup analyses within a single previous trial. For simplicity, in the simulations here, we assume that all multiple trials and/or analyses within the same trial have the same sample size of 100.
First, consider the situation where only 1 previous study is extant. The key element of the data-based approach is that the investigator bases the estimate of sample size for the pivotal phase III trial on exactly the magnitude of the intervention effect observed in the previous data. For example, in the US and Canadian Lubeluzole Ischemic Stroke Study Group,9 “Sample size … was based on the phase-II trial of lubeluzole. The study was powered … to detect a difference between 14% and 23.5% in 3-month mortality” (these being the figures previously observed). As another example, in the Cervene Stroke Study, “Sample size was determined with a pooled estimate of the primary efficacy variable from the prior studies (nalmefen 70%, placebo 55%).”10 Similar reasoning is evident in the discussion of the results of a phase II trial of magnesium sulfate: “Based on the figures obtained from this study, a [phase III] trial to demonstrate the efficacy of intravenous magnesium sulfate would require 712 patients … [to detect a] difference from 40% to 30% in proportion dead or disabled”12 (the 40% and 30% being observed in the phase II trial with 60 patients). In essence, the investigator assumes that the true (but unobservable) value of Pi, to be used in the sample size calculation is exactly the same as the value of pi observed in the previous study.
Table 2⇓ illustrates the deleterious effects that chance can have on the above procedure. The table assumes that the true (but unknown to the investigator) proportion of good outcomes in the intervention group is Pi=0.43. Because Pi is unknown, it will be estimated from the observed pi, which in turn is subject to sampling variability (summarized by its sampling distribution). The median value of the sampling distribution is, as might be anticipated, 0.43. However, 1 of 4 times (ie, 75th percentile), the investigator will have the bad luck to observe a pi of ≥0.47. From Table 1⇑, the assumption that Pi is 0.47 when in fact it is 0.43 will lead to a significant reduction in power (ie, from 80% to 52%). This phenomenon becomes more pronounced as the sample size decreases but can still affect the results even with large samples. For example, if the previous study has 500 patients (larger than a typical phase 2 trial), the 75th percentile of the sampling distribution is 0.45 (data not shown). In any event, the error in thinking here is to ignore sampling variability.
Table 3⇓ illustrates another way things can go wrong. Now suppose that the investigator has ≥1 pi, each based on a sample size of 100, from which to choose. For example, a pharmaceutical company might have data from phase II trials of 3 different compounds, and intends to only support a phase III trial of the most promising one. Alternatively, a phase III trial might already have been conducted whose overall results were negative/equivocal but also produced 10 subgroup analyses, each of which generates an estimate of pi. Finally, assume that the intervention is equally effective throughout, with true (but unobservable) Pi=0.43. Now, even though the actual Pi remains unchanged, the observed values of pi will tend to vary; for example, 3 such observed values might be pi=0.41, pi=0.44, and pi=0.48. In addition, suppose that the investigator intends to base the sample size calculation for the subsequent phase III trial on the best result observed. That is, the investigator will select the maximum observed value of pi, assume that this is the true rate Pi, and then plan the sample size for the following phase III trial accordingly.
In Table 3⇑, the large majority of entries indicate maximum outcome rates that exceed 0.43, thus causing the investigator to overestimate the treatment effect. This bias increases with the number of pi values considered (eg, increases with the number of subgroup analyses). For example, if 5 subgroup analyses are performed, the 50th percentile is pi=0.50. A phase III trial based on this misperception would have a power of only 14% (Table 1⇑). Even if the investigator has good luck and the 25th percentile is observed, then the estimate will be pi=0.47, and the power will only be 52% (Table 1⇑). This bias increases as the sample size of the subgroups decreases (data not shown). The phenomenon described here is well known in the statistical literature as regression to the mean. Here, the error in thinking is to uncritically accept at face value the best of a series of results, without taking into account the effects of chance.
A related error in thinking, common to both examples in this section, is to confuse the true but unobserved outcome rate Pi with the outcome rate pi observed in the data.
Underestimation of the Minimum Clinically Important Difference
Now suppose that the sample size for the phase III trial in question will be based on the minimum clinically important difference (in true outcome rates), derived using a combination of theory-based methods (eg, cost-effectiveness analysis) and intuition. Table 4⇓ presents potential inputs into a cost-effectiveness analysis to estimate this minimum clinically important difference; namely, expected survival, quality-adjusted survival, and medical costs, from 6 months until death, for patients with various levels of stroke-related disability. These estimates were derived using the stroke policy model (SPM), which, in brief, was developed by the Stroke PORT to describe the natural history of stroke and to aid in decision and cost-effectiveness analysis. Inputs to the SPM include a reanalysis of data from the Framingham Study, >150 000 Medicare claim files, a large survey of patients at risk for major stroke, and an expert-based synthesis of the literature pertaining to the relationship between disability and subsequent outcomes. An article in the Journal of Clinical Epidemiology describes the SPM in more detail and illustrates the use of the SPM in analysis of the results of a hypothetical trial with similar costs and outcomes during its 6 months of follow-up (thus implying that these results cancel) but whose intervention led to slight-to-moderate shifts of the pattern of disability at the conclusion of the trial.13
As described,13 estimation of the lifetime implications of this shift in the pattern of disability involves using the results of Table 4⇑ as inputs into a straightforward weighted average calculation, with its weights being the proportion of patients falling into the various Rankin categories at the conclusion of the follow-up period of the trial. In fact, because the comparison between intervention and usual care is a relative one (eg, the numerator of the incremental cost-effectiveness ratio is the difference in costs between the 2 groups), this calculation need only take into account the relative difference in outcomes between the 2 groups. For example, suppose that the only effect of the intervention is to shift 2% of patients from Rankin 5 to Rankin 2. Then, the cost savings associated with the intervention is (0.02)(283 382−117 583)=$3316. Similarly, the improvement in quality-adjusted life years associated with the intervention is (0.02)(3.48−0.72)=0.055. Even with a conservative estimation that the intervention does not lead to any cost savings by reducing medical use during the period of the trial, as long as its cost is less than $3316, it will dominate the usual care strategy (ie, lead to better health outcomes and lower costs) and clearly would be preferred.
Although the SPM is a particularly “high-tech” epidemiological model, and even though other models use different conceptual frameworks (eg, patient location rather than level of disability) and may be differently calibrated (eg, providing lower cost estimates by considering a less than complete set of medical costs), the above principle is consistently supported by other models in the published literature.14 15 In essence, the common insight among the various models is that since disabling stroke tends to result in institutionalization, which in turn has very high per diem costs, so long as it is not extraordinarily expensive an acute stroke treatment with even a modest impact will lead, at a population level, to a net improvement in the public health. This required impact is indeed quite modest (as can be verified by experimenting with weighted average calculations using the entries of Table 4⇑), and is considerably smaller than might initially be suggested by intuition.
We do not know whether neuroprotective drugs are effective in the treatment of acute ischemic stroke. We do, however, believe that the phase III trials of these drugs may have been underpowered. (Indeed, a lack of statistical power has often been acknowledged, in hindsight, by the trials’ investigators.) In particular, we have illustrated that even small overestimates of the efficacy of an intervention can lead to a serious reduction in statistical power, and we also illustrated some of the ways in which the efficacy of an intervention might be overestimated. Finally, we illustrated that a minimum clinically important difference, derived with cost-effectiveness modeling techniques, may be considerably smaller than might be suggested by intuition.
Our demonstration has various limitations. For example, a number of issues pertaining to the design and statistical analysis of trials have, for the purposes of illustration, been greatly simplified. Our focus has been almost entirely on statistical issues pertaining to sample size. This implicitly assumes that the more substantive issues, such as patient population, time window, dosage, outcome measures, and so forth, have been adequately resolved and thus that the intervention is being tested under conditions in which it has the best possible chance to demonstrate efficacy.4 Progress is being made on all of these fronts, including the development of more sensitive outcome measures.15 16 In any event, the design and successful completion of a randomized trial of an acute stroke intervention are extraordinarily difficult tasks, and our commentary is in no way intended as a criticism of the efforts of the investigators of these trials.
Having suggested that decision and cost-effectiveness analyses can be useful in helping to estimate the minimum clinically important difference, another limitation pertains to the state of the science for these models. At present, all such models leave much to be desired, especially in their treatment of quality of life, costs, and the impact of disability on subsequent outcomes. Despite their quantitative nature, the conclusions from these models certainly do not represent the same level of evidence as the more mathematically based illustrations here. Nevertheless, the consistency of the basic insight from various models in the literature,13 14 17 namely, that an analytically derived minimum clinically important difference might be much smaller than that suggested by intuition, should merit consideration.
Without engaging in extensive speculation about how intuitive estimates for clinically important differences have been derived in the past, we do note that a 10% difference (eg, 50% good outcomes in 1 group versus 40% good outcomes in another) is approximately at the threshold at which effects tend to be noticeable to individual observers (eg, in an entirely different context, it is approximately the size of the difference in mean height between populations of 15- and 16-year-old girls).3 18 In the absence of a more theory-driven approach, it would be quite reasonable for intuition to suggest a difference of this magnitude. Unfortunately, the task of estimating the size of the treatment effect that would “matter” is likely to be one for which intuition may be ill suited. For example, the cost-effectiveness example illustrated a situation (ie, namely, where only 2% of patients benefited, these having their disability reduced from a Rankin 5 to a Rankin 2) that suggests an intuitively small benefit when assessed from an overall population perspective. Nevertheless, for the few patients who avoid becoming highly disabled, the benefits are great. In any event, assessing the overall impact when few patients receive a large benefit, or many patients receive very small benefits, is a task that is cognitively difficult and thus amenable to assistance from analytic tools such as cost-effectiveness analysis.
In any event, it also seems important to note that intuition can vary over time, specialty, and other circumstances; for example, the effect sizes sought in the various megatrials within the field of cardiology are much smaller than a 5% to 10% absolute difference, yet are considered sufficiently large to be of importance. In essence, what the cardiologists have argued is that based on cost-effectiveness modeling and other formal approaches, small benefits, when averaged over a large number of cases, will in total accrue to a large positive impact on the public health, and this argument is now held to be intuitively reasonable. With >700 000 cases of stroke per year in the United States,19 such an argument applies with equal force to stroke.
Expert commentators have consistently recommended increasing the size of stroke trials. For example, in the recent Feinberg Memorial Lecture, Grotta2 recommended basing trials on a 5% to 10% absolute difference in outcome rates (this corresponds to 388 to 1534 patients per group). Similarly, Dorman and Sandercock1 recommend that sample sizes be far in excess of 750 patients per group. More generally, in the field of randomized trials as a whole, the calculations of sample size and statistical power are often substandard,20 and 1 of the components that is often missing is the minimum clinically important difference. In the absence of explicit guidelines about what constitutes a minimum clinically important difference, investigators can easily base sample size calculations on differences that are too large.21
Underpowered trials, particularly in the field of acute stroke treatment, should be avoided at all costs. In an underpowered trial, patients are placed at potential risk, yet the likelihood of successfully identifying an efficacious intervention (ie, the reason that it is appropriate to place such patients at risk) is low. Not only patients but also manufacturers, as well as the community as a whole, are at risk, all of whom are denied the benefits of interventions that are potentially effective. From a statistical perspective, perhaps the most important next step in avoiding further underpowered trials would be for the community of stroke researchers to agree on what constitutes a minimum clinically important difference. Presumably, the more forms of evidence that are used in these deliberations, the better.
The simulations for Tables 1⇑ and 2⇑ were produced by assuming that each patient in question had a probability of a good outcome of Pi. In a group of n patients, the total number of patients with a good outcome is the realization of a binomial random variable, with parameters n (ie, sample size) and Pi (ie, probability of success). The observed proportion of patients with good outcomes is the above observed value divided by n. For each set of conditions, 10 000 realizations were used. Simulations were performed with the Statistical Analysis System (SAS Institute), and the computer code is available from the authors on request.
The SPM was developed under contract to the Agency for Health Care Policy and Research (contract 282-91-0028).
- Received July 18, 2000.
- Revision received November 8, 2000.
- Accepted December 8, 2000.
- Copyright © 2001 by American Heart Association
Dorman PJ, Sandercock PAG. Considerations in the design of clinical trials of neuroprotective therapy in acute stroke. Stroke. 1996;27:1507–1515.
Grotta JC. Acute stroke therapy at the millennium: consummating the marriage between the laboratory and the bedside. Stroke. 1999;30:1722–1728.
Muir KW, Grosset DG. Neuroprotection for acute stroke: making clinical trials work. Stroke. 1999;30:180–182.
Stroke Therapy Academic Industry Roundtable (STAIR). Recommendations for standards regarding preclinical neuroprotective and restorative drug development. Stroke. 1999;30:2752–2758.
Cohen J. Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale, NJ: Lawrence Erlbaum Associates; 1988.
Wahlgren NG, Ranasinha KW, Rosolacci T, Franke CL, van Erven PMM, Ashwood T, Claesson L, for the CLASS Study Group. Clomethiazole Acute Stroke Study (CLASS): results of a randomized, controlled trial of clomethiazole versus placebo in 1360 acute stroke patients. Stroke. 1999;30:21–28.
The RANTTAS Investigators. A randomized trial of tirilazad mesylate in patients with acute stroke (RANTTAS). Stroke. 1996;27:1453–1458.
De Deyn PP, de Reuck J, Deberdt W, Vlietinck R, Orgogozo JM, for Members of the Piracetam in Acute Stroke Study (PASS) Group. Treatment of acute ischemic stroke with piracetam. Stroke. 1997;28:2347–2352.
Grotta J, for the US and Canadian Lubeluzole Ischemic Stroke Study Group. Lubeluzole treatment of acute ischemic stroke. Stroke. 1997;28:2338–2346.
Clark WM, Raps EC, Tong DC, Kelly RE, for the Cervene Stroke Study Investigators. Cervene (nalmefene) in acute ischemic stroke: final results of a phase III efficacy study. The Cervene Stroke Study Investigators. Stroke. 2000;31:1234–1239.
Yamaguchi T, Sno K, Takakura K, Saito I, Shinohara Y, Asano T, Yasuhara H, for the Ebselen Study Group. Ebeslen in acute ischemic stroke: a placebo-controlled, double-blind clinical trial. Stroke. 1998;29:12–17.
Muir KW, Lees KR. A randomized, double-blind, placebo-controlled pilot trial of intravenous magnesium sulfate in acute stroke. Stroke. 1995;26:1183–1188.
Samsa GP, Reutter RA, Parmigiani G, Ancukiewicz M, Abrahamse P, Lipscomb J, Matchar DB. Performing cost-effectiveness analysis by integrating randomized trial data with a comprehensive decision model: application to treatment of acute ischemic stroke. J Clin Epidemiol. 1999;52:259–271.
Caro JJ, Huybrects KF, for the Stroke Economic Analysis Group (STEM). Predicting long-term costs from functional status. Stroke. 1999;30:2574–2579.
Fagan SC, Morgenstern LB, Petitta A, Ward RE, Tilley BC, Marler JR, Levine SR, Broderick JP, Kwiatkowski TG, Frankel M, Brott TG, Walker MD, and the NINDS rt-PA Stroke Study Group. Cost-effectiveness of tissue plasminogen activator for acute ischemic stroke. Neurology. 1998;50:883–890.
Duncan PW, Wallace D, Lai SM, Johnson D, Embretson S, Laster LJ. The Stroke Impact Scale Version 2.0: evaluation of reliability, validity, and sensitivity to change. Stroke. 1999;30:2131–2140.
Williams LS, Weinberger M, Harris LE, Clark DO, Biller J. Development of a stroke-specific quality of life scale. Stroke. 1999;30:1362–1369.
Samsa GP, Edelman D, Rothman ML, Williams GR, Lipscomb J, Matchar DB. Determining clinically important differences in health status measures: a general approach with illustration to the Health Utilities Index Mark II. Pharmacol Economics. 1999;15:141–155.
Williams GR, Jiang JG, Matchar DB, Samsa GP. Incidence and occurrence of total (first-ever and recurrent) stroke. Stroke. 1999;30:2523–2528.