(Stroke. 2001;32:669.)
© 2001 American Heart Association, Inc.
Original Contributions |
From the Center for Clinical Health Policy Research (G.P.S., D.B.M.) and Departments of Medicine (G.P.S., D.B.M.) and Community and Family Medicine (G.P.S.), Duke University Medical Center, Durham, NC; and Department of Veterans Affairs Hospital (D.B.M.), Durham, NC.
Correspondence to Gregory P. Samsa, PhD, Duke University Medical Center, Suite 230, First Union Building, 2200 W Main St, Durham, NC 27705. E-mail samsa001{at}mc.duke.edu
| Abstract |
|---|
|
|
|---|
MethodsComputer simulations were based on the binomial distribution.
ResultsWe illustrate that even small overestimates of the efficacy of an intervention can lead to a serious reduction in statistical power, that the use of data from phase II studies tends to lead to such overestimation, and that a minimum clinically important difference derived with cost-effectiveness modeling techniques is considerably smaller than might be suggested by intuition.
ConclusionsWe recommend placing more emphasis on minimum clinically important differences when planning stroke trials, with these differences being derived from an assessment of the public health impact obtained in conjunction with the use of epidemiological and cost-effectiveness models. Even small benefits, when averaged over a sufficiently large number of cases, will, in total, accrue to a large positive impact on the public health.
Key Words: clinical trials epidemiology models, statistical sample size stroke, ischemic
| Introduction |
|---|
|
|
|---|
Although discussed elsewhere,1 one topic that was not considered by the roundtable was the statistical one of sample size. Here, we address the question of whether the previous phase III trials of neuroprotective drugs been too small to have the statistical power to detect effects that are nevertheless clinically meaningful. Also, we considered the related question of "why." Consideration of this latter question, not previously discussed in the literature, may help elucidate how sample size calculations can be improved in the future.
| Methods |
|---|
|
|
|---|
For concreteness, we also assume that the primary
analysis is a
2 test (ie, which
directly compares
pi with
pc). In
practice, greater statistical power can be gained with various
embellishments to this basic analytic strategy: for example, by
considering an outcome variable such as the Barthel Index to be
continuous rather than dichotomous, by combining multiple correlated
outcomes into a single statistical test, and by including covariates
such as study site and stroke severity. Nevertheless, the basic
statistical principles illustrated here remain valid even in the
presence of more sophisticated analytical approaches.
Our analysis plan uses simulations to illustrate 3 possible reasons why a randomized trial of an acute stroke treatment might be underpowered (1) sensitivity of the results of the sample size and statistical power calculations to small changes in the inputs to these analyses, (2) overestimation of the true (but unknown) difference in outcome rates between intervention and control, and (3) overestimation of the minimum clinically important difference in outcome rates between intervention and control. Details of the simulation methodology are given in Appendix.
From time to time, we use as examples the results of trials of neuroprotective agents that were published in Stroke during 1996 to 2000. Although not exhaustive, these trials are intended to be illustrative of current practice (and, indeed, the state of the science) in this field.
| Results |
|---|
|
|
|---|
Table 1
illustrates the relationship between sample
size and power. Its rows correspond to the assumptions used by the
investigator in determination of the sample size. (For simplicity, we
assume that the investigator has correctly specified that
Pc=0.40.
In practice, misspecification of
Pc will
be an additional source of error.) For example, if the investigator
assumes that
Pi=0.50
and uses a traditional 2-sided hypothesis test with a type I error rate
of 5% and a power of 80%, then the study will require a sample size
of 388 patients per group
(Table 1
, row 10, column
1).5 For this row in the
various columns, the statistical power, with the assumption of a sample
size of 388 per group, is presented as a function of the true
proportion of good outcomes in the intervention group. For example, if
the investigators assumption that
Pi=0.50
is correct, then the power is the desired 80% (line 5, column 11,
underlined). However, if the investigators assumption about the
outcome rate in the intervention group is optimistic and
Pi is
actually 0.45, then a trial with 388 patients per group will have a
power of only 29% (line 10, column 6). Even a difference of 2% in
outcome rates can have an effect on power; for example, if the assumed
outcome rate in the intervention group is
Pi=0.50
but the actual rate is
Pi=0.48,
then the power of the study to detect a statistically significant
impact of the intervention drops from 80% to 61%.
|
Another way to use
Table 1
is to simply note the sample size per group (column
2), to compare this against the sample sizes of various trials, and
then to move to the appropriate row in the table to assess the
statistical power. For example, sample sizes in the intervention group
for various phase III trials are 680 for
clomethiazole,6 565 for
tirilazad (planned as),7 464
for piracetam,8 368 for
lubeluzole,9 186 for
nalmefenen,10 and 152 for
ebselen.11 These correspond
approximately to lines 7 to 11 of
Table 1
(ie, with per-group sample sizes ranging from 173
to 787). For sample sizes of this magnitude, if the actual improvement
in good outcomes associated with the intervention is 5% (ie,
Pi=0.45),
statistical power ranges from only 16% (n=173 per group) to 52%
(n=787 per group). Similarly, if the actual improvement in good
outcomes associated with the intervention is 2% (ie,
Pi=0.42),
statistical power ranges from only 7% to 13%.
Overestimation of the True Intervention
Effect
The extent to which the results of
Table 1
are worrisome depends in large part on how small
Pi is
allowed to be. How, then, should the outcome rate in the intervention
group be specified? Three typical and often interrelated approaches are
data based, theory based, and intuition based. We begin by considering
the data-based approach.
Suppose that the efficacy of the intervention (ie, Pi, again with the assumption that Pc is precisely known) is to be estimated from previous studies. Typically, these are 1 or more phase II dose-selection trials, but in some circumstances, previously conducted phase III trials might be available as well. For these purposes, "multiple estimates of Pi" could consist of either more than 1 previous trial and/or subgroup analyses within a single previous trial. For simplicity, in the simulations here, we assume that all multiple trials and/or analyses within the same trial have the same sample size of 100.
First, consider the situation where only 1 previous study is extant. The key element of the data-based approach is that the investigator bases the estimate of sample size for the pivotal phase III trial on exactly the magnitude of the intervention effect observed in the previous data. For example, in the US and Canadian Lubeluzole Ischemic Stroke Study Group,9 "Sample size ... was based on the phase-II trial of lubeluzole. The study was powered ... to detect a difference between 14% and 23.5% in 3-month mortality" (these being the figures previously observed). As another example, in the Cervene Stroke Study, "Sample size was determined with a pooled estimate of the primary efficacy variable from the prior studies (nalmefen 70%, placebo 55%)."10 Similar reasoning is evident in the discussion of the results of a phase II trial of magnesium sulfate: "Based on the figures obtained from this study, a [phase III] trial to demonstrate the efficacy of intravenous magnesium sulfate would require 712 patients ... [to detect a] difference from 40% to 30% in proportion dead or disabled"12 (the 40% and 30% being observed in the phase II trial with 60 patients). In essence, the investigator assumes that the true (but unobservable) value of Pi, to be used in the sample size calculation is exactly the same as the value of pi observed in the previous study.
Table 2
illustrates the deleterious effects that chance can
have on the above procedure. The table assumes that the true (but
unknown to the investigator) proportion of good outcomes in the
intervention group is
Pi=0.43.
Because
Pi is
unknown, it will be estimated from the observed
pi,
which in turn is subject to sampling variability (summarized by its
sampling distribution). The median value of the sampling distribution
is, as might be anticipated, 0.43. However, 1 of 4 times (ie, 75th
percentile), the investigator will have the bad luck to observe a
pi of
0.47. From
Table 1
, the assumption that
Pi is
0.47 when in fact it is 0.43 will lead to a significant reduction in
power (ie, from 80% to 52%). This phenomenon becomes more pronounced
as the sample size decreases but can still affect the results even with
large samples. For example, if the previous study has 500 patients
(larger than a typical phase 2 trial), the 75th percentile of the
sampling distribution is 0.45 (data not shown). In any event, the error
in thinking here is to ignore sampling variability.
|
Table 3
illustrates another way things can go wrong. Now
suppose that the investigator has
1
pi, each
based on a sample size of 100, from which to choose. For example, a
pharmaceutical company might have data from phase II trials of 3
different compounds, and intends to only support a phase III trial of
the most promising one. Alternatively, a phase III trial might already
have been conducted whose overall results were negative/equivocal but
also produced 10 subgroup analyses, each of which generates an
estimate of
pi.
Finally, assume that the intervention is equally effective throughout,
with true (but unobservable)
Pi=0.43.
Now, even though the actual
Pi
remains unchanged, the observed values of
pi will tend to vary; for example, 3
such observed values might be
pi=0.41,
pi=0.44,
and
pi=0.48.
In addition, suppose that the investigator intends to base the sample
size calculation for the subsequent phase III trial on the best result
observed. That is, the investigator will select the maximum observed
value of
pi,
assume that this is the true rate
Pi, and
then plan the sample size for the following phase III trial
accordingly.
|
In
Table 3
, the large majority of entries indicate maximum
outcome rates that exceed 0.43, thus causing the investigator to
overestimate the treatment effect. This bias increases with the number
of pi
values considered (eg, increases with the number of subgroup
analyses). For example, if 5 subgroup analyses are
performed, the 50th percentile is
pi=0.50.
A phase III trial based on this misperception would have a power of
only 14%
(Table 1
). Even if the investigator has good luck and the
25th percentile is observed, then the estimate will be
pi=0.47,
and the power will only be 52%
(Table 1
). This bias increases as the sample size of the
subgroups decreases (data not shown). The phenomenon described here is
well known in the statistical literature as regression to the mean.
Here, the error in thinking is to uncritically accept at face value the
best of a series of results, without taking into account the effects of
chance.
A related error in thinking, common to both examples in this section, is to confuse the true but unobserved outcome rate Pi with the outcome rate pi observed in the data.
Underestimation of the Minimum Clinically
Important Difference
Now suppose that the sample size for the phase III
trial in question will be based on the minimum clinically important
difference (in true outcome rates), derived using a combination of
theory-based methods (eg, cost-effectiveness analysis) and
intuition.
Table 4
presents potential inputs into a
cost-effectiveness analysis to estimate this minimum clinically
important difference; namely, expected survival, quality-adjusted
survival, and medical costs, from 6 months until death, for patients
with various levels of stroke-related disability. These estimates were
derived using the stroke policy model (SPM), which, in brief, was
developed by the Stroke PORT to describe the natural history of stroke
and to aid in decision and cost-effectiveness analysis. Inputs
to the SPM include a reanalysis of data from the Framingham
Study, >150 000 Medicare claim files, a large survey of patients at
risk for major stroke, and an expert-based synthesis of the literature
pertaining to the relationship between disability and subsequent
outcomes. An article in the Journal of
Clinical Epidemiology describes the
SPM in more detail and illustrates the use of the SPM in
analysis of the results of a hypothetical trial with similar
costs and outcomes during its 6 months of follow-up (thus implying that
these results cancel) but whose intervention led to slight-to-moderate
shifts of the pattern of disability at the conclusion of the
trial.13
|
As described,13
estimation of the lifetime implications of this shift in the pattern of
disability involves using the results of
Table 4
as inputs into a straightforward weighted average
calculation, with its weights being the proportion of patients falling
into the various Rankin categories at the conclusion of the follow-up
period of the trial. In fact, because the comparison between
intervention and usual care is a relative one (eg, the numerator of the
incremental cost-effectiveness ratio is the difference in costs between
the 2 groups), this calculation need only take into account the
relative difference in outcomes between the 2 groups. For example,
suppose that the only effect of the intervention is to shift 2% of
patients from Rankin 5 to Rankin 2. Then, the cost savings associated
with the intervention is (0.02)(283 382-117 583)=$3316. Similarly,
the improvement in quality-adjusted life years associated with the
intervention is (0.02)(3.48-0.72)=0.055. Even with a conservative
estimation that the intervention does not lead to any cost savings by
reducing medical use during the period of the trial, as long as its
cost is less than $3316, it will dominate the usual care strategy (ie,
lead to better health outcomes and lower costs) and clearly would be
preferred.
Although the SPM is a particularly "high-tech"
epidemiological model, and even though other models use different
conceptual frameworks (eg, patient location rather than level of
disability) and may be differently calibrated (eg, providing lower cost
estimates by considering a less than complete set of medical costs),
the above principle is consistently supported by other models
in the published
literature.14 15
In essence, the common insight among the various models is that since
disabling stroke tends to result in institutionalization, which in turn
has very high per diem costs, so long
as it is not extraordinarily expensive an acute stroke treatment with
even a modest impact will lead, at a population level, to a net
improvement in the public health. This required impact is indeed quite
modest (as can be verified by experimenting with weighted average
calculations using the entries of
Table 4
), and is considerably smaller than might initially
be suggested by intuition.
| Discussion |
|---|
|
|
|---|
Our demonstration has various limitations. For example, a number of issues pertaining to the design and statistical analysis of trials have, for the purposes of illustration, been greatly simplified. Our focus has been almost entirely on statistical issues pertaining to sample size. This implicitly assumes that the more substantive issues, such as patient population, time window, dosage, outcome measures, and so forth, have been adequately resolved and thus that the intervention is being tested under conditions in which it has the best possible chance to demonstrate efficacy.4 Progress is being made on all of these fronts, including the development of more sensitive outcome measures.15 16 In any event, the design and successful completion of a randomized trial of an acute stroke intervention are extraordinarily difficult tasks, and our commentary is in no way intended as a criticism of the efforts of the investigators of these trials.
Having suggested that decision and cost-effectiveness analyses can be useful in helping to estimate the minimum clinically important difference, another limitation pertains to the state of the science for these models. At present, all such models leave much to be desired, especially in their treatment of quality of life, costs, and the impact of disability on subsequent outcomes. Despite their quantitative nature, the conclusions from these models certainly do not represent the same level of evidence as the more mathematically based illustrations here. Nevertheless, the consistency of the basic insight from various models in the literature,13 14 17 namely, that an analytically derived minimum clinically important difference might be much smaller than that suggested by intuition, should merit consideration.
Without engaging in extensive speculation about how intuitive estimates for clinically important differences have been derived in the past, we do note that a 10% difference (eg, 50% good outcomes in 1 group versus 40% good outcomes in another) is approximately at the threshold at which effects tend to be noticeable to individual observers (eg, in an entirely different context, it is approximately the size of the difference in mean height between populations of 15- and 16-year-old girls).3 18 In the absence of a more theory-driven approach, it would be quite reasonable for intuition to suggest a difference of this magnitude. Unfortunately, the task of estimating the size of the treatment effect that would "matter" is likely to be one for which intuition may be ill suited. For example, the cost-effectiveness example illustrated a situation (ie, namely, where only 2% of patients benefited, these having their disability reduced from a Rankin 5 to a Rankin 2) that suggests an intuitively small benefit when assessed from an overall population perspective. Nevertheless, for the few patients who avoid becoming highly disabled, the benefits are great. In any event, assessing the overall impact when few patients receive a large benefit, or many patients receive very small benefits, is a task that is cognitively difficult and thus amenable to assistance from analytic tools such as cost-effectiveness analysis.
In any event, it also seems important to note that intuition can vary over time, specialty, and other circumstances; for example, the effect sizes sought in the various megatrials within the field of cardiology are much smaller than a 5% to 10% absolute difference, yet are considered sufficiently large to be of importance. In essence, what the cardiologists have argued is that based on cost-effectiveness modeling and other formal approaches, small benefits, when averaged over a large number of cases, will in total accrue to a large positive impact on the public health, and this argument is now held to be intuitively reasonable. With >700 000 cases of stroke per year in the United States,19 such an argument applies with equal force to stroke.
Expert commentators have consistently recommended increasing the size of stroke trials. For example, in the recent Feinberg Memorial Lecture, Grotta2 recommended basing trials on a 5% to 10% absolute difference in outcome rates (this corresponds to 388 to 1534 patients per group). Similarly, Dorman and Sandercock1 recommend that sample sizes be far in excess of 750 patients per group. More generally, in the field of randomized trials as a whole, the calculations of sample size and statistical power are often substandard,20 and 1 of the components that is often missing is the minimum clinically important difference. In the absence of explicit guidelines about what constitutes a minimum clinically important difference, investigators can easily base sample size calculations on differences that are too large.21
Underpowered trials, particularly in the field of acute stroke treatment, should be avoided at all costs. In an underpowered trial, patients are placed at potential risk, yet the likelihood of successfully identifying an efficacious intervention (ie, the reason that it is appropriate to place such patients at risk) is low. Not only patients but also manufacturers, as well as the community as a whole, are at risk, all of whom are denied the benefits of interventions that are potentially effective. From a statistical perspective, perhaps the most important next step in avoiding further underpowered trials would be for the community of stroke researchers to agree on what constitutes a minimum clinically important difference. Presumably, the more forms of evidence that are used in these deliberations, the better.
| Appendix 1 |
|---|
|
|
|---|
| Acknowledgments |
|---|
Received July 18, 2000; revision received November 8, 2000; accepted December 8, 2000.
| References |
|---|
|
|
|---|
2.
Grotta JC. Acute
stroke therapy at the millennium: consummating the marriage between the
laboratory and the bedside.
Stroke. 1999;30:17221728.
3.
Muir KW, Grosset
DG. Neuroprotection for acute stroke: making clinical trials work.
Stroke. 1999;30:180182.
4.
Stroke Therapy
Academic Industry Roundtable (STAIR). Recommendations for standards
regarding preclinical neuroprotective and restorative drug development.
Stroke. 1999;30:27522758.
5. Cohen J. Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale, NJ: Lawrence Erlbaum Associates; 1988.
6.
Wahlgren NG,
Ranasinha KW, Rosolacci T, Franke CL, van Erven PMM, Ashwood T,
Claesson L, for the CLASS Study Group. Clomethiazole Acute Stroke Study
(CLASS): results of a randomized, controlled trial of clomethiazole
versus placebo in 1360 acute stroke patients.
Stroke. 1999;30:2128.
7.
The RANTTAS
Investigators. A randomized trial of tirilazad mesylate in patients
with acute stroke (RANTTAS).
Stroke. 1996;27:14531458.
8.
De Deyn PP, de
Reuck J, Deberdt W, Vlietinck R, Orgogozo JM, for Members of the
Piracetam in Acute Stroke Study (PASS) Group. Treatment of acute
ischemic stroke with piracetam.
Stroke. 1997;28:23472352.
9.
Grotta J, for the
US and Canadian Lubeluzole Ischemic Stroke Study Group.
Lubeluzole treatment of acute ischemic stroke.
Stroke. 1997;28:23382346.
10.
Clark WM, Raps
EC, Tong DC, Kelly RE, for the Cervene Stroke Study Investigators.
Cervene (nalmefene) in acute ischemic stroke: final results of
a phase III efficacy study. The Cervene Stroke Study Investigators.
Stroke. 2000;31:12341239.
11.
Yamaguchi T, Sno
K, Takakura K, Saito I, Shinohara Y, Asano T, Yasuhara H, for the
Ebselen Study Group. Ebeslen in acute ischemic stroke: a
placebo-controlled, double-blind clinical trial.
Stroke. 1998;29:1217.
12.
Muir KW, Lees KR.
A randomized, double-blind, placebo-controlled pilot trial of
intravenous magnesium sulfate in acute stroke.
Stroke. 1995;26:11831188.
13. Samsa GP, Reutter RA, Parmigiani G, Ancukiewicz M, Abrahamse P, Lipscomb J, Matchar DB. Performing cost-effectiveness analysis by integrating randomized trial data with a comprehensive decision model: application to treatment of acute ischemic stroke. J Clin Epidemiol. 1999;52:259271.[Medline] [Order article via Infotrieve]
14.
Caro JJ,
Huybrects KF, for the Stroke Economic Analysis Group (STEM).
Predicting long-term costs from functional status.
Stroke. 1999;30:25742579.
15.
Fagan SC,
Morgenstern LB, Petitta A, Ward RE, Tilley BC, Marler JR, Levine SR,
Broderick JP, Kwiatkowski TG, Frankel M, Brott TG, Walker MD, and the
NINDS rt-PA Stroke Study Group. Cost-effectiveness of tissue
plasminogen activator for acute
ischemic stroke.
Neurology. 1998;50:883890.
16.
Duncan PW,
Wallace D, Lai SM, Johnson D, Embretson S, Laster LJ. The Stroke Impact
Scale Version 2.0: evaluation of reliability, validity, and sensitivity
to change. Stroke. 1999;30:21312140.
17.
Williams LS,
Weinberger M, Harris LE, Clark DO, Biller J. Development of a
stroke-specific quality of life scale.
Stroke. 1999;30:13621369.
18. Samsa GP, Edelman D, Rothman ML, Williams GR, Lipscomb J, Matchar DB. Determining clinically important differences in health status measures: a general approach with illustration to the Health Utilities Index Mark II. Pharmacol Economics. 1999;15:141155.
19.
Williams GR,
Jiang JG, Matchar DB, Samsa GP. Incidence and occurrence of total
(first-ever and recurrent) stroke.
Stroke. 1999;30:25232528.
20.
Moher D, Dulberg
CS, Wells GA. Statistical power, sample size, and their reporting in
randomized controlled trials.
JAMA. 1994;272:122124.
21.
Goodman SN,
Berlin JA. The use of predicted confidence intervals when planning
experiments and the misuse of power when interpreting results.
Ann Intern Med. 1994;121:200206.
This article has been cited by other articles:
![]() |
K. B. Slot, E. Berge, P. Dorman, S. Lewis, M. Dennis, P. Sandercock, and on behalf of the Oxfordshire Community Stroke Proj Impact of functional status at six months on long term survival in patients with ischaemic stroke: prospective cohort studies BMJ, February 16, 2008; 336(7640): 376 - 379. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Fisher and for the Stroke Therapy Academic Industry Roundtabl Enhancing the Development and Approval of Acute Stroke Therapies: Stroke Therapy Academic Industry Roundtable Stroke, August 1, 2005; 36(8): 1808 - 1813. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. B. Matchar Editorial Comment--What Can Models Teach Us About Stroke Treatment?: Sorting Out the Missing Bits Stroke, June 1, 2004; 35(6): 1497 - 1498. [Full Text] [PDF] |
||||
![]() |
M. Fisher Recommendations for Advancing Development of Acute Stroke Therapies: Stroke Therapy Academic Industry Roundtable 3 Stroke, June 1, 2003; 34(6): 1539 - 1546. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. B. Matchar Health Policy in Stroke: Translating What We Know Into What We Do Stroke, February 1, 2003; 34(2): 370 - 371. [Full Text] [PDF] |
||||
![]() |
K W Muir Magnesium in stroke treatment Postgrad. Med. J., November 1, 2002; 78(925): 641 - 645. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. J. Furlan Acute Stroke Trials: Strengthening the Underpowered Stroke, June 1, 2002; 33(6): 1450 - 1451. [Full Text] [PDF] |
||||
![]() |
K. W. Muir Heterogeneity of Stroke Pathophysiology and Neuroprotective Clinical Trial Design Stroke, June 1, 2002; 33(6): 1545 - 1550. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Stroke Home | Subscriptions | Archives | Feedback | Authors | Help | AHA Journals Home | Search Copyright © 2001 American Heart Association, Inc. All rights reserved. Unauthorized use prohibited. |