Strengthening Acute Stroke Trials Through Optimal Use of Disability End Points
Background and Purpose— Suboptimal choices of primary end point for acute stroke trials may have contributed to inconclusive results. The Barthel Index (BI) and Rankin Scale (RS) have been widely used and analyzed in various ways. We sought to investigate the most powerful end point for use in acute stroke trials.
Methods— Data from the Glycine Antagonist in Neuroprotection (GAIN) International Trial were used to simulate 24 000 clinical trials exploring various patterns and magnitudes of treatment effect and thus to estimate the statistical power for a range of end points based on the BI or RS.
Results— RS end points were more powerful than BI end points. End points dichotomized toward the favorable extreme of either scale or adjusted according to baseline prognosis (“patient-specific” end point) were among the most powerful. Combining RS and BI in a “global” end point was also successful. Improvements in statistical power indicated that using a RS end point instead of BI ≥60 could reduce the sample size by up to 84% (95% CI, 80% to 87%), 73% (95% CI, 68% to 79%) for a patient-specific BI end point, or 81% (95% CI, 76% to 85%) for a global end point.
Conclusions— The RS and global end points are preferable to BI end points; the position of the cut point is also important. Better choices of end point substantially strengthen trial power for a given trial size or allow reduced sample sizes without loss of statistical power.
Most clinical trials in acute stroke have been unsuccessful in demonstrating a positive therapeutic effect. Neuroprotective trials have likely been underpowered to detect subtle but clinically important treatment effects. Statistical power is the probability that a statistical test identifies a significant treatment effect (where one truly exists) at a given significance level and sample size.1 Inappropriate choice of cut point for analysis of the outcome scale(s) may be one of several factors contributing to a lack of statistical power. Efforts must be made to optimize the analysis of clinical trials for both ethical and practical reasons.2
A variety of primary end points have been used in acute stroke trials. The Barthel Index (BI)3 and the modified Rankin Scale (RS)4 have been the most commonly used disability outcome measures. The BI is a 10-item scale in which disability is assessed on various aspects of self-care, such as dressing and toilet use. It has a maximum score of 100 (fully independent, physically functioning). The RS is a 6-point scale in which a patient is rated from 0 (no symptoms) to 5 (severe disability). Both the RS and BI have been shown to be reliable and valid for use in stroke5; however, the RS may be less reproducible because of its relative lack of structure.6
To date, functional outcome scores have usually been dichotomized as favorable versus unfavorable, although there is little consensus on the optimal cut point,7 and selection of this is often arbitrary. The most commonly used end point in published trials has been the BI cut point at 60, at which a patient is thought to be capable of independence from full-time care.8 BI cut points have ranged from 55 to 100.5 The RS has been used less frequently, although outcomes of ≤2 (slight or no disability) and ≤1 (no significant disability) have been utilized. A trichotomized BI end point (split into 3 categories) has also been used.9
The BI has a U-shaped distribution, in which patient outcomes cluster at the extremes. The quarter of patients who die are arbitrarily scored 0; the 40% who recover are scored 95 or 100. Since the remaining third have BI scores distributed between 5 and 90, any cut point selected within this range will have a small number of patients populating the adjacent categories: as few as 5% of the patients may lie 5 or 10 points below a cut point of 60. If it is assumed that a drug treatment effect will improve patients by only 1 or 2 BI categories and that not all patients will improve, the potential to detect such a small shift must be negligible. In contrast, patients are more heavily represented around BI 95, and here small improvements applying to a larger number of subjects may be more readily detected. Clearly, however, dichotomization at this mild end of the scale disadvantages the contribution of more severely affected patients; on average, their outcomes will be much poorer, and small but valuable improvements caused by treatment would not be measured. To allow both mildly and moderately severely affected patients to contribute to the significance test, a second cut point can be added, forming a trichotomized analysis.
Another approach is to use a global end point, simultaneously incorporating outcome measures from different domains such as handicap and activities of daily living. This is conceptually appealing because no single outcome measure describes all dimensions of recovery from stroke, yet it has received limited attention to date. The statistical power of a global end point should be greater than or equal to that of an individual end point10 but may be weakened with the inclusion of a scale less influenced by a treatment.
There is considerable heterogeneity in stroke severity; using an end point with a fixed cut point may render many patients uninformative. It may be appropriate to group patients according to clinical presentation and to vary cut points according to group. This “patient-specific” end point would give a more realistic assessment of a treatment effect and allow all patients to contribute to the results of the trial.
This article explores the optimal primary end points incorporating the BI and RS. We assessed a selection of end points used in published trials as well as patient-specific and global end points. We sought to establish which end point would perform best under likely trial circumstances. The Oxford classification11 was used to categorize patients by clinical presentation in this study.
Our statistical approach is described in an appendix to this article (available online at http://stroke.ahajournals.org). Briefly, we based our work on the patients from the Glycine Antagonist in Neuroprotection (GAIN) International Trial9 data set. The GAIN trial was neutral; however, to avoid any bias, only the placebo patients were used. We generated 24 000 clinical trials, each with 1400 patients split between active treatment and placebo groups (700 per group), representing 33.6 million randomized patients. Within each trial, patients were simulated by randomly sampling with replacement from the GAIN data. The characteristics of every simulated patient were based on a real example from the GAIN trial, preserving the correlation between the National Institutes of Health Stroke Scale (NIHSS),12 Oxford classification, and final outcome described by RS and BI. The placebo and treatment groups were generated slightly differently, so that the simulated treatment group was forced to have slightly milder stroke as assessed by NIHSS at baseline. The difference between the average NIHSS score for the 2 groups varied from 0 through 4 points (described as treatment level), but for clarity our results concentrate on the 2-point difference. This treatment level is equivalent to a relative risk reduction in being dead or disabled of 9%, an absolute risk reduction of 4%, or an odds ratio of 1.19, with the use of BI ≥60.
The above “fixed” effect was our most basic approach since it assumes that treatment is uniformly effective in all patients. Consequently, we also simulated effects in which benefit from treatment was dependent on certain patient characteristics, such as age and sex (neuroprotective effect, denoted NP); in which a uniform benefit was offset in a randomly selected subgroup by deterioration to mimic the effect of thrombolysis (TP1); and finally, an effect that was dependent on patient characteristics, with deterioration in some patients (TP2). In summary, there were 24 000 trials: 1500 simulated trials for each of 4 treatment effects and 4 treatment levels, with every trial involving 1400 patients.
Published cut points were used when we dichotomized or trichotomized the BI and/or RS (Table 1). We also explored patient-specific cut points, in which we specified different thresholds for favorable outcome according to baseline prognosis, using the Oxford classification to group patients. We chose thresholds that were close to the median value of BI or RS achieved by each Oxford classification category in the original GAIN trial.
Estimation of Statistical Power
We analyzed the simulated trials via Pearson’s χ2 test for dichotomized end points and the Cochran-Mantel-Haenszel χ2 test13 for trichotomized end points. The global end points were analyzed via generalized estimating equations.14 A bootstrap approach was used to calculate CIs for the power.
The end points were compared by calculating the sample size that would be required to maintain the same statistical power when 1 end point was chosen in preference to BI ≥60 with the use of standard sample size equations.15–17 If an end point were more powerful, the required sample size expressed as a percentage would be <100%. For an overall comparison of the end points, binary logistic regression was used to model the proportion of significant trials, adjusted for treatment effect size.
The pattern of results we observed was similar across all treatment effect patterns for both the RS and BI end points (Table 2). The NP effect and the TP2 effect could be detected with the lowest power. The BI ≥60 dichotomy was consistently the least powerful end point. Among the remaining BI end points, the ≥95 dichotomy and the patient-specific dichotomized end points were equally the most powerful (Figure). The RS end points followed a less consistent pattern. The RS ≤2 end point was the least powerful for all treatment effect patterns; end points incorporating RS ≤3 were no better (data not shown). Depending on the treatment effect pattern, the RS ≤1, the RS ≤1 and ≤2 trichotomy, or the dichotomized patient-specific end point was the most powerful. The range of power was narrower for the RS end points than the BI end points.
Both the dichotomized and patient-specific global end points were more powerful than the BI end points for all treatment effect patterns but not always more powerful than RS ≤1 or the RS ≤1 and ≤2 trichotomy. Generally, the patient-specific global end point was less powerful than the dichotomized global end point.
Table 3 compares the end points in terms of required sample sizes relative to BI ≥60. For the BI end points, the greatest sample size reduction was obtained under the TP2 effect and the patient-specific end point or the BI ≥95 end point. The RS end points generally had larger sample size reductions. Either of the global end points could reduce the sample sizes even further, depending on the underlying treatment effect pattern.
Overall, the RS end points were more powerful than the BI end points (Table 4). The odds of achieving a statistically significant result increased by 89% under a fixed treatment effect if a RS end point were used instead of a BI end point.
Our results have important implications for the choice of primary end point in acute stroke trials. Primary end points that include the RS are more powerful than those based on the BI. The position of the cut point on these scales is also of great importance; end points dichotomized toward the favorable extreme were more powerful. The patient-specific BI and the trichotomized RS also performed well.
Our analyses were performed with a range of treatment effects, and our findings are reasonably consistent across a likely range of trial conditions. However, since all analyses used the GAIN International database, applying the end points to an independent data set may be informative.
Broderick and colleagues18 used National Institute of Neurological Disorders and Stroke (NINDS) trial data and established that the RS dichotomized at ≤1 was the most effective in differentiating between the treatment groups in that trial. The BI dichotomized at ≥95 was also effective. However, since such an analysis is data dependent, it may not be generalizable. An analysis that relies solely on choosing positive end points from a selection of trials in which putative effects may have been seen is subject to selection bias and random variability. Our method involves assumptions about the generation of the treatment effect (since it assumes that outcome at 90 days is related to initial stroke severity). We used a sampling-based approach in which the “simulated” outcomes at 90 days were generated by selecting outcomes from the GAIN International database; real patient outcomes were used, and the correlation structure between the outcome measures was retained. By simulating 1500 trials of each treatment scenario (equivalent to 33.6 million patients), we achieved accurate estimates of statistical power.
Dichotomization may be less sensitive than trichotomized end points, global end points, or patient-specific end points.19 Berge and Barer20 supported the separate definition of favorable outcome for subgroups of patients to maximize the power of stroke trials. Patient-specific end points would ensure that trial results are generalizable across a wide range of stroke severity. Our cut points were chosen on the basis of the Oxford classification category; further work is required to assess more appropriate methods of selecting the cut points.
The inclusion of only the BI and RS in the global end points may have restricted the power. These outcome measures are highly correlated; the full potential of a global end point to assess many different dimensions of recovery was not exploited. The inclusion of other outcome measures such as the NIHSS may further improve the power, as used in the NINDS trial.21 However, some regulatory authorities, such as the European Medicines Evaluation Authority, have been reluctant to consider a global end point that combines diverse outcome measures.22
Most stroke trials are powered to detect an absolute risk reduction of 10%. This study used a treatment effect level that was equivalent to an absolute risk reduction of 4% (BI ≥60, fixed effect). We believe that this is a more realistic effect of a stroke intervention. This has resulted in levels of statistical power substantially below 80%, suggesting that the sample size of 1400 is too small. The final column in Table 2 shows the power that was achieved when a 3-point decrease in baseline NIHSS was applied (absolute effect of 9%). With this larger treatment effect, the power for all end points exceeds 80%, and although the absolute differences among the end points are smaller, the hierarchy is unchanged.
Treatment effect patterns influence study power. This may have been underestimated in stroke trial design. When treatment benefit is restricted to subgroups, lower power is observed because the average benefit is diluted by nonresponders. For example, our NP effect restricted the benefit for elderly women, and the overall absolute risk reduction was reduced to 2%.
Three trials have demonstrated a positive therapeutic effect in acute stroke: the NINDS recombinant tissue plasminogen activator (rtPA) trial,21 Stroke Treatment With Ancrod Trial (STAT),23 and Prolyse in Acute Cerebral Thromboembolism (PROACT) II.24 None of those trials used the most commonly published end points. The NINDS trial used a global end point incorporating the BI (≥95), RS (≤1), NIHSS (≤1), and Glasgow Outcome Scale25 (=1). PROACT II used RS ≤2, and the STAT study used BI ≥95 or score equal to prestroke value. It is notable that these end points were among the most powerful we assessed. However, a post hoc analysis of the European Cooperative Acute Stroke Study (ECASS) II trial26 found that if RS ≤2 had been used instead of RS ≤1, the trial would have been positive.
It is not only the choice of end point that can influence the power of a clinical trial: validity of outcome measures and restrictive entry criteria may also be factors. STAT and NINDS both restricted time to treatment to 3 hours. PROACT II restricted entry to patients with proven middle cerebral artery occlusion.
We have demonstrated the disadvantage of BI ≥60 as a primary end point. Trials that are currently in progress should consider revisions to their statistical analysis plan before unblinding takes place to optimize statistical power. Such a decision has recently been announced by the international Intravenous Magnesium Efficacy in Stroke (IMAGES) trial group.27
In conclusion, this study has shown that many clinical trials in acute stroke have not used an optimal primary end point, which may have led to inconclusive results. Statistical power is not sufficient to render a trial informative, but it may be a prerequisite. Substantial and significant increases in power are observed when a dichotomized end point cut at the favorable extreme of the BI or RS, a patient-specific end point, or a global end point is used. On average, RS end points appear more powerful than BI end points, whether analyzed alone or as part of a global end point.
Simulation of Outcomes
For our analyses, we based our simulations on ischemic stroke patients from the GAIN International Trial1 placebo group to generate 24 000 clinical trials, each with a sample size of 1400 patients (700 per treatment group). Outcomes at 90 days were generated by means of a selective sampling approach, as follows.
For each simulated trial, 2 random samples of 700 patients were randomly selected, with replacement, from the GAIN data. Replacement means that the same patient can appear more than once in the same trial. The first sample was assigned to the simulated placebo group, and the second sample was used to generate outcomes at 90 days for a simulated active treatment group. This was achieved by calculating a revised baseline NIHSS score for each patient, representing a therapeutic drug effect. Outcomes were then sampled with replacement from the GAIN data with the revised NIHSS value, specifying also that the outcome should come from a patient with the same Oxford classification as the original patient, to ensure that both simulated groups had similar clinical characteristics. The patients generated in this way were assigned to the simulated active treatment group. For example, a patient with a baseline NIHSS score of 5 and Oxford classification of partial anterior circulation infarct (PACI) could have a revised NIHSS score of 3. A new patient with a NIHSS score of 3 would be randomly selected with replacement from the PACI subset of the GAIN data to provide a representative outcome at 90 days. This process was repeated 1500 times (each representing a single clinical trial) for each treatment effect and magnitude. The simulated trial data sets therefore contained baseline NIHSS and 90-day outcomes that had been assessed under standard clinical trial conditions.
Treatment Effect Patterns
Several treatment effect patterns were simulated to mimic a range of conceivable drug effects, as follows: (1) fixed treatment effect, with a uniform shift in the baseline NIHSS score of all patients, regardless of initial stroke severity; (2) neuroprotective treatment effect (NP), in which the benefit a patient gained from the treatment was dependent on age and sex. Some neuroprotective agents tend to lower blood pressure, especially in older women, which could result in a lower chance of favorable outcome at 90 days.2 NP was set so that a woman would receive half the benefit of a man, and an older patient would receive less benefit than a young patient; (3) thrombolytic treatment effect (TP1), in which the patients had a fixed treatment effect, but 10% of patients deteriorated by an average of 4 NIHSS points (mimicking symptomatic hemorrhagic transformation).3 These patients were randomly selected, depending on their baseline NIHSS score; a severe stroke was more likely to deteriorate; and (4) another thrombolytic effect (TP2), in which the patients received benefit from treatment that was dependent on stroke severity such that a milder stroke received a greater benefit; TP2 also included deterioration.
Several sizes of treatment effect were applied, equal to a 0-, 1-, 2-, or 3-point reduction in baseline NIHSS score. The results presented are for a 2-point decrease, which is equivalent to a relative risk reduction in being dead or disabled of 9%, an absolute risk reduction of 4%, or an odds ratio of 1.19, with the use of BI ≥60.
Lees KR, Asplund K, Carolei A, Davis SM, Diener HD, Kaste M, Orgogozo JM, Whitehead J, for the GAIN International Investigators. Glycine Antagonist (Gavestinel) in Neuroprotection (GAIN International) in patients with acute stroke: a randomised controlled trial. Lancet. 2000;355:1949–1954.
Squire IB, Lees KR, Pryse-Phillips W, Kertesz A, Bamford J, for the Lifarizine Study Group. The effects of lifarizine in acute cerebral infarction: a pilot safety study. Cerebrovasc Dis. 1996;6:156–160.
Lees KR. Thrombolysis. Br Med Bull. 2000;56:389–400.
This study was supported by a collaborative studentship from the Medical Research Council and Pfizer to F.B. Young and by a Medical Research Council career development fellowship to Dr Weir. The GAIN trial was sponsored by GlaxoWellcome (now GlaxoSmithKline). GlaxoSmithKline had no involvement in this analysis or article.
- Received March 28, 2003.
- Revision received July 1, 2003.
- Accepted July 11, 2003.
Bland M. An Introduction to Medical Statistics. 3rd ed. Oxford, UK: Oxford University Press; 2000.
Stroke Therapy Academic Industry Roundtable II (STAIR-II). Recommendations for clinical trial evaluation of acute stroke therapies. Stroke. 2001; 32: 1598–1606.
Mahoney FI, Barthel DW. Functional evaluation: the Barthel Index. Md Med J. 1965; 14: 56–61.
Duncan PW, Jorgensen HS, Wade DT. Outcome measures in acute stroke trials: a systematic review and some recommendations to improve practice. Stroke. 2000; 31: 1429–1438.
Wilson JTL, Hareendran A, Grant M, Baird T, Schulz UGR, Muir KW, Bone I. Improving the assessment of outcomes in stroke: use of a structured interview to assign grades on the modified Rankin Scale. Stroke. 2002; 33: 2243–2246.
Sulter G, Steen C, De Keyser J. Use of the Barthel Index and modified Rankin Scale in acute stroke trials. Stroke. 1999; 30: 1538–1541.
Lees KR, Asplund K, Carolei A, Davis SM, Diener HD, Kaste M, Orgogozo JM, Whitehead J, for the GAIN International Investigators. Glycine Antagonist (Gavestinel) in Neuroprotection (GAIN International) in patients with acute stroke: a randomised controlled trial. Lancet. 2000; 355: 1949–1954.
Tilley BC, Marler JR, Geller NL, Lu M, Legler J, Brott TG, Lyden PD, Grotta J, for the NINDS rtPA Stroke Study Group. Use of a global test for multiple outcomes in stroke trials with application to the National Institute of Neurological Disorders and Stroke t-PA Stroke Trial. Stroke. 1996; 27: 2136–2142.
Brott T, Adams HP, Olinger CP, Marler JR, Barsan WG, Biller J, Spilker J, Holleran R, Eberle R, Hertzberg V, Rorick M, Moomaw CJ, Walker M. Measurements of acute cerebral infarction: a clinical examination scale. Stroke. 1989; 20: 864–870.
Agresti A. An Introduction to Categorical Data Analysis. New York, NY: Wiley; 1996.
Broderick JP, Lu M, Kothari R, Levine SR, Lyden PD, Haley EC, Brott TG, Grotta J, Tilley BC, Marler JR, Frankel M, and the NINDS rtPA Stroke Study Group. Finding the most powerful measures of the effectiveness of tissue plasminogen activator in the NINDS tPA Stroke Trial. Stroke. 2000; 31: 2335–2341.
Lees KR. Neuroprotection is unlikely to be effective in humans using current trial designs: an opposing view. Stroke. 2002; 33: 308–309.
Committee for Proprietary Medicinal Products (CPMP). Points to Consider on Clinical Investigation of Medicinal Products for the Treatment of Acute Stroke. 2001. Available at http://www.emea.eu.int/pdfs/human/ewp/056098en.pdf.