| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
(Stroke. 2003;34:2676.)
© 2003 American Heart Association, Inc.
Original Contributions |
From the Division of Cardiovascular and Medical Sciences, University of Glasgow, Gardiner Institute, Western Infirmary (F.B.Y., K.R.L.), and Robertson Centre for Biostatistics, University of Glasgow (C.J.W.), Glasgow, Scotland.
Reprint requests to Fiona B. Young, Division of Cardiovascular and Medical Sciences, University of Glasgow, Gardiner Institute, Western Infirmary, Glasgow G11 6NT, Scotland. E-mail Fby1w{at}clinmed.gla.ac.uk
| Abstract |
|---|
|
|
|---|
Methods Data from the Glycine Antagonist in Neuroprotection (GAIN) International Trial were used to simulate 24 000 clinical trials exploring various patterns and magnitudes of treatment effect and thus to estimate the statistical power for a range of end points based on the BI or RS.
Results RS end points were more powerful than BI end points. End points dichotomized toward the favorable extreme of either scale or adjusted according to baseline prognosis (
patient-specific
end point) were among the most powerful. Combining RS and BI in a
global
end point was also successful. Improvements in statistical power indicated that using a RS end point instead of BI
60 could reduce the sample size by up to 84% (95% CI, 80% to 87%), 73% (95% CI, 68% to 79%) for a patient-specific BI end point, or 81% (95% CI, 76% to 85%) for a global end point.
Conclusions The RS and global end points are preferable to BI end points; the position of the cut point is also important. Better choices of end point substantially strengthen trial power for a given trial size or allow reduced sample sizes without loss of statistical power.
Key Words: clinical trials end point determination neuroprotection stroke, acute thrombolysis
| Introduction |
|---|
|
|
|---|
A variety of primary end points have been used in acute stroke trials. The Barthel Index (BI)3 and the modified Rankin Scale (RS)4 have been the most commonly used disability outcome measures. The BI is a 10-item scale in which disability is assessed on various aspects of self-care, such as dressing and toilet use. It has a maximum score of 100 (fully independent, physically functioning). The RS is a 6-point scale in which a patient is rated from 0 (no symptoms) to 5 (severe disability). Both the RS and BI have been shown to be reliable and valid for use in stroke5; however, the RS may be less reproducible because of its relative lack of structure.6
To date, functional outcome scores have usually been dichotomized as favorable versus unfavorable, although there is little consensus on the optimal cut point,7 and selection of this is often arbitrary. The most commonly used end point in published trials has been the BI cut point at 60, at which a patient is thought to be capable of independence from full-time care.8 BI cut points have ranged from 55 to 100.5 The RS has been used less frequently, although outcomes of
2 (slight or no disability) and
1 (no significant disability) have been utilized. A trichotomized BI end point (split into 3 categories) has also been used.9
The BI has a U-shaped distribution, in which patient outcomes cluster at the extremes. The quarter of patients who die are arbitrarily scored 0; the 40% who recover are scored 95 or 100. Since the remaining third have BI scores distributed between 5 and 90, any cut point selected within this range will have a small number of patients populating the adjacent categories: as few as 5% of the patients may lie 5 or 10 points below a cut point of 60. If it is assumed that a drug treatment effect will improve patients by only 1 or 2 BI categories and that not all patients will improve, the potential to detect such a small shift must be negligible. In contrast, patients are more heavily represented around BI 95, and here small improvements applying to a larger number of subjects may be more readily detected. Clearly, however, dichotomization at this mild end of the scale disadvantages the contribution of more severely affected patients; on average, their outcomes will be much poorer, and small but valuable improvements caused by treatment would not be measured. To allow both mildly and moderately severely affected patients to contribute to the significance test, a second cut point can be added, forming a trichotomized analysis.
Another approach is to use a global end point, simultaneously incorporating outcome measures from different domains such as handicap and activities of daily living. This is conceptually appealing because no single outcome measure describes all dimensions of recovery from stroke, yet it has received limited attention to date. The statistical power of a global end point should be greater than or equal to that of an individual end point10 but may be weakened with the inclusion of a scale less influenced by a treatment.
There is considerable heterogeneity in stroke severity; using an end point with a fixed cut point may render many patients uninformative. It may be appropriate to group patients according to clinical presentation and to vary cut points according to group. This
patient-specific
end point would give a more realistic assessment of a treatment effect and allow all patients to contribute to the results of the trial.
This article explores the optimal primary end points incorporating the BI and RS. We assessed a selection of end points used in published trials as well as patient-specific and global end points. We sought to establish which end point would perform best under likely trial circumstances. The Oxford classification11 was used to categorize patients by clinical presentation in this study.
| Methods |
|---|
|
|
|---|
60.
The above
fixed
effect was our most basic approach since it assumes that treatment is uniformly effective in all patients. Consequently, we also simulated effects in which benefit from treatment was dependent on certain patient characteristics, such as age and sex (neuroprotective effect, denoted NP); in which a uniform benefit was offset in a randomly selected subgroup by deterioration to mimic the effect of thrombolysis (TP1); and finally, an effect that was dependent on patient characteristics, with deterioration in some patients (TP2). In summary, there were 24 000 trials: 1500 simulated trials for each of 4 treatment effects and 4 treatment levels, with every trial involving 1400 patients.
End Points
Published cut points were used when we dichotomized or trichotomized the BI and/or RS (Table 1). We also explored patient-specific cut points, in which we specified different thresholds for favorable outcome according to baseline prognosis, using the Oxford classification to group patients. We chose thresholds that were close to the median value of BI or RS achieved by each Oxford classification category in the original GAIN trial.
|
Estimation of Statistical Power
We analyzed the simulated trials via Pearsons
2 test for dichotomized end points and the Cochran-Mantel-Haenszel
2 test13 for trichotomized end points. The global end points were analyzed via generalized estimating equations.14 A bootstrap approach was used to calculate CIs for the power.
The end points were compared by calculating the sample size that would be required to maintain the same statistical power when 1 end point was chosen in preference to BI
60 with the use of standard sample size equations.1517 If an end point were more powerful, the required sample size expressed as a percentage would be <100%. For an overall comparison of the end points, binary logistic regression was used to model the proportion of significant trials, adjusted for treatment effect size.
| Results |
|---|
|
|
|---|
60 dichotomy was consistently the least powerful end point. Among the remaining BI end points, the
95 dichotomy and the patient-specific dichotomized end points were equally the most powerful (Figure). The RS end points followed a less consistent pattern. The RS
2 end point was the least powerful for all treatment effect patterns; end points incorporating RS
3 were no better (data not shown). Depending on the treatment effect pattern, the RS
1, the RS
1 and
2 trichotomy, or the dichotomized patient-specific end point was the most powerful. The range of power was narrower for the RS end points than the BI end points.
|
|
Both the dichotomized and patient-specific global end points were more powerful than the BI end points for all treatment effect patterns but not always more powerful than RS
1 or the RS
1 and
2 trichotomy. Generally, the patient-specific global end point was less powerful than the dichotomized global end point.
Table 3 compares the end points in terms of required sample sizes relative to BI
60. For the BI end points, the greatest sample size reduction was obtained under the TP2 effect and the patient-specific end point or the BI
95 end point. The RS end points generally had larger sample size reductions. Either of the global end points could reduce the sample sizes even further, depending on the underlying treatment effect pattern.
|
Overall, the RS end points were more powerful than the BI end points (Table 4). The odds of achieving a statistically significant result increased by 89% under a fixed treatment effect if a RS end point were used instead of a BI end point.
|
| Discussion |
|---|
|
|
|---|
Our analyses were performed with a range of treatment effects, and our findings are reasonably consistent across a likely range of trial conditions. However, since all analyses used the GAIN International database, applying the end points to an independent data set may be informative.
Broderick and colleagues18 used National Institute of Neurological Disorders and Stroke (NINDS) trial data and established that the RS dichotomized at
1 was the most effective in differentiating between the treatment groups in that trial. The BI dichotomized at
95 was also effective. However, since such an analysis is data dependent, it may not be generalizable. An analysis that relies solely on choosing positive end points from a selection of trials in which putative effects may have been seen is subject to selection bias and random variability. Our method involves assumptions about the generation of the treatment effect (since it assumes that outcome at 90 days is related to initial stroke severity). We used a sampling-based approach in which the
simulated
outcomes at 90 days were generated by selecting outcomes from the GAIN International database; real patient outcomes were used, and the correlation structure between the outcome measures was retained. By simulating 1500 trials of each treatment scenario (equivalent to 33.6 million patients), we achieved accurate estimates of statistical power.
Dichotomization may be less sensitive than trichotomized end points, global end points, or patient-specific end points.19 Berge and Barer20 supported the separate definition of favorable outcome for subgroups of patients to maximize the power of stroke trials. Patient-specific end points would ensure that trial results are generalizable across a wide range of stroke severity. Our cut points were chosen on the basis of the Oxford classification category; further work is required to assess more appropriate methods of selecting the cut points.
The inclusion of only the BI and RS in the global end points may have restricted the power. These outcome measures are highly correlated; the full potential of a global end point to assess many different dimensions of recovery was not exploited. The inclusion of other outcome measures such as the NIHSS may further improve the power, as used in the NINDS trial.21 However, some regulatory authorities, such as the European Medicines Evaluation Authority, have been reluctant to consider a global end point that combines diverse outcome measures.22
Most stroke trials are powered to detect an absolute risk reduction of 10%. This study used a treatment effect level that was equivalent to an absolute risk reduction of 4% (BI
60, fixed effect). We believe that this is a more realistic effect of a stroke intervention. This has resulted in levels of statistical power substantially below 80%, suggesting that the sample size of 1400 is too small. The final column in Table 2 shows the power that was achieved when a 3-point decrease in baseline NIHSS was applied (absolute effect of 9%). With this larger treatment effect, the power for all end points exceeds 80%, and although the absolute differences among the end points are smaller, the hierarchy is unchanged.
Treatment effect patterns influence study power. This may have been underestimated in stroke trial design. When treatment benefit is restricted to subgroups, lower power is observed because the average benefit is diluted by nonresponders. For example, our NP effect restricted the benefit for elderly women, and the overall absolute risk reduction was reduced to 2%.
Three trials have demonstrated a positive therapeutic effect in acute stroke: the NINDS recombinant tissue plasminogen activator (rtPA) trial,21 Stroke Treatment With Ancrod Trial (STAT),23 and Prolyse in Acute Cerebral Thromboembolism (PROACT) II.24 None of those trials used the most commonly published end points. The NINDS trial used a global end point incorporating the BI (
95), RS (
1), NIHSS (
1), and Glasgow Outcome Scale25 (=1). PROACT II used RS
2, and the STAT study used BI
95 or score equal to prestroke value. It is notable that these end points were among the most powerful we assessed. However, a post hoc analysis of the European Cooperative Acute Stroke Study (ECASS) II trial26 found that if RS
2 had been used instead of RS
1, the trial would have been positive.
It is not only the choice of end point that can influence the power of a clinical trial: validity of outcome measures and restrictive entry criteria may also be factors. STAT and NINDS both restricted time to treatment to 3 hours. PROACT II restricted entry to patients with proven middle cerebral artery occlusion.
We have demonstrated the disadvantage of BI
60 as a primary end point. Trials that are currently in progress should consider revisions to their statistical analysis plan before unblinding takes place to optimize statistical power. Such a decision has recently been announced by the international Intravenous Magnesium Efficacy in Stroke (IMAGES) trial group.27
In conclusion, this study has shown that many clinical trials in acute stroke have not used an optimal primary end point, which may have led to inconclusive results. Statistical power is not sufficient to render a trial informative, but it may be a prerequisite. Substantial and significant increases in power are observed when a dichotomized end point cut at the favorable extreme of the BI or RS, a patient-specific end point, or a global end point is used. On average, RS end points appear more powerful than BI end points, whether analyzed alone or as part of a global end point.
| Appendix |
|---|
|
|
|---|
For each simulated trial, 2 random samples of 700 patients were randomly selected, with replacement, from the GAIN data. Replacement means that the same patient can appear more than once in the same trial. The first sample was assigned to the simulated placebo group, and the second sample was used to generate outcomes at 90 days for a simulated active treatment group. This was achieved by calculating a revised baseline NIHSS score for each patient, representing a therapeutic drug effect. Outcomes were then sampled with replacement from the GAIN data with the revised NIHSS value, specifying also that the outcome should come from a patient with the same Oxford classification as the original patient, to ensure that both simulated groups had similar clinical characteristics. The patients generated in this way were assigned to the simulated active treatment group. For example, a patient with a baseline NIHSS score of 5 and Oxford classification of partial anterior circulation infarct (PACI) could have a revised NIHSS score of 3. A new patient with a NIHSS score of 3 would be randomly selected with replacement from the PACI subset of the GAIN data to provide a representative outcome at 90 days. This process was repeated 1500 times (each representing a single clinical trial) for each treatment effect and magnitude. The simulated trial data sets therefore contained baseline NIHSS and 90-day outcomes that had been assessed under standard clinical trial conditions.
Treatment Effect Patterns
Several treatment effect patterns were simulated to mimic a range of conceivable drug effects, as follows: (1) fixed treatment effect, with a uniform shift in the baseline NIHSS score of all patients, regardless of initial stroke severity; (2) neuroprotective treatment effect (NP), in which the benefit a patient gained from the treatment was dependent on age and sex. Some neuroprotective agents tend to lower blood pressure, especially in older women, which could result in a lower chance of favorable outcome at 90 days.2 NP was set so that a woman would receive half the benefit of a man, and an older patient would receive less benefit than a young patient; (3) thrombolytic treatment effect (TP1), in which the patients had a fixed treatment effect, but 10% of patients deteriorated by an average of 4 NIHSS points (mimicking symptomatic hemorrhagic transformation).3 These patients were randomly selected, depending on their baseline NIHSS score; a severe stroke was more likely to deteriorate; and (4) another thrombolytic effect (TP2), in which the patients received benefit from treatment that was dependent on stroke severity such that a milder stroke received a greater benefit; TP2 also included deterioration.
Several sizes of treatment effect were applied, equal to a 0-, 1-, 2-, or 3-point reduction in baseline NIHSS score. The results presented are for a 2-point decrease, which is equivalent to a relative risk reduction in being dead or disabled of 9%, an absolute risk reduction of 4%, or an odds ratio of 1.19, with the use of BI
60.
| Acknowledgments |
|---|
This study was supported by a collaborative studentship from the Medical Research Council and Pfizer to F.B. Young and by a Medical Research Council career development fellowship to Dr Weir. The GAIN trial was sponsored by GlaxoWellcome (now GlaxoSmithKline). GlaxoSmithKline had no involvement in this analysis or article.
Received March 28, 2003; revision received July 1, 2003; accepted July 11, 2003.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
M. Ali, P. M.W. Bath, J. Curram, S. M. Davis, H.-C. Diener, G. A. Donnan, M. Fisher, B. A. Gregson, J. Grotta, W. Hacke, et al. The Virtual International Stroke Trials Archive Stroke, June 1, 2007; 38(6): 1905 - 1910. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Aslanyan, C. J. Weir, K. W. Muir, K. R. Lees, and for the IMAGES Study Investigators Magnesium for Treatment of Acute Lacunar Stroke Syndromes: Further Analysis of the IMAGES Trial Stroke, April 1, 2007; 38(4): 1269 - 1273. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. L. Banks and C. A. Marotta Outcomes Validity and Reliability of the Modified Rankin Scale: Implications for Stroke Clinical Trials: A Literature Review and Synthesis Stroke, March 1, 2007; 38(3): 1091 - 1096. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. R. Lees, A. Davalos, S. M. Davis, H.-C. Diener, J. Grotta, P. Lyden, A. Shuaib, T. Ashwood, H.-G. Hardemark, W. Wasiewski, et al. Additional Outcomes and Subgroup Analyses of NXY-059 for Acute Ischemic Stroke in the SAINT I Trial Stroke, December 1, 2006; 37(12): 2970 - 2978. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Spears, K. G. TerBrugge, M. Moosavian, W. Montanera, R. A. Willinsky, M. C. Wallace, and M. Tymianski A Discriminative Prediction Model of Neurological Outcome for Patients Undergoing Surgery of Brain Arteriovenous Malformations Stroke, June 1, 2006; 37(6): 1457 - 1464. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. B. Young, C. J. Weir, K. R. Lees, and for the GAIN International Trial Steering Committe Comparison of the National Institutes of Health Stroke Scale With Disability Outcome Measures in Acute Stroke Trials Stroke, October 1, 2005; 36(10): 2187 - 2192. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Uyttenboogaart, R. E. Stewart, P. C.A.J. Vroomen, J. De Keyser, and G.-J. Luijckx Optimizing Cutoff Scores for the Barthel Index and the Modified Rankin Scale for Defining Outcome in Acute Stroke Trials Stroke, September 1, 2005; 36(9): 1984 - 1987. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. B. Young, K. R. Lees, C. J. Weir, and for the GAIN International Trial Steering Committe Improving Trial Power Through Use of Prognosis-Adjusted End Points Stroke, March 1, 2005; 36(3): 597 - 601. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. G. Merino, S. U. Lattimore, and S. Warach Telephone Assessment of Stroke Outcome Is Reliable Stroke, February 1, 2005; 36(2): 232 - 233. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Stroke Home | Subscriptions | Archives | Feedback | Authors | Help | AHA Journals Home | Search Copyright © 2003 American Heart Association, Inc. All rights reserved. Unauthorized use prohibited. |