Improving Trial Power Through Use of Prognosis-Adjusted End Points
Background and Purpose— The stroke patient population is heterogeneous, leading to wide variation in outcome caused by differences in age, initial severity, and presence of concomitant disease. Setting an identical recovery target for all patients in intervention trials may conceal individually important therapeutic treatment effects. Instead, a variable end point that takes severity or likely prognosis into account may be more informative.
Methods— We used data from the Glycine Antagonist in Neuroprotection (GAIN) International trial to assess statistical power of various primary end points for intervention trials. We selected prognosis-adjusted cut points based on Barthel Index (BI) or Rankin Scale (RS) using a prognostic model, or assigned a fixed end point within subgroups of patients defined by their Oxford category or National Institutes of Health Stroke Scale (NIHSS) score. We simulated a treatment effect and estimated statistical power with standard formulae.
Results— Assignment of end points using a prognostic model for individual patients increased statistical power, when compared with assigning end points using only the Oxford classification. For the BI, power was increased from 60% to 88% (equivalent to a 49% reduction in sample size if power remains unchanged). With the RS end points, power was increased from 84% to 92% (or a 24% reduction in sample size). Versus a fixed end point for all patients, model-based methods increased power by 22 percentage points for BI≥95 and 14 percentage points for RS≤1 (effective sample size reductions 43% and 34%).
Conclusion— Prognosis-adjusted end points can increase statistical power compared with fixed end points. Assessment is based on realistic goals for individual patients and yet trial results remain generalizable.
The stroke patient population is heterogeneous. Assigning a common end point for an entire stroke trial population of patients could mask important findings in trial results. For example, a patient with a severe stroke may never reach total independence (Barthel Index1 [BI]≥95) but may still consider being independent from essential care (BI≥60) to be a good outcome. On the other hand, relatively mildly affected patients may only consider themselves to have recovered if they reach total independence. To account for the heterogeneity, the cut point on a recovery scale could be varied for individual patients. This would tailor end points to suit the prognosis or baseline characteristics of patients. Such variation of cut points in a trial would allow patients to be assessed on achievable goals, while ensuring that the trial remains generalizable. Allowing for patients to be assessed on different goals applies the principles of goal attainment scaling,2 which considers the outcome to be favorable if the patient achieves a prespecified objective.
A few trials have already used variable end points to take account of initial severity. The Stroke Treatment with Ancrod Trial (STAT)3 assessed whether patients achieved a BI score of at least 95 or had a score at 3 months at least equal to their prestroke value. The Abciximab in Emergent Stroke Treatment Trial (AbESTT; H.P. Adams, written communication, November 2002) used a dichotomized Rankin Scale (RS)4 secondary end point that split patients into 3 groups using the baseline National Institutes of Health Stroke Scale (NIHSS) score for each patient:5,6 RS≤0 for baseline NIHSS of 0 to 7, RS≤1 for baseline NIHSS of 8 to 14, and RS≤2 for baseline NIHSS >14. Berge and Barer7 also suggested that it may be appropriate to use variable criteria for assessing patients with different stroke severity levels but did not define the criteria to be used.
We have previously described the use of prognosis-adjusted end points for the BI and RS,8 where we divided patients into subgroups using the Oxford classification.9 We chose appropriate cut points for each of the Oxford category by picking a value on the RS and BI that fell close to the median value of the scale within each Oxford category, basing our estimates on placebo data from the Glycine Antagonist in Neuroprotection (GAIN) International trial.10 To enhance the selection of end points, we now consider the use of factors that are more closely related to baseline severity and prognosis, because assignment of cut points using an approach that incorporates a prognostic model may be potentially more powerful.
We aimed to assess a range of prognosis-adjusted end points. We examined cut points for both the RS and BI that were assigned either using subgroups of patients or using prognostic models. These prognosis-adjusted end points were compared according to their statistical power estimated under likely trial circumstances and were also contrasted with power estimates for the traditional fixed end points of RS≤1 and BI≥95.
End Points Assessed
The end points that we assessed can be considered in 2 categories: those in which cut points were assigned to subgroups of patients and those that used a prognostic model to assign cut points to patients. The subgroup approach assigned cut points based on the pretreatment assessment of either the Oxford Classification or the NIHSS (Table 1). Our other approach used a prognostic model to assign the cut points to individual patients. We built separate models for each of the RS and BI, using ordinal logistic regression.11 We initially split each scale into 3 categories: 0 to 55, 60 to 90, and 95 to 100 for the BI and 0 to 1, 2, and 3 to 6 for the RS. Later, we considered more complex models with extra response categories. We used stepwise regression: variables entered into the model for selection included total baseline NIHSS, age, Oxford classification, individual components of the baseline NIHSS, and various risk factors. To take account of the side of the stroke, we added factors for the worst leg and arm scores from the NIHSS. We continued to add variables to the model for as long as inclusion of a variable significantly improved the amount of variation explained by the model. However, for both outcome scales we also developed simple models that incorporated only age and baseline NIHSS.
We developed all models using the GAIN International trial10 placebo data. There was no significant treatment effect observed in the GAIN trial, so the final prognostic models were tested on the GAIN investigative treatment group data to check for generalizability and accuracy.
Our prognostic models predicted the cumulative probability of a given patient being in each outcome category. Initially, we set the probability threshold at 50% (patients were assigned cut points that they had at least a 50% chance of achieving), but later we considered alternative thresholds to find the optimal cut point. For example, using a 3 category BI model with categories set to be 0 to 55, 60 to 90, and ≥95 and with a probability threshold of 50%, the highest outcome category would be predicted if a given patient had at least a probability of 50% of achieving a score of 95 or more. If the probability of achieving a score of 95 or more was <50% but the probability of obtaining a score of 60 or more was at least 50%, then the outcome category of the patient was predicted to be 60 to 90. If neither of these constraints was achieved, then the outcome was predicted as 0 to 55. We took the lower bound of these ranges for the cut point (ie, 60 for a patient predicted to lie between 60 and 90).
We also investigated alternative probability thresholds, ranging from 45% to 5%. Assessing patients on end points that are toward the favorable end of the outcome scale had been more powerful in a previous study,8 and hence if the probability threshold was reduced from 50%, then patients would be assessed on a cut point that would be more difficult to attain (movement of cut point toward the most favorable extreme of the scales).
Simulation of Treatment Effect and Estimation of Statistical Power
We simulated a treatment effect which assumed that all patients would derive equal benefit from the treatment. In practice this is unlikely to be a valid assumption; however, our previous work has shown that more complex treatment patterns tend to alter the magnitude but not the direction of the conclusions. Our method consisted of applying treatment effects to the GAIN International placebo data. The treatment effect in terms of an odds ratio was estimated using the data from a previous simulation study.8 Each simulated clinical trial in the study consisted of 2 treatment groups. Patients entered into the simulated placebo groups simply were randomly selected with replacement from the GAIN data. In contrast, patients entered into the simulated treatment groups were selected so as to have had a milder stroke on average than the placebo group (ie, they had lower baseline NIHSS scores). We assumed that this would confer a greater chance of favorable outcome at 90 days. We believe that this is the closest artificial treatment effect that we can generate to mimic a scenario in which a neuroprotectant or thrombolytic has an early effect to limit infarct extent.
The simulated clinical trials were used to estimate the difference between groups for each given treatment level in terms of an odds ratio. Treatment level could be defined as the difference in baseline NIHSS score that we had artificially generated (0, 1, 2, or 3 points). We then examined the 3-month outcomes of the patients and estimated differences between the active treatment and placebo groups for each end point. Bootstrap CIs12 for the odds ratios were constructed using 1000 replications. We calculated the statistical power of each end point and its 95% CI using standard formulae.13 We could then compare end points on the basis of this estimate of statistical power. We also calculated any potential reduction in sample size that we could introduce without reduction of power, assuming that we used the revised end point instead of a fixed end point that incorporated BI≥95 or RS≤1.
Barthel Index End Points
From stepwise ordinal logistic regression, we found baseline NIHSS, age, presence of diabetes, the gaze component of the NIHSS, and the worst leg NIHSS motor score to be the best predictors of outcome at 90 days (we labeled this model BI1). We called our simple robust BI model that included only baseline NIHSS and age, BI2.
Compared with using the Oxford classification to assign cut points to patients, only model BI2 increased the statistical power (Table 2). We found no further improvement in power through increasing the model complexity.
Rankin Scale End Points
From stepwise regression, we found baseline NIHSS, age, and worst leg score to be closely related to RS outcome. We called this model RS1, and a further model that included only baseline NIHSS and age was termed RS2.
On average, the RS patient specific end points had higher statistical power than the BI end points (Table 3). Subgrouping the patients by Oxford category delivered the lowest statistical power, whereas subgrouping by NIHSS produced the highest power. When we used a model instead of subgroups, we found that we achieved good power through the simple approach of RS2 that controlled only for age and baseline NIHSS.
Optimal Probability Threshold for Model-Based End Points
For the analyses above, we assigned target outcomes to patients such that there would be a 50% chance that they would attain the chosen recovery target. The statistical power of our model-based prognosis-adjusted end points was further improved when the probability thresholds were optimized (Table 4). For the BI model, greatest power was obtained with a probability threshold of 25%, whereas for the RS model the optimal probability threshold was 30%. This suggests that assessing patients on stricter criteria for recovery results in a more powerful end point. The power that we obtained by using the RS2 model at its optimal probability threshold exceeded that obtained when patients were subgrouped by NIHSS: 0.921 (95% CI: 0.913, 0.928) compared with 0.882 (95% CI: 0.871, 0.893).
Comparison of Prognosis-Adjusted End Points to Fixed End Points
We compared the prognosis-adjusted end points to what could be considered the best fixed end points (BI≥95 or RS≤1). With a treatment level of 2, the BI≥95 and RS≤1 fixed end points obtained statistical power of 0.657 (95% CI: 0.639, 0.676) and 0.784 (95% CI: 0.770, 0.800), respectively. These values are inferior to those obtained with the model-based or NIHSS subgroup-based prognosis-adjusted end points.
Comparison of End Points in Terms of a Relative Sample Size
Finally, we compared selected end points in terms of a relative sample size (Table 5). All of the model-based end points used the optimal probability thresholds discussed in the previous section. For the BI end points, the sample size could be reduced by 43% (95% CI: 41%, 45%) if the model BI2 was used instead of the BI≥95 dichotomized end point. Using the BI2 model end point rather than the Oxford category end point would allow an effective sample size reduction of 50% (95% CI: 48%, 52%). For the RS end points, using the RS2 model to assign cut points could reduce the sample size by 34% (95% CI: 32%, 36%) compared with the RS≤1 dichotomy. If the NIHSS subgroup end point was used instead of the RS≤1 dichotomy, the sample size could be reduced by 24% (95% CI: 22%, 28%). Using the Oxford category to subgroup patients could result in reductions in sample size of 14% (95% CI: 11%, 17%).
We have shown that prognosis-adjusted end points may be preferable to fixed end points. The prognosis-adjustment approach can be enhanced by assigning cut points based on individually estimated patient prognosis rather than categorization into subgroups. Our simple model-based approach to assign appropriate cut points for patients using age and baseline NIHSS to estimate prognosis produced the greatest power advantage for the BI end points. This is partly because the BI starts with low inherent power. Splitting the patients into groups using the NIHSS (as used in the AbESTT) before assigning RS end points proved effective but so was our model-based approach that considered the age and baseline NIHSS of the patients. These prognosis-adjusted end points were almost invariably more favorable than the best traditional fixed end points.
Our model-based method performed better than the approach in which prognosis was identified from NIHSS or Oxford subgroups. Although the NIHSS gives a reasonable measure of initial severity, it disregards factors such as age that also influence prognosis. The Oxford classification gives an even more crude grading of severity and as a guide to prognosis it is more suitable for epidemiological purposes than for prediction of outcome in individuals.
Although our model-based approach was slightly more powerful than the NIHSS subgroup method used in AbESTT, the absolute advantage was marginal and the simplicity of the AbESTT method may outweigh this advantage. Such an end point will consequently be easier to understand in the context of clinical trials.
We used the placebo patients from the GAIN International trial.10 The GAIN trial showed no effect of gavestinel. It would be informative also to use data from other sources to validate the results. Our treatment effect is artificially generated. We hope that a successful thrombolytic or neuroprotectant would have an almost immediate effect in limiting infarct extent and thus initial severity, but there can be no guarantee that this would hold true. Only 1 pattern of treatment effect was considered: a fixed effect where all patients were assumed to improve by the same magnitude. Even though there is little knowledge of the actual true effect of most stroke interventions, this is unlikely to be clinically valid; however, our previous work8 has shown that more complex treatment patterns tend to alter the magnitude but not the direction of the conclusions. Also, other factors can influence the statistical power of a trial: patient selection, sample size, and time to treatment are other possible factors that should be considered.14,15 Elsewhere, we have also recently proposed that age and baseline NIHSS should be used together when considering eligibility for acute stroke studies.16
We have investigated a range of prognosis-adjusted end points and found that adjusting the cut points for patients depending on prognosis offers analytical power advantages over use of a single fixed end point. Our optimal method of assigning cut points used a prognostic model that considered age and baseline NIHSS, though simply subgrouping according to baseline NIHSS was more straightforward and almost as effective. Prognosis-adjusted end points allow patients to be assessed on realistic achievable goals, while allowing the clinical trial results to be generalizable. Maximizing trial power makes development of treatment for stroke more attainable and less expensive.
The GAIN trial was sponsored by GlaxoWellcome (now GlaxoSmithKline). GlaxoSmithKline had no involvement in this analysis or article. F.B.Y. is supported by a collaborative studentship from the Medical Research Council and Pfizer. C.J.W. is funded by a Medical Research Council career development fellowship.
- Received May 19, 2004.
- Revision received September 20, 2004.
- Accepted October 22, 2004.
Mahoney FI, Barthel DW. Functional evaluation: the Barthel Index. Md Med J. 1965; 14: 56–61.
van Swieten JC, Koudstaal PJ, Visser MC, Schouten HJ, van Gijn J. Interobserver agreement for the assessment of handicap in stroke patients. Stroke. 1988; 19: 604–607.
Brott T, Adams HP, Olinger CP, Marler JR, Barsan WG, Biller J, Spilker J, Holleran R, Eberle R, Hertzberg V, Rorick M, Moomaw CJ, Walker M. Measurements of acute cerebral infarction: a clinical examination scale. Stroke. 1989; 20: 864–870.
Adams HP, Hacke W, Blumki A, Clark W, Hansen MD, LeClerc J. Adjusting favourable outcomes following treatment of acute ischemic stroke as influenced by baseline severity of neurological impairments. Cerebrovasc Dis. 2003; 16.(Abstract.)
Young FB, Lees KR, Weir CJ. Strengthening acute stroke trials through optimal use of disability end points. Stroke. 2003; 34: 2676–2680.
Lees KR, Asplund K, Carolei A, Davis SM, Diener HD, Kaste M, Orgogozo JM, Whitehead J; for the GAIN International Investigators. Glycine antagonist (gavestinel) in neuroprotection (GAIN International) in patients with acute stroke: a randomised controlled trial. Lancet. 2000; 355: 1949–1954.
Agresti A. Categorical Data Analysis. New York, NY: Wiley; 1990.
Davison AC, Hinkley DV. Bootstrap Methods and Their Application. Cambridge, UK: Cambridge University Press; 1997.
Dorman PJ, Sandercock P. Considerations in the design of clinical trials of neuroprotective therapy in acute stroke. Stroke. 1996; 27: 1507–1515.
Weir CJ, Kaste M, Lees KR. Targeting neuroprotection clinical trials in ischemic stroke patients with potential to benefit from therapy. Stroke. 2004; 35: 2111–2116.