Combined Clinical and Imaging Information as an Early Stroke Outcome Measure
Background and Purpose— Imaging information has been proposed as a potential surrogate outcome in stroke clinical trials. The purpose of this study was to determine whether an early outcome measure combining clinical and imaging information is better than either alone in predicting 3-month outcome in acute ischemic stroke patients.
Methods— Clinical information (National Institutes of Health Stroke Scale) and imaging information (CT infarct volume), measured at 1 week from 201 patients from the Randomized Trial of Tirilazad Mesylate in Acute Stroke (RANTTAS) study, were used in a multivariable logistic regression analysis to predict excellent and devastating 3-month outcome. The combined models were compared with the infarct volume models and the clinical models. Discrimination, calibration, and change in global model chi-square were assessed.
Results— The combined models and models using clinical information alone had areas under the receiver operating characteristic curves that did not differ significantly (probability value = 0.092 to 0.4), ranging from 0.83 to 0.95. The imaging alone models performed less well (P<0.005) and had areas under the receiver operating characteristic curves that ranged from 0.70 to 0.80.
Conclusions— The National Institutes of Health Stroke Scale at 1 week is highly predictive of 3-month outcome in ischemic stroke patients. The addition of 1-week infarct volume does not improve the accuracy of the predictive model.
There is great interest in the stroke clinical research community in developing a valid surrogate outcome measure.1–5⇓⇓⇓⇓ The ideal surrogate outcome measure would be determined early in recovery after acute stroke and would accurately reflect long-term outcomes. A valid surrogate outcome measure might reduce the time and cost of acute stroke trials.
Cranial CT infarct volume in stroke patients is one of the imaging measures that has been considered as a potential surrogate outcome in stroke clinical trials. Recently its value as an early measure has been questioned as a result of the weakness of the association between infarct volume and standard clinical outcome measures.6 Clinical information, including measures of neurological function and stroke severity, in combination with CT infarct volume has not been evaluated as a potential early outcome measure.
The purpose of this analysis was to determine whether the combination of clinical information and infarct volume information measured at 1 week after acute stroke will predict 3-month clinical outcome better than either clinical or imaging information alone in the participants of an acute stroke clinical trial.
Subjects and Methods
The stroke patients used for this analysis were participants in the Randomized Trial of Tirilazad Mesylate in Patients with Acute Stroke (RANTTAS).7 The RANTTAS trial was a multicenter, randomized, double-blind, vehicle-controlled trial of 556 fully eligible patients who were treated with tirilazad mesylate or vehicle within 6 hours of stroke symptom onset. Tirilazad is a nonglucocorticoid 21-aminosteroid. Patients were treated at 1 of 27 North American centers between May 1993 and December 1994. The protocol was approved by each center’s institutional review board and subjects gave informed consent. The primary outcome measures for this trial were the Barthel Index (BI),8 a very basic activities of daily living scale; and the Glasgow Outcome Scale (GOS),9 a global disability scale, both collected at 3 months. The National Institutes of Health Stroke Scale (NIHSS),10 a measure of neurological dysfunction and stroke severity, was also collected at 3 months as an additional key outcome measure. These 3-month outcome measurements were obtained by study investigators who were trained to perform the measurements accurately and consistently.7
There were also numerous early measures captured in the RANTTAS trial at 7 to 10 days (±1 day) from stroke onset. These included the NIHSS and head CT infarct volume. A prospective random sample of half of all fully eligible patients was selected at study entry to submit 7- to 10-day head CT scans for volumetric measurement of infarct size. Noncontrast CTs were done using a standardized protocol that specified 5-mm slices. Infarct volume was calculated centrally using planimetric techniques by an investigator blinded to the treatment group and the clinical characteristics of the patient.11 A total of 556 fully eligible patients were enrolled in the trial, and 256 fully eligible patients with CT and clinical measures from the RANTTAS trial were included in this study. Because treatment with tirilazad mesylate did not have an effect on outcome,7 the 2 treatment groups were combined for this analysis.
Independent variables for the predictive models were prespecified and included the 7- to 10-day NIHSS score as the clinical predictor and the 7- to 10-day head CT infarct volume as the imaging predictor.
The dependent variables for the model were also prespecified. These included the 3 commonly used stroke clinical trial outcome measures: NIHSS, BI, and GOS collected at 3 months. Each of these measures are well-established and reliable measures of various aspects of clinical outcome.8–10⇓⇓ Each of the 3 outcome scales was prospectively dichotomized based on clinically relevant and literature-supported thresholds.7,12–14⇓⇓⇓ Each outcome measure was dichotomized twice. The first dichotomization identified excellent outcome versus everything else, where excellent outcome reflected full or nearly full recovery. The second dichotomization identified devastating outcome versus everything else, where devastating outcome reflected nursing home level disability or death. The definitions of the dichotomizations are shown in Table 1.
Subjects were excluded from the analysis if they were missing the outcome variable required by the model (NIHSS, n=35; BI, n=27; GOS, n=27). In addition, if either the 7- to 10-day NIHSS score (n=17) or CT infarct volume (n=3) was missing, the subject was eliminated from the analysis. One-week infarct volume measures that occurred outside of the window of 7 to 10 days (±1 day) were used as infarct volumes for 21 subjects who were imaged slightly outside of this window. These included 10 patients who were imaged before the window (earliest was day 3) and 8 patients imaged after the window, with all but 1 being within 30 days. The remaining patients had incomplete data due to unknown exact timing of the imaging. This resulted in a total of 201 subjects for all the NIHSS analysis and 206 for all the BI and GOS analyses reported below.
We used multivariable logistic regression analysis to estimate predictive models. The variable selection and the analysis were designed to avoid over fitting the models12,15⇓ including limiting the number of prespecified predictor variables. We estimated 6 distinct models using this data set. We used restricted cubic spline15,16⇓ with 3 knots (10th, 50th, 90th percentiles) for the independent variables of NIHSS and CT infarct volume to allow nonlinearity in the models. A combined model (including the clinical and imaging predictors) was developed for each of the 3 outcome measures (NIHSS, BI, GOS) for each of the 2 levels of outcome (excellent outcome, devastating outcome). For each of the combined models, comparison models including clinical variable only models and imaging variable only models were also evaluated. Table 1 demonstrates the different models. The change in model explanatory power was evaluated by testing the change in global model chi-square.
A number of secondary analyses were conducted to assess whether the primary results were confounded by other variables or treatment bias. To evaluate the role of age as a potential confounder, we compared the 2-variable combined models with a 3-variable combined model including age as an additional clinical predictor. Sensitivity analyses were also conducted to evaluate both the effect of excluding those with zero infarct volume and excluding those who received tirilazad mesylate in the original trial.
Model performance among the combined models (clinical and imaging predictor), clinical models (clinical predictor alone), and imaging models (imaging predictor alone) were assessed using the area under the receiver operating characteristic curve (ROC curve) as our measure of discrimination. The ROC curve is a plot of the sensitivity versus 1 − specificity or the true positive rate versus the false-positive rate. The area under the ROC curve reflects the models’ ability to discriminate between those with excellent outcome and all other outcomes and those with devastating outcome compared with all other outcomes. An area under the ROC curve of 1 is perfect discrimination, and an area of 0.5 reflects discrimination that is no better than random chance.
Calibration curves were used to assess model calibration and are a plot of predicted probability of outcome versus actual outcome. In a calibration graph, the 45-degree line (ideal line or line of identity) represents perfect calibration, where each predicted probability of an outcome exactly matches the actual probability of an outcome. The closer the model calibration curve is to the ideal line, the better the calibration.
Each model was internally validated using bootstrap validation techniques.17 This method of internal validation assesses how accurately the models will predict outcome in a new similar sample of stroke patients. Resampling occurred 100 times for each bootstrap validation. All discrimination and calibration data presented are bootstrap corrected (bias corrected). All modeling analyses were done using Splus 4.5 software (MathSoft Inc.).
Simple Spearman correlations were also calculated between each of the predictor variables and each of the outcomes, as well as between the 2 predictor variables. The partial predictive power of each independent variable was assessed using plots of each of the independent variables versus the predicted probability of outcome using the combined models.
There were 256 fully eligible patients enrolled in the RANTTAS trial who submitted 7- to 10-day CT scans to the central registry. Two hundred one subjects were used for the NIHSS models, and 206 subjects were used for the BI and GOS models because of missing variables as described above. Table 2 demonstrates the baseline characteristics of the 206 patients used for analysis. The mean age was approximately 69 years and the median NIHSS at baseline was 10, suggesting moderate neurological deficit. The majority of subjects were white, and few were disabled before their enrolling stroke.
The values measured for early outcome predictor variables, the 3-month outcome variables, and the dichotomized outcome frequencies are listed in Table 3. The bootstrap corrected (bias corrected) area under the ROC curve for the combined models, the clinical only models, and the imaging only models are shown graphically in Figure 1. The combined models for each of the excellent and devastating outcomes performed almost exactly as did the models based on clinical information alone (P value for difference ranged from 0.092 to 0.4). Area under the ROC curves for the combined models ranged from 0.83 to 0.94 and for models using clinical information alone from 0.84 to 0.95. The imaging alone models did not perform as well as the other models (P value for differences <0.005) and had areas under the ROC curves that ranged from 0.70 to 0.80. Model discrimination did not differ for the secondary analyses including the addition of age or the exclusion of zero infarct volume (data not shown). For the analysis excluding those who received tirilazad mesylate, the imaging alone models did not perform as well as the other models for 5 of the 6 models; in the sixth model, the imaging alone models were indistinguishable from the other models (data not shown).
Calibration curves for 2 of the combined models are demonstrated in Figure 2 (top and bottom). These represent the best calibration (top) and the worst calibration (bottom) among the 6 models. The hatched line reflects perfect calibration, and the solid line is the bootstrap corrected (bias corrected) calibration curve for the given model. The excellent outcome as determined by the BI combined model has a calibration curve that is nearly superimposed on the ideal line, suggesting excellent calibration. The calibration curve for a devastating outcome as determined by the NIHSS combined model deviates more from the ideal but still represents very good calibration. The calibration curves for the other 4 combined models are not shown but resemble the calibration curve in Figure 2 (top).
The role of each of the predictor variables (clinical and imaging) in the combined model was also examined as demonstrated in Figure 3 (top and bottom). Figure 3 (top) demonstrates the role of 7- to 10-day NIHSS score in predicting the probability of outcome as determined by the combined model. Figure 3 (bottom) demonstrates the role of the 7- to 10-day infarct volume by head CT in predicting the probabilities of outcome by the combined model.
Spearman correlations between the predictor variables and the outcome variables were calculated, again to determine the strength of these relationships. Table 4 shows the Spearman correlations that are consistently higher for the NIHSS score correlation with outcome than they are for the infarct volume correlation with outcome. The correlation between the 2 predictor variables was 0.64.
These data from the RANTTAS clinical trial population suggest that NIHSS score measured about 1 week after an acute ischemic stroke is highly predictive of 3-month outcome as determined by either the NIHSS, BI, or GOS. One-week infarct volume is also predictive of 3-month outcome as determined by the NIHSS, BI, and GOS, but not as strongly as the NIHSS score. The combined model, including both the clinical and infarct volume information measured at 1 week, does not predict 3-month outcome better than the 1-week clinical information alone. Figure 3 (top and bottom) illustrates that predictions are dominated by the clinical variable. In Figure 3 (top) for a mild to moderate stroke (0 to 15) there is a narrow band of predicted probabilities, suggesting that the NIHSS is dominating the prediction and is minimally influenced by the other predictor (CT infarct volume). The graph in Figure 3 (bottom) indicates that for a small to moderate stroke volume (0 to 200 cm3), there is a broad range of predicted probabilities, suggesting that the other predictor variable (NIHSS) is controlling the prediction of outcome.
Previous data have shown that infarct volume is related to stroke outcome,6,11⇓ which is consistent with our data. It is uncertain why infarct volume adds relatively little to the prediction of 3-month outcome in this stroke population. Lesion location may confound the relationship between infarct size and clinical outcome, because 2 different infarcts of the same size in different locations could have very different functional expression. This may be magnified when the clinical measure is able to capture this functional difference, but the imaging variable is not. The large number of subjects with no infarct volume detected in this data set (71) also suggests that a potential lack of CT scan sensitivity may limit the use of this imaging technique as an early outcome measure. One-week head CT may be missing small infarcts because of lack of sensitivity, posterior fossa infarcts due to artifact, or other infarcts due to fogging effect.6 More sensitive imaging techniques, such as MRI, have been shown to have a stronger relationship with clinical outcome.3 Because both infarct volume and NIHSS measure some degree of stroke severity, it may also be that the NIHSS captures more complete information as it relates to 3-month outcome than does CT, although they clearly do not measure the exact same thing. Other potentially confounding variables such as previous brain injury or disability, medical complications, and differences in therapy may also play a role.
Patient age did not seem to be a confounding variable, in that the addition of age in the secondary analysis added little to the models’ predictive ability. Previous data have repeatedly suggested that age is an independent predictor of 3-month outcome.12,14,18–20⇓⇓⇓⇓ The inability of age to add to the predictive ability of the model may reflect the fact that the influence of age on outcome is already captured by the 1-week NIHSS score. For example, the age effect may relate to more medical complications that have surfaced by 1 week but are then captured by the NIHSS. Medical complications have been demonstrated to be related to death but not so clearly related to 3-month disability.21 Other potential confounders such as prestroke disability, history of diabetes, previous stroke history, and stroke subtype were not analyzed. These have all been shown to have relationships to outcome and could be confounding our results.12
There are several limitations of this study. First, the models were only tested in the data sets in which they were derived. Although the bootstrap internal validation strongly suggests that the models were not over fitted to the data set and are likely to perform as well in a similar population, these models have not been externally validated with independent data. Our strict regression modeling technique, following published guidelines,15 including a limited number of prespecified predictor variables, allowing nonlinearity of the predictor variables, and internal validation to get a bias-corrected estimate of the models’ performance, all increase the likelihood that these models will perform equally well in a similar independent data set. The rule of 10 requires that there should be at least 10 least frequent outcomes for each degree of freedom used in the model. The use of 4 degrees of freedom in the devastating outcome as measured by NIHSS model, which only had 28 least frequent outcomes (as shown in Table 3), was the only model that violated the rule of 10, because we had less than 10 outcomes for each degree of freedom. This model did not validate as well internally and is less likely than the other 5 models to perform as well in another data set. The overall modest size of the data set also limits our ability to identify relationships. In a larger data set, other variables could be added to the model to potentially improve the prediction, but in this data set, we were limited to relatively few variables.
Another limitation of this analysis is the loss of approximately 50 subjects because of missing data. Although the age and baseline stroke severity on the missing population was the same as the population we used (data not shown), there is always the possibility that this has biased our sample. The fact that our baseline data and infarct volume correlations resemble those found in other published analyses also argues against a significant bias.11
A third potential limitation is the large number of zero infarct volume subjects (71), which could have resulted in a biased result. However, when we re-estimated clinical, imaging, and combined models on the 135 subjects with a measurable infarct volume, we obtained the same results. ROC areas from the clinical models were almost identical to those from the combined models, and ROC areas from the imaging models were much lower. The large number of participants without any measurable infarct volume, however, does raise the question of the stroke severity of this clinical trial population. These results may not be generalizable to other, more severely injured stroke populations. The use of more sensitive imaging measures, such as MRI, may result in fewer subjects with no measurable infarct volume.
Although the original RANTTAS trial demonstrated that tirilazad mesylate was not effective in changing the outcomes of these stroke patients, we conducted a second sensitivity analysis to assess whether our primary results were biased by an unsuspected effect of the experimental drug. The discrimination for the clinical models and combined models were very similar, and the imaging models performed less well in 5 of the 6 models and were the same in the sixth. These data suggest against a bias due to drug effect.
A valid early outcome measure that is easily and reliably obtained, inexpensive, and highly predictive of 3-month outcome in stroke clinical trials could be very valuable. Such a measure could potentially reduce the follow-up time required for patients, the cost of trials, and the number of patients lost to follow-up in trials. At a minimum, a highly predictive regression model could be used for those lost to follow-up at 3-months to better predict 3-month outcome to allow them to be included in the data analysis. In general, when patients in a clinical trial are lost to follow-up, there are 3 analysis options: (1) those subjects can be dropped from the analysis, (2) the last observation can be carried forward, or (3) a prediction model can be used to estimate the likely outcome.
If there is an independently established and valid prediction model with good explanatory power, the prediction option should provide the best estimate of the treatment effect. This is because it provides the least biased estimate of the missing data. Using predictions, one could conduct the analysis with all randomized cases and maintain intention to treat unaffected by potentially biased losses to follow-up and with more accuracy than carrying forward the last observation. Because an area under the ROC curve of >0.8 is generally accepted as a strong enough relationship to make individual predictions,12 these data are encouraging that individual predictions with such a model may be acceptably accurate. If valid, these models could be used to impute 3-month outcome for participants lost to follow-up in stroke clinical trials.
Although our analysis used 1-week CT infarct volume, which added little to the prediction, further study of other imaging information, such as that obtained by MRI imaging, may improve the prediction of long-term outcomes after acute ischemic stroke.
Infarct volume measured at 1 week adds little to NIHSS score measured at 1 week as an early outcome measure for predicting 3-month excellent and devastating stroke outcome as measured by NIHSS, BI, and GOS. If these results are proven valid, it would be difficult to justify the expense and patient inconvenience of CT imaging 1 week after acute stroke if the purpose of the scan is to obtain an early indicator of 3-month clinical outcome.
Dr Johnston is supported by the National Institutes of Health-National Institute of Neurologic Disorders and Stroke (K23NS02168-01). The RANTTAS study was supported, in part, by the National Institutes of Health-National Institute of Neurologic Disorders and Stroke (R01-NS31554) and Pharmacia and Upjohn Company (Kalamazoo, Mich).
The RANTTAS investigators are listed in the Appendix of Reference 7.
This work was presented, in part, at the 2nd Neurology Outcomes Research Conference of the American Neurological Association, Boston, Mass, October 15, 2000.
- Received September 7, 2001.
- Revision received October 30, 2001.
- Accepted November 14, 2001.
- ↵Warach S, Boska M, Welch KMA. Pitfalls and potential of clinical diffusion-weighted MR imaging in acute stroke. Stroke. 1997; 28: 481–482.
- ↵Warach S, Moseley M, Johnston K, Adams H, Zivin J. Diffusion weighted imaging: ready for prime time? Plenary Session Presented at the 23rd International Joint Conference on Stroke and Cerebral Circulation, February 5, 1998, Orlando, Fla.
- ↵Thijs BN, Lansberg MG, Beaulieu C, Marks MP, Moseley ME, Albers GW. Is early ischemic lesion volume on diffusion-weighted imaging an independent predictor of stroke outcome? A multivariable analysis. Stroke. 2000; 31: 2597–2602.
- ↵Saver JL, Johnston KC, Homer D, Wityk R, Koroshetz W, Truskowski LL, Haley EC. Infarct volume as a surrogate or auxiliary outcome measure in ischemic stroke clinical trials. Stroke. 1999; 30: 293–298.
- ↵The RANTTAS Investigators. A randomized trial of tirilazad mesylate in patients with acute stroke (RANTTAS). Stroke. 1996; 27: 1453–1458.
- ↵Lyden P, Brott T, Tilley B, Welch KM, Mascha EJ, Levine S, Haley EC, Grotta J, Marler J. Improved reliability of the NIH Stroke Scale using video training. NINDS TPA stroke study group. Stroke. 1994; 25: 2220–2226.
- ↵Brott T, Marler JR, Olinger CP, Adams HP Jr, Tomsick T, Barsan WG, Biller J, Eberle R, Hertzberg V, Walker M. Measurements of acute cerebral infarction: lesion size by computed tomography. Stroke. 1989; 20: 871–875.
- ↵Johnston KC, Connors AF, Wagner DP, Knaus WA, Wang X, Haley EC Jr. A predictive risk model for outcomes of ischemic stroke. Stroke. 2000; 31: 448–455.
- ↵The NINDS t-PA Stroke Study Group. Generalized efficacy of t-PA for acute stroke. Subgroup analysis of the NINDS t-PA stroke trial. Stroke. 1997; 28: 2119–2125.
- ↵Harrell FE, Lee KL, Mark DB. Tutorial in biostatistics: multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med. 1996; 15: 367–387.
- ↵Efron B, Gong G. A leisurely look at the bootstrap, the jackknife, and cross-validation. Am Stat. 1983; 37: 36–48.
- ↵Censori B, Camerlingo M, Casto L, Ferraro B, Gazzaniga GC, Cesana B, Mamoli A. Prognostic factors in first-ever stroke in the carotid artery territory seen within 6 hours after onset. Stroke. 1993; 24: 532–535.
- ↵Johnston KC, Jiang YL, Lyden PD, Hanson SK, Feasby TE, Adams RJ, Faught RE Jr, Haley EC Jr. Medical and neurological complications of ischemic stroke: experience from the RANTTAS trial. Stroke. 1998; 29: 447–453.