Predicting Tissue Outcome From Acute Stroke Magnetic Resonance Imaging
Improving Model Performance by Optimal Sampling of Training Data
Background and Purpose— It has been hypothesized that algorithms predicting the final outcome in acute ischemic stroke may provide future tools for identifying salvageable tissue and hence guide individualized therapy. We developed means of quantifying predictive model performance to identify model training strategies that optimize performance and reduce bias in predicted lesion volumes.
Methods— We optimized predictive performance based on the area under the receiver operating curve for logistic regression and used simulated data to illustrate the effect of an unbalanced (unequal number of infarcting and surviving voxels) training set on predicted infarct risk. We then tested the performance and optimality of models based on perfusion-weighted, diffusion-weighted, and structural MRI modalities by changing the proportion of mismatch voxels in balanced training material.
Results— Predictive performance (area under the receiver operating curve) based on all brain voxels is excessively optimistic and lacks sensitivity in performance in mismatch tissue. The ratio of infarcting and noninfarcting voxels used for training predictive algorithms significantly biases tissue infarct risk estimates. Optimal training strategy is obtained using a balanced training set. We show that 60% of noninfarcted voxels consists of mismatch voxels in an optimal balanced training set for the patient data presented.
Conclusions— An equal number of infarcting and noninfarcting voxels should be used when training predictive models. The choice of test and training sets critically affects predictive model performance and should be closely evaluated before comparisons across patient cohorts.
Early identification of tissue at risk of infarction after acute ischemic stroke may aid clinical decision-making and potentially improve long-term patient outcome. Neuroimaging correlates of the ischemic core (irreversibly damaged tissue), the ischemic penumbra (electrically silent yet viable tissue), and (functioning) tissue with benign oligemia1 is therefore of immense clinical and scientific importance.
The perfusion–diffusion mismatch—tissue showing prolonged bolus transit time characteristics by perfusion-weighted MRI but normal diffusion-weighted imaging (DWI) characteristics—has found widespread use as an operational target for thrombolytic treatment.2 Although this dichotomization may identify patients who respond favorably to thrombolytic treatment beyond 3 hours,3 increasing evidence suggests that diffusion and transit time metrics do not suffice to outline the ischemic penumbra: DWI lesions are observed to reverse with early recanalization,4,5 emphasizing the importance of parallel physiological parameters in determining tissue fate. The perfusion–diffusion mismatch, in turn, generally overestimates final lesion size.6 Methods combining acute, multimodal voxel-based image information with follow-up imaging outcome (infarct, salvaged) into models predicting tissue outcome have therefore attracted considerable interest7–10 and are hypothesized to outperform prediction of outcome based on the mismatch dichotomization.
Although predictive models integrate multimodal voxel-by-voxel image information from cases with known outcome into models that may predict outcome in subsequent patient cases, the underlying training material may critically affect the precision and reliability of such predictions. Intuitively, the model must be trained on a representative range of initial image findings (eg, transit time and diffusion parameter changes) and corresponding outcomes (infarcting, noninfarcting tissue) to gain sufficient precision to guide outcome predictions and warrant group comparisons.
We suggest a clinically informative measure of predictive performance sensitive to correct classification of mismatch tissue. The study aims to use a logistic regression model theory to determine the optimal training set (relative number of infarcting and noninfarcting voxels) and further examine how model performance depends on the proportion of voxels affected by the stroke (mismatch voxels) in the training material. Finally, we examine how predicted tissue outcome and predicted lesion volume depend on training strategy.
Training a Predictive Model
Prediction of voxel outcome can be performed by identifying thresholds for a single image modality1,11,12 or by combining several modalities.9,10,13 In the latter case, infarction risk is assumed to depend on a linear combination of the acute modalities x1, …, xk in a particular voxel, eg,
in which the coefficients are derived from a training set where voxel outcomes are known. More flexible models such as k-nearest neighbor14 and multivariate adaptive regression splines15 can be applied in a similar fashion.
Results from logistic regression model theory show that optimal models are obtained by using training data with a balanced number of infarcting and surviving voxels.16 Consequently, a subset of surviving voxels must be chosen to match the patient’s final lesion in model training.
Testing a Predictive Model
The ability of a predictive algorithm to correctly predict final outcome is quantified by the size of the regions indicated in Figure 1A. For a given threshold, the number of correctly classified infarcting (true-positives [TP]) and surviving (true-negatives [TN]) voxels must be high, whereas the number of voxels falsely predicted to survive (false-negatives [FN]) or infarct (false-positives [FP]) must be low. This corresponds to simultaneously maximizing sensitivity (TP/TP+FN) and specificity (TN/TN+FP). To obtain a single performance measure independent of thresholds, the area under the receiver operating characteristic curve (AUC) can be calculated. The receiver operating characteristic curve is obtained by plotting (1-specificity) versus sensitivity for all thresholds. It can be shown that AUC is the probability that an infarcting voxel receives a higher score (risk of infarction) than a surviving voxel.17 This is also known as the Wilcoxon-Mann-Whitney test statistic.
Differentiating infarcting and surviving voxels in the mismatch area (ie, separating benign oligemia from penumbra) is a challenging aspect of predictive modeling—and indeed clinical management—in acute stroke. Consequently, we divide surviving voxels from a patient into 2 distinct subsets: surviving voxels from the mismatch area (R) and surviving voxels outside the mismatch area (S), ie, predominantly contralateral tissue. Performance can now be assessed using infarcting voxels together with surviving voxels from either R (AUCR) or S (AUCS; Figure 1B). It can be shown that the AUC calculated using all voxels (both R, S, and infarcting voxels) is a linear combination of these measures
where AUCR is calculated using all infarcted voxels and voxels from R and AUCS is calculated using all infarcted voxels and voxels from S (Figure 1B). The weights ωR andωS are the volume fractions of R and S relative to the volume of all surviving voxels. AUCR hence estimates the ability to differentiate infarcting and surviving voxels within the perfusion lesion area, whereas AUCS measures the ability to correctly differentiate between the infarct and surviving tissue outside the mismatch area. The volume of R is by far smaller than the volume of S and therefore the AUC value obtained using all voxels will be dominated by performance in regions that trivially never infarct (AUCS). Consequently, AUC based on all voxels is insensitive to correct classification of mismatch voxels. To simultaneously maximize AUCS and AUCR, we consider the mean of AUCR and AUCS as a measure of overall predictive performance, AUC0
AUC0 assigns equal importance to differentiation of infarcting voxels from affected and unaffected noninfarcting voxels.
In the following, voxels used to calculate performance of predictive models are denoted as test data. For illustrating the fact that AUC depends on test data, we consider performance quantified by the characteristic AUC measures (AUC, AUCR, AUCS). To avoid overestimating performance due to overfitting (a patient appearing in both test and training data), training and test data sets are sampled independently using jackknifing (leave-one-out procedure9).
Materials and Methods
Patients and Image Acquisition
Thirty-three patients presenting with symptoms of acute stroke were imaged with a 3.0-T MR scanner (Signa Excite; General Electric Medical Systems, Milwaukee, Wis). The protocol consisted of DWI, T2* gradient echo, fluid-attenuated inversion recovery, and gradient echo echoplanar perfusion-weighted imaging. All patients were treated with intravenous tissue plasminogen activator. Follow-up T2 fast spin echo images were acquired after 3 months. See Hjort et al18 for more details.
Cerebral blood flow, cerebral blood volume, and mean transit time (MTT) were calculated by oSVD19 with an automatic input function search algorithm20 using the perfusion analysis program PENGUIN (www.cfin.au.dk/software/penguin). Tracer arrival delay and dispersion were also determined as the time to peak of the residue function. Predictive algorithms were based on these perfusion measures in addition to T2, DWI, and apparent diffusion coefficient. All images were normalized with respect to mean values in normal-appearing contralateral white matter. Relative differences were applied for all physiological parameters, except for time to peak and MTT, in which absolute differences were used. The mismatch region was defined as the mismatch between acute perfusion and diffusion lesions, which were outlined on MTT and apparent diffusion coefficient maps by a neuroradiologist blinded to clinical and other MRI data as described in Wu et al.10 Infarcted tissue was delineated on follow-up images by a neuroradiologist. Analyses were limited to 16 patients with a follow-up lesion volume of ≥5 mL. Patient demographics are shown in the Table.
Simulation Study: Effect of Infarcting Voxel Proportion in Training Data
To assess sensitivity to training data selection and verify theoretical results,16 we simulated MTT values (n=1000) of infarcting and noninfarcting voxels based on MTT values obtained in patient data (noninfarcting: 5±2.5; infarcting: 10±4.5). Logistic regression models were developed with proportion of surviving voxels ranging from 2% to 98%. Independent test samples were simulated (n=1000) using this distribution of MTT and the proportion of surviving voxels was also varied. Mean risk of infarction for infarcting voxels in the simulated test data was calculated for each model.
Effect of Training Data Selection on Performance in Patient Data
To model sensitivity to selection of the noninfarcting voxels in a balanced training sample, we varied the sample fraction of surviving voxels uniformly sampled from the mismatch (R) and unaffected tissue (S) in training material. For each sample fraction, 16 models (each with one patient left out in training) were developed using jackknifing. Using these models, the performance measures AUC, AUCR, AUCS, and AUC0 were calculated for each patient and an optimal training sample was determined as the balance between surviving voxels from R and S, which maximizes AUC0. The ensuing model is denoted by M0. For comparison, we developed models using training sets in which all surviving voxels were from S (model MC) or all surviving voxels were from R (model MP).
Regional Infarct Risk Dependence on Training Data Selection
Infarct risk in core (apparent diffusion coefficient lesion), recruited tissue (mismatch tissue infarcted on follow-up), salvaged tissue (mismatch tissue normal on follow-up), and normal (ipsilateral) tissue10 was obtained using models M0, MC, and MP. Predicted lesion volume at a threshold of 50% was calculated for the models. The predicted lesion volumes were pairwise compared and compared with the actual measured volumes of the “true” outcome lesion. Moreover, the measured volumes were compared with the volume of the acute MTT lesions. Comparisons were done using a signed Wilcoxon test.
Effect of Test Data Balance on AUC
For further illustrating the dependence of AUC on test balance, we calculated AUC using infarcting voxels and surviving voxels from R (AUCR) for M0. Then, we recalculated AUC after adding surviving voxels from S in increments of 2% until all brain voxels were used for calculation of AUC.
Median infarction risk increases substantially when the number of infarcting voxels in the training set increases (Figure 2), eg, a voxel with MTT=10 seconds has a predicted infarct risk of 18% if 4% of the training sample consists of infarcting voxels in contrast to 76% when using a balanced training set. The result demonstrates the importance of sampling equally from infarcting and surviving voxels as indicated by theoretical results.16
Training Data Selection Dependence on Performance in Patient Data
The performance measures AUC, AUCR, and AUCS depend critically on training strategy (Figure 3A). AUC and AUCS do not differ substantially and decrease as the sampling fraction of surviving mismatch voxels increases, ie, the ability to correctly predict unaffected tissue is compromised. This decrease is small if the sampling fraction in R is between 0% and 60%, but as high as 12.4% if it exceeds 60%.
However, AUCR increases continuously as the sample fraction of surviving mismatch voxel increases. Maximum AUC0 is obtained for sampling fraction equal to 60% (Figure 3B), which is seemingly an optimal proportion of noninfarcting mismatch voxels. The corresponding model is denoted by M0. However, sampling fractions from 40% to 60% give similar results.
Regional Infarct Risk Dependence on Training Data Selection
Risk of infarction in core, recruited, salvaged, and normal tissue depends critically on which model (MC, M0, or MP) is applied for prediction (Figure 4). Using the common 50% threshold for predicting the final infarct results in a large part of salvaged tissue being falsely classified as infarcted for model MC in comparison to model M0. Moreover, interquartile range for risk of infarction in salvaged tissue is larger for MC compared with M0. The predicted lesion volumes of M0 were significantly different from the predicted lesion volumes of MC and MP (P<0.001). The predicted lesion volumes overestimate the final infarct for the 3 models (median differences and interquartile ranges: 78 mL [60 to 116] for MC, 35 mL [24 to 65] for M0, 78 mL [33 to 137] for MP). The acute MTT lesion overestimates the final infarct by 75 mL in median (interquartile range, 66 to 114), which is significantly larger than the bias observed using M0. Figure 5 shows risk maps thresholded at 50% for Patients 11 and 13 using models MC, M0, and MP and the corresponding acute DWI and MTT images. Patient 11 had 50% reperfusion 2 hours after treatment, whereas Patient 13 did not reperfuse. MC clearly overestimates the final infarct of both patients, whereas higher accuracy is observed for M0. Applying MP, areas of false infarct predictions are observed in normal tissue.
As can be seen in Figure 3C, adding surviving voxels from S to the test set implies an increase in calculated AUC, which we ascribe to different distribution of infarction risks of noninfarcted voxels outside and inside the mismatch. This further illustrates the dependence of calculated performance on test sets. Figure 3D illustrates the interacting effects of testing and training strategies on AUC.
Our results confirm earlier findings9,10,13 that logistic regression models may predict final outcome with higher accuracy than the conventional mismatch. Confirming theoretical results, our simulations further demonstrate that a balanced training strategy should be applied to avoid bias in predicted infarct risk. We find that tissue infarct risk estimates depend critically on the choice of the balanced training data (Figures 4 and 5⇑). Furthermore, our results demonstrate how model training may be further refined by determining the ratio of mismatch versus unaffected voxels among noninfarcting tissue to optimize model performance.
As a means of assessing predicted infarct volume, voxels with a predicted infarct risk of >50% has been proposed.9,10,16 Our study confirms this threshold choice yet revealing that predicted lesion volumes based on a 50% threshold vary greatly depending on training strategy. Final lesion volume is consistently overestimated by applying a 50% threshold, but bias may be minimized by the proposed optimal training strategy. Moreover, bias is significantly smaller using the optimal training strategy compared with the conventional mismatch approach. Small susceptibility artifacts caused a part of the observed overestimation. Given the immense potential of predictive algorithms in clinical decision-making and drug efficacy testing demonstrated in pioneering work so far,10,13 our work emphasizes the importance of special attention to optimal selection of training data. In particular, the comparison of observed lesion volumes in patients after a specific treatment with their lesion volume predicted under the condition of no treatment is only possible if the composition of training sets is carefully chosen. This also applies to designs in which treatment efficacy is demonstrated as the decrease in regional infarct risk (Figure 4).
One disadvantage of logistic regression models is that estimating SEs of model parameters requires independence of voxel outcomes. Although this is a common assumption,9,10 the pathophysiology of stroke predicts a regional correlation of infarct risk levels that affects the reliability of derived CIs and subsequent hypothesis testing. To obtain correct error estimates, we suggest that spatial autocorrelation structures21 should be modeled.
AUC is an established measure of predictive ability, although it may be a misleading measure of performance. It avoids subjectivity in threshold selection by summarizing model performance over all possible risk thresholds. In particular, thresholds corresponding to diagnostically unacceptable low sensitivity and specificity contribute to AUC. Furthermore, the actual scale of infarction risks and the goodness of fit of the model are ignored. Although AUC is theoretically independent of the balance between positive and negative outcomes in a test set, we have shown that AUC changes according to the regional clustering of outcomes. Predicting voxel outcome in the mismatch area is of primary clinical interest and when assessing treatment efficacy. We suggested the measure AUCR, which is sensitive to performance in the mismatch area, because AUC calculated using all brain voxels is insensitive to discrimination of infarcting and noninfarcting mismatch voxels. Other measures of performance used in the literature, eg, the root mean square error (Brier score), accuracy, and Youden’s J-index, would also be affected by the choice of testing material.
The sampling method for selecting training material depends on the specific outlining of the perfusion-weighted imaging lesion and, consequently, optimal sample fraction may depend on the perfusion metric and threshold applied. Using a time-to-peak threshold of 6 seconds as mismatch criterion did not alter the optimal fraction criterion significantly (results not shown). The sampling method can also be applied for other supervised learning methods used for classification of tissue outcome. The optimal choice of training set may vary according to patient cohort, therapy, time from onset, type of stroke, and so on.
There are several limitations to the study. Despite the overwhelming number of image voxels included, the patient cohort was limited, just as we did not include anatomic, stroke type, or reperfusion information, all speculated to improve specificity of outcome prediction.10,22 The correct assignment of voxel outcome depends on accurate coregistration of acute and follow-up images and is possibly biased by infarct shrinkage. Means of reducing this potential bias by excluding voxels located at infarct edges as well as means of automatically outlining final lesion volume to reduce bias caused by interrater classification disagreement should hence be pursued.
We acknowledge Niels Hjort, Mahmoud Ashkanian, and Christine Sølling for collecting the data used in the manuscript.
Sources of Funding
Supported by the Danish National Research Foundation (L.Ø., K.M.), the EU Commissions Sixth Framework Program (L.Ø., K.Y.J., Contract 027294), and GlaxoSmithKline (K.M., educational grant).
- Received March 11, 2009.
- Revision received May 14, 2009.
- Accepted May 28, 2009.
Astrup J, Siesjö BK, Symon L. Thresholds in cerebral ischemia—the ischemic penumbra. Stroke. 1981; 12: 723–725.
Hjort N, Butcher K, Davis SM, Kidwell CS, Koroshetz WJ, Röther J, Schellinger PD, Warach S, Østergaard L. Magnetic resonance imaging criteria for thrombolysis in acute cerebral infarct. Stroke. 2005; 36: 388–397.
Davis SM, Donnan GA, Parsons MW, Levi C, Butcher KS, Peeters A, Barber PA, Bladin C, De Silva DA, Byrnes G, Chalk JB, Fink JN, Kimber TE, Schultz D, Hand PJ, Frayne J, Hankey G, Muir K, Gerraty R, Tress BM, Desmond PM; EPITHET Investigators. Effects of alteplase beyond 2 h after stroke in the Echoplanar Imaging Thrombolytic Evaluation Trial (EPITHET): a placebo-controlled randomised trial. Lancet Neurology. 2008; 7: 299–309.
Fiehler J, Foth M, Kucinski T, Knab R, Bezold M, Weiller C, Zeumer H, Röther J. Severe ADC decreases do not predict irreversible tissue damage in humans. Stroke. 2002; 33: 79–86.
Sorensen AG, Copen WA, Østergaard L, Buonanno FS, Gonzalez RG, Rordorf G, Rosen BR, Schwamm LH, Weisskoff RM, Koroshetz WJ. Hyperacute stroke: simultaneous measurement of relative cerebral blood volume, relative cerebral blood flow, and mean tissue transit time. Radiology. 1999; 210: 519–527.
Jacobs MA, Mitsias P, Soltanian-Zadeh H, Santhakumar S, Ghanei A, Hammond R, Peck DJ, Chopp M, Patel S. Multiparametric MRI tissue characterization in clinical stroke with correlation to clinical outcome: part 2. Stroke. 2001; 32: 950–957.
Rose SE, Chalk JB, Griffin MP, Janke AL, Chen F, McLachan GJ, Peel D, Zelaya FO, Markus HS, Jones DK, Simmons A, O'Sullivan M, Jarosz JM, Strugnell W, Doddrell DM, Semple J. MRI based diffusion and perfusion predictive model to estimate stroke evolution. Magn Reson Imaging. 2001; 19: 1043–1053.
Wu O, Koroshetz WJ, Ostergaard L, Buonanno FS, Copen WA, Gonzales RG, Rordorf G, Rosen BR, Schwamm LH, Weisskoff RM, Sorensen AG. Predicting tissue outcome in acute human cerebral ischemia using combined diffusion- and perfusion-weighted MR imaging. Stroke. 2001; 32: 933–942.
Wu O, Christensen S, Hjort N, Dijkhuizen RM, Kucinski T, Fiehler J, Thomalla G, Röther J, Østergaard L. Characterizing physiological heterogeneity of infarction risk in acute ischaemic stroke using MRI. Brain. 2006; 129: 2384–2393.
Butcher KS, Parsons M, MacGregor L, Barber PA, Chalk J, Bladin C, Levi C, Kimber T, Schultz D, Fink J, Tress B, Donnan G, Davis S. Refining the perfusion–diffusion mismatch hypothesis. Stroke. 2005; 36: 1153–1159.
Wintermark M, Flanders AE, Velthuis B, Meuli R, van Leeuwen M, Goldsher D, Pineda C, Serena J, van der Schaaf I, Waaijer A, Anderson J, Nesbit G, Gabriely I, Medina V, Quiles A, Pohlman S, Quist M, Schnyder P, Bogousslavsky J, Dillon WP, Pedraza S. Perfusion-CT assessment of infarct core and penumbra—receiver operating characteristic curve analysis in 130 patients suspected of acute hemispheric stroke. Stroke. 2006; 37: 979–985.
Mouridsen K, Wu O, Koroshetz WJ, Sorensen A, Østergaard L. Optimal parameter choice in predicting final outcome in acute stroke. In Proceedings of the 12th Annual Scientific Meeting of the International Society of Magnetic Resonance for Medicine (ISMRM04), 2004, Kyoto, Japan.
Cramer JS. Predictive performance of the binary logit model in unbalanced samples. Statistician. 1999; 48: 85–94.
Hjort N, Ashkanian M, Sølling C, Mouridsen K, Christensen S, Gyldensted C, Andersen G, and Østergaard L. MRI detection of early blood–brain barrier disruption: parenchymal enhancement predicts focal hemorrhagic transformation after thrombolysis. Stroke. 2008; 29: 1025–1028.
Mouridsen K, Christensen S, Gyldensted L, Østergaard L. Automatic selection of arterial input functions for quantification of cerebral perfusion using cluster analysis. Magn Reson Med. 2006; 33: 524–531.
Bivand RS, Pebesma EJ, Gómez-Rubio V. Applied Spatial Data Analysis With R. New York: Springer; 2008.
Wu O, Christensen S, Rosa-Neto P, Hjort N, Rodell A, Dijkhuizen RM, Fiehler J, Röther J, Østergaard L. Anatomy as a parameter in multiparametric MRI-based predictive algorithms. In Proceedings of the 12th Annual Scientific Meeting of the International Society of Magnetic Resonance for Medicine (ISMRM04), 2004, Kyoto, Japan.