A Matching Algorithm to Address Imbalances in Study Populations
Application to the National Institute of Neurological Diseases and Stroke Recombinant Tissue Plasminogen Activator Acute Stroke Trial
Background and Purpose— Outcome from stroke is highly dependent on baseline conditions. Patients with stroke have a wide range of severities, ages, and etiologies and it has proven difficult to achieve randomization of key variables in clinical trials. We present a new post hoc approach to achieve balance among selected variables. To illustrate the approach, we rebalanced the National Institute of Neurological Diseases and Stroke Recombinant Tissue Plasminogen Activator trial, in which the contribution of baseline imbalances continues to be debated.
Methods— We selected baseline stroke severity (National Institutes of Health Stroke Scale), age, and glucose as matching criteria. The closest matched placebo and treated subjects were identified based on nearness to each other in 3-dimensional Euclidean space. Matching was performed within the quintiles of National Institutes of Health Stroke Scale that have been previously used to assess balance. Subjects who could not be matched were eliminated. Outcomes were assessed using the original specified National Institute of Neurological Diseases and Stroke trial measures.
Results— We successfully matched the 2 arms resulting in nearly identical baseline characteristics and distribution among quintiles. Despite fewer subjects after outlier elimination, the primary outcome measures remained significantly improved. After rebalancing, the magnitude of benefit was reduced by 13% to 23%. Benefit was apparent mostly in the large vessel occlusion subtype.
Conclusion— This study demonstrated the feasibility of rebalancing individual subjects within a randomized trial. After rebalancing and outlier elimination, recombinant tissue plasminogen activator continued to demonstrate improved outcome. That the apparent treatment effect was reduced suggests that imbalances contributed to the magnitude of the original National Institute of Neurological Diseases and Stroke outcomes. This method could in theory be applied to any data set to find matched subjects for outcome or other analyses.
Many early positive results in acute stroke treatment trials were not subsequently replicated.1 Although there are many potential explanations, we contended that imbalances in baseline variables contributed to apparent early success and may also obscure potentially positive results.1 Imbalances are particularly important in acute stroke given the range of severities and types of subjects typically enrolled in these trials but are not unique to stroke and can affect any disease in which there is variation in its presentation. Methods have been used to adjust or assess imbalances, including testing for covariance and regression. However, because the relationship between baseline variables and outcome is not necessarily linear,1 we contend that these methods are flawed. Moreover, we argued that complex outcome analyses can potentially magnify the effect of imbalances if they further subdivide subjects where imbalances are likely to be even more likely.1
Earlier, we established a pooled control population sample from published randomized trials and generated a continuous nonlinear model by which any particular treatment group can be compared based on its median and mean characteristics (PPREDICTS, Pooled Placebo REsponse DICtates Treatment Success1). This method was able to accurately identify studies that subsequently were either positive or negative when pursued to Phase 3. It can be argued that the group characteristics may still obscure important imbalances within the populations. Moreover, the method depends on sufficient control population data available for each outcome measure to generate a model.2 We report a new application of a method that involves direct matching of placebo and treatment arm subjects based on important baseline characteristics that have been associated with outcome. We apply this method to the National Institute of Neurological Diseases and Stroke (NINDS) Recombinant Tissue Plasminogen Activator (rtPA) trial, because imbalances in the 2 arms have been the subject of debate since its original publication.3–5
The controversy regarding NINDS imbalances has prompted reanalyses and considerable discussion.6–10 These controversies generally involved various ways to correct for baseline imbalances or focused on subgroups based on initial stroke severity. The most recent analysis, using the released complete individual data set, concluded that rtPA treatment had minimal effect on outcome7 and used a correction method not dependent on the nature of the relationship between baseline factors and outcome. However, these authors compared outcomes with a variable not originally intended as the primary outcome measure (change in National Institutes of Health Stroke Scale [NIHSS] from baseline) that is also nonlinear across the range of severity.11 In this report, we describe our method and demonstrate that the originally specified outcomes can be tested for the newly matched groups.
A custom Matlab program was written to match placebo subjects with treated subjects based on baseline characteristics followed by elimination of outliers and comparison of outcomes with this matched sample. The NINDS rtPA trial database was obtained from www.ntis.gov.12
A number of matching methods such as Euclidean distance, Mahalanobis distance, and propensity score matching have been proposed for a variety of conditions.13–15 A simple matching method based on weighted Euclidean distances was adopted to obtain a pair of subjects considered the nearest neighbors in an n-dimensional space.15 Euclidean distance was defined as follows: equation
Finding the Nearest Neighbor
For this study, Delaunay triangulation16–18 was used to identify the nearest neighbor for each placebo (source) subject from the rtPA-treated (target) population in 3-dimensional space of baseline NIHSS, age, and pretreatment glucose. Delaunay triangulation is based on a variation of the Convex Hull algorithm.19 We elected to match on 3 baseline parameters simultaneously—NIHSS, age, and blood glucose—all factors associated with outcomes in stroke.20–22 As proposed for using Euclidean distance matching,23 these parameters were verified to be independent of each other for both the source and target groups by Pearson and Spearman tests.24 Baseline NIHSS, age, and pretreatment glucose were multiplied by weighting factors 9.9, 2.29, and 1 to provide equivalency to the means when determining distances. After weighting, the means of each factor equaled the mean of 150.5, which was the mean of the largest value, glucose. The purpose of weighting is to avoid the factor with the largest value having an inordinate influence on the matching.15 We matched only those subjects in whom all 3 baseline variables were available. There were 2 subjects in the rtPA arm without a recorded glucose value and so these subjects and 2 fewer control arm subjects were not included in the final analysis.
From the Delaunay triangulation, the distance between each source–target pair was computed. A method to identify and eliminate outliers was adapted from the published literature.25 This method considers the distribution of data points in quartiles and establishes a threshold beyond which points are considered outliers. Threshold distance was determined by the formula based on 25th and 75th percentiles (Q25 and Q75)25 equation
As recommended, pairs with distances greater than the threshold were considered outliers25 and eliminated from further consideration. Baseline characteristics and 90-day outcomes of modified Rankin Scores (mRS: 0 to 6), Barthel Index (BI: 0 to 100), Glasgow Outcome Scale (GOS: 1 to 5), 90-day mortality, and 90-day NIHSS for the matched pairs were compared before and after matching.
Handling of Subjects With Prior Disability
Because prior disability may have an effect on recovery of function in subjects, we decided to take advantage of the “no prior disability” flag in the NINDS database. After excluding subjects with prior disability, matching and outlier elimination steps were repeated.
When subjects in both arms are grouped into quintiles of NIHSS (1 to 5, 6 to 10, 11 to 15, 16 to 20, >20), there were still more rtPA subjects in the least severe NIHSS quintile and fewer subjects in the most severe quintile after this first round of matching. To correct for this imbalance in the distribution of NIHSS, the matched output of the “handling of subjects with prior disability” step was subjected to within-quintile matching to generate an equal number of subjects in each quintile of placebo and rtPA-treated subjects. The magnitude of treatment effect (the absolute difference in proportions of subjects achieving the outcome criteria between the rtPA and the placebo groups) was determined before and after within-quintile matching and the difference expressed as a percentage of the original difference in outcomes.
Matching Within Stroke Subtype
The NINDS trialists recorded a baseline stroke subtype for each subject. We applied these subtypes to our matched samples, using the within-quintiles matched group, including those with prior disability, to compare with the original data set results.
Other Statistical Tests
Comparisons of distributions in terms of median NIHSS pre- and postmatching were performed by using Wilcoxon rank sum test. Student t test was used to compare means and variances of distributions. Fisher test was used to compare proportions in each group achieving mRS 0 to 1, BI of 95 to 100, GOS of 1, and NIHSS of 0 to 1 in the prematched sample. For matched groups, the paired proportions of mRS 0 to 1, BI of 95 to 100, GOS of 1, NIHSS of 0 to 1, and mortality were tested using the McNemar test for discordant pairs as recommended.26 These tests were verified with Stata Version 8.
Initial Group Matching
The weighted Euclidean distance between each rtPA-treated and nearest placebo arm patient was calculated in a 3-dimensional space of baseline NIHSS, age, and baseline glucose. Pairs with distances greater than the threshold were eliminated. The Figure shows all the rtPA- and placebo-treated subjects plotted in a 3-dimensional space of NIHSS, age, and glucose. Blue circles (n=566; filled: rtPA; open: placebo) represent subjects matched with the closest neighbor and subjects’ outliers are represented by red circles (n=54) and eliminated from further consideration. Matching and outlier elimination resulted in balanced groups of patients in terms of the median NIHSS (rtPA: 14 versus placebo: 14; P=0.61; Table 1) and means of NIHSS (14.4±7.3 versus 14.7±6.7; P=0.68), age (67.9±11.4 versus 66.5±11.8; P=0.15), and glucose (144±67.4 versus 145±66.6; P=0.90). Prematch and postmatch baseline characteristics are shown in Table 1.
Postmatch, the proportions of subjects in the rtPA group achieving mRS of 0 to 1 (0.39 versus 0.26), BI of 95 to 100 (0.52 versus 0.40), GOS of 1 (0.45 versus 0.32), and NIHSS ≤1 (0.34 versus 0.21) was higher than the proportions in the placebo group. McNemar test of paired proportions for each functional outcome measure was significant (Table 1) indicating the benefit of rtPA persisted after this rebalancing step. Mortality was similar in both rtPA and placebo groups (0.17 versus 0.20; P=0.27).
Exclusion of subjects with prior disability (24 in the treated and 24 in the placebo group) followed by 3-dimensional matching and outlier elimination again resulted in balanced baseline factors (Table 2). Proportion of subjects achieving functional outcomes of mRS 0 to 1, BI 95 to 100, GOS of 1, and 90-day NIHSS ≤1 remained significant for rtPA (Table 2). Mortality was again similar in both groups (0.16 versus 0.18; P=0.51).
To achieve equivalency in the range of severities among both arms, within-quintile matching was performed to generate the same number of subjects for rtPA- and placebo-treated groups in each quintile. Balance in terms of baseline factors persisted in these 2 groups (Table 2). The final baseline median NIHSS for each quintile is shown at the bottom of Table 3. McNemar test of matched pairs for all functional outcomes was significant for mRS 0 to 1, BI 95 to 100, GOS of 1, and NIHSS ≤1 (Table 2). We calculated the magnitude of treatment effect for all functional outcome measures before and after within-quintile matching. Within-quintile matching resulted in reduced differences between rtPA and placebo (Table 2). The magnitude decreased by 19% for those that achieved an mRS 0 to 1, 23% for BI 95 to 100, 13% for GOS of 1, and 21% for NIHSS ≤1. Mortality was minimally affected, although percentagewise appears to be large (−2% before and −1% after matching).
Stroke Subtype Matching
As reported in the original NINDS publication,3 the prematched comparisons based on subtype classification at baseline showed that the maximal benefit in terms of absolute differences in functional outcomes were for the “small vessel occlusion” subtype and the least benefit for the “cardioembolic” stroke (left 3 columns in Table 4; note that the NINDS paper did not specify any statistical test for these observations and our calculation had some minor discrepancies from the NINDS report3). Initial group matching when presented as stroke subtype did not demonstrate appreciable change in the apparent benefit for each subtype. However, postwithin-quintile matching reduced the apparent benefit in both “small vessel occlusion” and “cardioembolic stroke” with minimal to no reduction in the magnitude of effect in subjects classified as having “large vessel occlusion.” However, the quality of matching diminished presumably as the number of subjects available in each subtype diminished.
Imbalances in baseline characteristics are important in a condition such as stroke, which is not a homogenous disease based both on etiology, severity, location, and comorbid conditions. In this regard, the NINDS rtPA trial is not unique.1 Baseline stroke severity as reflected in NIHSS accounts for approximately 80% of functional outcome and mortality variance21 with lesser contributions from age and glucose.22 Imbalances can be lessened by expanding the number of subjects, although imbalances can still persist. There are potential factors that are not yet known and hence not tested for balance that can potentially influence outcome.
The relationship between baseline factors and outcome is not necessarily linear1; hence, it is not clear how best to “correct” for differences. Because of the uncertain relationship among variables, we elected to mathematically match subjects for their baseline characteristics before applying statistical tests of outcomes and validated that we were able to successfully match for key variables. In essence, we sacrificed larger numbers of subjects and reduced power for more homogenous populations. After rebalancing, the benefit from rtPA persisted, although the beneficial effect of treatment was somewhat reduced on a percentage basis. Nevertheless, our analysis indicates that rtPA is an effective treatment for improving outcome in acute ischemic stroke. Subtype analyses were performed and suggested that the majority of benefit from rtPA was in the large vessel occlusion group. However, the quality of matching was reduced with the smaller numbers in the subtypes. These results should be further tempered by the tentative diagnosis of subtypes in the NINDS rtPA trial in which only those studies available at baseline were used.3
Different matching methods have been proposed in other circumstances. A propensity score is obtained by first developing a linear regression model for a combined group (of case and control subjects together) and then calculating the probability of each subject being in the case (treatment) group.13,14 Two subjects in the case and control groups with the least differences in propensity score are considered matched. Propensity score measurement makes the following assumptions: (1) that the case and control groups have a large number of subjects; (2) that the distributions are normal (Gaussian) in shape; and (3) that the distributions are identical or nearly identical in terms of mean and SD.13,27 This method and the underlying assumptions are likely valid only when applied to large databases with thousands of subjects.28 If the 2 distributions of cases and control subjects are nonoverlapping in terms of independent variables, then it is likely that propensity scores of most of the cases will nearly be a 1 and those of the controls nearly a 0. If one were to then match based on equivalency of propensity scores (propensity score distance), then the matches of 2 unequal distributions will be imperfect. Mahalanobis distance measurement is based on mean, variance, and covariance among independent variables in the control group ignoring the treatment (case) group properties.13 Thus, this would be an appropriate method only if the case and control groups had nearly the same mean, variance, and covariances. Additionally, as the number of dimensions of comparisons increase, finding a good match becomes more difficult and the computational cost rises.13
Weighted Euclidean matching, as performed here, does not impose the restriction in terms of equality of mean, variance, and covariance between the 2 comparisons or that placebo and target arms be from overlapping distributions.15 A relationship is also not assumed between baseline variables and the likelihood of being assigned to the treatment arm or the placebo arm. Euclidean matching does assume independence of the variables used for distance calculations23 and if the variables covary, Mahalanobis matching may have to be used with the attendant problems. Although we selected 3 variables widely considered important in outcome in stroke, the Euclidean matching is easily scalable to >3 variables.
A number of different methods have been proposed for finding closest points in multidimensional space, including “exhaustive searching” and the Delauanay-Convex hull method.16,17 The exhaustive search method first calculates all potential distances between points and then ranks them to identify the shortest distance or closest match. Because it calculates all the potential distances, it may be more computationally costly compared with Delaunay triangulation if the number of dimensions or subjects is very large.16,17 Although we selected the Delaunay triangulation method for use here, it is likely that either method as well as others proposed in the literature16 would be acceptable.
The application to the NINDS rtPA trial indicated that a proportion of the difference in outcome from treatment was due to imbalances in the populations. Although in this case, benefit was maintained, such a contribution could conceivably obscure a smaller treatment effect in either direction (ie, false-negative or -positive) for a less effective agent. We do not suggest that this post hoc analysis method obviates the need for randomized trials. Blinded randomization reduces other potential biases such as selective recruitment, assignment of treatment or dropouts, and the subjective nature of outcome assessments, but randomization itself does not ensure balanced populations. In theory, our method could be applied to any condition in which the individual data sets are available and matching can be performed to any available population or groups of control subjects.
P.M. and T.A.K. hold the copyright and a patent application has been submitted on their behalf for PPREDICTS. They have no relationship to the NINDS rtPA trial or to Genentech, Inc.
- Received November 24, 2009.
- Revision received December 22, 2009.
- Accepted December 24, 2009.
Mandava P, Kent TA. A method to determine stroke trial success using multidimensional pooled control functions. Stroke. 2009; 40: 1803–1810.
Wasiewski WW. To Phase 3 or not to Phase 3? Stroke. 2009; 40: 1553–1554.
Mann J. NINDS reanalysis committee’s reanalysis of the NINDS trial. Stroke. 2005; 36: 230–231.
Ingall TJ, O'Fallon WM, Asplund K, Goldfrank LR, Hertzberg VS, Louis TA, Christianson TJ. Findings from the reanalysis of the NINDS Tissue Plasminogen Activator for Acute Ischemic Stroke Treatment Trial. Stroke. 2004; 35: 2418–2424.
Kwiatowski T, Libman R, Tilley BC, Lewandowski C, Grotta JC, Lyden P, Levine SR, Brott T; National Institute of Neurological Disorders and Stroke Recombinant Tissue Plasminogen Activator Study Group. The impact of imbalances in baseline stroke severity on outcome in the National Institute of Neurological Disorders and Stroke Recombinant Tissue Plasminogen Activator Stroke Study. Ann Emerg Med. 2005; 45: 377–384.
Gladstone D, Hill M, Black S. tPA for acute stroke: balancing baseline imbalance. CMAJ. 2002; 166: 1652–1653.
Savitz SI, Lew R, Bluhmki E, Hacke W, Fisher M. Shift analysis versus dichotomization of modified Rankin Scale outcome scores in the NINDS and ECASS-II trials. Stroke. 2007; 38: 3205–3212.
D'Agostino RB. Tutorials in Biostatistics: Statistical Methods in Clinical Studies. West Sussex, UK: John Wiley & Sons Ltd; 2004: 67–83.
Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983; 70: 41–55.
Bergstralh EJ, Kosanke JL. Computerized matching of cases to controls. Technical report 56. Available at: http://mayoresearch.mayo.edu/mayo/research/biostat/upload/56.pdf. Accessed September 2, 2009.
Skiena SS. Computational geometry. In: The Algorithm Design Manual. New York: Springer-Verlag; 1998: 345–396.
Matlab. The Language of Technical Computing. Boston: The Mathworks Inc; 2000: 11.18–11.37.
Uchino K, Billheimer D, Cramer SC. Entry criteria and baseline characteristics predict outcome in acute stroke trials. Stroke. 2001; 32: 909–916.
Weimar C, König IR, Kraywinkel K, Ziegler A, Diener HC. German Stroke Study Collaboration. Age and National Institutes of Health Stroke Scale Score within 6 hours after onset are accurate predictors of outcome after cerebral ischemia: development and external validation of prognostic models. Stroke. 2004; 35: 158–162.
Hill T, Lewicki P. Statistics: Methods and Applications. A Comprehensive Reference for Science, Industry and Data Mining. Tulsa, OK: Statsoft Inc; 2006: 163.
Myers JL, Well A. Research Design and Statistical Analysis. Hillsdale, NJ: Lawrence Erlbaum Associates Inc; 1995.
NIST/SEMATECH e-Handbook of Statistical Methods. Available at: www.itl.nist.gov/div898/handbook/. Accessed November 23, 2009.
Rosner B. Fundamentals of Biostatistics, 6th ed. Belmont, CA: Thomson; 2006: 408–412.
Shadish WR, Cook TD, Campbell DT. Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Boston: Houghton Mifflin Company; 2002: 164.