Finding the Most Powerful Measures of the Effectiveness of Tissue Plasminogen Activator in the NINDS tPA Stroke Trial
Background and Purpose—We sought to identify the most powerful binary measures of the treatment effect of tissue plasminogen activator (tPA) in the National Institute of Neurological Disorders and Stroke (NINDS) rTPA Stroke Trial.
Methods—Using the Classification and Regression Tree (CART) algorithm, we evaluated binary cut points and combination of binary cut points with the 4 clinical scales and head CT imaging measures in the NINDS tPA Stroke Trial at 4 times after treatment: 2 hours, 24 hours, 7 to 10 days, and 3 months. The first analysis focused on detecting evidence of “early activity” of tPA with the use of outcome measures derived from the 2-hour and 24-hour clinical and radiographic measures. The second analysis focused on longer-term outcome and “efficacy” and used outcome measures derived from 7- to 10-day and 3-month measures. After identifying the cut points with the ability to classify patients into the tPA and placebo groups using part I data from the trial, we then used data from part II of the trial to validate the results.
Results—Of the 5 most powerful outcome measures for early activity of tPA, 4 involved the National Institutes of Health Stroke Scale (NIHSS) score at 24 hours or changes in the NIHSS score from baseline to 24 hours. The best overall single outcome measure was an NIHSS score ≤2 at 24 hours, which provided an odds ratio of 5.4 (95% CI, 2.4 to 12.1) and a projected sample size of 58 per treatment group assuming an α of 0.05 (2-sided test) and a power of 80% using part I data. The top 2 and 3 of the top 5 outcome measures for detecting the longer-term efficacy of tPA also involved the NIHSS score. A Rankin score of 0 or 1 at 3 months was the third most powerful outcome measure. Outcome measures identified by CART from part I data were not as sensitive in detecting the effectiveness of tPA when applied to part II data.
Conclusions—Measures using the NIHSS and a Rankin score ≤1 were the most sensitive discriminators of the effectiveness of tPA in the NINDS tPA Stroke Trial compared with the other clinical and radiological measures. The outcome measures identified in this exploratory analysis (eg, NIHSS score ≤2 at 24 hours) would be best used as an outcome measure in future phase II trials of recanalization begun within the first 3 hours after stroke onset, with inclusion and exclusion criteria similar to those in the NINDS tPA Stroke Trial.
Stroke is the third leading cause of death and the leading cause of adult disability, but recruitment of patients into clinical trials of acute ischemic stroke during the 1990s has been poor. The major reason for the lack of eligible patients is the narrow time window to treatment in almost all of the recent and ongoing clinical trials. Intravenous tissue plasminogen activator (tPA), the only therapy for acute stroke that is approved by the Food and Drug Administration, was proven to be effective in 2 studies in which all patients were treated within 3 hours and nearly half within 90 minutes.1 The combination of a limited number of eligible patients and many promising new drugs and devices for acute ischemic stroke indicates the need for the efficient design of future randomized phase II and phase III clinical trials.
Selection of the primary outcome measure or end point is among the most important considerations in the design of a clinical trial and depends not only on the disease under study but also on the expected mechanism and effect of a given therapy. Ideally, a study outcome measure should be easy to perform, reproducible, valid, clinically meaningful, and resistant to bias.2 It should also detect clinically relevant differences in the effectiveness of various therapies for a given disease with the smallest number of patients possible.
The National Institute of Neurological Disorders and Stroke (NINDS) rtPA Stroke Trial, the basis for approval of tPA, used 4 different clinical scales (primary outcome measures) as prespecified binary end points.1 Additionally, volumetric measurement of infarct by CT (secondary measure) was a prespecified end point. The present exploratory study was designed to determine which binary end points, or combination of end points, would consistently require the fewest number of patients to detect a significantly beneficial effect of tPA, assuming a power of 80% and an α of 0.05, if such a study were repeated. We hoped that such an analysis might provide guidance concerning selection of outcome measures or end points for future phase II and phase III studies of therapy for acute ischemic stroke.
Subjects and Methods
The NINDS rtPA Stroke Trial was composed of 2 separate parts with nearly identical study methodology but with different goals.1 The focus of part I was identifying early activity of tPA, and the prespecified end point of interest was improvement of ≥4 points from the baseline National Institutes of Health Stroke Scale (NIHSS) to the 24-hour NIHSS or complete resolution (NIHSS score=0) at 24 hours (evidence of early activity of tPA). The goal of part II was to determine whether tPA was associated with a significantly better long-term functional and neurological outcome. The prespecified end point for part II was a favorable outcome measured from 4 neurological and functional scales dichotomized to clearly identify patients with minimal or no neurological or functional deficit: Rankin Scale score of 0 or 1,3 NIHSS score of 0 or 1,4 Glasgow Outcome Scale score of 1,5 or Barthel Index score of 95 to 100.6 A global statistic was used to test the effect of tPA on the likelihood of a favorable outcome compared with the placebo-treated group.7 In addition, detailed image analyses of CT images obtained at baseline, 24 hours, 7 to 10 days, and 3 months enabled comparisons of the total lesion volume between the 2 treatment groups.8
This exploratory analysis was designed to identify the most powerful binary end points with regard to detecting a treatment effect of tPA in the NINDS rtPA Stroke Trial. In other words, if we were starting the NINDS tPA Stroke Trial today, which end points or outcome measures would be the most likely to detect the activity or efficacy of tPA with the smallest sample size possible? To accomplish this, we wanted to consider all binary end points that could be constructed from the 4 clinical scales and imaging measures at 4 times after treatment: 2 hours, 24 hours, 7 to 10 days, and 3 months.
We conducted 2 separate analyses. The first analysis focused on detecting evidence of “early activity” of tPA and used outcome measures derived from 2- and 24-hour clinical and radiographic measures. The second analysis focused on longer-term outcome and “efficacy” and used outcome measures derived from 7- to 10-day and 3-month measures.
The analyses were performed with the use of the Classification and Regression Tree (CART) algorithm.9 CART methodology is known as binary recursive partitioning using a nonparametric approach. CART was used to construct variable partitions to classify patients into either the treatment or placebo group with a simple recursive tree structure. CART started with outcomes of interest (eg, a given NIHSS score at 24 hours, changes in the NIHSS score from baseline to 24 hours, and lesion volume at 24 hours). For each outcome (eg, NIHSSS score at 2 hours), CART explored all the possible cutoff points (eg, values of 0, ≤1, and ≤42) and picked the end point (the cutoff point) that provided the greatest separation of the patients into the tPA-treated group and the placebo-treated group. CART identified the end point with the greatest ability to separate the 2 treatment groups from among all the binary end points that were considered. This end point was then used to divide patients into 2 groups (subgroups). From these 2 subgroups, CART selected end points (the cutoff point) for every remaining binary outcome of interest and identified an end point from among the other end points with the greatest ability to divide the patients into treatment groups. The process continued until no further end point could be identified (see Figure 1⇓ for example). We then calculated the sample sizes for each end point or end point combination on the basis of 80% power, a 2-sided test, and α of 0.05. We used the part I data from the NINDS rtPA Stroke Trial to explore all the possible end points using the CART approach and used the part II data to validate the results.
We compared the sample size between the end points identified by this exploratory analysis and the original primary end points for parts I and II in the NINDS rtPA Stroke Trial. The sample size calculation was based on a χ2 test of proportions. Since power is a function of sample size, if we had fixed sample size and computed power, holding constant the α and proportions of interest, the same end points would have been chosen. We chose instead to fix power and compute sample size to provide information that is more interpretable to the clinician.
To explore how entry criteria may affect the choice of end point for a future trial based on the data from NINDS rtPA Stroke Trial, we repeated the entire procedure including only patients with an NIHSS score ≥10 at baseline. The rationale for selecting this cutoff is because of a planned trial of intravenous/intra-arterial tPA (the Interventional Management of Stroke [IMS] Trial), in which patients will have treatment started within 3 hours of onset. In this trial, an NIHSS score ≥10 will be used to ensure a high likelihood of a visible intra-arterial clot at angiography.10
Four of the 5 most powerful end points in detecting early activity of tPA involved the NIHSS (Table 1⇓). The other end point was CT lesion volume ≤0.3 cm3 at 24 hours. The best overall single end point was NIHSS score ≤2 at 24 hours, which provided an odds ratio of 5.4 (95% CI, 2.4 to 12.1) and a projected sample size of 58 per group. By comparison, the original primary end point of part I of the NINDS rtPA Stroke Trial for early activity of tPA was improvement ≥4 points from the baseline NIHSS to the 24-hour NIHSS or complete resolution (NIHSS score=0) at 24 hours. According to this end point, we would project a needed sample size of 625 per group to reach 80% power using part I data and 573 using part II data under similar assumptions.
A combination of 2 measures provided only slightly lower projected sample sizes in detecting early activity than use of a single measure (Figure 1⇑). For example, NIHSS score ≤2 or change in NIHSS score ≥8 between baseline and 2 hours was associated with an odds ratio of 5.22 with a sample size of 40 per treatment group, assuming the same power and α level. The next best 2-measure combination in detecting differences between the 2 treatment groups was NIHSS score ≤2 (good outcome) or NIHSS score at 24 hours ≥26 (bad outcome; sample size, 47).
Table 2⇓ demonstrates the results when the best single end points for detecting early activity selected with part I data are applied to part II data. The 3 end points with the smallest projected sample size with the use of part I data were also sensitive end points when applied to part II data, but part II data indicate that the needed number of patients with the use of this end point in a proposed study would be somewhat higher. These 3 end points were NIHSS score ≤2 at 24 hours, change between baseline and 24-hour NIHSS score ≥15 points, and NIHSS score ≤5 at 2 hours. The projected sample sizes for these 3 end points as determined by part II data ranged from 96 to 121 per patient group. Applying the top 3 combinations of 2 end points in part I of the trial to part II data resulted in projected sample sizes of 113 to 134, which were not smaller than projected sample sizes using single end points.
Table 3⇓ represents the best end points with regard to detecting differences in longer-term outcome or efficacy. The 2 most sensitive end points involved the NIHSSS (projected sample size, 70 to 73). A Rankin score of 0 or 1 at 3 months was the third most sensitive end point (projected sample size, 91). A combination of 2 measures did not appreciably change the needed sample size. For example, the best combination was a change in NIHSS score from baseline to 7 to 10 days of ≥24 or lesion volume at 3 months of ≤0.1 cm3 (odds ratio, 3.72; sample size, 62).
By comparison, the original primary end point for efficacy in part II of the NINDS tPA Stroke Trial was a composite end point of Rankin Scale score of 0 or 1, Glasgow Coma Scale score of 1, Barthel Index score of 95 to 100, and NIHSS score of 0 or 1. Using the composite end points, we would project a needed sample size of 122 per group based on part I data and 232 per group based on part II data, using similar assumptions concerning α and power.
Table 4⇓ illustrates results when the best single end points for detecting longer-term efficacy with part I data are applied to part II data. In general, the projected sample sizes are higher. Applying the combination of measures to part II data gave similar projected sample sizes compared with projected sample sizes with the use of single measures alone.
Table 5⇓ demonstrates the substantial change in the best outcome measures when the analysis is limited to patients with NIHSS score ≥10 at baseline. NIHSS measures still consistently perform best, but the most discriminating NIHSS measures in this subgroup are different than those in which the entire sample of patients is used. Of the best 5 end points as determined from part I data, only 2 measures were among the best as determined by part II data. These 2 measures were change in NIHSS score ≥15 (or return to NIHSS score of 0) from baseline to 24 hours and NIHSS score ≤2 at 24 hours. Using either of these 2 measures, one would expect to need 100 to 200 patients per treatment group for a phase II study of safety and early drug activity.
We have found that the NIHSS score is an excellent outcome measure for use in phase II studies, in which the primary goal is to screen potential stroke therapies for early activity and safety as well as long-term outcome. Measures using the NIHSS were the most sensitive and consistent discriminators of the effectiveness of tPA in both parts of the NINDS rtPA Stroke Trial compared with the other clinical and radiological measures. In addition, use of an NIHSS score ≤2 at 24 hours was a more powerful measure of a treatment effect of tPA in this group of patients than the composite global end point at 3 months, the primary outcome measure of efficacy in part II of the NINDS rtPA Stroke Trial.1 7 A Rankin score of 0 or 1 at 3 months and at 7 to 10 days and a Barthel Index score of ≥95 at 7 to 10 days were other sensitive measures of longer-term efficacy in addition to NIHSS-based end points.
The power of the NIHSS outcome measure to detect a treatment effect has several possible explanations. First, the NIHSS has been shown to be reproducible, valid, and an excellent predictor of long-term outcome.4 11 12 13 The 42-point NIHSS is more sensitive to change than global scales such as the modified Rankin, which has only 5 levels of change. Finally, great care was taken in the NINDS rtPA Stroke Trial to ensure that all treating investigators could perform the NIHSS reliably and correctly before the trial was started and as the trial progressed.1 13 Such attention to performance of the NIHSS reduces variability and makes change in the score more likely to be related to real changes in the patient’s condition.
The most powerful measure of the effectiveness of tPA during the first 3 months after stroke was a given level or amount of change in the NIHSS during the first 24 hours, or even the first 2 hours, after start of treatment. This finding supports the clinical observation that dramatic improvement in neurological function during the first hours after treatment may be an excellent marker of a treatment effect.14 In the original NINDS tPA Pilot Trial, some patients improved so dramatically during the first several hours that they were termed “on-the-table responders” by the investigators. Our analysis appears to confirm this clinical impression of an early treatment effect in some patients.
The CART method examines only possible binary cut points and does not evaluate other nonbinary end points (eg, comparing the median NIHSS in 2 treatment groups at 24 hours). Binary end points work best when outcome measures are not distributed normally but are skewed or clustered. The distribution of measures of outcome after stroke, using treatment measures such as the Barthel Index, Rankin Scale, and NIHSS, is quite skewed. Patient scores on these scales tend to cluster at the very good and bad ends of the given scale, with relatively fewer patients in the middle of the scale. Figure 2⇓ demonstrates just such a distribution, using the Barthel Index at 3 months in tPA-treated and placebo patients from the NINDS tPA Stroke Trial. The distribution of radiographic end points such as CT volume of brain infarction is also skewed. Thus, the CART method should be a useful tool to explore clinical and radiographic end points in acute stroke studies. The J- or U-shaped distribution of scale scores such as the Barthel Index, or the skewed distribution of volumes of infarction as measured by CT, emphasizes that investigators must consider carefully the distribution of scale scores from similar previous trials, the desired clinically relevant outcomes, and the expected effect of a therapy when choosing study end points.
Global outcome measures using multiple appropriate end points have been shown to increase the power of a study when individual end points are correlated in the same direction, compared with using only one of the individual end points.7 Such an approach was used successfully for part II of the NINDS trial. Retrospectively, the global outcome method was applied to the European Cooperative Acute Stroke Study (ECASS) I and gave a positive result.15 However, the global outcome approach is only as good as the individual outcome measures. The more powerful the individual end points, the better the global outcome approach should be. Using the validated results of an exploratory CART analysis, one could improve the selection of individual end points for a global outcome measure.
The NIHSS criterion identified in this exploratory analysis (eg, NIHSS score ≤2 at 24 hours) would be appropriate as an end point in a proposed phase II trial of recanalization begun within the first 3 hours after stroke onset with inclusion and exclusion criteria similar to those in the NINDS trial. Different end points may be more appropriate in other proposed trials depending on the entry criteria, the time to treatment, and the suspected action of the drug. For example, when we included only patients with NIHSS score ≥10, the projected best end points were different than those in which the whole patient sample was used.
Comparison of end points in the ECASS II16 and NINDS rtPA Stroke Trial1 illustrates how selection of an end point depends on the time to treatment and expected action of the drug. In the NINDS rtPA Stroke Trial, in which patients were treated within 3 hours, a Rankin score of 0 or 1 at 3 months was the third most sensitive end point with regard to long-term efficacy. Using this end point would require only 91 patients per treatment group to detect a benefit for tPA in part I of the tPA Stroke Trial. A Rankin score of 0 to 2 at 3 months was a much less sensitive end point for detecting a difference and would require 212 patients per treatment group.
By contrast, the ECASS II study treated patients within 6 hours of onset and used a Rankin score of 0 or 1 as the primary study outcome.16 When this end point was used, the ECASS II study was negative. When the ECASS II study was analyzed post hoc with an end point of Rankin score of 0 to 2, a benefit for tPA-treated patients was demonstrated. The Prolyse in Acute Cerebral Thromboembolism (PROACT) II trial, an intra-arterial study that also treated patients within 6 hours, also found a Rankin score of 0 to 2 a more sensitive measure of the benefit of prourokinase than a Rankin score of 0 to 1.17 The different sensitivities of different cut points for the Rankin Scale make biological sense. Thrombolytic therapy administered soon after onset of ischemia, as in the NINDS tPA Stroke Trial, would be more likely to open the occluded artery more quickly and to salvage brain tissue than therapy given at a later time. Such patients treated sooner would be more likely to return to normal or near normal, as measured by a Rankin score of 0 or 1. Patients who receive thrombolytic therapy after 3 hours of symptom onset would be less likely to have sparing of ischemic brain but could still accrue some benefit. Such patients may be less likely to return to normal than if treated with intravenous tPA within 3 hours but may have a shift in their disability from severe to mild or moderate or a Rankin score of 0 to 2. For a future study of a recanalization therapy begun within 3 hours of onset, a Rankin score of 0 or 1 would likely be a better 3-month end point than a Rankin score of 0 to 2. For a study of neuroprotection given at a later time window, another cut point may be better.
Our study also demonstrates the potential problems of the use of exploratory post hoc analyses from small to moderate studies to determine end points for subsequent trials without subsequently validating the results prospectively.18 19 20 A definitively positive trial like the NINDS trial, as determined by both primary and secondary end points, is the ideal way to explore the sensitivities of various end points. Many of the end points that were quite sensitive to the early activity of tPA and long-term effectiveness in part I patients were less sensitive with the use of part II data. Ideally, one should choose an end point that is consistently sensitive in detecting a treatment effect in multiple studies, such as in both part I and part II of the NINDS tPA Stroke Trial. The reasons underlying the different sensitivities of various end points in part I and part II of the trial may relate in part to overall differences in baseline characteristics between the patients in part I or part II that have been shown to be related to long-term outcome (eg, baseline NIHSS score or age) or to random error. We plan to explore this issue further in the NINDS data set.
Some investigators have suggested that radiological end points, particularly changes in MR diffusion/perfusion, may be used as surrogate end points in acute stroke studies and require fewer patients to demonstrate the activity of a given therapy.21 22 23 In our study high-quality analysis of the volume of brain infarction as measured by CT was not as sensitive to a treatment effect as the clinical scale measures. To equal the sensitivities of the NIHSS within the 3-hour time window, MRI measures would need to be able to detect differences between treatment groups with fewer than 60 to 70 patients per group. Certainly MRI will never surpass neurological scales such as the NIHSS in terms of speed, ease of use, lack of missing data, and cost. In addition, the time to obtain MRI, even with echo planar imaging, will delay the time to treatment for at least the near future. Whether end points using diffusion or diffusion/perfusion MRI can be a more sensitive measure of treatment effect in a clinical trial than the NIHSS during the first 24 hours is a hypothesis that needs to be tested.
This study was supported by National Institutes of Health contracts NO1-NS-02382, NO1-NS-02374, NO1-NS-02377, NO1-NS-02379, NO1-NS-02373, NO1-NS-02378, NO1-NS-02376, and NO1-NS-02380.
- Received March 30, 2000.
- Revision received July 18, 2000.
- Accepted July 18, 2000.
- Copyright © 2000 by American Heart Association
Lyden PD, Hanston L. Assessment scales for the evaluation of stroke patients. J Stroke Cerebrovasc Dis.. 1998;7:113–127.
Van Swieten JC, Koudstaal PJ, Visser MC, Schouten HJA, van Gijn J. Interobserver agreement for the assessment of handicap in stroke patients. Stroke.. 1988;19:604–607.
Brott TG, Adams HP Jr, Olinger CP, Marler JR, Barsan WG, Biller J, Spilker J, Holleran R, Eberle R, Hertzberg V. Measurements of acute cerebral infarction: a clinical examination scale. Stroke.. 1989;20:864–870.
Tilley BC, Marler JR, Geller NL, Lu M, Leger J, Brott T, Lyden P, Grotta J. Use of a global test for multiple outcomes in stroke trials with application to the National Institute of Neurological Disorders and Stroke t-PA Stroke Study Trial. Stroke.. 1996;27:2136–2142.
NINDS rt-PA Stroke Study Group. Effect of intravenous rt-PA on ischemic stroke lesion size measured by computed tomography. Stroke. In press.
Breiman L, Friedman J, Stone CJ. Classification and Regression Trees. New York, NY: Chapman and Hall; 1984.
Lewandowski CA, Frankel M, Tomsick TA, Broderick J, Frey J, Clark W, Starkman S, Grotta J, Spilker J, Khoury J, Brott T, and the EMS Bridging Trial Investigators. Combined intravenous and intra-arterial r-TPA versus intra-arterial therapy of acute ischemic stroke: Emergency Management of Stroke (EMS) Bridging Trial. Stroke.. 1999;30:2598–2605.
Lyden P, Lu M, Jackson C, Marler JR, Kothari R, Brott T, Zivin J, and the NINDS t-PA Stroke Trial Investigators. Underlying structure of the National Institutes of Health Stroke Scale: results of a factor analysis. Stroke.. 1999;30:2347–2354.
Adams HP Jr. Baseline NIH Stroke Scale score strongly predicts outcome after stroke: a report of the Trial of Org 10172 in Acute Stroke Treatment (TOAST). Stroke.. 1999;30:2496. Abstract.
Lyden P, Brott T, Tilley B, Welch KM, Mascha EJ, Levine S, Haley EC, Grotta J, Marler J, for the NINDS t-PA Stroke Study Group. Improved reliability of the NIH Stroke Scale using video training. Stroke.. 1994;25:2220–2226.
Brott T, Haley EC, Levy DE, Barsan W, Broderick J, Sheppard G, Spilker J, Kongable G, Reed R, Marler J. Urgent therapy for stroke: pilot study of tissue plasminogen activator administered within 90 minutes. Stroke. 1992;23:632–640.
Hacke W, Bluhmki E, Steiner T, Tatlisumak T, Mahagne M, Sacchetti M, Meier D. Dichotomized efficacy endpoints and global endpoint analysis applied to the ECASS intention-to-treat data set: post hoc analysis of ECASS I. Stroke.. 1998;29:2073–2075.
Hacke W, Kaste M, Fieschi C, von Kummer R, Davalos A, Meier D, Larrue V, Bluhmki E, Davis S, Donnan G, Schneider D, Diez-Tejedor E, Trouillas P, for the Second European-Australian Acute Stroke Study Investigators. Randomised, double-blind, placebo-controlled trial of thrombolytic therapy with intravenous alteplase in acute ischemic stroke (ECASS II). Lancet.. 1998;352:1245–1251.
Furlan A, Higashida R, Wechsler L, Gent M, Rowley H, Kase C, Pessin M, Ahuja A, Callahan F, Clark W, Silver F, Rivera F, for the PROACT Investigators. Intra-arterial prourokinase for acute ischemic stroke: the PROACT II Study: a randomized controlled trial. JAMA.. 1999;282:2003–2011.
Diener HC, Hacke W, Hennerici M, Rådberg J, Hantson L, DeKeyser J, for the Lubeluzole International Study Group. Lubeluzole in acute ischemic stroke: a double-blind, placebo-controlled phase II trial. Stroke.. 1996;27:76–81.
Clark WM, Williams BJ, Selzer KA, Zweifler RM, Sabounjian LA, Gammans RE. A randomized efficacy trial of citicoline in patients with acute ischemic stroke. Stroke.. 1999;30:2592–2597.
Warach S, Boska M, Welch K. Pitfalls and potential of clinical diffusion-weighted MR imaging in acute stroke. Stroke.. 1997;28:481–482.
Darby DG, Barber PA, Gerraty RP, Desmond PM, Yang Q, Parsons M, Li T, Tress BM, Davis SM. Pathophysiological topography of acute ischemia by combined diffusion-weighted and perfusion MRI. Stroke.. 1999;30:2043–2052.
Albers GW. Expanding the window for thrombolytic therapy in acute stroke: the potential role of acute MRI for patient selection. Stroke.. 1999;30:2230–2237.