| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
(Stroke. 2006;37:2644.)
© 2006 American Heart Association, Inc.
Comments, Opinions, and Reviews |
From the Department of Molecular and Experimental Medicine, The Scripps Research Institute, La Jolla, Calif.
Correspondence to James A. Koziol, PhD, The Scripps Research Institute, 10550 North Torrey Pines Road, MEM216, La Jolla, CA 92037. E-mail koziol{at}scripps.edu
| Abstract |
|---|
Methods We provide an abbreviated discussion of analytical methods for ordinal scales, and related effect size measures; we illustrate these methods and their interpretation with outcome data from the SAINT I study of NXY-059 in acute ischemic stroke.
Results The nonparametric Mann-Whitney statistic provides a straightforward method for analysis of the modified Rankin Scale, and incorporates associated measures of effect size. These measures are directly related to the concepts of Number Needed to Treat and Number Needed to Harm.
Conclusions Our re-examination of the outcome data from the SAINT I study provides little evidence for the purported efficacy of NXY-059. More broadly, analysis and interpretation of ordinal outcome scales based on ascribed numerical values to the steps of the scale should be done cautiously. Statistical treatment of multiple primary outcome measures in acute stroke clinical trials should be established before analysis. Lastly, conflating statistical and clinical significance should be avoided.
Key Words: Mann-Whitney statistics numbers needed to treat or harm
| Introduction |
|---|
| Primary Efficacy End Point of the SAINT I Study |
|---|
|
We present the mRS primary outcome of the SAINT I study in Table 2, which was derived from the authors Figure 2. There are 849 patients in the placebo arm, and 850 patients in the NXY-059 arm, who were included in the efficacy analysis. Note that the authors grouped patients with mRS=5 or 6; the individual breakdown is not available.
|
The authors declared that "NXY-059 significantly improved the overall distribution of scores on the modified Rankin scale, as compared with placebo (P=0.038 by the Cochran-Mantel-Haenszel test)". We are unable to verify their analysis because their procedure relies on prior knowledge of stratification factors that are also unavailable to us. Nevertheless, the data in Table 2 are prototypical of ordinal outcome measures in stroke clinical trials and will be used for illustrative purposes. Moreover, insofar as "prognostic factors that influence stroke outcome were well matched and showed no interaction with the treatment effect" in the SAINT I study, our reanalyses will provide additional insight into the trial findings.
| Nonparametric Analysis of Efficacy |
|---|
Incidentally, if we ignore the categorical nature of these data, and merely compare mean mRS scores on the 2 arms with a standard 2-sample t test, we have: mean mRS=2.841 (SD 1.75) on the placebo arm, mean mRS=2.713 (SD 1.80) on the NXY-059 arm, t1697=1.484, 1-sided P=0.069, 2-sided P=0.138. With this large sample size, a Z test could also have been used with equivalent results. Not surprisingly, the nonparametric and parametric approaches here are in close accord.3
| Effect Size Measure Based on the Mann-Whitney Statistic |
|---|
In the context of the SAINT I study, let X1, X2, ..., X849 denote the mRS scores of the 849 individuals on the placebo arm, and let Y1, Y2, ..., Y850 denote the mRS scores of the 850 patients on the NXY-059 arm. The Mann-Whitney statistic U is defined as
|
|
where Uij=1, 1/2, or 0 depending on whether Yj is greater than, equal to, or less than Xi, m=849, and n=850. Then, U/mn is an empirical estimate of the effect size measure
=Pr(Y>X)+1/2 Pr(Y=X) (Pr denoting probability). Again in the present context, larger mRS scores are more deleterious than smaller scores, so a measure of efficacy of NXY-059 relative to placebo would be 1
=Pr(X>Y)+1/2 Pr(X=Y). This is simply the probability that an individual randomly assigned to placebo would have a larger (more deleterious) mRS outcome than n individual randomly assigned to NXY-059, plus one-half the probability that there would be no difference in mRS outcome.
With continuous response variables X and Y, the probability of tied data, Pr(X=Y), would be 0, and the effect size measure
=Pr(Y>X) is immediately interpretable as a measure of the separation of the 2 underlying distributions. Because there is considerable overlap of mRS scores in the arms of the SAINT I study, it is perhaps more reasonable here to consider the 3 measures Pr(Y>X) (better outcomes on placebo), Pr(X>Y) (better outcomes on NXY-059), and Pr(X=Y) (no difference in outcome). Acion et al have termed these measures probabilistic indices, and argued that they provide intuitive and simple effect size measures with categorical outcomes.4
Our point estimates of these measures from the SAINT I study are: Pr(Y>X)=0.392, Pr(X>Y)=0.431, and Pr(X=Y)= 0.177. However, it is desirable to report CIs for these measures. Newcombe gives a nice discussion of confidence interval construction for measures derived from the Mann-Whitney statistic5,6; we here adopted a bootstrap approach, which has reasonable properties with a minimum of model assumptions. We first drew 10 000 random samples with replacement of sizes 849 and 850 using the empirical distributions for placebo and NXY-059 respectively, as given in Table 2, and calculated the 3 measures Pr(X>Y), Pr(X=Y), and Pr(Y>X) from each of the random samples. With the bootstrap percentiles,7 we then found the joint 95% CIs for these 3 measures to be 404 to 0.460, 0.173 to 0.182, and 0.364 to 0.419.
| Number Needed to Treat |
|---|
Unfortunately, in the present setting, the determination of a nominal 95% CI is fraught with difficulties. We calculated a bootstrap 95% CI for the difference in mean mRS (placebo treatment) as 0.026 to 0.289. As above, the bootstrap CI was based on 10 000 replicate samples. If the lower limit of the 95% CI for mean difference in mRS had been positive, then we would merely have inverted the 2 limits of this CI to derive the corresponding CI for NNT. But, because the CI for the mean difference in mRS includes 0 (that is, no treatment effect), the inverted "CI" will include the value infinity (1/0), and the negative limit 0.026 of the CI is indicative of "harm" with NXY-059, not efficacy.
There is an alternative formulation of the concept of NNT with ordinal data, explicitly formulated with the mRS.10 With the notation introduced in the previous section, Saver would define NNT as 1/Pr(X>Y), and number needed to harm (NNH) as 1/Pr(Y<X). Using Savers criteria, we immediately calculate joint 95% CIs as 2.17 to 2.48 for NNT (better outcome on NXY-059 than placebo), and 2.39 to 2.75 for NNH (better outcome on placebo than NXY-059).
It is apparent that invocation of the concept of NNT with ordinal data as with the outcome variable of the SAINT I study (Table 2) is less transparent than with binomial data: in particular, we must also entertain the real possibility of indifference, that is, no difference in outcome between the competing regimens (here NXY-059 and placebo), as well as the converse possibility, reflected in the NNH. We remark that the "continuous variable" formulation used by Lees et al for NNT is highly dependent on the scoring scheme used for the categories: we could have weighted the categorical scores differently, resulting in different NNT values. In contrast, the Saver measures would not be affected, so long as the transformations in scores were monotone (because the underlying Mann-Whitney statistic is invariant to monotone transformations of the data). Regardless, whenever the CI for the "difference parameter" includes 0 (no efficacy) then construction and interpretation of CIs for NNT is troublesome. Indeed, had Lees et al followed conventional practice by constructing a CI for the difference in mRS between the NXY-059 and placebo arms, this would have alerted them and interested readers to the problematic nature of the efficacy of NXY-059 based on the concept of NNT.
| Statistical Versus Clinical Significance |
|---|
If it is agreed that the mRS provides the appropriate numerical scale on which to base efficacy judgments for clinical trials such as SAINT I, then we suggest that a moderate effect size of 0.5,11 equivalently, a "one-half standard deviation" rule12,13 might be invoked as indicative of "clinical significance". According to this guideline, an observed difference of about 0.8 U (1/2 SD, from Table 2) on the mRS scale would be suggestive of clinical significance. In contrast, the SAINT I study has an observed effect size of 0.17 (0.13/pooled SD) in favor of NXY-059, quite modest by these standards.14 Even if one accepts the authors declaration of statistical significance for the shift in mRS scores favoring NXY-059, they appear to have conflated statistical with clinical significance.
| Choice of Outcome Measures |
|---|
Now, it is not at all unusual to have 2 or more primary outcomes in Phase III clinical trials in stroke: indeed, the National Institute of Neurological Disorders and Stroke (NINDS) recombinant tissue plasminogen activator stroke trial incorporated 4 neurological and functional scales (Rankin, NIHSS, Glasgow Outcome Scale, and Barthel Index), in the primary assessment of whether recombinant tissue plasminogen activator was associated with a significantly improved long-term functional and neurological outcome.17 On the other hand, it is highly unusual not to adjust statistically for the 2 primary outcomes of SAINT I, either by declaring a priori that one or the other outcome would need to achieve statistical significance at the
=0.05/2=0.025 level when tested individually, in order to declare statistical significance for the trial at the overall
=0.05 level (this being a simple Bonferroni correction for the multiple testing), or, by combining the 2 outcome measures into 1 global statistic, assessed at the overall
=0.05 level.18 Unfortunately, Lees et al followed neither standard, instead reported uncorrected P values for their 2 primary outcome measures; they then imputed statistical significance for the trial solely on the basis of their analysis of mRS outcomes. This declaration is puzzling, especially as the underlying clinical protocol would likely have specified a statistical adjustment to obviate inflation of Type I error with 2 primary outcome variables, particularly when reporting findings to regulatory agencies.
When choosing outcome scales for acute stroke clinical trials, it would be prudent to select scales with proven reliability, validity, and responsiveness (that is, sensitivity to clinical change).19 We refer interested readers to New and Buchbinder20 for discussion of the mRS from a clinimetric viewpoint (and also, an extensive and relevant bibliography), and Bruno et al,21 for discussion of the NIHSS.
| Conclusions |
|---|
Multiple primary end points in acute stroke clinical trials are not uncommon. One would hope the various outcome scales are congruent and statistical adjustment for multiple end points is appropriate. Lastly, conflation of statistical and clinical significance should be avoided.
| Acknowledgments |
|---|
Sources of Funding
This research was supported in part by the Stein Endowment Fund, Department of Molecular and Experimental Medicine, The Scripps Research Institute.
Disclosures
None.
Received March 21, 2006; revision received June 27, 2006; accepted July 19, 2006.
| References |
|---|
2. van Swieten JC, Koudstaal PJ, Visser MC, Schouten HJA, van Gijn J. Interobserver agreement for the assessment of handicap in stroke patients. Stroke. 1988; 19: 604607.
3. Moses LE, Emerson JD, Hosseini H. Analyzing data from ordered categories. New Engl J Med. 1984; 311: 442448.[Abstract]
4. Acion L, Peterson JJ, Temple S, Arndt S. Probabilistic index: an intuitive non-parametric approach to measuring the size of treatment effects. Statist Med. 2006; 25: 591602.
5. Newcombe RG. Confidence intervals for an effect size measure based on the Mann-Whitney statistic. Part 1: General issues and tail-area-based methods. Statist Med. 2006; 25: 543557.[CrossRef]
6. Newcombe RG. Confidence intervals for the effect size measure based on the Mann-Whitney statistic. Part 2: Asymptotic methods and evaluation. Statist Med. 2006; 25: 559573.[CrossRef]
7. Efron B, Tibshirani RJ. An Introduction to the Bootstrap. Chapman & Hall, New York, 1993.
8. Cook RJ, Sackett DL. The number needed to treat: a clinically useful; measure of treatment effect. Brit J Med. 1995; 310: 452454.
9. Altman DG. Confidence intervals for the number needed to treat. Brit J Med. 1998; 317: 13091312.
10. Saver JL. Number needed to treat estimates incorporating effects over the entire range of clinical outcomes. Arch Neurol. 2004; 61: 10661070.
11. Cohen J. Statistical Power Analysis for the Behavioral Sciences, Revised Edition. New York: Academic Press; 1977.
12. Norman GR, Sloan JA, Wyrwich KW. Interpretation of changes in health-related quality of life: the remarkable universality of half a standard deviation. Medical Care. 2003; 5: 582592.
13. Sloan JA, Dueck A. Issues for statisticians in conducting analyses and translating results for quality of life end points in clinical trials. J Biopharm Statist. 2004; 14: 7396.[CrossRef][Medline] [Order article via Infotrieve]
14. Del Zoppo GJ. Stroke and neurovascular protection. New Engl J Med. 2006; 354: 553555.
15. Dromerick AW, Edwards DF, Diringer MN. Sensitivity to changes in disability after stroke: A comparison of four scales useful for clinical trials. J Rehabil Res Dev. 2003; 40: 18.[CrossRef][Medline] [Order article via Infotrieve]
16. Young FB, Weir CJ, Lees KR. Comparison of the National Institutes of Health Stroke Scale with disability outcome measures in acute stroke trials. Stroke. 2005; 36: 21872192.
17. NINDS rt-PA Stroke Study Group. Tissue plasminogen activator for acute ischemic stroke. New Engl J Med. 1995; 333: 15811587.
18. Tilley BC, Marler J, Geller NL, Lu M, Legler J, Brott T, Lyden P, Grotta J. Use of a global test for multiple outcomes in stroke trials with application to the National Institute of Neurological Disorders and Stroke t-PA Stroke Trial. Stroke. 1996; 27: 21362142.
19. Streiner D, Norman G. Health Measurement Scales: A Practical Guide to Their Development and Use, 3rd ed. New York: Oxford University Press; 2003.
20. New PW, Buchbinder R. Critical appraisal and review of the Rankin scale and its derivatives. Neuroepidemiology. 2006; 26: 415.[CrossRef][Medline] [Order article via Infotrieve]
21. Bruno A, Saha C, Williams LS. Using change in the National Institutes of Health Stroke Scale to measure treatment effect in acute stroke trials. Stroke. 2006; 37: 920921.
This article has been cited by other articles:
![]() |
S. W. Miller and Y. Y. Palesch Comments Regarding the Recent OAST Article Stroke, January 1, 2008; 39(1): e14 - e14. [Full Text] [PDF] |
||||
![]() |
M. D. Ginsberg Response to Letter by Fisher et al Stroke, November 1, 2007; 38(11): e128 - e128. [Full Text] [PDF] |
||||
![]() |
A. I. Faden and B. Stoica Neuroprotection: Challenges and Opportunities Arch Neurol, June 1, 2007; 64(6): 794 - 800. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. D. Ginsberg Life After Cerovive: A Personal Perspective on Ischemic Neuroprotection in the Post-NXY-059 Era Stroke, June 1, 2007; 38(6): 1967 - 1972. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. J. del Zoppo and J. A. Koziol Recanalization and Stroke Outcome Circulation, May 22, 2007; 115(20): 2602 - 2605. [Full Text] [PDF] |
||||
![]() |
J. L. Saver Deriving Number-Needed-to-Treat and Number-Needed-to-Harm From the Saint I Trial Results Stroke, February 1, 2007; 38(2): 257 - 257. [Full Text] [PDF] |
||||
![]() |
J. A. Koziol and A. C. Feng Response to Letter by Saver Stroke, February 1, 2007; 38(2): 258 - 258. [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Stroke Home | Subscriptions | Archives | Feedback | Authors | Help | AHA Journals Home | Search Copyright © 2006 American Heart Association, Inc. All rights reserved. Unauthorized use prohibited. |