Response to Letter by Saver
The principal purpose of our article1 is didactic: we present a valid, straightforward method for comparing outcomes in 2 treatment groups if the basis of comparison is an ordinal response scale. The technique, based on the Mann-Whitney nonparametric statistic, is readily available to practitioners in most elementary statistics software programs. We illustrated the technique with data reproduced from the SAINT I trial of NXY-059 in acute ischemic stroke.2
In fact, the motivation for our article was the SAINT I publication.2 A hallmark of rigorous science is reproducibility: can reported findings in the scientific literature be replicated independently? On reading the New England Journal of Medicine article, we were dismayed that the authors failed to provide sufficient information to allow us to verify their assertion of the “statistical significance” of one of the primary outcomes of the SAINT I trial, the shift in modified Rankin Scale (mRS) values in the NXY-059 intervention group relative to the control group. Indeed, even if we had been provided the raw data accruing from the SAINT I trial, we would have been unable to validate their declaration of statistical significance, as we would need to make educated guesses on how they undertook a stratified analysis. Though this may well be a reflection of our failings as data analysts, our frustration certainly would have been less palpable had the authors merely followed CONSORT guidelines3 for statistical reports of clinical trials.
A corollary development of the Mann-Whitney approach to the analysis of ordinal data is the construction of probabilistic indices, which represent simple effect size measures with categorical outcomes. (See Koziol and Feng1 for exposition and references.) We also described1 how a measure of number-needed-to-treat (NNT) may be derived from this formulation. Dr Saver takes us (and, incidentally, the SAINT I authors) to task for not adopting his methodology for calculating NNT, to which we plead nolo contendre. Nevertheless, let us argue (perhaps more forcefully than in Koziol and Feng1) that the very concept of NNT is of dubious value (or, at least, needs to be interpreted with extreme caution) when applied to ordinal categorical data such as the mRS in stroke or Kurtzke’s Expanded Disability Status Scale (EDSS) in multiple sclerosis.
Let us begin with one statement we can validate from the New England Journal of Medicine article2: the clinical benefit of NXY-059 “amounts to an average improvement of 0.13 point on the mRS per patient, which suggests that about 8 patients would need to be treated to achieve improvement equal to 1 point on the scale for one patient.” As we noted previously,1 the authors’ calculation of NNT is the reciprocal of the observed mean difference in mRS on the two arms, 1/(2.84–2.71)=1/0.13=7.69, where 2.84 and 2.71 are the mean mRS scores on the placebo and NXY-059 arms in the SAINT I trial. This is a “standard” construction of NNT4; our criticism related primarily to the authors’ failure to report a confidence interval for their estimate of NNT.5 Now, let us undertake a Gedanken experiment. Suppose that a new interventional drug NID-001 replaced NXY-059 in the SAINT I setting. Of the 850 patients randomized to drug NID-001, 389 had mRS 0 at 90 days, and 461 had mRS 5 (or 6). The mean mRS on the drug NID-001 arm is 2.71, indicating improvement relative to the placebo arm, and with the same NNT as NXY-059. Given the equivalence of NXY-059 and NID-001 in terms of NNT, should the drugs be considered equally efficacious?
Clearly, the concept of NNT does not distinguish between scenarios in which there is apparently a modest change of mRS in most patients, as with NXY-059, and the more dramatic response profile with NID-001, in which only a select subgroup of patients benefits, albeit enormously. Furthermore, one should not lose sight of the fact that the mRS or EDSS does not represent a linear scale: a shift of 1 point from 1 to 0 does not necessarily have the same import or utility to patients as a shift from 3 to 2, or 6 to 5. Were we to ascribe different numerical values to the categories of the mRS, for example using a 100 (mRS=1) to 0 (mRS=6) quality of life scale, then the numerical estimates of NNT could change substantially. We suggest that the likelihood of particular outcomes (eg, excellent stroke outcome, or severe disability) is far more relevant to acute stroke patients and their families than the more elusive concept of an expected shift of 0.13 point in the mRS if NXY-059 (or NID-001) is prescribed. Technically, given the categorical nature of the mRS, NNT or any other univariate measure of treatment effect entails a loss of information relative to full specification of the discrete distributions of mRS across outcome categories in the intervention and reference cohorts. In general, these distributions are fully characterized by the individual probabilities of the categories, which cannot be captured by a single parameter or measure. We believe that informed judgment and treatment decisions in acute stroke should be based on complete information relating to the constellation of potential outcomes, rather than on a single summary statistic.
Koziol JA, Feng AC. On the analysis and interpretation of outcome measures in stroke clinical trials: lessons from the SAINT I study of NXY-059 for acute ischemic stroke. Stroke. 2006; 37: 2644–2647.
The Consort Statement. Available at: www.consort-statement.org Accessed October 10, 2006.
Cook RJ, Sackett DL. The number needed to treat: a clinically useful measure of treatment effect. Brit J Med. 1995; 310: 452–454.
Altman DG. Confidence intervals for the number needed to treat. Brit J Med. 1998; 317: 1309–1312.