Effect Size Measures and Their Relationships in Stroke Studies
- effect size measure
- Mann-Whitney measure
- numbers needed to treat
- proportional odds ratio
- randomized controlled trial
- risk difference
Many articles in Stroke have considered good statistical practice for adequate planning and high-power analysis for stroke trials. They have discussed which test may be adequate and powerful, proposals for an effect size measure, and proposals for defining number needed to treat (NNT) based on an ordinal scale (see online-only Data Supplement for citations).
The clinical problem is straightforward. We read results of trials that fall into 3 categories: unequivocally neutral or even negative; overwhelmingly positive; or encouraging but open to various interpretations according to the approach taken to statistical analysis and presentation of findings. This dependence on methodology for the third group undermines our confidence in effective treatments and can prompt unjustified repetition of trials of ineffective treatments. A robust, powerful, and universal statistical approach is required.
The statistical problem is more complex. From dozens of available statistical tests, each may be uniquely powerful in certain circumstances. However, trials intended as confirmatory for regulatory approval or for the use in clinical guidelines demand that the analysis plan be prespecified so that the test of choice should minimize assumptions yet maximize power for the anticipated difference between treatment groups. Furthermore, it is not sufficient to indicate that 1 treatment is significantly different from another: clinical research guidelines require that the magnitude of the treatment effect should be declared using the so-called effect size measure accompanied by its measure of precision, the confidence interval.
Thus, there is need for a robust test, preferably for all data types—binary, ordinal, or continuous—and a test-related effect size measure, with a confidence interval that matches the test-related P value. This requirement for an adequate analysis of study data fortunately restricts the plethora of available tests to a small number of useful candidates.
Here, we describe and explain the relationships between 2 tests, of which 1 is well-known and the other is less familiar. Both are suitable for the analysis of binary, ordinal, and continuous data, and both offer associated confidence intervals. Using intravenous thrombolysis trial data, we can illustrate the relative merits of these tests when compared with other approaches. We will demonstrate why dichotomizing ordinal scales are undesirable. Finally, we can show that in principle each well-known effect size measure can be re-expressed as the popular NNT.
Family of Mann–Whitney Measures
One family of effect size measures fulfills both desirable criteria: robustness, that is minimizing assumptions, and relevance for clinical research in which medicinal products are tested against a reference treatment. This family is based on the concept of proversions. Two groups are contrasted by comparing each patient of 1 group with each member of the other and counting the number of cases for which there is superiority for 1 or the other group. The number of proversions for the 2 groups can be related to the total number of comparisons to express a probability. In statistical textbooks, the probabilities of the 2 groups are called P(X<Y) and P(Y<X). If the scale is not a continuous one but an ordered scale with only a few categories, there will also be patients in each group with equal outcomes, yielding the so-called tied values. The number of ties defines the probability P(X=Y).
With these 3 probability values, we can construct several general measures of treatment superiority. Two possible measures and the principles of this procedure were discussed in a recent publication in this journal.1 Our discussion concentrates on the following 3 further measures:
Mann–Whitney measure of superiority [MW = P(X < Y)+0.5P(X = Y)]
Mann–Whitney difference [MWD = P(X < Y)−P(Y < X)]
Mann–Whitney centered [MWC = MW−0.5]
Of note, each of these measures can be transformed into 1 of the other 2, but all are robust measures of superiority, differing only marginally in their interpretation when describing the degree of superiority for presenting study results. There is no assumption involved in this procedure, it requires simple counting.
All 3 measures are related to the well-known Wilcoxon–Mann–Whitney (WMW) test. The Mann–Whitney measure of superiority is the probability that a randomly selected patient of 1 group is better off than a randomly selected patient of the other group. The Mann–Whitney difference (MWD) is also in use because when applied to binary data the measure is identical with the risk difference (RD). These are essentially synonymous with other well-known measures of relevance: Kendalls τ, Somer d, Goodman–Kruskal γ, etc. The Mann–Whitney centered is less familiar but is important for interpretation in relationship to RD and NNT as explained later. All 3 measures can be interconverted or expressed as a standardized mean difference via relation MW = Φ (d /), with Φ (·) meaning the standard normal distribution and d being the standardized difference (Table 1).
Displaying Outcomes: the Mann–Whitney Measures and the P–P Graph
Stroke researchers have become familiar with display of outcomes, such as Rankin scales, according to paired horizontal stacked bar graphs, sometimes termed Grotta bars by stroke researchers (Figure 1).2 However, although these readily illustrate at which levels in the ordinal Rankin scale benefit may have occurred, they do not display the total magnitude of effect so clearly.
There is another graph type that can display all Mann–Whitney measures, their relationships, and even relationships to other measures of interest. This is the percentile–percentile plot (P–P plot),3,4 better known to stroke researchers as the receiver operating characteristic curve. This graph gives simply the cumulative distribution of the data of 1 group plotted against that of the other group. The MW superiority measure is equal to the area under the curve of the receiver operating characteristic curve.5 Familiarity with this makes the concept of the MW superiority measure easy to grasp.
Taking as an example the data from the European Cooperative Acute Stroke Study-III (ECASS-III) trial of intravenous alteplase versus placebo,6,7 Figure 2 shows the P–P plot of both cumulative distributions. Because the curve is completely above the diagonal line, this indicates superiority of alteplase (statisticians call this type of difference stochastic ordering when the line is always above or on the line of identity). The area under the curve is identical with the MW measure, which in this example is MW=0.546. This is only a small superiority but is typical for clinical stroke research. Of interest, the lower bound of the confidence interval is 0.508 so that the superiority is statistically significant.
The ECASS-III data represent a convenient situation with a monotonically ascending curve, not far from being convex. The graph illustrates practically no difference between treatments at high Rankin values, representing severe disability or death; only scores in the range 0 to 4 show a difference.
We can also see how this graph incorporates the various possible dichotomizations (responder definitions). Each vertical line from the P–P curve to the diagonal is just the RD of a dichotomization.
If we restrict our result to only one of the cut points, we have a smaller area under the curve when working with a convex curve, which means loss of patient information. Also, if the overall superiority is small, the fluctuation at the various cut points is so large that dichotomizing the ordinal scale resembles a lottery, particularly as clinical trials intending dichotomization must prespecify their choice of cut point. We choose to prespecify one of the various possible RDs at our peril unless the choice has unique clinical relevance. Certainly, we do not need to dichotomize to derive an NNT, as we explain later. Ordinal analysis is sometimes loosely described as a shift analysis. In fact, the WMW test seeks evidence for more than simply shift: it can also detect a beneficial treatment difference regardless of the pattern of benefit. The P–P plot or Grotta bars will illustrate the pattern, and if a test of where the maximal difference lies should be required, the Kolmogorov–Smirmov test does this, taking into account the number of possible differences.8 Unfortunately the Kolmogorov–Smirnov test cannot give a confidence interval around the dichotomized analysis.
Mann–Whitney Measure and Odds Ratio
The Mann–Whitney measure and the odds ratio (OR) are 2 sides of the same coin. One can be derived from the other, which is useful for the interpretation of study results; these in turn can be transformed into other measures of effect size, including standardized difference or NNT. The formula for obtaining MW from the OR is as follows, as could be demonstrated by our research team (H.Z. and V.R.):
This procedure assumes proportionality of the odds. If this assumption does not apply, it is safer to calculate the MW statistic directly by counting because this is the gold standard of analysis, making no assumptions. A conversion program implemented by our research group (H.Z. and V.R.) is also available at http://www.idv-software.com/mw-convert/.
We can check and validate these translation procedures. Taking the data of the ECASS-III study, Table 2 presents results in juxtaposition, first MW is derived directly based on the counting process (the number of patients with better outcome) and then also the proportional OR based on the terms of the P value calculation (so-called test-based result), and finally the MW is derived from OR.
The equivalence is understandable because the P–P plot shows near proportionality of the ORs.
Note that the P value is identical for all measures, P=0.0196, so that the associated confidence intervals should not cover the null difference (which is 0.5 for MW and 1.0 for OR). The computer outputs for all these calculations are given in the online-only Data Supplement (Tables I and II in the online-only Data Supplement). In passing, we note that our OR was not calculated as usual based on logistic regression but based on the P value as proposed by Lachin.9
Of interest, the MW measure based on pure counting is assumption free, whereas the calculation of the OR and the derived MW measure demands at least near proportionality of effect. Therefore, the direct way is the gold standard among these procedures.
In practical terms, if treatment, such as an endovascular approach, was to increase the proportion of patients with excellent outcome at the expense of a small increase in mortality or severe independence, then the OR for surviving would not be comparable with the OR for recovering fully and proportionality would not occur. The MW statistic could indicate net benefit, but the pooled OR may be misleading.
Useful Definitions of RD
It is well-known that the simple RD in a 2×2 table is identical with the MWD. Thus, until recently, it was taken for granted that the MWD is a generalized RD, which can be used for calculating NNT based on categories of an ordinal scale. However, using geometric and mathematical arguments, it can be shown that the weighted average of the RDs for the whole area (not only of the empirical RDs), the expected RD, is identical to MW–0.5, which is half of the hitherto assumed amount 2×(MW–0.5)=MWD (H. Zimmermann, Dipl. Math., and V.W. Rahlfs, PhD, Comments on number-to-treat derived from ordinal scales, submitted manuscript, 2012). The MWD in this situation is larger than any of the possible RDs, a fact which simply is not reasonable. A recent publication of the Optimising the Analysis of Stroke (OAST) Collaboration proposed calculating NNT from ordinal scales based on the difference of the superiority probabilities of the 2 groups being compared.10 However, this difference of probabilities is just again the MWD so that the given NNT numbers are too small.
The new finding is of considerable importance for interpretation. We get a fresh interpretation for a MW measure based on RDs. For instance, a RD of 6 percentage points is now identical with an MW=0.56, which according to Cohen effect size definition is a small difference. This seems plausible.
There is another useful definition of the RD that gives an absolute upper bound of RD (maximum RD [MRD]). The derivation formula is11 as follows:
Both RD definitions are of interest for interpretation, the (summarizing) expected RD and the local maximum RD: each should be cited with its corresponding confidence interval. In the following, we give the expected RD and the derived generalized NNT, as well as maximum RD and derived minimum NNT, for the ECASS-III study data:
Of interest is the comparison of different proposed numbers for RD and NNT, all based on the ordinal scale modified Rankin scale of the study ECASS-III. The described maximum RD for ordinal scales is nearly the RD for a well-chosen cut point of the ordinal scale in well-behaved data situations (convex curve in the P–P plot). The newly proposed expected (weighted average) RD is, of course, smaller. It is only half of the value described by Bath et al10 and much smaller than the RD given by Saver et al.12 A detailed table with comparisons of numbers for the different procedures is given in the online-only Data Supplement to this article (Table IV in the online-only Data Supplement).
The question might be raised whether the NNT estimate is attenuated or inflated by nondifferential misclassification (judgment error) as is well-known for dichotomous data. However, Lu et al13 have shown that, based on work with the Glasgow Outcome Scale (GOS) in the field of traumatic brain injury, the effect of misclassification is likely to be small when ordinal scales are used instead of dichotomous scales. This militates against the frequently encountered dichotomization of ordinal scales. In any case, clinical trial researchers should strive to avoid misclassification by applying appropriate measures for enhancing reliability of judgment scale data.
Our approach delivers only the net effect of treatment. Although it may be intriguing to express treatment effects as an NNT for benefit contrasted with an NNT for harm, estimation of these opposing effects has a subjective component,12 whereas the net effect can be calculated without any assumption unsupported by the trial design.
Choice of Test(s)
It is now well recognized that dichotomizing an ordinal or continuous scale and analyzing the resulting binary scale mean discarding patient information that can only be compensated for by substantially enhancing the patient number in clinical trials. The loss may be ≥50%.14–16
Dichotomization and analysis of binary data, the so-called responder analysis, are no longer recommended, at least for first-line analysis. The only justification for reporting is the clinical meaningfulness of the treatment effect.14
Despite a multitude of tests that are frequently used for the analysis of ordinal data,17 all of these tests can be reduced to 2 principles: WMW-type (stochastic superiority or rate difference) and OR-type tests (Table 3).
Because the WMW procedures and OR procedures are closely related, permitting 1 result to be re-expressed as the other (at least for the approximately proportional odds situation), there seems little advantage of one compared with the other. For interpretation, it may be helpful to present both results.
Are there reasons in practical work to prefer one or the other? According to current guidelines, there is 1 basic stipulation: each analysis procedure P value must be accompanied by the associated effect size measure and its confidence interval. This requirement is fulfilled by logistic regression, as well as by the WMW test, with its associated family of MW measures for stochastic superiority and their confidence intervals. In addition, 2 other features are desirable for an efficient data analysis: adjustment for covariates and generalization to multiple end points. The first is available for the ordinal logistic regressions. The second is available for logistic regression as well: it was applied in binary form within the National Institute of Neurological Disorders (NINDS) trial of alteplase by Tilley et al20 and has been described in ordinal form by Whitehead et al.21 These 2 desirable features are also available for the WMW test. The first is well-known with the name stratified analysis (eg, van Elteren pooling principle of WMW) available as Cochran–Mantel–Haenszel pooling or formal meta-analysis pooling procedures of WMW results. The generalization of the WMW test to multiple end points with more powerful directional tests is known as the Wei–Lachin procedure22,23 and was recently discussed in a similar format with name modified O’Brien.24 These generalizations allow the combination of binary, ordinal, and continuous data into 1 powerful analysis.
Comparing the relative merits of OR and WMW procedures, we recommend the latter because there are fewer assumptions and restrictions when using the WMW principle (proportionality, missing treatment by covariate interaction, power loss by applying binary logistic regression). Thus, we could call the WMW test the gold standard test because it is based on simple ranking of data (Wilcoxon principle) or just counting of ordered comparisons (Mann–Whitney principle). The use of the Mann–Whitney effect size measure also allows the calculation of an average RD and then the derivation of a generalized NNT, which is an effect size measure currently required by many guidelines.17 However, we emphasize that data analysis using ordinal logistic regression in most analyses of stroke trials should give correct and unbiased results because the procedure seems to be reasonably robust about the proportionality assumption. This could be verified with empirical data from studies in the field of stroke.17
Sample Size Determination
The sample size calculation method of Whitehead with respect to ordinal data with few categories is well-known.25 For the WMW test, Frick and Rahlfs26 developed a more exact method from a method proposal by Noether (Program available).27,28
Some other data evaluation methods in use should be mentioned here. Analysis of ordinal scales with only a few categories should not be performed using the differences of medians and the median test. This analysis is not sensitive to small improvements (in most cases there is only a difference of 0 or 1 point). Note that this approach failed for ECASS-I.29 Although there are more sensitive tests available (Su-Wei procedure),30 these tests should only be applied when the data situation is one of pure shift (for each individual patient), which cannot be true for bounded scales (eg, 0–6 modified Rankin scale). Although the t test with its test for mean difference is more sensitive for small differences and in well-behaved data situations delivers comparably good results, we would not recommend it as a first-line analysis procedure because of its strong dependence on the specific data situation.
About continuous or quasi-continuous data, we would not recommend parametric procedures, such as t test, linear regression, ANOVA, and ANCOVA, when for this data situation more adequate and more powerful procedures are available. Robust procedures lose little but can gain a lot when applied instead of parametric procedures. An additional advantage of the nonparametric WMW test is that all scales may be analyzed with a comparable effect size measure using the same procedure.
In summary, the MW effect size measure and the derived generalized NNT number deliver an overall picture for the complete ordinal scale, giving the net effect of superiority and inferiority across the whole scale, and the result does not rest on assumptions (gold standard test). This is a substantial advantage when performing confirmatory studies that require prespecified analysis methods. Examination of the data with a P–P plot can offer information about the nature of the difference between groups although Grotta bars provide some corresponding information and are now familiar to stroke researchers. For instance, the P–P graph can show why one dichotomization gives a favorable result, whereas another does not. If one prefers the OR as the first-line analysis, we strongly recommend comparing the result with that of the gold standard analysis. If results agree, there is no objection to calculate the OR.
Figure 3 gives an overview of all discussed effect size measures and their relationship. In conclusion, we recommend the Mann–Whitney effect size measure (and WMW test). This is not only because it is in itself a valuable robust measure of a beneficial treatment effect, but also because it is the missing link in relationships among other well-known effect size measures. Possessing the MW measure, we can have in principle all other well-known measures.
Dr Rahlfs is active as consultant for EVER Neuro Pharma and receives honoraria for this activity. H. Zimmermann is a consultant for mathematical statistics for idv-Data Analysis and Study Planning. Dr Lees chaired the European Stroke Organisation working group on stroke outcome measures; chairs the outcomes adjudication committees for the Clot Lysis: Evaluating Accelerated Resolution of Intraventricular Hemorrhage III (CLEAR III), EuropHyp-1, and SITS-OPEN trials; chairs independent data monitoring committees for trials in stroke sponsored by commercial organizations including Boehringer Ingelheim, Lundbeck, Grifols, Fundació Privada ICTUS Malaltia Vascular; chairs the Virtual International Stroke Trials Archive; has consulted with EVER Neuro Pharma; and participated in the Stroke Academic Industry Roundtables (STAIR) I-VIII but has received no support from any source in connection with the present article.
The opinions expressed in the article are not necessarily those of the editors or of the American Heart Association.
Guest Editor for this article was Ralph L. Sacco, MD.
The online-only Data Supplement is available with this article at http://stroke.ahajournals.org/lookup/suppl/doi:10.1161/STROKEAHA.113.003151/-/DC1.
- Received August 12, 2013.
- Accepted November 12, 2013.
- © 2013 American Heart Association, Inc.
- Howard G,
- Waller JL,
- Voeks JH,
- Howard VJ,
- Jauch EC,
- Lees KR,
- et al
- Wilk MB,
- Gnanadesikan R
- Bluhmki E,
- Chamorro A,
- Dávalos A,
- Machnig T,
- Sauce Ch,
- Wahlgren N,
- et al
- Lachin JM
- Saver JL,
- Gornbein J,
- Grotta J,
- Liebeskind D,
- Lutsep H,
- Schwamm L,
- et al
- 15.↵The Optimising Analysis of Stroke Trials (OAST) Collaboration. Can we improve the statistical analysis of stroke trials? Statistical reanalysis of functional outcomes in stroke trials. Stroke. 2007;38:1911–1915.
- Bath PM,
- Geeganage C,
- Gray LJ,
- Collier T,
- Pocock S
- Bath PM,
- Lees KR,
- Schellinger PD,
- Altman H,
- Bland M,
- Hogg C,
- et al
- Tilley BC,
- Marler J,
- Geller NL,
- Lu M,
- Legler J,
- Brott T,
- et al
- 28.↵Nnpar, program for calculating sample size, power and other parameters for the Wilcoxon-Mann-Whitney test. http://www.idv-cro.com/cms/index.php?id=195&L=1. Accessed November 25, 2013.