| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
(Stroke. 2005;36:777.)
© 2005 American Heart Association, Inc.
Original Contributions |
From the Department of Psychology (J.T.L.W., B.M.), University of Stirling, Stirling, UK; Outcomes Research (A.Hareendran), Pfizer Ltd, Sandwich, UK; the Department of Medicine for the Elderly (A.Hendry), Wishaw General Hospital, Wishaw, UK; the Department of Medicine for the Elderly (J.P.), Victoria Infirmary, Glasgow, UK; and the Department of Neurology (I.B., K.W.M), Institute of Neurological Sciences, Glasgow, UK.
Correspondence to J.T.L. Wilson, Department of Psychology, University of Stirling, Stirling FK9 4LA, UK. E-mail J.T.L.Wilson{at}stir.ac.uk
| Abstract |
|---|
|
|
|---|
Methods Inter-rater agreement was studied among raters from 3 stroke centers. Fifteen raters were recruited who were experienced in stroke care but came from a variety of professional backgrounds. Patients at least 6 months after stroke were first assessed using conventional mRS definitions. After completion of initial mRS assessments, raters underwent training in the use of a structured interview, and patients were re-assessed. In a separate component of the study, intrarater variability was studied using 2 raters who performed repeat assessments using the mRS and the mRS-SI. The design of the latter part of the study also allowed investigation of possible improvement in rater agreement caused by repetition of the assessments. Agreement was measured using the
statistic (unweighted and weighted using quadratic weights).
Results Inter-rater reliability: Pairs of raters assessed a total of 113 patients on the mRS and mRS-SI. For the mRS, overall agreement between raters was 43% (
=0.25,
w=0.71), and for the structured interview overall agreement was 81% (
=0.74,
w=0.91). Agreement between raters was significantly greater on the mRS-SI than the mRS (P<0.001). Intrarater reliability: Repeatability of both the mRS and mRS-SI was excellent (
=0.81,
w
0.94).
Conclusions Although individual raters are consistent in their use of the mRS, inter-rater variability is nonetheless substantial. Rater variability on the mRS is thus particularly problematic for studies involving multiple raters. There was no evidence that improvement in inter-rater agreement occurred simply with repetition of the assessment. Use of a structured interview improves agreement between raters in the assessment of global outcome after stroke.
Key Words: clinical trials disability evaluation outcome
| Introduction |
|---|
|
|
|---|
Previous evaluations of inter-rater reliability of the mRS1,2,5,7,8 have used raters of similar background, usually from a single center,1,5,7,8 and may therefore have underestimated potential inter-rater variability. One way of investigating the importance of differences between raters is by comparing the conventional open-ended use of the mRS with a structured interview (SI). The SI for the mRS1 was developed to standardize the manner of administration of the scale while still allowing room for clinical judgment. It was developed based on input from patients and clinicians regarding aspects of disability after a stroke. We previously examined the inter-rater reliability of the mRS and SI (the mRS-SI), and concluded that using SI improved reliability.1
The current study investigated rater variability in outcome assessment using the mRS. To simulate more closely the conditions of a multicenter study, we investigated inter-rater reliability when using numerous raters with different professional backgrounds. We also investigated the repeatability of assessments and studied the extent to which repetition effects might be contributing to improvement in agreement between raters.
| Subjects and Methods |
|---|
|
|
|---|
For the study, each patient was interviewed independently by 2 raters. After all assessments on the mRS had been completed, raters were trained in the use of the SI and the method of obtaining a Rankin grade.1 Training consisted of an explanation by the investigator, followed by a video showing an example interview that the raters scored, and 2 example transcripts that the raters scored. Each patient was then assessed again by the same 2 raters using the SI.
Intrarater Agreement
Repeatability of the mRS and the mRS-SI was investigated for 2 raters. The raters were nurses with experience of working in neurology but were without a specific background in stroke care and were not involved in the inter-rater study. A description of the mRS was given to the raters, and they each performed 12 assessments of stroke patients with the scale before commencing the study. During the practice interviews, raters also used the Barthel Index.
To investigate repeatability (testretest reliability) of the mRS, each rater independently assessed patients on 2 occasions using the mRS (at an interval of 7 to 14 days). After all assessments had been completed, the raters were trained in the use of the SI and the procedure for grading patients on the mRS. They then rated another set of patients on 2 occasions (at an interval of 7 to 14 days).
Possible effects of repeat assessment were investigated by comparing inter-rater agreement on first interviews with inter-rater agreement on second interviews for each of the 2 parts of the intrarater agreement study. If repeat assessment improved consistency between raters, then this would be expressed as greater agreement on the second interviews than first interviews.
Statistical Analysis
Agreement between raters is analyzed using the
statistic. This statistic corrects for agreement by chance, and a weighted value (
w) takes into consideration the size of disagreements. To allow comparison with previous studies, we give both the unweighted and weighted values of
, together with the 95% confidence intervals (CIs) of the values. The weighted value of
(using quadratic weights) is equivalent to the intraclass correlation coefficient used for continuous measures.10 Conventional interpretation of the degree of agreement indicated by
is: 0 to 0.20, "poor;" 0.21 to 0.40, "fair;" 0.41 to 0.60, "moderate;" 0.61 to 0.80, "good;" and 0.81 to 1.00, "very good."11 To compare methods of rating, we used appropriate nonparametric tests.
| Results |
|---|
|
|
|---|
Ratings on the mRS and on the SI are shown in Tables 1 and 2
. On the mRS, there was exact agreement between raters in only 43% of cases, and it was most common to disagree by one category (50% of cases). Table 1 shows that disagreement was greater at some category boundaries than others. For example, there was agreement for 88.5% of cases regarding whether patients had Rankin score 0 to 1 or less, whereas there was only agreement for 71% of cases regarding whether they had Rankin score 0 to 2 or less.
|
|
The unweighted kappa (
) value for inter-rater agreement on the conventional mRS was 0.25 (95% CI, 0.16 to 0.35), and the weighted kappa value (
w) was 0.71 (95% CI, 0.53 to 0.88). On the SI, there was agreement in 81% of cases:
=0.74 (95% CI, 0.64 to 0.84) and
w=0.91 (95% CI, 0.73 to 1.0). To compare the studies, we analyzed category differences between raters. Grades assigned by rater 1 were subtracted from grades assigned by rater 2, and the absolute values were computed. Comparison of the extent of disagreement between raters when using the mRS and mRS-SI showed that there was less disagreement using the SI (Wilcoxon z=5.35; P=0.001)
A comparison of the ratings made on the mRS in the first part of the study with the ratings made for the same patients using the mRS-SI is shown in Table 3. There was exact agreement in 48% of cases:
=0.32 (95% CI, 0.26 to 0.39) and
w=0.72 (95% CI, 0.60 to 0.85), indicating good agreement between gradings. However, there was a shift in the gradings toward categories indicative of greater disability when using the SI. The overall difference between the distributions of Rankin grades was significant (Wilcoxon z=5.1; P<0.001).
|
Intrarater Agreement
Fifty patients were recruited for the initial assessment, 48 (27 male) of whom returned for a second assessment. The median interval between first and second assessments was 7 days (range, 4 to 13 days). The mean age of patients assessed twice was 68.6 years (range, 41 to 86 years), and the mean interval poststroke was 14.9 months (range, 7 to 25 months). Twenty-two patients were interviewed by rater 1 first, and 26 by rater 2 first.
Comparison of Rankin grades on first and second assessment on the mRS showed that there was agreement between first and second assessment in 85% of cases for rater 1, and in 96% for rater 2;
=0.81 (95% CI, 0.66 to 0.96) for rater 1 and
=0.95 (95% CI, 0.66 to 1.0) for rater 2, and
w=0.94 (95% CI, 0.66 to 1.0) for rater 1 and
w=0.99 (95% CI, 0.70 to 1.0) for rater 2.
Fifty new patients were recruited for a first assessment using the SI, 48 (29 male) of whom returned for a second assessment. The median interval between first and second assessments was 7 days (range, 7 to 9 days). The mean age of patients assessed twice was 68.4 years (range, 32 to 86 years), and the mean interval after stroke was 14.2 months (range, 6 to 24 months). Twenty-five patients were interviewed by rater 1 first, and 23 by rater 2 first. On the mRS-SI, there was agreement between first and second assessment on 88% of cases for rater 1, and 98% for rater 2;
=0.84 (95% CI, 0.68 to 0.99) for rater 1 and
=0.97 (95% CI, 0.81 to 1.0) for rater 2, and
w=0.96 (95% CI, 0.67 to 1.0) for rater 1 and
w=0.99 (95% CI, 0.71 to 1.0) for rater 2.
Impact of Repeat Assessment
It is conceivable that the differences in agreement on the mRS and mRS-SI observed in the study with multiple raters are caused by repetition of the assessment; that is, repetition of the assessment may improve agreement between raters. To examine whether this was likely to be the case, we looked at changes in agreement when the assessment was repeated. Calculation of inter-rater agreement on the mRS showed that exact agreement between raters was 71% on first interview and 65% on the second interview; this difference was not significant on a Wilcoxon test (z=1.0, P=0.32). Agreement between raters on the mRS-SI was 58% on first interview and 56% on the second interview; this was not significantly different (Wilcoxon test, z=0.38, P=0.71).
| Discussion |
|---|
|
|
|---|
value of only 0.25. Previous studies of the reliability of the mRS have reported rather higher levels of inter-rater reliability.1,2,5,8 The current findings thus suggest that use of multiple raters from different professions for the conventional mRS is likely to increase variability. On the SI, there was significant improvement in agreement between raters. The results thus confirm that the reliability of assessment on the mRS can be improved by use of SI. Information variance (raters having different levels/sources of information) and criterion variance (raters using different criteria) have been identified as the 2 most important sources of unreliability of questionnaires used to classify patients.12 The SI provides a systematic framework for the collection of information, suggesting questions to be asked for each level. Although it allows rephrasing and follow-up of questions and the collection of both objective and subjective information, the framework promotes better consistency of the level of information. There is also a guide for interpreting responses to inform the criteria to assign a patient to a Rankin grade
Most disagreements on the mRS were by only one category, and the scale thus achieves a satisfactory level of inter-rater reliability assessed by the weighted
statistic. Quadratic weighting emphasizes the size of disagreements (differences are squared), and this is reasonable when the
statistic is used as a measure of the reliability of the mRS as a 6-point scale used to describe overall disability in a sample. However, in clinical trials the mRS is almost always dichotomized and consideration is confined to intergroup differences above and below a specific boundary (for example, mRS 0 to 2 versus mRS 3 to 5). In this case, weighting is irrelevant, and what matters is the extent of agreement at the particular cut-point chosen. The current findings indicate that inter-rater agreement at some boundaries is relatively poor.
The intrarater reliability study shows that there is excellent repeatability for both the conventional mRS and the SI. The level of agreement for the mRS is similar to that reported by Wolfe et al5 for repeatability of this assessment. The results thus confirm that individual raters are consistent over time in their use of the conventional mRS and the SI. The results also indicate that improvement in agreement between raters does not occur simply with repeat testing. Inter-rater agreement on second testing was very similar to agreement on second testing for both methods of assessing disability. Similarly, the SI gives consistent results when repeated, with no evidence of improvement in agreement simply through repeat assessment. Effects of repetition are thus not an explanation of the improvement in agreement between raters observed in the first study. An unexpected observation in the intrarater reliability study is that using the SI did not lead to greater agreement between raters. This part of the study was not designed to allow comparison of inter-rater reliability, and it may simply be that the patients recruited for the second part of the study involved more problematic cases (for example, cases close to borderlines between categories, cases with pre-existing disability, or cases with inconsistent reports).
The results show that whereas individual raters are consistent in their use of the mRS, inter-rater variability can nonetheless be substantial. An implication of the disagreement between raters found in the present study is that different raters focus on different aspects of disability. If so, then there is no single consensus on the application of the mRS. It is likely for example that some raters focus on physical disability and use some simple rules of thumb to assess patients, whereas other raters take more account of social limitations, residual symptoms, and level of functioning before the stroke. As a consequence, some raters showed a large shift in ratings when using the SI, whereas others showed little or no change.
This study simulates the scenario in most multicenter trials, in which outcome assessment may be performed by individuals from a variety of professional backgrounds. Whereas internal or local use of the conventional mRS by a small cohort of individuals with similar training or background may provide an adequate measure of disability, conventional use of the mRS in multicenter clinical trials is likely to reduce the power of the study as a result of increased variability. The effect of misclassifications on the power of clinical trials is discussed by Choi et al;6 in a hypothetical example using dichotomized outcomes, a 10% misclassification rate reduced an overall difference in favor of treatment from 7.5% to 6%, whereas a 20% misclassification reduced the difference to 4.5%. In the current study, the disagreement rate for dichotomized outcomes on the conventional mRS reached 29% when ratings were split between Rankin 2 and 3. Our study indicates that any desire to continue using the conventional mRS for reasons of comparability with previous data is likely to be greatly outweighed by its poor inter-rater reliability. The conventional mRS has proved quite contentious when used alone as the primary endpoint, with discrepant results being obtained in some trials based on arbitrary choices of dichotomization cut-point (eg, PROACT 213 ECASS II14,15). Use of the SI produced a shift in the overall distribution of ratings in comparison to the conventional mRS. It is likely that the distribution obtained with the SI results from taking a more systematic and comprehensive approach to assessment of disability and handicap after stroke. We have previously found evidence of the validity of the mRS-SI with respect to other measures of impairment, activities of daily living, and quality of life.16 The shift obtained in the distribution of Rankin grades emphasizes the importance of further work on the construct validity of the SI.
The potential for improvement in the power of clinical trials and consequent reduction in sample size required as a result of reduced variability is an important aspect to consider while designing trials. The findings suggest a number of practical ways in which the effects of observer variability in the assessment of outcome can be reduced in multicenter studies. The choice of where to dichotomize outcomes is often based on arguments concerning the clinical meaning of the split; however, other things being equal, agreement on the conventional mRS seems better at the Rankin 1 to 2 boundary than the Rankin 2 to 3 boundary. The number of raters involved in the study should be kept to the minimum that is practical (for example, no more than 1 rater per center). Agreement on the mRS is improved if raters use standardized questions and guidelines, and training is given. In most clinical trials, raters perform a number of assessments (including, for example, the National Institutes of Health or Scandinavian Stroke Scales, and the Barthel Index) before finally assigning a global outcome rating on the mRS. Although this will standardize at least some of the information that is collected by raters and should reduce inter-rater variability, there is also a need for consensus guidelines for resolving issues that arise when assessing outcome on the mRS. Training for raters in the interpretation of this information may improve variability of outcome assessment, but the value of unstructured training is uncertain. Finally, the use of SI after appropriate training appears to produce consistent and substantial improvements in reliability that should be considered the optimal approach at the present time.
| Acknowledgments |
|---|
| Footnotes |
|---|
Received August 12, 2004; revision received November 19, 2004; accepted December 2, 2004.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
H. Ay, E. M. Arsava, M. Vangel, B. Oner, M. Zhu, O. Wu, A. Singhal, W. J. Koroshetz, and A. G. Sorensen Interexaminer Difference in Infarct Volume Measurements on MRI: A Source of Variance in Stroke Research Stroke, April 1, 2008; 39(4): 1171 - 1176. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. B. Slot, E. Berge, P. Dorman, S. Lewis, M. Dennis, P. Sandercock, and on behalf of the Oxfordshire Community Stroke Proj Impact of functional status at six months on long term survival in patients with ischaemic stroke: prospective cohort studies BMJ, February 16, 2008; 336(7640): 376 - 379. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. J. Quinn, J. Dawson, J. S. Lees, T.-P. Chang, M. R. Walters, K. R. Lees, and for the GAIN and VISTA Investigators Time Spent at Home Poststroke: "Home-Time" a Meaningful and Robust Outcome Measure for Stroke Trials Stroke, January 1, 2008; 39(1): 231 - 233. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. J. Quinn, J. Dawson, M. R. Walters, and K. R. Lees Reliability of the Modified Rankin Scale Stroke, November 1, 2007; 38(11): e144 - e144. [Full Text] [PDF] |
||||
![]() |
T. J. Quinn, K. R. Lees, H.-G. Hardemark, J. Dawson, and M. R. Walters Initial Experience of a Digital Training Resource for Modified Rankin Scale Assessment in Clinical Trials Stroke, August 1, 2007; 38(8): 2257 - 2261. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Eriksson, P. Appelros, B. Norrving, A. Terent, and B. Stegmayr Assessment of Functional Outcome in a National Quality Register for Acute Stroke: Can Simple Self-Reported Items Be Transformed Into the Modified Rankin Scale? Stroke, April 1, 2007; 38(4): 1384 - 1386. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. L. Banks and C. A. Marotta Outcomes Validity and Reliability of the Modified Rankin Scale: Implications for Stroke Clinical Trials: A Literature Review and Synthesis Stroke, March 1, 2007; 38(3): 1091 - 1096. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Stroke Home | Subscriptions | Archives | Feedback | Authors | Help | AHA Journals Home | Search Copyright © 2005 American Heart Association, Inc. All rights reserved. Unauthorized use prohibited. |