(Stroke. 2002;33:2243.)
© 2002 American Heart Association, Inc.
Original Contributions |
From the Department of Psychology, University of Stirling, Stirling, UK (J.T.L.W., M.G.); Outcomes Research, Pfizer Ltd, Sandwich, UK (A.H.); and Department of Neurology, Institute of Neurological Sciences, Glasgow, UK (T.B., U.G.R.S., K.W.M., I.B.).
Correspondence to J.T.L. Wilson, Department of Psychology, University of Stirling, Stirling FK9 4LA, UK. E-mail J.T.L.Wilson{at}stir.ac.uk
| Abstract |
|---|
|
|
|---|
Methods Sixty-three patients with stroke 6 to 24 months previously were interviewed and graded independently on the modified Rankin Scale by 2 observers. These observers then underwent training in use of a structured interview for the scale that covered 5 areas of everyday function. Eight weeks after the first assessment, the same observers reassessed 58 of these patients using the structured interview.
Results Interrater reliability was measured with the
statistic (weighted with quadratic weights). For the scale applied conventionally, overall agreement between the 2 raters was 57% (
w=0.78); 1 rater assigned significantly lower grades than the other (P=0.048). On the structured interview, the overall agreement between raters was 78% (
w=0.93), and there was no overall difference between raters in grades assigned (P=0.17). Rankin grades from the conventional assessment and the structured interview were highly correlated, but there was significantly less disagreement between raters when the structured interview was used (P=0.004).
Conclusions Variability and bias between raters in assigning patients to Rankin grades may be reduced by use of a structured interview. Use of a structured interview for the scale could potentially improve the quality of results from clinical studies in stroke.
Key Words: clinical trials disability evaluation outcome outcome assessment
| Introduction |
|---|
|
|
|---|
The reliability of the modified Rankin Scale has been investigated and found to be satisfactory.3 However, comparison with the Barthel Index4 indicates lower levels of interrater agreement for the modified Rankin Scale5 and suggests that some raters may systematically assign higher or lower Rankin grades than others. The descriptions given for the categories of the modified Rankin Scale are broad and open to subjective interpretation. Walking is the only explicit criterion for assessment mentioned, and even for this criterion, it is not specified whether someone requiring an aid should be considered able to walk. It is therefore left open to raters to develop idiosyncratic criteria or to apply the scale in an impressionistic manner. Discrepancies are particularly striking for Rankin grades 2, 3, and 4, and it has been suggested that interviewers should use a checklist of activities of daily living (ADL) to produce greater uniformity in the application of the scale.3 Wolfe and colleagues5 have advocated using Barthel scores to generate ratings on the Rankin Scale and shown that this improves reliability. However, this approach can be applied only to the lower outcome categories of the modified Rankin Scale that relate to the basic ADL assessed by the Barthel Index.
The Glasgow Outcome Scale6 is similar in concept to the Rankin Scale, and the problem of impressionistic use of the Glasgow Outcome Scale has been addressed through the use of a structured interview.7 The purpose of the present study was to develop a structured interview for the modified Rankin Scale and to compare this form of assessment with the conventional application of the scale. We wanted to investigate whether using a structured interview could increase agreement between raters on the modified Rankin Scale. We carried out 2 interrater reliability studies with the same patients 8 weeks apart, the first with the conventional scale and the second with the structured interview. To reduce the possibility of change between the first and second assessments, only patients who had suffered stroke
6 months previously were included in the study.
| Methods |
|---|
|
|
|---|
15 minutes to administer.
|
Participants
The study was confined to patients surviving stroke by
6 months. Study inclusion criteria were as follows: age
18 years; diagnosis of stroke 6 to 24 months previously; living at home, living in an institution, and/or attending outpatient clinics; and ability to respond appropriately to interview in English. Excluded from the study were patients with terminal cancer, seizure disorder, dementia, substance or alcohol abuse, and major organ failure (unstable cardiopulmonary function, impaired hepatic or renal function resulting in episodic alterations in functional ADL); those unable and/or unlikely to comprehend and follow the study protocol; and patients not contacted on the advice of their general practitioners. Informed consent was obtained for each study participant.
Procedure
Both raters were neurologists in training; rater 1 (T.B.) was a specialist registrar in neurology with 4 years of experience; and rater 2 (U.G.R.S.) was a senior house officer with 2 years of experience. Before beginning the study, the raters practiced applying the modified Rankin Scale in a stroke population. Patients were assessed on 2 occasions 8 weeks apart. On the first occasion, the 2 raters interviewed each patient independently and assigned a rating on the modified Rankin Scale. The raters were instructed not to confer about ratings of individual patients. Rankin grades were assigned immediately after the initial interview. After all patients had been assessed, the raters were trained to use the structured interview to assign Rankin grades. The patients were then recalled, and each patient was independently assessed with the structured interview.
Statistical Analysis
Strength of agreement between raters is described with the
statistic that corrects for agreement by chance. When there are >2 points on an assessment, it is appropriate to use a weighted value (
w) to take into account the size of disagreements. To facilitate comparison with previous studies, we used quadratic weights for this analysis. Quadratic weights penalize extreme disagreements particularly heavily (differences are squared), and it has been shown that when weighted this way,
w is comparable to the intraclass correlation coefficient used for continuous measures.9 Brennan and Silman10 suggest the following interpretation of the
statistic (weighted appropriately) for the agreement between clinical measures: 0 to 0.20=poor, 0.21 to 0.40=fair, 0.41 to 0.60=moderate, 0.61 to 0.80=good, and 0.81 to 1.00=very good. The 95% CIs for
w values are given. Ratings were also compared through the use of appropriate nonparametric tests.
| Results |
|---|
|
|
|---|
The 58 patients (31 men) who took part in both assessments were between 37 and 90 years of age (mean, 68.3 years; SD, 10.95 years). The first assessment took place 6 to 24 months after stroke (mean, 17.1 months; SD, 5.2 months).
Ratings from the first interview, in which the modified Rankin Scale was applied in the conventional manner, are given in Table 2. The overall agreement between raters was 57%; the unweighted
statistic was 0.44, and
w was 0.78 (95% CI, 0.53 to 1.0). In 8 cases, rater 1 rated patients less favorably than rater 2, and in 17 cases, rater 1 rated patients more favorably. Comparison of grades given by raters indicated a significant overall difference between observers (Wilcoxon Z=-1.98, P=0.048, 2-tailed test).
|
Ratings from the structured interview for the modified Rankin Scale are given in Table 3. The overall agreement between raters was 78%,
=0.70, and
w=0.93 (95% CI, 0.67 to 1.0). In 4 cases, patients were rated less favorably by rater 1, and in 9 cases, they were rated more favorably. There was no significant difference in the overall rankings assigned (Wilcoxon Z=-1.4, P=0.17).
|
To compare the studies, we analyzed disagreements between raters. There were 25 disagreements between raters in the first study and 13 in the second study. Rankin grades for rater 2 were subtracted from Rankin grades for rater 1, and the absolute differences with and without the structured interview were compared. The analysis showed that the extent of disagreement was less when the structured interview was used (Wilcoxon Z=-2.85, P=0.004).
The overall distributions of ratings for each rater (given in the "total" columns in Tables 2 and 3) do not differ substantially between the 2 assessments. Although there were differences in the individual ratings between the 2 assessments (22 for rater 1, 19 for rater 2), only 1 difference was by >1 category. To test whether there was a significant shift in overall scoring for each rater, we compared the Rankin scores given without the structured interview with those obtained with the structured interview using the Wilcoxon test. For both observers, there was no overall difference in the ratings assigned on the 2 assessments (Wilcoxon Z=-1.8, P=0.072 for rater 1; Z=-0.69, P=0.491 for rater 2). The 2 assessments were also highly correlated (Spearmans correlation, 0.82, P<0.001 for rater 1; Spearmans correlation, 0.90; P<0.001 for rater 2).
| Discussion |
|---|
|
|
|---|
The findings of the present study are consistent with previous studies of the reliability of the modified Rankin Scale.3,5,11 The interrater reliability of conventional assessment with the modified Rankin Scale is satisfactory but nonetheless open to improvement. Direct comparison of the present findings with previous reports is complicated because the distributions of gradings differ. The recruitment criteria for the present study tended to eliminate the most mildly disabled and the most severely disabled groups. To allow comparison with the present study, we reanalyzed the study of Van Swieten et al,3 confining analysis to 67 patients in Rankin categories 0 to 4. This analysis yielded an overall agreement of 61%, a
of 0.49 and a
w of 0.80 (quadratic weights). Wolfe et al5 reported values of
w ranging from 0.75 to 0.96 for interobserver agreement on the Modified Rankin, and Bamford et al11 gave a value of 0.72 for the version of the Rankin Scale used in the Oxfordshire Community Stroke Project. The reliability of the conventional Rankin Scale in the present study is thus very similar to previous reports. In agreement with the study of Wolfe et al,5 the present study also demonstrates that significant bias may be present even when the
value is satisfactory. Wolfe et al reported systematic differences in the overall rankings produced by their 3 raters.
Limitations of the present study are that only 2 raters were used, and both came from similar professional backgrounds. In large clinical trials, multiple observers contribute data, and 2 observers do not represent this situation. Using multiple observers with different backgrounds may lead to greater divergence in the application of the conventional Modified Rankin (ie, lower reliability), and there could consequently be a larger effect of introducing a standardized procedure. In the present study, it was not possible to counterbalance order of assessment with the conventional and structured interview, because exposure to the structured interview will inevitably affect the style of approach adopted in the conventional assessment. Once exposed to the structured interview, raters may simply continue to ask the same questions when asked to assess patients on the conventional scale. It is possible that some of the reduction in variability is due to a practice effect for raters or another time difference. However, interrater reliability for the conventional assessment obtained in our study was similar to previous reports; and interrater reliability with the structured interview was similar to that found for the use of a structured interview for the Glasgow Outcome Scale in head-injured patients.7 Our results are therefore comparable with previous studies and suggest that major differences resulting from practice are unlikely. Further study could define the relevance of any time effects and investigate interrater reliability when the assessment is used by multiple raters from different professional backgrounds. After the study, the raters were debriefed in detail, and reasons for disagreements were identified. We have subsequently developed a set of detailed guidelines for the interview and a video for use in training raters. Training raters is particularly likely to be of importance in multicenter clinical trials, and raters should have the opportunity to observe an interview being conducted and record responses themselves. It may also be desirable to have an accreditation process to ensure that observers are applying the scale appropriately.
Van Swieten et al3 noted that further improvement in the Rankin Scale should be possible, and our results confirm that it is possible to reduce variability in ratings. The present findings probably have most relevance to the conduct of multicenter clinical trials. Use of the modified Rankin Scale as a functional assessment in clinical trials is supported by a recent analysis,12 and it is likely to remain a popular end point. Demonstrating convincing treatment effects in acute stroke trials has proved to be a challenge, and it has been suggested that use of less-than-optimal methods to measure outcome may be responsible for problems of inconsistent findings.2 Choi and colleagues13 have demonstrated that misclassification not only reduces the power of a clinical trial but also reduces the size of the observed treatment effect on dichotomous outcomes. The present investigation shows that use of a structured interview for the modified Rankin Scale may help to reduce variation between raters and improve the quality of results in clinical studies.
| Acknowledgments |
|---|
| Footnotes |
|---|
Received October 31, 2001; revision received May 6, 2002; accepted May 10, 2002.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
J. H. Pan, X. Y. Song, S. Y. Lee, and T. Kwok Longitudinal Analysis of Quality of Life for Stroke Survivors Using Latent Curve Models Stroke, October 1, 2008; 39(10): 2795 - 2802. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. S.K. Lo, J. O.Y. Cheng, E. M.C. Wong, W. K. Tang, L. K.S. Wong, J. Woo, and T. Kwok Handicap and Its Determinants of Change in Stroke Survivors: One-Year Follow-Up Study Stroke, January 1, 2008; 39(1): 148 - 153. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. J. Quinn, K. R. Lees, H.-G. Hardemark, J. Dawson, and M. R. Walters Initial Experience of a Digital Training Resource for Modified Rankin Scale Assessment in Clinical Trials Stroke, August 1, 2007; 38(8): 2257 - 2261. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. L. Banks and C. A. Marotta Outcomes Validity and Reliability of the Modified Rankin Scale: Implications for Stroke Clinical Trials: A Literature Review and Synthesis Stroke, March 1, 2007; 38(3): 1091 - 1096. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. T. L. Wilson, A. Hareendran, A. Hendry, J. Potter, I. Bone, and K. W. Muir Reliability of the Modified Rankin Scale Across Multiple Raters: Benefits of a Structured Interview Stroke, April 1, 2005; 36(4): 777 - 781. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. M. Greenberg, J. A. Eng, M. Ning, E. E. Smith, and J. Rosand Hemorrhage Burden Predicts Recurrent Intracerebral Hemorrhage After Lobar Hemorrhage Stroke, June 1, 2004; 35(6): 1415 - 1420. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. B. Young, K. R. Lees, and C. J. Weir Strengthening Acute Stroke Trials Through Optimal Use of Disability End Points Stroke, November 1, 2003; 34(11): 2676 - 2680. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. J. Newcommon, T. L. Green, E. Haley, T. Cooke, M. D. Hill, J.T.L. Wilson, A. Hareendran, T. Baird, K.W. Muir, and I. Bone Improving the Assessment of Outcomes in Stroke: Use of a Structured Interview to Assign Grades on the Modified Rankin Scale Stroke, February 1, 2003; 34(2): 377 - 378. [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Stroke Home | Subscriptions | Archives | Feedback | Authors | Help | AHA Journals Home | Search Copyright © 2002 American Heart Association, Inc. All rights reserved. Unauthorized use prohibited. |