Improving the Assessment of Outcomes in Stroke
Use of a Structured Interview to Assign Grades on the Modified Rankin Scale
Background and Purpose— The modified Rankin Scale is widely used to assess changes in activity and lifestyle after stroke, but it has been criticized for its subjectivity. The purpose of the present study was to compare conventional assessment on the modified Rankin Scale with assessment through a structured interview.
Methods— Sixty-three patients with stroke 6 to 24 months previously were interviewed and graded independently on the modified Rankin Scale by 2 observers. These observers then underwent training in use of a structured interview for the scale that covered 5 areas of everyday function. Eight weeks after the first assessment, the same observers reassessed 58 of these patients using the structured interview.
Results— Interrater reliability was measured with the κ statistic (weighted with quadratic weights). For the scale applied conventionally, overall agreement between the 2 raters was 57% (κw=0.78); 1 rater assigned significantly lower grades than the other (P=0.048). On the structured interview, the overall agreement between raters was 78% (κw=0.93), and there was no overall difference between raters in grades assigned (P=0.17). Rankin grades from the conventional assessment and the structured interview were highly correlated, but there was significantly less disagreement between raters when the structured interview was used (P=0.004).
Conclusions— Variability and bias between raters in assigning patients to Rankin grades may be reduced by use of a structured interview. Use of a structured interview for the scale could potentially improve the quality of results from clinical studies in stroke.
The Rankin Scale1 has wide acceptance as a measure of functional outcome after stroke and has become one of the most popular end points for clinical trials.2 The version of the scale most commonly used in trials is the modified Rankin Scale,3 which is a simple 6-point assessment that includes reference to both limitations in activity and changes in lifestyle.
The reliability of the modified Rankin Scale has been investigated and found to be satisfactory.3 However, comparison with the Barthel Index4 indicates lower levels of interrater agreement for the modified Rankin Scale5 and suggests that some raters may systematically assign higher or lower Rankin grades than others. The descriptions given for the categories of the modified Rankin Scale are broad and open to subjective interpretation. Walking is the only explicit criterion for assessment mentioned, and even for this criterion, it is not specified whether someone requiring an aid should be considered able to walk. It is therefore left open to raters to develop idiosyncratic criteria or to apply the scale in an impressionistic manner. Discrepancies are particularly striking for Rankin grades 2, 3, and 4, and it has been suggested that interviewers should use a checklist of activities of daily living (ADL) to produce greater uniformity in the application of the scale.3 Wolfe and colleagues5 have advocated using Barthel scores to generate ratings on the Rankin Scale and shown that this improves reliability. However, this approach can be applied only to the lower outcome categories of the modified Rankin Scale that relate to the basic ADL assessed by the Barthel Index.
The Glasgow Outcome Scale6 is similar in concept to the Rankin Scale, and the problem of impressionistic use of the Glasgow Outcome Scale has been addressed through the use of a structured interview.7 The purpose of the present study was to develop a structured interview for the modified Rankin Scale and to compare this form of assessment with the conventional application of the scale. We wanted to investigate whether using a structured interview could increase agreement between raters on the modified Rankin Scale. We carried out 2 interrater reliability studies with the same patients 8 weeks apart, the first with the conventional scale and the second with the structured interview. To reduce the possibility of change between the first and second assessments, only patients who had suffered stroke ≥6 months previously were included in the study.
The structured interview (Table 1) differs from the conventional guided interview for the modified Rankin Scale by defining specific questions to grade each category. The structured interview developed for the study consists of 5 sections: (1) constant care, (2) basic ADL, (3) instrumental ADL, (4) limitations in participation in usual social roles, and (5) checklist for the presence of common stroke symptoms. Items for inclusion in the interview were selected after review of outcome assessments used in stroke and focus groups held with stroke patients. An initial draft of the interview was piloted before the final version for the study was produced. Section 2 of the interview is based on items from the Barthel Index and followed the definitions provided by Collin et al.8 Sections 3 and 4 are adapted from the structured interview for the extended Glasgow Outcome Scale.7 Unlike a questionnaire, a structured interview allows reformulation of questions to suit the particular circumstances. In each section, restrictions in activities before stroke are recorded, and raters are instructed to discount preexisting limitations in the final rating. Raters were encouraged to interview a relative when possible and to base their assessments on ability to do the task rather than performance. Further details concerning the principles involved in administering a structured interview are given elsewhere.7The structured interview took ≈15 minutes to administer.
The study was confined to patients surviving stroke by ≥6 months. Study inclusion criteria were as follows: age ≥18 years; diagnosis of stroke 6 to 24 months previously; living at home, living in an institution, and/or attending outpatient clinics; and ability to respond appropriately to interview in English. Excluded from the study were patients with terminal cancer, seizure disorder, dementia, substance or alcohol abuse, and major organ failure (unstable cardiopulmonary function, impaired hepatic or renal function resulting in episodic alterations in functional ADL); those unable and/or unlikely to comprehend and follow the study protocol; and patients not contacted on the advice of their general practitioners. Informed consent was obtained for each study participant.
Both raters were neurologists in training; rater 1 (T.B.) was a specialist registrar in neurology with 4 years of experience; and rater 2 (U.G.R.S.) was a senior house officer with 2 years of experience. Before beginning the study, the raters practiced applying the modified Rankin Scale in a stroke population. Patients were assessed on 2 occasions 8 weeks apart. On the first occasion, the 2 raters interviewed each patient independently and assigned a rating on the modified Rankin Scale. The raters were instructed not to confer about ratings of individual patients. Rankin grades were assigned immediately after the initial interview. After all patients had been assessed, the raters were trained to use the structured interview to assign Rankin grades. The patients were then recalled, and each patient was independently assessed with the structured interview.
Strength of agreement between raters is described with the κ statistic that corrects for agreement by chance. When there are >2 points on an assessment, it is appropriate to use a weighted value (κw) to take into account the size of disagreements. To facilitate comparison with previous studies, we used quadratic weights for this analysis. Quadratic weights penalize extreme disagreements particularly heavily (differences are squared), and it has been shown that when weighted this way, κw is comparable to the intraclass correlation coefficient used for continuous measures.9 Brennan and Silman10 suggest the following interpretation of the κ statistic (weighted appropriately) for the agreement between clinical measures: 0 to 0.20=poor, 0.21 to 0.40=fair, 0.41 to 0.60=moderate, 0.61 to 0.80=good, and 0.81 to 1.00=very good. The 95% CIs for κw values are given. Ratings were also compared through the use of appropriate nonparametric tests.
Sixty-three patients were initially recruited into the study and took part in the first assessment in which the modified Rankin Scale was applied in a conventional manner; 58 patients returned (92%) for a second assessment 8 weeks after the first with the Rankin structured interview. Reasons for loss to follow-up were death (n=1), serious illness (n=2), alcohol abuse (n=1), and did not attend (n=1).
The 58 patients (31 men) who took part in both assessments were between 37 and 90 years of age (mean, 68.3 years; SD, 10.95 years). The first assessment took place 6 to 24 months after stroke (mean, 17.1 months; SD, 5.2 months).
Ratings from the first interview, in which the modified Rankin Scale was applied in the conventional manner, are given in Table 2. The overall agreement between raters was 57%; the unweighted κ statistic was 0.44, and κw was 0.78 (95% CI, 0.53 to 1.0). In 8 cases, rater 1 rated patients less favorably than rater 2, and in 17 cases, rater 1 rated patients more favorably. Comparison of grades given by raters indicated a significant overall difference between observers (Wilcoxon Z=−1.98, P=0.048, 2-tailed test).
Ratings from the structured interview for the modified Rankin Scale are given in Table 3. The overall agreement between raters was 78%, κ=0.70, and κw=0.93 (95% CI, 0.67 to 1.0). In 4 cases, patients were rated less favorably by rater 1, and in 9 cases, they were rated more favorably. There was no significant difference in the overall rankings assigned (Wilcoxon Z=−1.4, P=0.17).
To compare the studies, we analyzed disagreements between raters. There were 25 disagreements between raters in the first study and 13 in the second study. Rankin grades for rater 2 were subtracted from Rankin grades for rater 1, and the absolute differences with and without the structured interview were compared. The analysis showed that the extent of disagreement was less when the structured interview was used (Wilcoxon Z=−2.85, P=0.004).
The overall distributions of ratings for each rater (given in the “total” columns in Tables 2 and 3⇑) do not differ substantially between the 2 assessments. Although there were differences in the individual ratings between the 2 assessments (22 for rater 1, 19 for rater 2), only 1 difference was by >1 category. To test whether there was a significant shift in overall scoring for each rater, we compared the Rankin scores given without the structured interview with those obtained with the structured interview using the Wilcoxon test. For both observers, there was no overall difference in the ratings assigned on the 2 assessments (Wilcoxon Z=−1.8, P=0.072 for rater 1; Z=−0.69, P=0.491 for rater 2). The 2 assessments were also highly correlated (Spearman’s correlation, 0.82, P<0.001 for rater 1; Spearman’s correlation, 0.90; P<0.001 for rater 2).
The results for the conventionally applied modified Rankin Scale indicate good interrater reliability but indicate that significant bias may be present. The structured interview for the modified Rankin Scale had very good interrater reliability, the extent of disagreement between raters was less, and significant bias was not present. Comparison of ratings on the structured interview and the conventional Rankin showed that they were highly correlated, indicating that the structured interview has satisfactory criterion validity when measured against the conventional Rankin as a standard.
The findings of the present study are consistent with previous studies of the reliability of the modified Rankin Scale.3,5,11⇓⇓ The interrater reliability of conventional assessment with the modified Rankin Scale is satisfactory but nonetheless open to improvement. Direct comparison of the present findings with previous reports is complicated because the distributions of gradings differ. The recruitment criteria for the present study tended to eliminate the most mildly disabled and the most severely disabled groups. To allow comparison with the present study, we reanalyzed the study of Van Swieten et al,3 confining analysis to 67 patients in Rankin categories 0 to 4. This analysis yielded an overall agreement of 61%, a κ of 0.49 and a κw of 0.80 (quadratic weights). Wolfe et al5 reported values of κw ranging from 0.75 to 0.96 for interobserver agreement on the Modified Rankin, and Bamford et al11 gave a value of 0.72 for the version of the Rankin Scale used in the Oxfordshire Community Stroke Project. The reliability of the conventional Rankin Scale in the present study is thus very similar to previous reports. In agreement with the study of Wolfe et al,5 the present study also demonstrates that significant bias may be present even when the κ value is satisfactory. Wolfe et al reported systematic differences in the overall rankings produced by their 3 raters.
Limitations of the present study are that only 2 raters were used, and both came from similar professional backgrounds. In large clinical trials, multiple observers contribute data, and 2 observers do not represent this situation. Using multiple observers with different backgrounds may lead to greater divergence in the application of the conventional Modified Rankin (ie, lower reliability), and there could consequently be a larger effect of introducing a standardized procedure. In the present study, it was not possible to counterbalance order of assessment with the conventional and structured interview, because exposure to the structured interview will inevitably affect the style of approach adopted in the conventional assessment. Once exposed to the structured interview, raters may simply continue to ask the same questions when asked to assess patients on the conventional scale. It is possible that some of the reduction in variability is due to a practice effect for raters or another time difference. However, interrater reliability for the conventional assessment obtained in our study was similar to previous reports; and interrater reliability with the structured interview was similar to that found for the use of a structured interview for the Glasgow Outcome Scale in head-injured patients.7 Our results are therefore comparable with previous studies and suggest that major differences resulting from practice are unlikely. Further study could define the relevance of any time effects and investigate interrater reliability when the assessment is used by multiple raters from different professional backgrounds. After the study, the raters were debriefed in detail, and reasons for disagreements were identified. We have subsequently developed a set of detailed guidelines for the interview and a video for use in training raters. Training raters is particularly likely to be of importance in multicenter clinical trials, and raters should have the opportunity to observe an interview being conducted and record responses themselves. It may also be desirable to have an accreditation process to ensure that observers are applying the scale appropriately.
Van Swieten et al3 noted that further improvement in the Rankin Scale should be possible, and our results confirm that it is possible to reduce variability in ratings. The present findings probably have most relevance to the conduct of multicenter clinical trials. Use of the modified Rankin Scale as a functional assessment in clinical trials is supported by a recent analysis,12 and it is likely to remain a popular end point. Demonstrating convincing treatment effects in acute stroke trials has proved to be a challenge, and it has been suggested that use of less-than-optimal methods to measure outcome may be responsible for problems of inconsistent findings.2 Choi and colleagues13 have demonstrated that misclassification not only reduces the power of a clinical trial but also reduces the size of the observed treatment effect on dichotomous outcomes. The present investigation shows that use of a structured interview for the modified Rankin Scale may help to reduce variation between raters and improve the quality of results in clinical studies.
This project was supported by a research grant from Pfizer UK to the University of Stirling. We would like to thank Peter J. Snyder, PhD, Robert Bagdorf, MD, and Michael Krams, MD, for their comments and support of the project.
Copies of the Structured Interview for the Modified Rankin Scale and accompanying notes can be obtained from the corresponding author or at http://www.stir.ac.uk/psychology/staff/JTLW1.
- Received October 31, 2001.
- Revision received May 6, 2002.
- Accepted May 10, 2002.
- ↵Rankin J. Cerebral vascular accidents in patients over the age of 60: II. Prognosis Scottish Med J. 1957; 2: 200–215.
- ↵Duncan PW, Jorgensen HS, Wade DT. Outcome measures in acute stroke trials: a systematic review and some recommendations to improve practice. Stroke. 2000; 31: 1429–1438.
- ↵van Swieten JC, Koudstaal PJ, Visser MC, Schouten HJ, van Gijn J. Interobserver agreement for the assessment of handicap in stroke patients. Stroke. 1988; 19: 604–607.
- ↵Mahoney FI, Barthel DW. Functional evaluation: the Barthel index. Maryland State Med J. 1965; 14: 61–65.
- ↵Wolfe CDA, Taub NA, Woodrow EJ, Burney PG. Assessment of scales of disability and handicap for stroke patients. Stroke. 1991; 22: 1242–1244.
- ↵Brennan P, Silman A. Statistical methods for assessing observer variability in clinical measures. BMJ. 1992; 304: 1491–1494.
- ↵Broderick JP, Lu M, Kothari R, Levine SR, Lyden PD, Haley EC, Brott TG, Grotta J, Tilley BC, Marler JR, Frankel M. Finding the most powerful measures of the effectiveness of tissue plasminogen activator in the NINDS tPA Stroke Trial. Stroke. 2000; 31: 2335–2341.