Reliability (Inter-rater Agreement) of the Barthel Index for Assessment of Stroke Survivors
Systematic Review and Meta-analysis
Background and Purpose—The Barthel Index (BI) is a 10-item measure of activities of daily living which is frequently used in clinical practice and as a trial outcome measure in stroke. We sought to describe the reliability (interobserver variability) of standard BI in stroke cohorts using systematic review and meta-analysis of published studies.
Methods—Two assessors independently searched various multidisciplinary electronic databases from inception to April 2012 inclusive. Inclusion criteria comprised: original research, human stroke participants, and inter-rater reliability data on equivalent methods of BI administration. Manuscripts were reviewed against prespecified inclusion criteria. Primary outcome for meta-analysis was reliability, measured by weighted κ (κw).
Results—From 20 210 titles, 306 abstracts were reviewed, 12 studies met inclusion criteria, and 10 were included in meta-analysis (n=543 participants; range of participants in studies, 7–21). There was substantial clinical heterogeneity with respect to method of BI application; population studied and assessors. Two papers were graded high quality. Overall interobserver reliability of standard administration of the BI was excellent (κw, 0.93; 95% confidence interval, 0.90–0.96 random effects modeling).
Conclusions—The BI has excellent inter-rater reliability for standard administration after stroke. However, included studies were modest in size, with clinical heterogeneity and variable methodological quality. Despite these limitations, standard BI seems an appropriate outcome measure for stroke trials and practice.
The Barthel Index (BI), originally described in 1955 by Dr Florence Mahoney and Dorothea Barthel,1 is a 10-item measure of activities of daily living. In stroke medicine, BI is used in clinical practice to assess baseline abilities, to quantify functional change after rehabilitation, and to inform discharge planning.2 BI is also a frequently used functional outcome measure for clinical stroke trials, second only to modified Rankin Scale (mRS) in prevalence.3
To draw meaningful conclusions from clinical trials, robust outcome measures are required. Classical test theory describes important clinimetric properties of scales—reliability, validity, and responsiveness.4 Various measures of reliability are described: internal consistency, intraobserver variability, and interobserver variability. It is the latter that is most important for a clinical trial where multiple observers will be using the test.
Certain stroke outcome measures are limited by poor reliability. A recent review of mRS described potential for important interobserver variability when traditional assessment is used.5 Interobserver variability has implications for the validity of trial results; poor reliability in outcomes assessment implies a degree of misclassification and risks drawing erroneous conclusions from clinical trial end points.6,7
We sought to collate and synthesize the published evidence describing the interobserver variability of standard administration BI in stroke survivors, using systematic review and meta-analysis.
Two clinician researchers (L.D., S.G.) with a background in stroke medicine independently reviewed the published literature. There are no published guidelines specific to systematic review of psychometric data, throughout the process we adhered to reporting and conduct guidance based on Meta-analysis of Observational Studies in Epidemiology (MOOSE)8 and Preferred Reporting Items for Systematic Review and Meta-analysis (PRISMA).9 Search strategy, inclusion criteria, and analysis strategy were prespecified and documented in a review protocol (protocol available from authors on request).
Eligibility Criteria and Study Selection
All eligibility assessments were performed independently and blinded by 2 reviewers (L.D., T.Q.). Disagreements between reviewers were resolved by consensus.
Study population was any age, human, stroke survivors. We included all clinical and pathological stroke subtypes (ischemic, hemorrhagic, transient ischemic attack). Where study population was mixed stroke and nonstroke, we included the study if population of stroke survivors was greater than an arbitrary cut point of 85%. We used no restrictions relating to the BI assessor and specifically did not exclude studies on the basis of background or training of observers.
All studies purporting to measure BI reliability through patient interview (interobserver variability of BI scoring) were reviewed with no specific restriction on the basis of study design, intervention, or language. We excluded studies that compared differing BI measures, for example, reliability of structured telephone questionnaire versus unstructured face-to-face assessment. We did not prespecify exclusion criteria relating to numbers of participants or assessors.
We applied no restrictions on the basis of BI assessment methodology. We prespecified subgroup analysis based on method of BI assessment (face to face, telephone). We recognize that several scales use the label BI, we included all papers that purported to assess BI but our primary analysis was limited to those scales based on the original 10-item scale—standard BI.1
We devised a sensitive search strategy with assistance from an information scientist. We interrogated a comprehensive battery of cross-discipline electronic databases from inception to April 2012 inclusive: ARIF (University of Birmingham), Cochrane Database of Systematic Reviews (Cochrane Library), CINAHL (EBSCO), EMBASE (OVID), EMBASE Classic (OVID), LILACS (BIREME), Google Scholar (Google), MEDION (Universities of Maastricht and Leuven), MEDLINE (OVID), OLDMEDLINE (OVID), and Web of Science (Thomson Reuters).
We formulated keywords using National Library of Medicine Medical Subject Headings headings and study-specific terms; keywords were designed to be as inclusive as possible. We created search terms according to key concepts of: (1) functional assessment, (2) stroke, and (3) reliability assessment. If the electronic database allowed, we used explosion features for keyword searching (Figure 1).
In addition to the electronic database search, we handsearched contemporary reviews, key reference works,10,11 and high impact journals in the field of stroke medicine (Age and Ageing, Cerebrovascular Diseases, International Journal of Stroke, Journal of the American Geriatrics Society, Stroke). To identify studies not yet in print, we handsearched proceedings of scientific meetings for the period January 2010 to May 2012 (American Stroke Association—International Stroke Conference; European Stroke Conference; World Stroke Organization—World Stroke Congress). We searched bibliographies of all retrieved articles for further references, and the process was repeated until no new articles were found.
From the list of included titles, we reviewed abstracts for appropriateness to the study question. We retrieved the full text of any article that either reviewer suspected may be relevant. Where potentially relevant data were not available in the published manuscript, we attempted electronic contact with the original authors. For those studies not published in English language, translation services were used.
We assessed internal and external validity of the search. A third researcher (T.Q.) checked a random selection of 1000 titles for inclusion and results were compared with the researchers’ consensus inclusion list. Before formulating the search strategy, we selected 5 papers relevant to BI properties.1,2,12–14 We cross-checked our list of titles from literature searching to ensure all papers were included. Our search strategy was developed independent of and blinded to the choice of 5 titles for validation.
We extracted data to a prespecified and piloted proforma. Data extraction was performed independently and blinded by 2 researchers (L.D., T.Q.). Disagreement was resolved by discussion and consensus.
Quality Assessment and Risk of Bias
We assessed study quality and risk of bias using a prespecified bespoke tool based on the Guidelines for Reporting Reliability and Agreement studies (GRRAS).15 Important elements were: description of the study population (recruitment, demographics, stroke severity); description of the assessors (number, occupation, experience with functional assessment, or BI); description of time between first and second assessment; description of methodology of BI administration; description of training of the assessor in BI; description of sample size calculation; and description of blinding.
Each category was scored high quality if described in sufficient detail in the paper and no potential for bias. A priori we defined high quality of paper as a high quality score on ≥5 of the 7 criteria. We piloted the tool using 2 papers and revised. Quality assessment was performed independently and blinded by 2 researchers (L.D., T.Q.) with disagreement resolved by discussion and consensus.
Reliability of BI is traditionally described using either κ statistics (this may be weighted to account for degree of difference between observers [κw]), intraclass correlation coefficient (ICC), Bland Altman method, or percentage agreement between observers. We included any of these descriptors and derived others if data allowed. ICC equates to κw when quadratic weighting is applied. κ Statistics range from κ=0.00 (no agreement other that that expected by chance) to κ=1.00 (complete agreement). We used standard definitions of poor (κ=0.00–0.20), fair (κ = 0.21–0.40), moderate (κ=0.41–0.60), good (κ=0.61–0.80), and very good (κ=0.81–1.00) agreement.16
For meta-analysis, we used a single-group descriptive analysis. We described standardized differences and corresponding 95% confidence intervals (95% CIs). We present both fixed and random effects model data. We assessed consistency (heterogeneity) by visual inspection of study data plots. We planned to explore potential publication bias by plotting κ against the inverse of its standard error to allow visual inspection of symmetry. All calculations were performed using Comprehensive Meta-analysis software (Version 2, Biostat, Englewood).
From 20 210 original titles, 306 abstracts were eligible for review, 35 papers were initially considered for inclusion, and 12 studies involving 627 patients met our inclusion criteria.17–28 Of these, 10 (n=543 subjects) had data that allowed for meta-analysis.17–26 Study populations were exclusively stroke survivors. Four reports required translation (German, Dutch, and Portuguese).17–20 We contacted 3 authors for additional information and received additional data from 2. Included studies were from various countries (n=9), using assessors of differing background and experience and assessing BI at various time points poststroke (range, 14 days to >1 year; Table 1)
Internal validity of our search strategy was confirmed as an independent third reviewer found no titles other than those already selected by the original 2 researchers. External validity of our search strategy was confirmed as all prespecified titles were included in the search results.
Variability in BI scoring was described using various measures; κ statistics (n=3), κw (n=6), ICC (n=4), and percentage agreement (n=2). Interobserver variability of BI varied from near perfect (κw=0.99) to good (κ=0.62) in the original descriptions.(Table 2) To make use of the largest dataset, quadratic κw, and ICC scores were combined and used for the meta-analysis. For those studies that described results in terms of κ and agreement, there were insufficient data presented in the original reports to allow back derivation of κw or ICC.
Meta-analysis suggested overall very good reliability for standard face-to-face BI (κw, 0.95; 95% CI, 0.94–0.96 fixed effects modeling; κw, 0.93; 95% CI, 0.90–0.96 random effects modeling); there were insufficient suitable data to perform meta-analysis for other methods of BI assessment.(Figure 2).
Visual inspection suggested modest statistical heterogeneity. There was substantial clinical heterogeneity across the included studies, particularly with respect to BI assessors, population studied, and numbers included. Our quality assessment tool described 2 papers as high quality. The most common limitations in published descriptions of studies were lack of sample size calculation (n=11), poor description of blinding (n=10), and poor description of BI training (n=7) (Figure 3). Given the small number of included studies in the meta-analysis with limited spread of standard error, we did not use funnel plot asymmetry to assess for publication bias.
Certain studies described reliability across individual items of the BI (n=7). The items bladder and bowel continence were consistently rated as having very good reliability (Table 3).
We performed 1 post hoc subanalyses comparing those papers where raters are described as having training in BI application and administration (n=5 papers, 4 suitable for meta-analysis)19,20,22,26 with those papers that have no description of using BI training (n=7, 6 suitable for meta-analysis).17,18,21,23–25 Meta-analysis (random effects model) suggested a strong trend toward improved reliability in those papers that describe BI training: κw (trained) 0.95 (95% CI, 0.92–0.98); κw (no training) 0.91 (95% CI, 0.89–0.92).
We studied published reports of reproducibility of BI assessment scores. Our results suggest excellent interobserver reliability of the standard BI as a stroke outcome measure.
The reliability of BI is often quoted as a particular strength of this outcome measure for use as a stroke trial end point.29 A previous systematic review of the reliability of BI is available, but this study excluded stroke survivors.12 We present the first systematic review of the interobserver reliability of BI in stroke. Our review findings are broadly similar to the nonstroke analysis12—published accounts describe good reliability of BI but clinical heterogeneity and variable study quality preclude any definitive statements.
The preferred functional outcome measure in stroke research is the mRS.3 Interobserver variability in mRS is perceived as a potential limitation of this measure, although use of novel administration techniques may improve mRS reliability.30 Based on our data, BI would seem the more reliable outcome measure, although differences are modest: κw=0.95 for BI, compared with κw=0.90 for standard mRS in a recent meta-analysis.5 Reliability is only one of the important clinimetric properties to be considered when choosing a functional outcome measure. Other favorable clinimetric properties of BI include various measures of validity and acceptability.2 However, BI performs less well for other important properties, most notably responsiveness to clinical change and well-recognized floor and ceiling effects.2 In this respect, mRS remains preferable to BI14 and we would not suggest replacing mRS as the study end point of choice. As the 2 scales measure differing constructs, activities of daily living (BI) and global activity with a focus on walking (mRS), measurement of both may have some utility.
The utility of meta-analysis is dependent on the quality of the included papers. In this regard, our analysis is limited by the substantial clinical heterogeneity between included papers and problems with potential for bias. To limit heterogeneity, we focussed on studies that compared equivalent BI administration methods. We recognize that many good studies have described reliability of differing BI techniques, for example, structured telephone versus face-to-face interview.31 Study quality and potential for bias was a concern, only 2 of the included papers were graded as high quality and no papers fulfilled all of our prespecified quality criteria. Specific reporting guidance for reliability studies is relatively new, and we encourage future studies to make use of this resource.15 Our data were not suited to conventional assessment of publication bias, and it is possible that the favorable reliability described could be, in part, because of publication bias. However, unlike intervention trials, there is no reason to suspect that results of a psychometric study would impact on its chance of publication.
It has been demonstrated that training improves consistency in application of stroke assessment scales.32 Training and certification resources are available for common stroke scales (mRS; National Institutes of Health Stroke Scale).30,32 These are recognized by regulatory authorities and have become industry standard for large-scale trials. Our subanalysis suggested that training may be associated with improved reliability of BI. This analysis comes with certain caveats, papers may have used training but not described this in the published paper. Difference in reliability between trained and untrained was modest and may not be clinically important when both groups still had excellent reliability. Nonetheless, we would encourage use of standardized training in BI.33
A strength of our analysis was the robust assessment of available data sources. We considered a number of reports from non-English and nonmedical sources, and several non-English language reports were included in the final analysis. Our literature searching strategy was as comprehensive and systematic as possible. The spread of reliability estimates obtained suggests no overt publication bias.
There were limitations to our study methodology. No universally accepted method for analysis of multiple κ statistics from differing populations has been described. Analysis of κ statistics is subject to numerous caveats.14 In general, reliability of a tool is a factor of the tool itself, the subjects, the assessor, and the setting, and so κ will vary with population size/number of assessors. To give a summary of reliability across multiple studies, we performed a meta-analysis that made the fewest assumptions of the data. We use this to give a summary estimate, but recognize that some statisticians do not feel κ statistics should be pooled. Similarly, our analyses of heterogeneity and publication bias were the analyses that made fewest assumptions of the data.
If we wish to improve our understanding of how BI performs as a clinical and trial outcome measure in contemporary practice, the ideal study methodology would involve a large series of trained observers of differing backgrounds and from differing international centers, assessing BI, using a standard methodology, on unselected patients at a predetermined time point after discharge. The importance of studying the properties of a scale in a setting that mirrors clinical practice can be seen from review studies describing mRS reliability. Many of the published studies reporting reasonable mRS reliability are single center, with small numbers of raters and stroke survivors.5 However, interobserver variation is considerable in those studies that are multicenter, with multiple raters from different professional backgrounds and large numbers of stroke survivors.33 This is the situation that more closely mirrors a contemporary stroke intervention study. We found no published papers with a design similar to our ideal. Thus suitably powered, multicentre studies of BI properties will still provide useful data.
Our data suggest excellent reliability of BI as a stroke outcome measure across the published literature, although studies included in our meta-analysis may not be representative of contemporary, multicenter stroke trials. As there was a suggestion of improved reliability with BI training, we would encourage use of standardized methods of BI administration or training.34
Dr Quinn has assisted with creation of training resources for stroke scales and has received payment from Training Campus to support this work. The other authors have no conflicts to report.
- Received September 28, 2012.
- Revision received October 28, 2012.
- Accepted November 21, 2012.
- © 2013 American Heart Association, Inc.
- Quinn TJ,
- Langhorne P,
- Stott DJ
- Asplund K
- Quinn TJ,
- Dawson J,
- Walters MR,
- Lees KR
- Jaffar S,
- Leach A,
- Smith PG,
- Cutts F,
- Greenwood B
- D’Olhaberriague L,
- Litvan I,
- Mitsias P,
- Mansbach HH
- Streiner DL,
- Norman GR
- Finch E,
- Brooks D,
- Stratford PW,
- Mayo NE
- Sainsbury A,
- Seebass G,
- Bansal A,
- Young JB
- Schlote A,
- Krüger J,
- Topp H,
- Wallesch CW
- Cincura C,
- Pontes-Neto OM,
- Neville IS,
- Mendes HF,
- Menezes DF,
- Mariano DC,
- et al
- Wolfe CD,
- Taub NA,
- Woodrow EJ,
- Burney PG
- Loewen SC,
- Anderson BA
- Leung SO,
- Chan CC,
- Shah S
- Bradford A,
- Norris J,
- Lees Kr
- Lees KR,
- Bath PM,
- Schellinger PD,
- Kerr DM,
- Fulton R,
- Hacke W,
- et al
- Quinn TJ,
- Lees KR,
- Hardemark HG,
- Dawson J,
- Walters MR
- Della Pietra GL,
- Savio K,
- Oddone E,
- Reggiani M,
- Monaco F,
- Leone MA
- Lyden P,
- Brott T,
- Tilley B,
- Welch KM,
- Mascha EJ,
- Levine S,
- et al
- Wilson JT,
- Hareendran A,
- Hendry A,
- Potter J,
- Bone I,
- Muir KW
- 34.↵University of Glasgow. Training Campus. The Barthel index of activities of daily living. English program homepage.Available at: http://barthel-english.trainingcampus.net. 2011. Accessed July 31, 2012.