Improving Interrater Agreement About Brain Microbleeds
Development of the Brain Observer MicroBleed Scale (BOMBS)
Background and Purpose— If the diagnostic and prognostic significance of brain microbleeds (BMBs) are to be investigated and used for these purposes in clinical practice, observer variation in BMB assessment must be minimized.
Methods— Two doctors used a pilot rating scale to describe the number and distribution of BMBs (round, low-signal lesions, <10 mm diameter on gradient echo MRI) among 264 adults with stroke or TIA. They were blinded to clinical data and their counterpart’s ratings. Disagreements were adjudicated by a third observer, who informed the development of a new Brain Observer MicroBleed Scale (BOMBS), which was tested in a separate cohort of 156 adults with stroke.
Results— In the pilot study, agreement about the presence of ≥1 BMB in any location was moderate (κ=0.44; 95% CI, 0.32–0.56), but agreement was worse in lobar locations (κ=0.44; 95% CI, 0.30–0.58) than in deep (κ=0.62; 95% CI, 0.48–0.76) or posterior fossa locations (κ=0.66; 95% CI, 0.47–0.84). Using BOMBS, agreement about the presence of ≥1 BMB improved in any location (κ=0.68; 95% CI, 0.49–0.86) and in lobar locations (κ=0.78; 95% CI, 0.60–0.97).
Conclusion— Interrater reliability concerning the presence of BMBs was moderate to good, and could be improved with the use of the BOMBS rating scale, which takes into account the main sources of interrater disagreement identified by our pilot scale.
In stroke medicine, the burning questions about brain microbleeds (BMBs) concern their diagnostic significance (for example, for the ante mortem diagnosis of cerebral amyloid angiopathy and other disorders) and whether BMBs should influence the use of antithrombotic and thrombolytic drugs.1 If the diagnostic and prognostic significances of BMBs are to be investigated and used for these purposes in clinical practice, then definitive research studies should fulfill a variety of prerequisites,2 including knowledge of the usefulness of the scales used to rate BMBs as well as the intrarater and interrater variation of the individual researchers using the scales.
Previous studies have reported variable levels of interrater agreement among 2 or 3 observers (Table 1).3–15 Studies of interrater reliability found kappa values ranging from 0.33 (fair)3 to 0.88 (excellent);10,11 however, most found κ>0.7, but sample sizes were small, 95% CIs were not always provided, and the properties of the rating scales used were not described.
In view of the variation in reported interrater agreement, the potential for a rating scale to improve levels of agreement, and the absence of a rating scale for BMBs, we further quantified interrater agreement about the presence, number, size, and location of BMBs to develop a simple BMB classification scheme that might minimize observer variation.
Subjects and Methods
We studied a subset of patients from a hospital-based stroke register (the Edinburgh Stroke Study, http://www.dcn.ed.ac.uk/ess/)16 (Table 2). Consecutive consenting stroke and transient ischemic attack (TIA) patients were recruited to the register from outpatient clinics and hospital admissions (total n=2160). In the current study we included those patients who had undergone at least 1 MRI scan with gradient echo (GRE) sequences (n=264). If a patient had >1 MRI, then we used the earliest scan.
MRIs were performed on a GE Signa LX 1.5-Tesla machine with 22-mT m−1 maximum strength gradients using the manufacturer-supplied quadrature birdcage head coil. Diagnostic MR imaging included (all axial sequences): diffusion-weighted (TR, 9999; TE, 98.8; matrix, 128×128; FoV, 24×24; slice thickness, 5 mm; slice gap, 1 mm; NEX, 1 [where TR indicates relaxation time; TE, echo time; FoV, field of view; NEX, number of excitations]); T2-weighted (TR, 6300; TE, 107; matrix, 256×256; FoV, 24×18; slice thickness, 5 mm; slice gap, 1.5 mm; NEX, 2); fluid-attenuated inversion recovery (TR, 9002; TE, 147; matrix, 256×256; FoV, 24×24; slice thickness, 5 mm; slice gap, 1.5 mm; NEX, 1); and GRE (T2*; TR, 620; TE, 15; flip angle, 20; FoV, 24×18; matrix, 256×192; slice thickness, 5 mm; slice gap, 1 mm; NEX, 2).
Brain Microbleed Rating
A neuroradiologist (G.M.P.) and a neurologist (C.C.), both with experience in rating BMBs, independently assessed all MRI sequences on cut film, belonging to all 264 adults, using a pilot BMB rating scale, which required the reader to quantify BMBs subdivided by size (<5 mm, 5–10 mm), side of brain (left, right), and location (lobar [cortex/gray–white junction; subcortical white matter], deep [basal ganglia grey matter; internal and external capsules; thalamus], and posterior fossa [brain stem; cerebellum]; Figure 1). All BMBs were measured manually. Each observer was blinded to clinical data and to the other observer’s ratings. BMBs were defined as homogeneous, round foci, <10 mm diameter (no minimum size was specified), of low signal intensity on GRE T2*-weighted MRI. The observers were aware of the main BMB mimics. Low-signal lesions on GRE T2* within a lesion compatible with an infarct were regarded as hemorrhagic transformations rather than BMBs.
Development and Testing of the Brain Observer MicroBleed Scale (BOMBS)
After the assessment of interrater agreement with the pilot scale, a senior neuroradiologist (J.M.W.) reviewed MRI scans about which the 2 observers disagreed in their BMB ratings. We developed the Brain Observer MicroBleed Scale (BOMBS) to account for some of the common sources of disagreement and other major problems encountered (see Supplemental Figure I, available online at http://stroke.ahajournals.org and www.sbirc.ed.ac.uk/imageanalysis.html). We re-evaluated agreement between the same 2 observers using BOMBS in a different set of patients undergoing identical MRI sequences and parameters to quantify BMBs subdivided by size, side of brain, and location (as before), but with the addition of a further subdivision into “certain” and “uncertain” BMB categories. The study population for the assessment of BOMBS was a series of 156 patients with stroke (different to the 264 in whom the pilot rating scale was tested) who had been recruited in 2 other stroke studies requiring MRI. Both studies recruited from the same hospital sources as the Edinburgh Stroke Study. One study recruited patients with lacunar or nondisabling cortical ischemic stroke (the Mild Stroke Study); the other included outpatients presenting >1 week after a mild stroke, in whom CT scanning would not discriminate between ischemic or hemorrhagic stroke, requiring MRI for stroke subtyping.
We quantified observer agreement using the unweighted κ statistic for nominal data (such as dichotomized presence versus absence of ≥1 BMB) analyzed in any brain location and in separate brain areas (lobar, deep, and posterior fossa). When using BOMBS, we calculated κ for BMBs rated certain, and for BMBs rated certain or uncertain. Intraclass correlation coefficients were calculated to assess agreement between observers for the overall numbers of BMBs. When exploring interobserver reliability in measurements of BMB size, we restricted our analysis to MRI scans on which both raters had observed definite BMBs in the same brain location. All analyses were performed in SPSS version 13.0, except for confidence intervals for κ, which were calculated using Confidence Interval Analysis software.17
Ethical approval was granted by the Lothian Research Ethics Committee.
Study of Interrater Reliability Using a Pilot Rating Scale
Agreement about the presence/absence of ≥1 BMB in any location in the brain was moderate (κ, 0.44; 95% CI, 0.32–0.56), but it appeared to be better in deep and posterior fossa locations when compared to lobar areas (Table 2). The intraclass correlation coefficient for the overall number of BMBs was 0.91 (95% CI, 0.88–0.93). The 2 observers disagreed about the presence of ≥1 BMB on 65 MRI scans. When these disagreements were reviewed by a third observer (J.M.W.), most were found to occur when there was doubt about whether there was 1 BMB on a scan or none. The main causes for disagreement were common BMB mimics such as vascular flow voids (cortical and perforator vessels), irregularly shaped lesions, lesions too pale to be confident about them being a BMB, partial volume artifacts from the petrous temporal bone or orbit, and variable signal dropout (Figure 2).
Development and Testing of the BOMBS
We revised the pilot rating scale on the basis of the causes of the observed disagreements to develop the BOMBS (Supplemental Figure I). Interrater agreement about the presence/absence of ≥1 BMB improved using BOMBS when the analysis was restricted to certain BMBs, but remained similar to the pilot rating scale when considering certain and uncertain BMBs (Table 2). No significant difference in interrater reliability was discernible between brain locations using BOMBS (Table 2). The intraclass correlation coefficient for the overall number of certain BMBs was 0.93 (95% CI, 0.91–0.95). There were 27 definite BMBs observed by both raters in the same brain location, 25 of which were rated in the same size category (93%; 95% CI, 77–98); 2 BMBs were considered to be ≥5 mm by observer A, but <5 mm by observer B.
With our simple pilot rating scale, we found that the assessment of BMBs on GRE T2* MR images in patients with stroke or TIA was not straightforward, with only moderate levels of interrater agreement, comparable to previous studies. BOMBS (Supplemental Figure I) improved interrater reliability when all brain locations were analyzed together, and particularly in lobar locations, which were identified in our pilot study as a difficult part of the brain to rate (Table 2). Although the consideration of BMB mimics is widely recognized as being important, observer variation persists, even when mimics are carefully thought about during MR scan review. BOMBS had its main effect by differentiating certain from uncertain BMBs; uncertainty about BMBs may be an important problem, because it applied to between one-third to one-half of BMBs in this study (Table 2).
BMB maximum diameters in previous research have varied from 2 to 5 mm, to ≤7 mm and ≤10 mm.2 In this study, using a maximum diameter of 10 mm, we found few BMBs >5 mm in diameter, and we found only 2 disagreements about BMB size, but further studies are needed of observer agreement in BMB size categorization and of the pathological substrates for BMBs of varying sizes in different patient populations.
We found good agreement about the total number of BMBs. It is quite possible that the number of BMBs may influence their prognostic significance,1 but this is not beyond doubt. On both these counts, continuing to collect the total number of BMBs rated by any observer—rather than subdividing a rating scale into no/few/many BMBs—will contribute to improving agreement about BMB number, as well as determining what the numeric thresholds for BMB prognostic/therapeutic significance are. Furthermore, studying observers’ certainty in relation to their ratings of the presence/absence of BMBs as well as the number and size of BMBs seen will help in understanding whether small/uncertain BMBs are more likely to be counted in patients with multiple certain BMBs than those with a solitary certain BMB or multiple uncertain BMBs. BOMBS appeared to influence rater behavior in our study (Table 2); for example, observer A rated more BMBs than observer B in the pilot study, but this pattern was reversed with BOMBS.
Although studies have described the interrater reliability of BMB ratings (Table 1), we sought to improve agreement as our primary objective. We used κ and our study design fulfilled the assumptions inherent in the κ statistic: the subjects undergoing study and the observers were independent, and the categories in the scale were independent, mutually exclusive, and exhaustive.18 The design of BOMBS benefited from the lessons gleaned using the pilot scale and independent review of the scans about which the observers disagreed. This study also benefited from using consistent imaging parameters and the same range of sequences (including GRE T2* in all), and blinding of each observer to the other’s ratings.
The main weakness of this study was that these findings have not yet been validated in larger cohorts, in other disease groups, and among other observers. We encourage other researchers to do so to explore the generalizability of BOMBS. The influence of practice effects in our observers cannot be ruled out, but even before their ratings using the pilot rating scale both had experience of interpreting BMBs on MRI. The prevalence of BMBs appeared to decrease when BOMBS was validated, which was likely to have been related to the younger patients with milder strokes, whose MRI scans were used to test BOMBS, than those whose MRIs were used for the pilot rating scale; a systematic review has found BMB prevalence to be lower in these groups.2 An artifact of the κ statistic is that it is affected in complex ways by the prevalence of the abnormality undergoing study, but a decrease in BMB prevalence from 20%–40% to 18%–25% is unlikely to have significantly biased the observers or affected the properties of the κ statistic.
The BOMBS dichotomization of BMBs into certain and uncertain was intended to improve agreement about the existence of certain BMBs. Prioritizing the identification of certain BMBs would result in improved specificity (at the expense of sensitivity) by identifying BMBs more reliably, which could improve the internal and external validities of research projects and encourage more reliable identification of BMBs should they become relevant in clinical practice. Investigators could also explore the robustness of their study findings using sensitivity analyses (by restricting analyses to either certain BMBs, or certain and uncertain BMBs, which would improve sensitivity at the expense of specificity). This dichotomization also permits the identification of a separate group of MRIs with uncertain BMBs to help improve understanding of why and how observers disagree about BMBs, and to follow-up such patients to determine if these uncertain BMBs mature into certain BMBs.
Our findings should be regarded as a baseline measure of observer agreement for future studies using BOMBS. Further work on ways of improving observer agreement about BMBs is needed, and training observers to recognize certain and uncertain BMBs, as well as their mimics, is an obvious priority (Figure 2). BOMBS will also enable others to study agreement about BMB size, number, brain location, and diagnostic certainty, as well as exploring the influence of these factors on the diagnostic and prognostic usefulness of BMBs.
Because the clinical implications of BMBs remain to be established, there is still an opportunity to improve the reliability of BMB assessments by the use (and further development) of the BOMBS rating scale, so that adequately powered, well-designed studies will be able to answer the outstanding clinical concerns about BMBs’ diagnostic and prognostic value, and whether they should influence the prescription of antiplatelet, anticoagulant, or thrombolytic drugs. The use of a standard scale for BMBs is also essential for future studies to enable comparisons and meta-analyses.
Professor Martin Dennis, along with a team of other stroke specialists, recruited and clinically characterized many of the patients in this study. Programming support for the datasets analyzed was provided by Mike McDowall and Aidan Hutchison.
Sources of Funding
C.C. was supported by a grant from the EA 2691 and ADRINORD. The UK Medical Research Council funded R.A.S.S. (Clinician Scientist Fellowship G108/613). The Wellcome Trust funded F.D. (075611), as well as C.L.M.S. and C.A.J. (Clinician Scientist Award to CS WT063668MF). The NHS R&D Health Technology Assessment Panel funded S.K. (96/08/01), and the scans and associated data collection were funded by these 2 sources and Chief Scientist Office of the Scottish Executive (CZB/4/281). The imaging was conducted in the SFC Brain Imaging Research Centre at the University of Edinburgh (www.sbirc.ed.ac.uk).
- Received May 26, 2008.
- Accepted June 10, 2008.
Fiehler J, Albers GW, Boulanger J-M, Derex L, Gass A, Hjort N, Kim JS, Liebeskind DS, Neumann-Haefelin T, Pedraza S, Rother J, Rothwell P, Rovira A, Schellinger PD, Trenkler J, for the MRSG. Bleeding risk analysis in stroke imaging before thrombolysis (BRASIL): Pooled analysis of t2*-weighted magnetic resonance imaging data from 570 patients. Stroke. 2007; 38: 2738–2744.
Cordonnier C, Al-Shahi Salman R, Wardlaw J. Spontaneous brain microbleeds: Systematic review, subgroup analyses and standards for study design and reporting. Brain. 2007; 130: 1988–2003.
Jeerakathil T, Wolf PA, Beiser A, Hald JK, Au R, Kase CS, Massaro JM, DeCarli C. Cerebral microbleeds: Prevalence and associations with cardiovascular risk factors in the Framingham study. Stroke. 2004; 35: 1831–1835.
Roob G, Schmidt R, Kapeller P, Lechner A, Hartung HP, Fazekas F. MRI evidence of past cerebral microbleeds in a healthy elderly population. Neurology. 1999; 52: 991–994.
Greenberg SM, O'Donnell HC, Schaefer PW, Kraft E. MRI detection of new hemorrhages: Potential marker of progression in cerebral amyloid angiopathy. Neurology. 1999; 53: 1135–1138.
Kakuda W, Thijs VN, Lansberg MG, Bammer R, Wechsler L, Kemp S, Moseley ME, Marks MP, Albers GW. Clinical importance of microbleeds in patients receiving iv thrombolysis. Neurology. 2005; 65: 1175–1178.
Lee SH, Park JM, Kwon SJ, Kim H, Kim YH, Roh JK, Yoon BW. Left ventricular hypertrophy is associated with cerebral microbleeds in hypertensive patients. Neurology. 2004; 63: 16–21.
Viswanathan A, Guichard JP, Gschwendtner A, Buffon F, Cumurcuic R, Boutron C, Vicaut E, Holtmannspotter M, Pachai C, Bousser MG, Dichgans M, Chabriat H. Blood pressure and haemoglobin a1c are associated with microhaemorrhage in CADASIL: A two-centre cohort study. Brain. 2006; 129: 2375–2383.
Lee S-H, Kim BJ, Roh J-K. Silent microbleeds are associated with volume of primary intracerebral hemorrhage. Neurology. 2006; 66: 430–432.
Lemmens R, Gorner A, Schrooten M, Thijs V. Association of apolipoprotein E epsilon2 with white matter disease but not with microbleeds. Stroke. 2007; 38: 1185–1188.
Vernooij MW, van der Lugt A, Ikram MA, Wielopolski PA, Niessen WJ, Hofman A, Krestin GP, Breteler MM. Prevalence and risk factors of cerebral microbleeds: The Rotterdam scan study. Neurology. 2008; 70: 1208–1214.
Jackson C, Crossland L, Dennis M, Wardlaw JCS. Assessing the impact of the requirement for explicit consent in a hospital-based stroke study. QJM. 2008; 101: 281–289.
Bryant T. Confidence Interval Analysis. (2.0.0 build 41). Bristol: BMJ Books; 2000.