Background and Purpose The evaluation of cerebrovascular end points in prospective studies is often based exclusively on medical record examination and may be made by more than one observer over time. To address the issues of adequacy of medical record information and consistency in diagnosis over time, we evaluated interobserver agreement for the main items of the stroke classification system used in the Physicians’ Health Study. This trial included 22 071 physicians randomly assigned in 1982 to receive either aspirin or placebo to assess the subsequent risk of cardiovascular events, including stroke.
Methods Stroke subtype, stroke severity, and certainty of diagnosis were first classified from medical records from the years 1982 through 1988. The 216 stroke events reported in this period were independently reclassified in 1994 and compared with the initial classification using kappa statistics.
Results Overall agreement in major stroke types (hemorrhagic, ischemic, undetermined stroke) as well as in hemorrhagic stroke subtypes was excellent (κ=0.81 and κ=0.95, respectively). A wide range of values for the ischemic stroke subtypes (κ=0.13 to κ=0.96) was obtained. Agreement was substantial in assessment of stroke severity (κ=0.71), and it was fair (κ=0.33) for certainty of diagnosis.
Conclusions Interobserver agreement is high for major stroke types as well as for categories of hemorrhagic stroke on the basis of review of medical records and results of imaging data. The classification of ischemic stroke subtypes, however, is subject to substantial interobserver disagreement. Periodic reclassification of random samples of end points might be considered in long-term prospective studies to assess potential misclassification of events by different observers.
A valid and reproducible clinical diagnosis is essential in observational studies and in randomized stroke trials. Furthermore, in studies of agents such as aspirin, which decrease clotting but may increase bleeding, differentiation among the major types of ischemic infarction and hemorrhage is an absolute necessity. While studies of smaller sample sizes often include patient examinations, larger studies tend to rely solely on reviews of medical records, including diagnostic test results. In addition, classification of end points in prospective studies of long duration may be made by more than one observer. Classification systems based on patient history and examination are fairly reliable in differentiating among the major stroke types of subarachnoid hemorrhage (SAH), intracerebral hemorrhage (ICH), and ischemic infarction.1 However, the reliability of stroke subtype classification is poor unless the diagnostic criteria for each subtype are clearly defined and the results of diagnostic tests are included.1 2 3
To address the issues of adequacy of medical record information as well as consistency of diagnosis over time, we independently reclassified and assessed interobserver agreement in the classification of the 216 strokes in the Physicians’ Health Study reported before the early termination of the aspirin component of the trial in January 1988.
Subjects and Methods
A detailed description of the participants and methods of the Physicians’ Health Study has previously been published.4 Briefly, 22 071 US male physicians, 40 to 84 years of age at entry in 1982, were assigned randomly to aspirin, beta carotene, both active agents, or both placebo, with a 2×2 factorial design. All randomized participants were free from self-reported prior stroke, transient ischemic attack, or myocardial infarction. Information was collected at baseline with mailed questionnaires, including information about age, height, weight, systolic and diastolic blood pressures, physical activity, alcohol consumption, cigarette smoking, and history of angina pectoris, coronary revascularization procedures, diabetes mellitus, and hypertension. Every 6 months for the first year and annually thereafter, participants were mailed a brief questionnaire asking about their compliance with the randomized treatment assignments and the occurrence of new events, including stroke and transient ischemic attack. Nonfatal strokes were reported on the semiannual or annual questionnaires. Deaths were usually reported by families or postal authorities. Persistent nonrespondents were telephoned. All procedures involved in the study were approved by the institutional review board of the Brigham and Women’s Hospital, and participants gave informed consent for enrollment in the Physicians’ Health Study.
While the beta carotene component of the study is still ongoing, the aspirin component was terminated early on January 15, 1988, because of the emergence of a statistically extreme 44% reduction in the risk of a first myocardial infarction among individuals assigned to aspirin.5 By that time, participants had been followed for an average of 60.2 months. Morbidity follow-up was 99.7% complete, and mortality follow-up was 100%.
A diagnosis of stroke was confirmed after the review of medical records and all other available information by the End Points Committee (consisting of two internists, one cardiologist, and one neurologist, all blinded to the treatment assignment). A definite stroke was defined as a focal neurological deficit that lasted longer than 24 hours and was attributable to a vascular event. Strokes were classified into six subtypes on the basis of presence of risk factors, the mode of onset, clinical findings, test results, and the nature and location of the occluded vessel: embolic infarct, atherothrombotic infarct, embolic or thrombotic infarct (undifferentiable), SAH, ICH, and stroke of undetermined type. The latter category applied to instances in which the clinical data, although consistent with stroke, did not allow a distinction between ischemic and hemorrhagic subtype. The confidence (certainty) about the diagnosis of stroke was classified as possible, probable, or certain. Severity of stroke at hospital discharge or at time of stabilization for patients who were not hospitalized was determined using the following six-grade scale: 1, no residual impairment; 2, minor nonfunctionally impairing deficit; 3, mild functional deficit with some restriction of lifestyle; 4, moderate deficit significantly interfering with activities of daily life; 5, dependent state requiring chronic care; and 6, fatal. All newly reported strokes were first classified according to this system by one neurologist (Harris H. Funkenstein, MD) between 1982 and 1988. In 1994, two other neurologists (K.B., C.S.K.) independently reclassified these strokes on the basis of the same medical records but blinded to the first classification. They discussed all cases to reach consensus on the reclassification of each stroke. The first and second classifications were then compared in terms of stroke subtype diagnosis, stroke severity, and certainty of diagnosis. The three neurologists who conducted the initial and second reviews of the cases based the diagnosis of stroke subtype on their best clinical judgment after reviewing all available clinical and laboratory information rather than on the application of predefined precise diagnostic criteria for each stroke subtype.
The kappa statistic (κ)6 is the most commonly used measure of agreement for categorical data. It is chance corrected, ie, it compares the amount of observed agreement with that expected, taking into account the prevalence of the item measured. The unweighted κ for a dichotomized item (or symptom) and two observers has been extended to instances with multiple observers and to observations with more than two categories on an ordinal scale.7 8 In the latter situation, either standardized (ie, quadratic) or self-defined weights should be used. The κ based on quadratic weights is asymptotically equal to the intraclass correlation. However, the intraclass correlation is only to be used with interval data, while κ is based on nominal (two categories only) or ordinal data.9 We calculated quadratic-weighted κ for stroke severity and certainty of diagnosis, since these variables were measurable on an ordinal scale. In addition, unweighted κ values for a dichotomized outcome were calculated for each stroke category. To receive a summary measure of agreement within major stroke types (ie, ischemic or hemorrhagic), an overall κ as the unweighted average10 of the corresponding single categories was calculated. If the agreement is that expected by chance, κ=0. Generally, a κ of 0.80 and higher can be considered as excellent agreement, between 0.40 and 0.80 as moderate to substantial, between 0.20 and 0.40 as fair, and less or equal to 0.20 as slight or poor.11
The reclassification diagnosed all 216 events of the first classification as strokes. The bases for the diagnosis were the clinical features included in the medical records along with reports of diagnostic tests, including neuroimaging data. The latter included CT in 184 (85%) patients, MRI in 17 (8%), and cerebral angiography in 41 (19%).
Table 1⇓ presents a summary of agreement on major stroke types between the first and the second classifications. An overall agreement of 93.1% with a κ of 0.81 was obtained. None of the strokes initially classified as ischemic were reclassified as hemorrhagic. However, 12 of the initial ischemic strokes were reclassified as undetermined in contrast to only 2 such diagnoses in the first classification, indicating a more conservative approach in the later interpretation of diagnostic results. Only 1 of 35 strokes initially classified as hemorrhagic was reclassified as undetermined. This patient had a clinical presentation suggestive of SAH but without proof of the diagnosis by CT or lumbar puncture. The first classification of the event as hemorrhagic was based solely on the symptoms at onset; the second classification labeled the stroke as undetermined because of the lack of confirmatory laboratory data for intracranial hemorrhage.
Table 2⇓ gives a summary of agreement for stroke subtypes. Every category of ischemic stroke from the first classification showed a spread over the ischemic stroke categories of the reclassification, indicating low agreement. The undetermined stroke category revealed good agreement, although the number of events classified was quite small. Agreement on hemorrhagic stroke categories was very high, reaching perfect agreement on ICH and reclassifying only 2 of 10 SAHs as ICHs.
Table 3⇓ presents the κ for each stroke category. On the level of major stroke types, agreement was excellent, with κ=0.98 for hemorrhagic and κ=0.82 for ischemic strokes. Regarding stroke subtypes, agreement was also excellent for the two categories of hemorrhagic stroke but was substantially lower for the three categories of ischemic stroke. Undetermined strokes revealed a moderate agreement (κ=0.45) between the two classifications. The overall κ values for the two major stroke types represent an average of the agreement within the corresponding single categories. They reveal excellent agreement on the hemorrhagic stroke categories (κ=0.95) but only fair agreement on ischemic stroke subtype diagnosis (κ=0.34).
Table 4⇓ shows the overall interobserver agreement on stroke severity. Since 10 cases were coded as having “missing information” in the first classification, only 206 cases were reevaluated. In addition, the category representing death (grade 6) was excluded from the calculation of the κ because perfect agreement was implicit and its inclusion would have artificially increased the overall κ value for stroke severity. Since severity is listed on a scale from 1 to 5 as defined earlier, only quadratic weights were used to calculate the weighted κ for this table. Agreement occurred in 94.3% of all cases compared with 80.4% expected by chance alone, yielding a moderate to substantial agreement (κ=0.71). A closer look at each category of severity reveals that the agreement was substantial for cases with no residual symptoms and was only slight for those with minor functional deficits, but it improved again in more severe cases. The following κ values were obtained for each category of severity: 0.47 (no residual, grade 1), 0.19 (nonfunctional deficit, grade 2), 0.17 (mild functional deficit, grade 3), 0.26 (moderate deficit, grade 4), 0.46 (severe deficit, chronic care, grade 5), and 1.00 (fatal, grade 6).
For confidence in stroke diagnosis (an item generated after the evaluation of all data concerning the event), total agreement was found in 78.2% of cases. Since the categories of possible, probable, and definite stroke also represent an ordering, only quadratic weights were used to calculate the weighted κ. A κ value of 0.33 was obtained, demonstrating only a fair interobserver agreement for degree of certainty in the stroke diagnosis.
These data revealed an excellent interobserver agreement in the diagnosis of major stroke types and subtypes of hemorrhagic stroke and stroke severity ratings. The agreement for subtypes of ischemic stroke and certainty of diagnosis was substantially lower.
With regard to stroke subtype, interobserver agreement is generally lower for ischemic than hemorrhagic categories, regardless of the level of diagnostic workup. Previous evaluations of interobserver agreement on stroke classification from clinical impression and from medical records have found wide ranges for the κ.1 2 3 12 Gross et al1 found that the addition of complete workup information to the findings from physical examination and patient history alone increased the agreement from κ=0.15 to κ=0.38 when using a nine-category scale. Reducing the stroke subtypes to four (ischemic, SAH, ICH, and other) by collapsing all ischemic categories and providing the workup information to the same physician who initially examined the patient improved the κ value further to 0.69. The interobserver agreement among the physicians who only had access to complete written information on patients was insignificantly lower than the agreement among those who examined the patient in person (κ=0.54 versus κ=0.61). However, the total number of patients in this study was quite small (n=17). Gordon et al3 tested the agreement of the ischemic stroke classification used in the TOAST (Trial of ORG 10172 in Acute Stroke Treatment) study13 by sending out 18 written case reports to 24 neurologists. They were asked to classify the cases on the basis of the patients’ histories, description of the physical examination, and test results. The overall interobserver agreement was a significant 54% increase over chance (κ=0.54). Values for each subtype varied between κ=0.75 for cardioembolic stroke and κ=0.51 for ischemic stroke due to small artery occlusion. However, the number of patients was small (n=18), and the sample contained detailed clinical data and diagnostic workup, intentionally including instances of particular diagnostic uncertainty, to test the degree of variability in stroke subtype diagnosis among investigators selected to participate in a controlled clinical trial. In contrast, our study evaluated agreement on 216 stroke cases, both hemorrhagic and ischemic, with a heterogeneous level of diagnostic workup, reflecting general clinical practice.
The use of CT has greatly facilitated the diagnosis of hemorrhagic stroke and the differentiation of SAH and ICH. Thus, CT is for the most part responsible for the high levels of interrater agreement in these stroke subtypes (in our study, κ=0.96 for ICH and κ=0.82 for SAH) and in the diagnosis of “ischemic” stroke as a category defined by the absence of blood on initial CT. However, the situation is quite different in the classification of subtypes of ischemic stroke after exclusion of a hemorrhagic event by CT. Infarct subtype categorization is usually based on a combination of data, including the affected vascular territory and/or infarct mechanism (ie, [cardio]embolic versus atherothrombotic).12 13 14 15 16 17 18 19 20 Because this differentiation is derived from data on a variety of test results and clinical findings, the κ is generally lower, reflecting differences of opinion on the diagnostic value of clinical findings and laboratory results among physicians. Furthermore, disagreement among observers is enhanced by the lack of predefined strict criteria for the diagnosis of ischemic stroke subtypes, as shown in our study. However, the high interobserver reliability for major stroke types observed in this study serves the purpose of the Physicians’ Health Study, which was primarily concerned with the incidence of ischemic or hemorrhagic events during use of an agent (aspirin) capable of altering the incidence of both types of cerebrovascular event. For trials of stroke therapy or for observation of potential differential effects of risk factors or interventions on specific subtypes of brain infarction, detailed diagnostic criteria need to be preestablished to ensure improved interobserver reliability.
Interrater agreement on stroke severity has not been previously evaluated in a retrospective manner. On the basis of the prospectively generated results reported by Shinar et al2 and Lindley et al,12 interrater agreement for neurological signs differs widely (eg, weak arm, κ=0.77; sensory loss, arm, κ=0.1512 ). Thus, it is not surprising that we found better agreement on severity in cases with major residual deficits or other severe symptoms and in cases with no residual symptoms at all. Similar results were obtained by van Swieten et al,21 who found excellent interobserver reliability at both ends of a six-item modified Rankin disability scale,22 with fair agreement in the intermediate degrees of handicap after stroke. This study and ours also show that a six-step scale is adequate to classify poststroke handicap, resulting in an overall acceptable interrater agreement.
Agreement on the certainty of stroke diagnosis was only fair (κ=0.33) in our study. However, all cases initially diagnosed as stroke were diagnosed as stroke in the second review. The low level of agreement in the categories of certainty of diagnosis only reflects the assignment of different degrees of strength to the available evidence by the raters rather than doubt about the actual diagnosis of a stroke event. This subjective difference on the certainty of stroke diagnosis was also observed in the study of Gross et al.1 Although there were variations between “low” and “high” subjective confidence on initial clinical impressions among observers, their agreement on the final stroke diagnosis (based on all available data) was not different (κ=0.34 and 0.39, respectively).
The interpretation of our results has to take certain factors into account. The time elapsed between the first and the second classification varied between 6 and 12 years. Although the technical quality of diagnostic tests (especially neuroimaging and Doppler ultrasonography) substantially improved during this period, the reclassification was generally done using the same data as in the first classification. In a few instances, additional medical records from stroke recurrence with more data (such as MRI) related to the first event or the availability of diagnostic test results sent with long delay broadened the evidence for a specific subtype reclassification, but this was a rare occurrence and could not have accounted for a substantial difference in agreement.
In summary, these data demonstrate high interobserver agreement for major stroke types as well as for categories of hemorrhagic stroke with a classification system based on review of medical records. The classification of ischemic stroke subtypes carries some uncertainty because of the complexities of diagnosis based on interpretation of combined clinical and laboratory data and because of the lack of preestablished criteria for their diagnosis. Thus, if ischemic stroke subtype is a major end point in a clinical trial, it is evident that clear diagnostic criteria need to be established before initiation of the trial. These data also demonstrate that neurologist raters analyzing the same medical records years apart produced reliable results on the diagnosis of major stroke types as tested by interobserver agreement. Thus, misclassification of strokes in such circumstances is not likely to be a plausible explanation for the observed results. To ensure quality control in long-term prospective studies, periodic reclassification of a sample of randomly selected stroke end points by persons other than the study neurologists should be considered to identify possible personal diagnostic biases or misapplication of diagnostic criteria.
This study was supported by grants HL-26490, HL-34595, CA-34944, and CA-40360 from the National Institutes of Health and by a scholarship from the German Academic Exchange Service (DAAD) (Dr Berger). The authors are grateful to Charles H. Hennekens, MD, for helpful comments during the preparation of the manuscript, to Frances LaMotte, BS, for her invaluable computer expertise, and to Robert J. Glynn, ScD, for assistance in the statistical analysis of the data.
This article is dedicated to the memory of Harris H. Funkenstein, MD, who served as neurologist on the End Points Committee of the Physicians’ Health Study from its inception until his untimely and tragic death.
- Received October 4, 1995.
- Accepted October 25, 1995.
- Copyright © 1996 by American Heart Association
Gordon DL, Bendixen BH, Adams HP, Clarke W, Kappelle LJ, Woolson RF, and the TOAST Investigators. Interphysician agreement in the diagnosis of subtypes of acute ischemic stroke: implications for clinical trials. Neurology. 1993;43:1021-1027.
Norman GR, Streiner DL. Biostatistics: The Bare Essentials. St Louis, Mo: Mosby-Year Book; 1994.
Fleiss JL. Statistical Methods for Rates and Proportions. New York, NY: John Wiley & Sons; 1981:211-225.
Lindley RI, Warlow CP, Wardlaw JM, Dennis MS, Slattery J, Sandercock PAG. Interobserver reliability of a clinical classification of acute cerebral infarction. Stroke. 1993;24:1801-1804.
Adams HP, Bendixen BH, Kappelle LJ, Biller J, Love BB, Gordon DL, Marsh EE, and the TOAST Investigators. Classification of subtype of acute ischemic stroke: definitions for use in a multicenter clinical trial. Stroke. 1993;24:35-41.
Brennan P, Silman A. Statistical methods for assessing observer variability in clinical measures. BMJ. 1992;304:1491-1494.
Caplan LR. Diagnosis and treatment of ischemic stroke. JAMA. 1991;226:2413-2418.
Bogousslavsky J, van Melle G, Regli F. The Lausanne Stroke Registry: analysis of 1000 consecutive patients with first stroke. Stroke. 1988;19:1083-1092.
Foulkes MA, Wolf PA, Price TR, Mohr JP, Hier DB. The Stroke Data Bank: design, methods and baseline characteristics. Stroke. 1988;19:547-554.
Weisberg LA. Diagnostic classification of stroke, especially lacunes. Stroke. 1988;19:1071-1073.
Mohr JP, Barnett HJM. Classification of ischemic strokes. In: Barnett HJM, Mohr JP, Stein BM, Yatsu FM, eds. Stroke: Pathophysiology, Diagnosis, and Management. New York, NY: Churchill Livingstone; 1986:281-291.
Spitzer K, Thie A, Caplan LR, Kunze K. The Microstroke expert system for stroke type diagnosis. Stroke. 1989;20:1353-1356.
van Swieten JC, Koudstaal PJ, Visser MC, Schouten HJA, van Gijn J. Interobserver agreement for the assessment of handicap in stroke patients. Stroke. 1988;19:604-607.