Interobserver Agreement for the Bedside Clinical Assessment of Suspected Stroke
Background and Purpose— Stroke remains primarily a clinical diagnosis, with information obtained from history and examination determining further management. We aimed to measure inter-rater reliability for the clinical assessment of stroke, with emphasis on items of history, timing of symptom onset, and diagnosis of stroke or mimic. We explored reasons for poor reliability.
Methods— The study was based in an urban hospital with an acute stroke unit. Pairs of observers independently assessed suspected stroke patients. Findings from history, neurological examination, and the diagnosis of stroke or mimic, were recorded on a standard form. Reliability was measured by the κ statistic. We assessed the impact of observer experience and confidence, time of assessment, and patient-related factors of age, confusion, and aphasia on inter-rater reliability.
Results— Ninety-eight patients were recruited. Most items of the history and the diagnosis of stroke were found to have moderate to good inter-rater reliability. There was agreement for the hour and minute of symptom onset in only 45% of cases. Observer experience and confidence improved reliability; patient-related factors of confusion and aphasia made the assessment more difficult. There was a trend for worse inter-rater reliability among patients assessed very early and very late after symptom onset.
Conclusions— Clinicians should be aware that inter-rater reliability of the clinical assessment is affected by a variety of factors and is improved by experience and confidence. Our findings have implications for training of doctors who assess patients with suspected stroke and identifies the more reliable components of the clinical assessment.
The medical history and neurological examination form the cornerstone of emergent evaluation of patients with suspected stroke. When considering treatment with thrombolysis, determining the time of symptom onset and whether the patient awoke with symptoms is critical.1 The bedside assessment needs to be reliable and accurate. The 2 components are linked: the fact that inter-rater reliability or agreement is poor indicates that the findings will often not be valid (or accurate). Many studies have measured the inter-rater reliability of elements of the neurological examination, finding it to vary from fair to good.2–11⇓⇓⇓⇓⇓⇓⇓⇓⇓ Only 2 studies measured items of the history, and these found surprisingly poor inter-rater reliability.2,3⇓ No studies have assessed the inter-rater reliability of the clinical diagnosis of stroke (or stroke mimic).12
As part of a larger project to improve the clinical assessment of patients with suspected stroke (brain attack), we aimed to determine the inter-rater reliability of the components of the clinical assessment of patients with brain attack and to explore whether modifiable external factors (such as experience, time from onset, or level of clinical confidence) could explain clinical disagreements. We were particularly interested in testing items of history and the overall diagnosis of stroke or stroke mimic, and timing of symptom onset, given their importance in determining eligibility for acute stroke therapies.
Materials and Methods
The study was based in an urban teaching hospital with an acute stroke unit. Patients admitted to hospital with symptoms of brain attack (defined as “apparently focal brain dysfunction of apparently abrupt onset”) were studied prospectively by a pair of examiners. We wanted a broad range of patients with suspected stroke, so we did not set time limits for inclusion in the study. Allocation of examining pairs to patients was determined by a randomized and counterbalanced list. Time between assessments was kept to a minimum to reduce the effect that fluctuations in patient state could have on assessment of inter-rater reliability. All patients provided consent, and the study had ethics committee approval.
Because we recruited patients with clinical features of suspected stroke, it was necessary to determine the final diagnosis of the event. We convened a panel of experts (stroke physicians, neurologists, and a neuroradiologist) to review the clinical, neuroimaging, and other laboratory data after the patient had been discharged and to assign the final diagnosis by consensus.
Four observers participated in the study, deliberately chosen to reflect a broad range of neurology experience. Three were between 5 and 9 years postgraduation from medical school (2 had completed 4 years of neurology training, and 1 was an internal medicine physician without formal training in neurology). The fourth was a final year medical student who received 4 weeks of practical training in clinical neurology assessment before the study commenced.
Observers took a history and performed a neurological examination on each patient. History could only be taken from the patient, rather than a family member or ambulance officer, and each observer was blind to the clinical record. We measured key items, chosen a priori by consensus among the stroke team, considered to be important in making a clinical diagnosis of stroke (vascular risk factors, history of the presenting symptom, neurological examination, and diagnostic formulation). Items were scored as present, absent, or unknown. No specific instructions as to how to elicit or score the assessment were given. Observers independently recorded their findings on a standard data form at the end of their assessment, which was placed in a sealed envelope before any discussion about the patient was permitted.
The National Institutes of Health Stroke Scale (NIHSS) was not assessed in this study because several other studies have assessed its inter-rater reliability.5,6,9,10⇓⇓⇓ The 3 medically qualified examiners had completed the NIHSS training videotapes and were familiar with the scale. Observers were given written definitions for the Oxfordshire Community Stroke Project (OCSP) classification.13
Reliability was described using Cohen’s κ statistic, a measure of the extent to which agreement is greater than expected by chance alone.14 κ values range between 0 (chance agreement) and +1.00 (complete agreement); κ<0.2 represents poor agreement; 0.21 to 0.40, fair agreement; 0.41 to 0.60, moderate agreement; 0.61 to 0.80, good agreement; and 0.81 to 1.00, excellent agreement.15 We did not use weighted κ statistic in this study. Analyses were performed in SPSS version 11.0.0 (SPSS Inc), and CIs for κ values were calculated using Confidence Interval Analysis software.16
We selected a range of variables with fair or moderate agreement for further exploration. Given its importance in determining eligibility for thrombolysis, we also looked at timing of symptom onset. Prespecified subgroups included the following.
We assessed the impact of experience by comparing the results for patients assessed by a medical student–physician pair with patients assessed by a physician–physician pair.
Examiners’ Level of Confidence
Observers were asked to rate their confidence for their clinical findings in the history and examination for each patient. A highly confident pair was defined as both observers stating that they were very confident; a low confidence pair was defined as 1 or both observers rating their confidence as low. Comparing the 2 categories allowed the impact of confidence to be assessed.
Time From Onset of Symptoms to Examination
We divided the patient sample into those assessed within 12 hours, 12 to 24 hours, 24 to 48 hours, and >48 hours after symptom onset. This allowed an assessment of the impact of time on the inter-rater reliability of the clinical assessment.
Although not modifiable, the impact of patient-related factors such as age, confusion, and aphasia were also assessed. For this analysis, confusion or aphasia was deemed to be present when the first medically qualified examiner scored it as present. A patient was classed as confused if he/she made ≥1 errors on tests of orientation and attention.
We recruited 98 patients over a 3-month period. The sample was not consecutive (27 patients admitted during the study period were not included because of unavailability of examiners or declined consent). Median age of the study population was 79 years (interquartile range [IQR] 69 to 86 years). The final diagnosis was a stroke in 74 patients (76%) and a condition mimicking stroke in 24 patients (24%). Median time from symptom onset to hospital presentation was 4 hours and 48 minutes (IQR 1 hour and 54 minutes to 13 hours and 18 minutes). Fifteen patients (15%) were assessed within 12 hours of onset, 26 (27%) between 12 and 24 hours, 28 (29%) between 24 and 48 hours, and 29 (29%) >48 hours after onset of symptoms (latest time was 8 days). Median time from beginning the first examination to beginning the second was 56 minutes (IQR 30 to 114 minutes). At the time of assessment, 10 patients had no clinical signs (all but 2 still had symptoms).
Inter-Rater Reliability of the Clinical Assessment
Agreement was good for the presence of 3 vascular risk factors but moderate for the others (Table 1⇓). There was good or better agreement for many focal neurological symptoms and signs and the diagnosis of stroke or mimic (κ=0.77). The OCSP classification had moderate reliability. Observers had difficulty with the classification of partial anterior circulation syndrome (PACS; κ=0.43), and disagreements were most frequently between total anterior circulation syndrome and PACS, and lacunar syndrome and PACS.
There was good inter-rater reliability for whether an exact time of symptom onset could be determined (κ=0.63). However, both observers wrote the time and date of symptom onset in only 42 patients. There was agreement for date, hour, and minute in 19 of 42 (45%), agreement for date and hour in 11 of 42 (26%), agreement for date in only 9 of 42 (21%), and complete disagreement for date and time in 3 of 42 (7%). In the latter 3 patients, whose symptoms had fluctuated, the dispute arose over whether symptoms were first noted on waking or the night before. The median difference in onset times for the 23 patients for whom there was disagreement was 30 minutes (IQR 10 to 90 minutes). Inter-rater reliability was moderate for factors such as whether the patient woke from sleep with deficit, had improved since onset, or symptoms were stable (Table 1⇑).
Exploring Reasons for Poor Inter-Rater Reliability
The Impact of Experience
A pair of physicians saw 45 patients, and a medical student–physician pairing saw 53 patients. The κ was higher for the doctor–doctor pairing across almost all components of the clinical assessment (Table 2). The difference was greatest for the neurological examination (eg, κ for visual neglect was 0.23 for student–doctor and 0.64 for doctor–doctor). The κ values for whether there was an exact time of onset were similar in both groups.
Clinician’s Level of Confidence
When both observers were certain of their findings, inter-rater reliability was higher than when 1 or both observers were uncertain (Table 2). The medical student was certain of her findings from the history on 63% of occasions, similar to the physicians (61%), but the physicians were more certain of their examination findings (83%) compared with the medical student (56% of occasions; P<0.001; χ2).
The Impact of Time
A medical student–physician pair and a physician–physician pair saw an equal proportion of patients at each of the 4 time intervals. Inter-rater reliability was highest when assessing patients between 12 and 24 hours after symptom onset (Table 2). Reliability appeared to be worse for patients who were assessed very early or very late.
In general, inter-rater reliability was substantially worse for patients with aphasia, slightly worse for patients with confusion, and no different in those >80 years of age (Table 3).
Stroke remains a clinical diagnosis, supported, in most cases but not all, by relevant brain imaging abnormalities. Important treatment decisions must be made on the basis of a rapid history and examination,12 often obtained by inexperienced doctors. A key issue is timing of symptom onset,1 but its reliability has not been studied previously. It is important that stroke clinicians understand which clinical data are unreliable and why and what can be done to improve reliability.
We found that the diagnosis of stroke or mimic had substantial agreement. Most items of the neurological examination had moderate or better reliability, as several other studies found.2–11⇓⇓⇓⇓⇓⇓⇓⇓⇓ Many items of history were reliable, which refutes the only 2 studies that have tested the history.2,3⇓ Greater examiner experience and confidence, which are almost certainly interlinked, increased reliability. The assessment was made more difficult by patient factors such as aphasia and confusion, and in those presenting very early or late.
Our study adds important additional information to the previous studies because it is pragmatic and generalizable. We used real patients as subjects rather than clinical vignettes8 or videotaped observations.9 Our subjects presented with the undifferentiated clinical syndrome of “brain attack,” unlike early studies in which stable stroke patients were used and observers were expecting to find neurological symptoms or signs.3–5,7,10⇓⇓⇓⇓ Observers had different levels of experience, which reflects clinical reality (they were not highly trained stroke neurologists,3,4⇓ nor were the signs demonstrated to the less experienced members of the team by a stroke neurologist5).
There are limitations to our study. Many of the above factors might be expected to reduce inter-rater reliability. In particular, we obtained history only from the patient, whereas often in clinical practice, family members or emergency staff are able to provide some information. Signs can fluctuate, and patients become fatigued between the first and second assessment. Conversely, reliability may have been inflated by the impact of “coaching” between first and subsequent assessments. Although our selection criteria (brain attack) were inclusive, referral bias may have resulted in more patients with stroke being recruited. This might be expected to result in better reliability than an unselected population. Few of our patients were seen in the hyperacute phase, so it is possible that in this population, reliability (particularly for time of onset) is better or worse than was observed in our study. More studies of patients presenting within 3 hours are needed, but these would be difficult, partly because the patient’s condition might fluctuate more rapidly in these early hours and partly because attempting to record 2 examinations might interfere with delivery of acute treatments.
There are several well-described limitations to the use of the κ statistic that influence the interpretation of our results.15,17⇓ Very low (or high) prevalence results in high levels of expected agreement, and consequently the κ value is often low despite near perfect agreement.15 It is important to inspect the raw data for prevalence effects. Observer bias, in this sense a systematic difference between 2 observers in the way questions are answered, is a form of disagreement with important practical implications, but it is not separately identified by κ. We minimized bias by ensuring counterbalanced allocation of observers (ie, equal pairings of observers, and each observer was observer 1 and 2 an equal number of times). Statistical methods to adjust for bias and prevalence have been proposed, and some experts believe the intraclass correlation coefficient is a better measure, but nevertheless κ remains the most widely used index of agreement.17
What are the implications of our study? This study will help physician trainers to identify the reliable (and unreliable) components of a clinical assessment. Such assessments remain important even in this technological age as they are universally generalizable and can help triage patients more appropriately for fast-track imaging, thrombolysis, or interhospital transfer. One strategy to improve the reliability of the clinical assessment would be to provide less experienced observers with detailed rules or guidelines for the interpretation of the information obtained. Such guidelines improve reliability for the diagnosis of transient ischemic attack18 and the NIHSS,9 probably by increasing the examiner’s confidence.
The determination of time of symptom onset is crucial for acute stroke therapy. We found that in <50% of patients did the examiners actually agree on the hour and minute of symptom onset. A difference of even 30 minutes in determining time of onset could prevent a patient receiving thrombolysis under current license. This emphasizes the need to corroborate the time of symptom onset with witnesses or time-related data (such as the start of a television program).
In conclusion, we demonstrated that the inter-rater reliability for many elements of clinical assessment is modest. Yet, for all its limitations (and for the foreseeable future), the bedside history and physical examination direct the immediate management of patients with suspected stroke. More research is required to evaluate methods to improve inter-rater reliability, with a particular emphasis on timing of symptom onset.
P.J.H. was funded by Chief Scientist Office, Health Department, Scottish Executive, grant reference CZB/4/14. J.A.H. received a grant from the Hersenstichting Nederland (Dutch Brain Foundation), Korte Houtstraat 10 in The Hague. We thank Sarah Keir for assisting with patient recruitment, and Steff Lewis for statistical advice. All authors were involved in devising the study and reviewing drafts of manuscripts. P.J.H., J.A.H., J.K. and B.L. recruited and examined the patients. P.J.H. analyzed the data and wrote the manuscript. J.M.W. and M.S.D. obtained funding and were responsible for the overall study.
- Received November 13, 2005.
- Revision received December 22, 2005.
- Accepted January 1, 2006.
Adams HP Jr, Adams RJ, Brott T, del Zoppo GJ, Furlan A, Goldstein LB, Grubb RL, Higashida R, Kidwell C, Kwiatkowski TG, Marler JR, Hademenos GJ; Stroke Council of the American Stroke Association. Guidelines for the early management of patients with ischemic stroke: a scientific statement from the Stroke Council of the American Stroke Association. Stroke. 2003; 34: 1056–1083.
Tomasello F, Mariani F, Fieschi C, Argentino C, Bono G, De Zanche L, Inzitari D, Martini A, Perrone P, Sangiovanni G. Assessment of inter-observer differences in the Italian multicenter study on reversible cerebral ischemia. Stroke. 1982; 13: 32–35.
Gelmers HJ, Gorter K, de Weerdt CJ, Wiezer HJ. Assessment of interobserver variability in a Dutch multicenter study on acute ischemic stroke. Stroke. 1988; 19: 709–711.
Brott T, Adams HP Jr, Olinger CP, Marler JR, Barsan WG, Biller J, Spilker J, Holleran R, Eberle R, Hertzberg V. Measurements of acute cerebral infarction: a clinical examination scale. Stroke. 1989; 20: 864–870.
Lindley RI, Warlow CP, Wardlaw JM, Dennis MS, Slattery J, Sandercock PA. Interobserver reliability of a clinical classification of acute cerebral infarction. Stroke. 1993; 24: 1801–1804.
Gordon DL, Bendixen BH, Adams HP Jr, Clarke W, Kapelle LJ, Woolson RF, TOAST Investigators. Interphysician agreement in the diagnosis of subtypes of acute ischemic stroke: implications for clinical trials. Neurology. 1993; 43: 1021–1027.
Lyden P, Brott T, Tilley B, Welch KM, Mascha EJ, Levine S, Haley EC, Grotta J, Marler J. Improved reliability of the NIH Stroke Scale using video training. NINDS TPA Stroke Study Group. Stroke. 1994; 25: 2220–2226.
Brennan P, Silman A. Statistical methods for assessing observer variability in clinical measures. BMJ. 1992; 304: 1491–1494.
Altman DG, Machin D, Bryant TN, Gardner MJ. Statistics With Confidence: Confidence Intervals and Statistical Guidelines. London, UK: BMJ Books; 2000.
Al Shahi R, Pal N, Lewis SC, Bhattacharya JJ, Sellar RJ, Warlow CP. Observer agreement in the angiographic assessment of arteriovenous malformations of the brain. Stroke. 2002; 33: 1501–1509.
Koudstaal PJ, Gerritsma JG, van Gijn J. Clinical disagreement on the diagnosis of transient ischemic attack: is the patient or the doctor to blame? Stroke. 1989; 20: 300–301.