| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
(Stroke. 2006;37:776.)
© 2006 American Heart Association, Inc.
Original Contributions |
From the Division of Clinical Neurosciences, Western General Hospital, Crewe Rd, Edinburgh, UK (J.A.H., J.K., R.I.L., B.L., M.S.D., J.M.W.); and the RMH Stroke Centre (P.J.H.), Department of Neurology, Royal Melbourne Hospital, and Department of Medicine (Neuroscience), Monash University (Alfred Hospital Campus), Victoria, Australia.
Correspondence to Dr Peter Hand, Department of Neurology, c/- Post Office, Royal Melbourne Hospital, Victoria 3050, Australia. E-mail peter.hand{at}mh.org.au
| Abstract |
|---|
|
|
|---|
Methods The study was based in an urban hospital with an acute stroke unit. Pairs of observers independently assessed suspected stroke patients. Findings from history, neurological examination, and the diagnosis of stroke or mimic, were recorded on a standard form. Reliability was measured by the
statistic. We assessed the impact of observer experience and confidence, time of assessment, and patient-related factors of age, confusion, and aphasia on inter-rater reliability.
Results Ninety-eight patients were recruited. Most items of the history and the diagnosis of stroke were found to have moderate to good inter-rater reliability. There was agreement for the hour and minute of symptom onset in only 45% of cases. Observer experience and confidence improved reliability; patient-related factors of confusion and aphasia made the assessment more difficult. There was a trend for worse inter-rater reliability among patients assessed very early and very late after symptom onset.
Conclusions Clinicians should be aware that inter-rater reliability of the clinical assessment is affected by a variety of factors and is improved by experience and confidence. Our findings have implications for training of doctors who assess patients with suspected stroke and identifies the more reliable components of the clinical assessment.
Key Words: classification diagnosis observer variation stroke assessment
| Introduction |
|---|
|
|
|---|
As part of a larger project to improve the clinical assessment of patients with suspected stroke (brain attack), we aimed to determine the inter-rater reliability of the components of the clinical assessment of patients with brain attack and to explore whether modifiable external factors (such as experience, time from onset, or level of clinical confidence) could explain clinical disagreements. We were particularly interested in testing items of history and the overall diagnosis of stroke or stroke mimic, and timing of symptom onset, given their importance in determining eligibility for acute stroke therapies.
| Materials and Methods |
|---|
|
|
|---|
Because we recruited patients with clinical features of suspected stroke, it was necessary to determine the final diagnosis of the event. We convened a panel of experts (stroke physicians, neurologists, and a neuroradiologist) to review the clinical, neuroimaging, and other laboratory data after the patient had been discharged and to assign the final diagnosis by consensus.
Observers
Four observers participated in the study, deliberately chosen to reflect a broad range of neurology experience. Three were between 5 and 9 years postgraduation from medical school (2 had completed 4 years of neurology training, and 1 was an internal medicine physician without formal training in neurology). The fourth was a final year medical student who received 4 weeks of practical training in clinical neurology assessment before the study commenced.
Data Collection
Observers took a history and performed a neurological examination on each patient. History could only be taken from the patient, rather than a family member or ambulance officer, and each observer was blind to the clinical record. We measured key items, chosen a priori by consensus among the stroke team, considered to be important in making a clinical diagnosis of stroke (vascular risk factors, history of the presenting symptom, neurological examination, and diagnostic formulation). Items were scored as present, absent, or unknown. No specific instructions as to how to elicit or score the assessment were given. Observers independently recorded their findings on a standard data form at the end of their assessment, which was placed in a sealed envelope before any discussion about the patient was permitted.
The National Institutes of Health Stroke Scale (NIHSS) was not assessed in this study because several other studies have assessed its inter-rater reliability.5,6,9,10 The 3 medically qualified examiners had completed the NIHSS training videotapes and were familiar with the scale. Observers were given written definitions for the Oxfordshire Community Stroke Project (OCSP) classification.13
Statistical Analysis
Reliability was described using Cohens
statistic, a measure of the extent to which agreement is greater than expected by chance alone.14
values range between 0 (chance agreement) and +1.00 (complete agreement);
<0.2 represents poor agreement; 0.21 to 0.40, fair agreement; 0.41 to 0.60, moderate agreement; 0.61 to 0.80, good agreement; and 0.81 to 1.00, excellent agreement.15 We did not use weighted
statistic in this study. Analyses were performed in SPSS version 11.0.0 (SPSS Inc), and CIs for
values were calculated using Confidence Interval Analysis software.16
Subgroup Analyses
We selected a range of variables with fair or moderate agreement for further exploration. Given its importance in determining eligibility for thrombolysis, we also looked at timing of symptom onset. Prespecified subgroups included the following.
Observer Experience
We assessed the impact of experience by comparing the results for patients assessed by a medical studentphysician pair with patients assessed by a physicianphysician pair.
Examiners Level of Confidence
Observers were asked to rate their confidence for their clinical findings in the history and examination for each patient. A highly confident pair was defined as both observers stating that they were very confident; a low confidence pair was defined as 1 or both observers rating their confidence as low. Comparing the 2 categories allowed the impact of confidence to be assessed.
Time From Onset of Symptoms to Examination
We divided the patient sample into those assessed within 12 hours, 12 to 24 hours, 24 to 48 hours, and >48 hours after symptom onset. This allowed an assessment of the impact of time on the inter-rater reliability of the clinical assessment.
Although not modifiable, the impact of patient-related factors such as age, confusion, and aphasia were also assessed. For this analysis, confusion or aphasia was deemed to be present when the first medically qualified examiner scored it as present. A patient was classed as confused if he/she made
1 errors on tests of orientation and attention.
| Results |
|---|
|
|
|---|
Inter-Rater Reliability of the Clinical Assessment
Agreement was good for the presence of 3 vascular risk factors but moderate for the others (Table 1
). There was good or better agreement for many focal neurological symptoms and signs and the diagnosis of stroke or mimic (
=0.77). The OCSP classification had moderate reliability. Observers had difficulty with the classification of partial anterior circulation syndrome (PACS;
=0.43), and disagreements were most frequently between total anterior circulation syndrome and PACS, and lacunar syndrome and PACS.
|
|
There was good inter-rater reliability for whether an exact time of symptom onset could be determined (
=0.63). However, both observers wrote the time and date of symptom onset in only 42 patients. There was agreement for date, hour, and minute in 19 of 42 (45%), agreement for date and hour in 11 of 42 (26%), agreement for date in only 9 of 42 (21%), and complete disagreement for date and time in 3 of 42 (7%). In the latter 3 patients, whose symptoms had fluctuated, the dispute arose over whether symptoms were first noted on waking or the night before. The median difference in onset times for the 23 patients for whom there was disagreement was 30 minutes (IQR 10 to 90 minutes). Inter-rater reliability was moderate for factors such as whether the patient woke from sleep with deficit, had improved since onset, or symptoms were stable (Table 1
).
Exploring Reasons for Poor Inter-Rater Reliability
The Impact of Experience
A pair of physicians saw 45 patients, and a medical studentphysician pairing saw 53 patients. The
was higher for the doctordoctor pairing across almost all components of the clinical assessment (Table 2). The difference was greatest for the neurological examination (eg,
for visual neglect was 0.23 for studentdoctor and 0.64 for doctordoctor). The
values for whether there was an exact time of onset were similar in both groups.
|
Clinicians Level of Confidence
When both observers were certain of their findings, inter-rater reliability was higher than when 1 or both observers were uncertain (Table 2). The medical student was certain of her findings from the history on 63% of occasions, similar to the physicians (61%), but the physicians were more certain of their examination findings (83%) compared with the medical student (56% of occasions; P<0.001;
2).
The Impact of Time
A medical studentphysician pair and a physicianphysician pair saw an equal proportion of patients at each of the 4 time intervals. Inter-rater reliability was highest when assessing patients between 12 and 24 hours after symptom onset (Table 2). Reliability appeared to be worse for patients who were assessed very early or very late.
Patient-Related Factors
In general, inter-rater reliability was substantially worse for patients with aphasia, slightly worse for patients with confusion, and no different in those >80 years of age (Table 3).
|
| Discussion |
|---|
|
|
|---|
We found that the diagnosis of stroke or mimic had substantial agreement. Most items of the neurological examination had moderate or better reliability, as several other studies found.211 Many items of history were reliable, which refutes the only 2 studies that have tested the history.2,3 Greater examiner experience and confidence, which are almost certainly interlinked, increased reliability. The assessment was made more difficult by patient factors such as aphasia and confusion, and in those presenting very early or late.
Our study adds important additional information to the previous studies because it is pragmatic and generalizable. We used real patients as subjects rather than clinical vignettes8 or videotaped observations.9 Our subjects presented with the undifferentiated clinical syndrome of "brain attack," unlike early studies in which stable stroke patients were used and observers were expecting to find neurological symptoms or signs.35,7,10 Observers had different levels of experience, which reflects clinical reality (they were not highly trained stroke neurologists,3,4 nor were the signs demonstrated to the less experienced members of the team by a stroke neurologist5).
There are limitations to our study. Many of the above factors might be expected to reduce inter-rater reliability. In particular, we obtained history only from the patient, whereas often in clinical practice, family members or emergency staff are able to provide some information. Signs can fluctuate, and patients become fatigued between the first and second assessment. Conversely, reliability may have been inflated by the impact of "coaching" between first and subsequent assessments. Although our selection criteria (brain attack) were inclusive, referral bias may have resulted in more patients with stroke being recruited. This might be expected to result in better reliability than an unselected population. Few of our patients were seen in the hyperacute phase, so it is possible that in this population, reliability (particularly for time of onset) is better or worse than was observed in our study. More studies of patients presenting within 3 hours are needed, but these would be difficult, partly because the patients condition might fluctuate more rapidly in these early hours and partly because attempting to record 2 examinations might interfere with delivery of acute treatments.
There are several well-described limitations to the use of the
statistic that influence the interpretation of our results.15,17 Very low (or high) prevalence results in high levels of expected agreement, and consequently the
value is often low despite near perfect agreement.15 It is important to inspect the raw data for prevalence effects. Observer bias, in this sense a systematic difference between 2 observers in the way questions are answered, is a form of disagreement with important practical implications, but it is not separately identified by
. We minimized bias by ensuring counterbalanced allocation of observers (ie, equal pairings of observers, and each observer was observer 1 and 2 an equal number of times). Statistical methods to adjust for bias and prevalence have been proposed, and some experts believe the intraclass correlation coefficient is a better measure, but nevertheless
remains the most widely used index of agreement.17
What are the implications of our study? This study will help physician trainers to identify the reliable (and unreliable) components of a clinical assessment. Such assessments remain important even in this technological age as they are universally generalizable and can help triage patients more appropriately for fast-track imaging, thrombolysis, or interhospital transfer. One strategy to improve the reliability of the clinical assessment would be to provide less experienced observers with detailed rules or guidelines for the interpretation of the information obtained. Such guidelines improve reliability for the diagnosis of transient ischemic attack18 and the NIHSS,9 probably by increasing the examiners confidence.
The determination of time of symptom onset is crucial for acute stroke therapy. We found that in <50% of patients did the examiners actually agree on the hour and minute of symptom onset. A difference of even 30 minutes in determining time of onset could prevent a patient receiving thrombolysis under current license. This emphasizes the need to corroborate the time of symptom onset with witnesses or time-related data (such as the start of a television program).
In conclusion, we demonstrated that the inter-rater reliability for many elements of clinical assessment is modest. Yet, for all its limitations (and for the foreseeable future), the bedside history and physical examination direct the immediate management of patients with suspected stroke. More research is required to evaluate methods to improve inter-rater reliability, with a particular emphasis on timing of symptom onset.
| Acknowledgments |
|---|
Received November 13, 2005; revision received December 22, 2005; accepted January 1, 2006.
| References |
|---|
|
|
|---|
Related Articles:
This article has been cited by other articles:
![]() |
S. Rastas, A. Verkkoniemi, T. Polvikoski, K. Juva, L. Niinisto, K. Mattila, E. Lansimies, T. Pirttila, and R. Sulkava Atrial Fibrillation, Stroke, and Cognition: A Longitudinal Population-Based Study of People Aged 85 and Older Stroke, May 1, 2007; 38(5): 1454 - 1460. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. B. Goldstein Improving the Clinical Diagnosis of Stroke Stroke, March 1, 2006; 37(3): 754 - 755. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Stroke Home | Subscriptions | Archives | Feedback | Authors | Help | AHA Journals Home | Search Copyright © 2006 American Heart Association, Inc. All rights reserved. Unauthorized use prohibited. |