Retrospective Assessment of Initial Stroke Severity
Comparison of the NIH Stroke Scale and the Canadian Neurological Scale
Background and Purpose—The NIH Stroke Scale (NIHSS) and the Canadian Neurological Scale (CNS) have been reported to be useful for the retrospective assessment of initial stroke severity. However, unlike the CNS, the NIHSS requires detailed neurological assessments that may not be reflected in all patient records, potentially limiting its applicability. We assessed the reliability of the retrospective algorithms and the proportions of missing items for the NIHSS and CNS in stroke patients admitted to an academic medical center (AMC) and 2 community hospitals.
Methods—Randomly selected records of patients with ischemic stroke admitted to an AMC (n=20) and community hospitals with (CH1, n=19) and without (CH2, n=20) acute neurological consultative services were reviewed. NIHSS and CNS scores were assigned independently by 2 neurologists using published algorithms. Interrater reliability of the scores was determined with the intraclass correlation coefficient, and the numbers of missing items were tabulated.
Results—The intraclass correlation coefficient for NIHSS and CNS, respectively, were 0.93 (95% CI, 0.82 to 1.00) and 0.97 (95% CI, 0.90 to 1.00) for the AMC, 0.89 (95% CI, 0.75 to 1.00) and 0.88 (95%, 0.73 to 1.00) for the CH1, and 0.48 (95% CI, 0.26 to 0.70) and 0.78 (95% CI, 0.60 to 0.96) for the CH2. More NIHSS items were missing at the CH2 (62%) versus the AMC (27%) and the CH1 (23%, P=0.0001). In comparison, 33%, 0%, and 8% of CNS items were missing from records from CH2, AMC, and CH1, respectively (P=0.0001).
Conclusions—The levels of interrater agreement were almost perfect for retrospectively assigned NIHSS and CNS scores for patients initially evaluated by a neurologist at both an AMC and a CH. Levels of agreement for the CNS were substantial at a CH2, but interrater agreement for the NIHSS was only moderate in this setting. The proportions of missing items are higher for the NIHSS than the CNS in each setting, particularly limiting its application in the hospital without acute neurological consultative services.
Quality of care and cost-effectiveness studies focused on ischemic stroke require adjustment for initial stroke severity, the most powerful predictor of outcome.1 The NIH Stroke Scale (NIHSS) and the Canadian Neurological Scale (CNS), 2 stroke impairment scales originally designed for prospective scoring,2 3 have been used for the retrospective assessment of initial stroke severity.4 5 6 In addition, both scales are helpful in determining prognosis.7 The NIHSS is the most commonly used impairment measure in ischemic stroke treatment trials, and its retrospective application for outcome studies would permit direct comparisons with data from the prospective trials.
The retrospective application of these scales has only been assessed in limited settings thus far. For example, the NIHSS was found to be both reliable and valid when applied retrospectively in a study of patients enrolled in clinical trials who had been prospectively assessed.4 A retrospective algorithm developed to apply the NIHSS on the basis of data extracted from patients’ medical records in an academic hospital setting also appeared to be reliable and valid.5 The reliability and validity of the CNS was established in a similar setting.6 However, in comparison to the CNS, the NIHSS requires detailed neurological evaluations that may not be reflected in all patient records. Even when retrospectively applied in an academic medical center (AMC), only 1 record provided information that permitted completion of all items of the NIHSS.5
Only a minority of stroke patients are admitted to AMCs. The majority are cared for by non-neurologists and are admitted to community hospitals.8 However, neither the NIHSS nor the CNS has been used retrospectively in these settings. Because of the detail necessary for NIHSS scoring, we hypothesized that retrospective assessment of the CNS would be more reliable than the NIHSS when using records from community hospitals in which evaluations were performed by non-neurologists. The aim of the present study was to assess the reliability of the published retrospective algorithms and to document the proportions of missing items for the NIHSS and CNS in stroke patients admitted to an AMC in comparison to community hospitals with and without acute neurological consultative services.
Subjects and Methods
Randomly selected medical records of patients with ischemic stroke admitted to an AMC (n=20) and community hospitals with (CH1, n=19) and without (CH2, n=20) acute neurological consultation services were reviewed. Patients were identified from a stroke registry at the AMC and on the basis of the International Classification of Diseases, 9th Revision (ICD-9), codes at the community hospitals, with the diagnosis verified by record review. Only patients with a primary diagnosis of ischemic stroke and whose records included at least an admission history, admission physical examination, and discharge summary were included.
The initial neurological examination documented in the admission notes was preferentially used for retrospective assessment of stroke severity. Discharge summaries were used only when the admission note was not available and the admission neurological examination was adequately documented (n=2). Data abstraction was performed independently by 2 neurologists who were certified in prospective administration of the NIHSS but were not blinded to the source of the hospital records. The NIHSS and the CNS scores were assigned using published algorithms.5 6 Missing items from the NIHSS and the CNS were scored as normal.5 6 Scores for individual scale items, the total scores, and the number of missing items for each scale were recorded.
Interrater reliability was assessed with the intraclass correlation coefficient (ICC). The ICC is a measure of the total variance of the sample, which includes the differences among reviewers, the differences among subjects, and the unexplained residual variance. The ICC is maximized if the variance caused by differences among subjects is high, relative to the variance caused by differences among reviewers and residual variance.9 Weighted κ scores were calculated for individual items of the NIHSS and CNS for each hospital. For reference, both the ICC and weighted-κ scores may be interpreted according to the following guidelines: chance (0), poor (0 to 0.19), fair (0.20 to 0.39), moderate (0.40 to 0.59), substantial (0.60 to 0.79), and almost perfect agreement (0.80 to 1.0). Statistical analysis was performed using the SAS statistical software package (SAS). Kruskal-Wallis nonparametric ANOVA statistics were used to compare scores among hospitals. The protocol was reviewed and exempted by the Institutional Review Board at each hospital.
The mean ages of the patients were 69.2±16.3, 67.8±11.8, and 68.9±13.0 years of age at the AMC, CH1, and CH2, respectively (ANOVA, P=0.95). The proportion of female stroke patients was 60%, 58%, and 65% at the 3 hospitals, respectively (χ2 test, P=0.90). The median and range of NIHSS and CNS total scores are given in Table 1⇓. There were no significant differences in the median NIHSS (Kruskal-Wallis, P=0.317) or CNS (Kruskal-Wallis, P=0.316) total scores among the 3 hospitals. On the basis of the level of agreement measured by the ICCs, the NIHSS total scores were moderate for CH2 in which the retrospective ratings were based on evaluations performed by non-neurologists but almost perfect for the AMC and CH1. In contrast, the levels of agreement for the CNS were substantial for CH2 and almost perfect at the AMC and CH1 (Table 2⇓).
The medians, ranges, and proportion of missing items for each scale in each setting are shown in Table 3⇓. CH2 had significantly more missing items for both the NIHSS and the CNS than the AMC or CH1 (Kruskal-Wallis χ2=70.6, P=0.0001, and χ2=52.2, P=0.0001, for NIHSS and CNS, respectively). The proportions of missing items from the NIHSS and the CNS by setting are given in Figures 1⇓ and 2⇓, respectively.
Interobserver agreement for individual retrospective NIHSS items is shown in Table 4⇓. For the AMC, substantial (κ range, 0.6 to 0.8) or almost perfect (κ>0.8) agreement was found for all items of the NIHSS, except for visual fields (κ=0.47). At CH1, 9 of 13 items showed substantial or better agreement, but all 3 level of consciousness (LOC) subscores and gaze assessments resulted in only fair or moderate agreement. At CH2, only 4 of 13 items had substantial or better agreement. In addition, extinction was missing from 100% of the records at CH2, therefore the κ score for this item could not be assessed. Visual field assessment resulted in chance agreement.
Interobserver agreement for individual retrospective CNS items is given in Table 5⇓. For the AMC, all items had almost perfect agreement, with the exception of orientation assessment, which was moderate. At CH1, all items had substantial or better agreement except for LOC and orientation assessments, which were fair to moderate. At CH2, all items had substantial or better agreement except for distal arm and leg assessments, which were only fair.
We found almost perfect levels of agreement for total NIHSS and CNS scores assigned retrospectively to patients with ischemic stroke at an AMC and a CH1 but not a CH2. Therefore, the detailed neurological assessment required for retrospective assignment of the NIHSS was most reliable in settings in which neurologists performed the acute stroke evaluations. These data also show that the retrospective CNS scores were more reliable than the NIHSS when non-neurological specialists documented the evaluation. Substantially more NIHSS compared with CNS items were missing in each setting.
The retrospective NIHSS algorithm was developed using typical hospital discharge summaries of stroke patients.5 However, the retrospective application of the NIHSS is dependent on documentation of the neurological examination at admission. The retrospective algorithm was developed and assessed with patients admitted to an AMC and was validated in comparison to prospective scores. In this setting, missing items from the retrospective score were frequently normal when correlated with the prospective score.5 This assumption may be appropriate when neurologists evaluate acute stroke patients, but, as shown in the present study, non-neurologists may not routinely document neurological examinations in the same detail as neurologists. We found that the retrospective interpretation of ambiguously documented examination findings was difficult, leading to poorer reliability of the retrospective NIHSS. The retrospective scoring algorithm may need to be revised to address this problem.
The proportion of missing items was higher for both scales at the CH2 than for the AMC or the CH1 (Table 3⇑). Over one half of the NIHSS items were missing from each record at CH2, and some individual records were lacking 90% of the items for this scale. According to the retrospective algorithms, missing items are scored as normal. This could lead to an underestimation of the magnitude and uncertainty as to the nature of the stroke-related deficit in settings with a high proportion of missing items. The high proportion of missing retrospective NIHSS items limits its application for outcome studies, particularly when acute neurological consultation is not available. In addition, the items most commonly documented at CH2 included motor deficit of the arm (60%) and leg (60%) and LOC (56%), effectively reducing the NIHSS score to the items captured by the CNS.
Consistent with Williams et al,5 individual NIHSS items most commonly missing from patient records were assessments of dysarthria, visual fields, and neglect/extinction. Assessments of motor arm deficit, visual fields, and aphasia were particularly unreliable at CH2. Items such as hemianopia and extinction may not be documented because those features of the examination may not affect clinical decision-making at admission. However, both hemianopia and neglect/inattention are predictors of functional independence and outcome in stroke patients.1 10 11 Therefore, inconsistencies in documentation of these items in the retrospective NIHSS may provide misleading assessments of stroke prognosis and outcome.
The reliability of the individual NIHSS items was better at the AMC than in either of the community hospitals (Table 4⇑). This result may be due to a higher level of documentation in a teaching hospital than in a community hospital setting.5 Furthermore, when the examination was documented by a non-neurologist in the community hospital, the raters disagreed more frequently on which items were missing. This affected scoring (missing NIHSS items scored as normal), and therefore estimates of the weighted κ scores were statistically unstable, with large standard errors and wide confidence limits. In addition, with items such as visual fields, the simple agreement was high (95%), but the weighted κ was low (0). This paradox has been well-described and occurs when the marginal or column values are unbalanced, leading to a high expected proportion as a result of chance.12 13 Because the numerator portion of the formula is equal to the observed minus the expected proportion caused by chance, the resulting κ score is low. These κ scores should therefore be interpreted with caution.
This study has several limitations. First, most patients had only mild to moderate strokes (median NIHSS, 4.5 to 8), reflecting the patient populations at the participating hospitals. Reliability may differ in settings with patients with a wider range of deficits. Second, neither the NIHSS nor the CNS scores were assessed prospectively (gold standard) as a method of validation. However, the purpose of this study was to assess the comparative reliability of the NIHSS and the CNS and not to revalidate the retrospective algorithm. Third, some bias cannot be excluded because the abstractors were not blinded to the source of the hospital records. However, it is very difficult to effectively blind the assessments because of obvious differences in the ways in which examinations were documented in the medical record. Finally, abstractors for the present study were neurologists certified in the prospective use of the NIHSS. In practice, abstractors should also be certified in retrospective application of either algorithm if they are to be used in clinical studies.
Unlike the comprehensive neurological assessment provided by the NIHSS, the CNS focuses on LOC and motor deficits. As a result, the NIHSS is more frequently used in prospective clinical trials to give a fuller assessment of stroke-related impairments. Both scales have significant prognostic value;7 however, the increased comprehensiveness of the NIHSS must be balanced against the greater reliability of the CNS as applied retrospectively, especially in studies relying on assessments documented in the medical record by non-neurologists. The impairments captured by the CNS correlate well with disability measurements of activities of daily living such as the Barthel Index,14 a common outcome scale used in both prospective15 and retrospective16 stroke studies. Therefore, the CNS provides a reliable and clinically meaningful retrospective assessment of initial stroke severity that can be applied in a variety of clinical settings.
A reliable and valid assessment of stroke severity is a critical covariate for the analysis and interpretation of outcome studies.1 The reliability of the retrospective NIHSS and CNS scoring was acceptable in both AMC and CH settings in which neurologists performed and documented the initial evaluations. However, the high proportion of missing items limits the application of the retrospective NIHSS in a setting in which a neurologist did not perform the evaluation at the time of hospital admission. The use of the CNS may result in a more reliable assessment of stroke severity than the NIHSS for retrospective outcome studies that include community hospitals without acute neurologic consultative services.
Funding for this research was provided by the Agency for Healthcare Research and Quality training grant. The authors would like to thank Gregory Samsa, PhD, for his statistical expertise and advice. We would also like to thank Joan Mesler, Robert Mitchell, MD, and Kyle McDermott for their assistance with obtaining patient records.
- Received August 31, 2000.
- Revision received November 23, 2000.
- Accepted December 6, 2000.
- Copyright © 2001 by American Heart Association
Fullerton K, MacKenzie G, Stout R. Prognostic indices in stroke. Q J Med. 1988;250:147–162.
Brott T, Adams HP, Olinger CP, Marler JR, Barsan WG, Biller J, Spilker J, Holleran R, Eberle R, Hertzberg V, Rorick M, Moomaw CJ, Walker M. Measurements of acute cerebral infarction: a clinical examination scale. Stroke. 1989;20:864–870.
Cote R, Battista RN, Wolfson C, Boucher J, Adam J, Hachinski V. The Canadian Neurological Scale: validation and reliability assessment. Neurology. 1989;39:638–643.
Kasner SE, Chalela JA, Luciano JM, Cucchiara BL, Raps EC, McGarvey ML, Conroy MB, Localio AR. Reliability and validity of estimating the NIH Stroke Scale score from medical records. Stroke. 1999;30:1534–1537.
Williams LS, Yilmaz EY, Lopez-Yunez AM. Retrospective assessment of initial stroke severity with the NIH Stroke Scale. Stroke. 2000;31:858–862.
Goldstein LB, Chilukuri V. Retrospective assessment of initial stroke severity with the Canadian Neurological Scale. Stroke. 1997;28:1181–1184.
Muir KW, Weir CJ, Murray GD, Povey C, Lees KR. Comparison of neurological scales and scoring systems for acute stroke prognosis. Stroke. 1996;27:1817–1820.
Mitchell JB, Ballard DJ, Whisnant JP, Ammering CJ, Samsa GP, Matchar DB. What role do neurologists play in determining the costs and outcomes of stroke patients? Stroke. 1996;27:1937–1943.
Allen C. Predicting the outcome of acute stroke: a prognostic score. J Neurol Neurosurg Psychiatry. 1984;47:475–480.
Jongbloed L. Prediction of function after stroke: a critical review. Stroke. 1986;17:765–776.
Bushnell C, Phillips-Bute B, Laskowitz D, Lynch J, Chilukuri V, Borel C. Survival and outcome after endotracheal intubation for acute stroke. Neurology. 1999;52:1374–1381.