Retrospective Assessment of Initial Stroke Severity With the NIH Stroke Scale
Background and Purpose—It is important to adjust stroke outcomes for differences in initial stroke severity. The NIH Stroke Scale (NIHSS) is a commonly used stroke severity measure but has been validated for retrospective scoring only in a subset of stroke clinical trial participants. The purpose of this research was to assess the validity and reliability of an algorithm for retrospective NIHSS scoring in a setting with usual chart documentation.
Methods—An algorithm for retrospective NIHSS scoring was developed with written history and physical admission notes. Missing physical examination data were scored as normal. One investigator prospectively scored the admission NIHSS in 32 consecutive stroke patients. Two raters retrospectively scored the NIHSS by applying the algorithm to photocopied admission notes. Linear regression was used to assess interrater reliability and agreement between prospective and retrospective NIHSS scores. The Wilcoxon signed rank test was used to assess systematic scoring bias. Weighted kappa statistics were calculated to assess the level of agreement of individual NIHSS items.
Results—Only 1 admission note was complete for all NIHSS elements. Interrater reliability was near perfect (r2=0.98, P<0.001). Agreement between prospective and retrospective NIHSS score was also excellent (r2=0.94, P<0.001) and there was no systematic bias in retrospective scores. Agreement for individual items was moderate to high for all items except level of consciousness.
Conclusions—Retrospective NIHSS scoring with the algorithm is reliable and unbiased even when physical examination elements are missing from the written record. Stroke research using retrospective review of charts or of administrative databases should adjust for differences in stroke severity using such an algorithm.
There is increasing emphasis on assessing quality of care at both local and national levels. Quality is usually inferred by linking specific structure or process of care indicators to patient-level outcomes such as mortality, length of stay, or functional outcome. Interpreting these data, particularly across institutions, requires case mix adjustment. This can be difficult, because quality assessments and case mix adjustment are usually done from retrospective analyses of administrative databases that do not contain clinically relevant variables for measuring disease severity.
When patient-level stroke outcomes are assessed, initial stroke severity is one of the variables that must be taken into account in adjusting outcomes for differences in case mix.1 2 This adjustment is critical, because it is well established that stroke severity at onset influences many outcomes, including mortality, length of stay, progression of deficit, and eventual functional recovery.3 4 5 6 7 8 Although initial stroke severity can be validly assessed in prospective studies, requisite variables are not typically recorded in a quantifiable way in most patient-care settings.
The NIH Stroke Scale (NIHSS) is a well validated and commonly used stroke impairment scale that sums the scores from individual elements of the neurological examination to provide an overall stroke impairment score.9 10 Although the NIHSS was recently reported to be reliable and valid for retrospective scoring,11 the study was conducted exclusively in stroke clinical trial participants, for whom the level of chart documentation is likely higher than in usual clinical practice. This report also did not address the reliability of individual NIHSS items. The aim of this research was to assess the validity and reliability of an algorithm for retrospective NIHSS scoring that would not require research-level documentation of the neurological examination. We hypothesized that (1) chart documentation for all elements of the NIHSS in non–clinical trial participants would be less than previously reported and (2) the algorithm devised for retrospective scoring would provide a reasonable estimate of initial stroke severity.
Subjects and Methods
An algorithm for retrospective NIHSS scoring was developed using written admission history and physical examinations (H&Ps) from patients admitted with acute ischemic stroke (Appendix). Instructions were developed for scoring each of the 12 individual NIHSS elements (level of consciousness, response to questions, response to commands, best gaze, visual, facial palsy, motor, ataxia, sensory, best language, dysarthria, and extinction/inattention). Missing NIHSS elements were coded as normal. The algorithm was pilot tested in a series of 10 ischemic stroke patients with written H&Ps. After algorithm development, one investigator prospectively scored the initial NIHSS within 24 hours of admission in a consecutive series of 32 ischemic stroke patients. This examination was not recorded in the medical record. All patients were admitted to the neurology service at one of 3 hospitals. In each case, the admitting team consisted of a medical student, neurology resident, and staff neurologist. Admitting students or physicians were not aware of this research project, and written H&Ps were not standardized. This study was approved by the institutional IRB.
Two raters certified in prospective administration of the NIHSS used the written algorithm to independently complete retrospective NIHSS assessments from photocopied admission H&Ps. If needed, H&Ps were edited before retrospective review to remove any mention of the NIHSS. Stroke Team notes, if present, were not used for retrospective scoring. One rater also noted which NIHSS elements were missing from the written H&P.
Linear regression was used to assess interrater reliability and to examine the relationship between retrospective (NIH-R) and prospective (NIH-P) scores. Because NIHSS scores are ordinal and not usually normally distributed, the Wilcoxon signed rank test was used to assess systematic differences in NIH-P and NIH-R. Weighted kappa statistics were calculated to determine the level of agreement of individual items of the NIH-P and NIH-R. Weighted kappa was used because NIHSS items are ordinal, not binary, and this technique adjusts for the amount of agreement expected by chance and the magnitude of individual disagreements. We used the usual quadratic weighting scheme, which bases disagreement weights on the square of the amount of discrepancy, thus making the weighted kappa equivalent to the intraclass correlation coefficient.12
The 32 ischemic stroke patients were representative of our usual stroke population with mean (SD) age 63 (14) years; 50% were male. Mean NIH-P was 5.4 (median 3, range 0 to 24). Only 1 admitting H&P was complete for all NIHSS elements (Table 1⇓); 9 (29%) were complete except for the “extinction/inattention” element. Other items frequently missing were (n with element absent): “dysarthria” (10), “commands,” (5); “ataxia”(4), and “visual” (4).
Interrater reliability between retrospective raters was nearly perfect (Figure 1⇓; r=.99, r2=0.98, P<0.001). Agreement was also nearly perfect for overall prospective and retrospective NIHSS score (Figure 2⇓; r=.97, r2=0.94, β=1.04, 95% CI β=0.94, 1.14, P<0.001). NIH-R scores were not significantly different from NIH-P scores (mean 5.3 and 5.4, respectively; median 3 and 3, respectively; P=0.55 by Wilcoxon signed rank test), which indicated no systematic bias in retrospective scoring. Weighted kappa for agreement between individual NIH-P and NIH-R items was moderate to high for all items except level of consciousness, ranging from 0.54 for “commands” to 0.94 for “best gaze” (Table 2⇓). In general, agreement was best among motor items and worst among higher cortical function items.
Our data show that NIHSS scores can be reliably estimated from written admission H&Ps even when specific elements of the NIHSS are missing from the written record. This supports the assumption that neurological examination items not mentioned in the written record are usually normal.
Kasner and colleagues11 also found that retrospective NIH scores could be reliably generated, with an 86% probability of correctly ranking NIH scores in 5-point interval categories at both admission and discharge. They also found that raters with varying levels of clinical stroke expertise (although all were certified in NIHSS prospective assessment, as in our study) were equally proficient at retrospective scoring. A major difference between our study and that of Kasner et al was that all but 1 of their 39 admission notes had “complete or near-complete” neurological examinations, while only 1 of our 32 notes was complete for all NIHSS items. The most likely reason for this difference is that, as expected, documentation about participants in clinical trials is more complete than documentation for nontrial participants. Our study thus extends the generalizability of retrospective NIHSS scoring, showing that our algorithm reliably estimates the initial NIHSS score in ischemic stroke patients, even in settings with incomplete documentation of the neurological examination.
Our data also provide an assessment of which of the NIHSS items are most accurate in retrospective assessment. We found that, in general, motor items such as limb motor function and gaze (eye movements) had higher agreement than cortical items such as level of consciousness or response to questions. One reason for this difference may be that motor response is usually assessed with a standardized scale, often the Medical Research Council scale, whereas higher cortical functions are recorded in a more individualized manner, thus potentially increasing the variability between observers. This result, however, is opposite to that reported in the retrospective validation of the Canadian Neurological Scale (CNS),13 in which Goldstein and colleagues found a higher level of agreement on speech, level of consciousness, and orientation items compared with motor items. This disparity may be related to the differing requirements for CNS and NIHSS items. For example, NIHSS items for commands and orientation require response to specific questions or commands and thus may be more difficult to accurately score retrospectively, whereas CNS motor items require both proximal and distal ratings for each limb and so may be less accurate when only a single motor score is recorded for each limb.
The importance of adjusting for initial stroke severity when assessing subsequent clinical outcomes cannot be overstated. Numerous studies have found that initial stroke severity is a strong predictor of many important outcomes, including mortality, length of stay, discharge destination, and functional outcomes.3 4 5 6 7 8 Most administrative databases include variables such as mortality, length of stay, and discharge destination as important clinical outcomes but do not include a measure of initial stroke severity. If such databases are used for comparing quality of stroke care between institutions or providers, adjustment must be made for case mix, including stroke severity. Our data provide an algorithm with which to reliably assess initial NIHSS scores for this purpose. Ultimately, our ability to demonstrate what constitutes high-quality stroke care will require us to link various structure and process variables to improved patient-level outcomes; this linkage is most convincingly demonstrated if databases prospectively record and appropriately adjust for differences in initial stroke severity.
Although we demonstrated excellent reliability for retrospective NIHSS scoring and for interrater NIHSS assessments, it is important to note that all raters were certified in prospective NIHSS assessment and thus may be more adept at interpreting written H&Ps in the context of the NIHSS examination. We also performed this study exclusively in teaching hospitals, where one could argue that the level of documentation of the neurological examination might be higher than in nonteaching settings. The effect that a higher proportion of missing examination items or that raters not trained in prospective NIHSS administration may have on the reliability of the retrospective algorithm is not known. We also had relatively few comatose patients, so the performance of the algorithm in these most-severe stroke patients remains to be determined.
Nonetheless, these data support the use of this retrospective algorithm to reliably score the NIHSS from written history and physical examinations, even when specific NIHSS items are missing from the written record. The use of such an algorithm will enhance our understanding of the influence of initial stroke severity on subsequent outcomes and will improve our ability to draw meaningful conclusions about stroke outcomes from administrative databases.
Dr Williams is supported by a Research Career Development Award, Office of Research and Development, Health Services Research and Development, Department of Veterans Affairs.
Reviews of this manuscript were directed by Prof Marie-Germaine Bousser.
- Received December 3, 1999.
- Revision received January 12, 2000.
- Accepted January 12, 2000.
- Copyright © 2000 by American Heart Association
Davenport RJ, Dennis MS, Warlow CP. Effect of correcting outcome data for case mix: an example from stroke medicine. BMJ. 1996;312:1503–1505.
Oxbury JM, Breenhall RCD, Grainger KMR. Predicting the outcome of stroke: acute stage after cerebral infarction. BMJ. 1975;3:125–127.
Muir KW, Weir CJ, Murray GD, Povey C, Lees KR. Comparison of neurological scales and scoring systems for acute stroke prognosis. Stroke. 1996;27:1817–1820.
Samuelsson M, Soderfeldt B, Olsson GB. Functional outcome in patients with lacunar infarction. Stroke. 1996;27:842–846.
DeGraba TJ, Hallenbeck JM, Pettigrew KD, Dutka AJ, Kelly BJ. Progression in acute stroke: value of the initial NIH Stroke Scale score on patient stratification in future trials. Stroke. 1999;30:1208–1212.
Brott T, Adams HP Jr, Olinger CP, Marler JR, Barsan WG, Biller J, Spilker J, Holleran R, Eberle R, Herztberg V, Rorick M, Moonaw CJ, Walker M. Measurement of acute cerebral infarction: a clinical examination scale. Stroke. 1989;20:864–870.
Kasner SE, Chalela JA, Luciano JM, Cucchiara BL, Raps EC, McGarvey ML, Conroy MB, Localio AR. Reliability and validity of estimating the NIH Stroke Scale score from medical records. Stroke. 1999;30:1534–1537.
Goldstein LB, Chilukuri V. Retrospective assessment of initial stroke severity with the Canadian Neurological Scale. Stroke. 1997;28:1181–1184.