Training as a Prerequisite for Reliable Use of NIH Stroke Scale
To the Editor:
Before new therapies for ischemic stroke are established, their safety and effectiveness must be proved. In particular, the numerous multicenter acute stroke trials currently being performed require a valid, efficient, and reliable measure of patient status and outcome after treatment. Interrater variation in the assessment of neurological deficits could imply that important effects of the treatment remain concealed, which in turn may have a misleading influence on therapeutic decisions. A commonly used yardstick for measuring the outcome of neurological deficits in stroke patients is the National Institutes of Health Stroke Scale (NIHSS).1 2 3 Not only experienced neurologists can reliably apply the NIHSS; it can be used as well by nonneurologists or even nonphysicians (eg, study nurses),4 5 6 7 provided the raters are well trained and given detailed instructions. As far as the NINDS study is concerned, the investigators were video trained and required to take an examination.1
The question, however, of whether the NIHSS provides precise and reliable data when applied without an intensive training program has not yet been raised. We therefore investigated the reliability of the NIHSS as used by trained and untrained raters in 22 stroke patients in the Neurological Department at the University Hospital of Cologne. Diagnosis was confirmed by CT. Eighteen patients were suffering from ischemic and 4 from hemorrhagic stroke; 3 of the strokes were infratentorial and 19 supratentorial; 13 lesions were left and 6 right hemispheric. Five patients were obtunded or comatose, and 5 suffered from severe aphasia.
Four neurologists in our department independently assessed the patients’ neurological status. Two raters were experienced in using the NIHSS, video trained and instructed by the material of the NINDS-group (available from B.C. Tilley, PhD, Biostatistics and Research Epidemiology, Henry Ford Health Science Center, 1 Ford Place, Suite 3E, Detroit, MI 48202). The other two were inexperienced in the application of the NIHSS and were given no information other than the original NIHSS examination form. In this form the instructions for rating are very short and do not go into detail on how to handle problematic cases, such as aphasic, comatose, or unresponsive patients. To minimize a possible bias from a training or fatigue effect in the patients, untrained and trained raters were assigned in random order. To reduce the impact of fluctuation on the patients’ neurological state, evaluation had to be performed within a close time window (<90 minutes) only in the state of subacute stroke (>12 hours after symptom onset).5
To assess interobserver reliability, the κ statistic was used for every individual item to be examined with use of SPSS for Windows 7.0 (SPSS Inc) and BMDP 7.0 Dynamic (BMDP Statistical Software, Inc). The degree of interrater agreement based on κ is considered excellent if κ>0.80, substantial if κ is between 0.61 and 0.80, moderate if between 0.41 and 0.60, fair if between 0.21 and 0.40, and slight or poor if κ≤0.20.8 Additionally, we compared the total scores because they are generally used for assessment of outcome in clinical studies. However, this is not strictly valid, because the NIHSS does not represent a collection of numerical ratings but rather one of ordinal ratings.7
The results regarding the reliability achieved among trained raters with mean κ=0.61 (SD=0.17) show substantial interrater reliability. As far as the untrained group is concerned, κ was 0.33 (SD=0.22), indicating fair interrater reliability. Between trained and untrained raters, the unweighted κ was 0.45 (SD=0.2), indicating moderate agreement. Thus, the reliability achieved among untrained raters and that achieved among trained and untrained raters is substantially poorer than the result achieved among trained raters.
Furthermore, reliability of individual items differed substantially between trained and untrained raters. Among trained raters, fair agreement was found only in 2 items, limb ataxia (κ=0.34) and neglect (κ=0.32); there were no items with poor agreement. Among the untrained raters, the items ataxia (−0.03), gaze (κ=0.06), visual fields (κ=−0.02), and dysarthria (κ=0.18) were poorly reliable; furthermore, 6 items were only fairly reliable. The mean total score for the trained raters was 11.81 (SD=10.02; range, 2 to 40); the maximum difference between scores was 3 points in 5 patients, 2 points in 4 patients, and 1 point in 7 patients. Identical score was reached in 5 patients (1 missing). Among untrained raters, the maximum difference was 10 points, and a difference of ≥4 points was found in 4 patients. Between trained and untrained raters, the difference of total scores reached ≥4 points in 12 patients.
The results of the present study suggest that good interrater reliability of the NIHSS3 4 5 6 7 depends on adequate training of the raters. Interobserver reliability among trained raters was substantial, with a mean of κ=0.61, which is comparable to prior study results (κ between 0.51 and 0.69).3 4 5 6 7 By contrast, only fair reliability was achieved among untrained raters (κ=0.33).
The discrepancies in interobserver reliability achieved among untrained observers and among untrained and trained observers (κ=0.45) are alarming. They might not only influence study results but also therapeutical decisions. Especially in a large multicenter trial effected on an international basis with numerous raters of different countries, high reliability of the assessment score as a primary outcome measure is crucial. Our study demonstrates that even within one and the same department only fair reliability can be reached without an adequate training program.
Apart from the difficulties in the assessment of some individual items, as reported in former studies,3 4 5 6 7 one of the major problems the group of the untrained raters faced was the lack of instruction necessary for assessment of the 10 comatose, obtunded, unresponsive, and aphasic patients. In 8 of these patients, there were substantial differences (≥4 or more points) in total scores between trained and untrained raters, because assessment of several individual items such as gaze, visual fields, and dysarthria in those patients is especially problematic if no detailed instruction is provided.
In conclusion, without any systematic training program and knowledge of detailed instructions, the NIHSS can not reliably be applied. Therefore, a standardized use of the NIHSS is mandatory.
- Copyright © 1998 by American Heart Association
Albanese MA, Clarke WR, Adams HP, Woolson RF, and TOAST Investigators. Ensuring reliability of outcome measures in multicenter clinical trials of treatments for acute ischemic stroke. Stroke. 1994;25:1746–1751.
Brott T, Adams HP, Olinger CP, Marler JR, Barsan WG, Biller J, Spilker J, Holleran R, Eberle R, Hertzberg V, Rorick M, Moomaw CJ, Walker M. Measurements of acute cerebral infarction: a clinical examination scale. Stroke. 1989;20:864–870.
Goldstein LB, Samsa GP. Reliability of the National Institutes of Health Stroke Scale: extension to non-neurologists in the context of a clinical trial. Stroke. 1997;28:307–310.
Lyden P, Brott T, Tilley B, Welch KMA, Mascha EJ, Levine S, Haley EC, Grotta J, Marler J, and the NINDS TPA Stroke Study Group. Improved reliability of the NIH Stroke Scale using video training. Stroke. 1994;25:2220–2226.