Reliability of the National Institutes of Health Stroke Scale
Extension to Non-Neurologists in the Context of a Clinical Trial
Background and Purpose The reliability of the National Institutes of Health Stroke Scale (NIHSS) has been established through testing its use in live and videotaped patients. This reliability testing has primarily focused on the use of the scale by neurologists. We sought to determine the reliability of the NIHSS as used by non-neurologists in the context of a clinical trial.
Methods In anticipation of the initiation of a randomized trial of a new therapy for patients with acute ischemic stroke, 30 physician investigators (30% of whom were not neurologists) and 29 non-physician study coordinators were trained in the use of the NIHSS at an informational and training conference using standardized videotaped patient examinations. A series of 4 patients were rated initially. After 3 months, the same 4 patients were rerated, providing a measure of intraobserver reliability. An additional series of 4 new patients were also rated after 3 months and, with the initial 4 ratings, provided data for assessment of interobserver reliability.
Results Overall, 28% of the raters had previous experience with the NIHSS, and 22% had previously used the videotapes as used in the present trial. The coefficients of determination (r2) were each greater than .95 when the means of the two ratings of the same 4 cases were compared between (1) neurologists and other types of physicians, (2) physicians and study coordinators, (3) raters who had prior experience with the NIHSS and those without prior experience, and (4) raters who had used the videotapes in the past and those who had never viewed the tapes. The calculated r2s were greater than .98 for the initial rating of the first 4 cases and for the later rating of the 4 new cases. The slopes of the regression lines were all near 1, indicating that the raters were similarly calibrated. The intraclass correlation coefficients were .93 and .95, reflecting high levels of intraobserver and interobserver reliability.
Conclusions These data extend the previously demonstrated reliability of the NIHSS to non-neurologists and show that both a variety of physician investigators and nurse study coordinators can be rapidly trained to reliably apply the scale in the context of an actual clinical trial.
Ensuring the reliability of outcome measures is essential to the conduct of clinical stroke trials. Variability in the measurement of stroke-related neurological deficits could both obscure the demonstration of potentially meaningful treatment effects and necessitate sample sizes larger than those required on the basis of the inherent variability in stroke impairments and recovery. Several reliable and at least partially validated stroke impairment scales have been developed over the last decade.1 The National Institutes of Health Stroke Scale (NIHSS) is a graded neurological examination rating speech and language, cognition, visual field deficits, motor and sensory impairments, and ataxia and has become a standard part of the clinical assessments used in many recent interventional trials. Considerable effort has been expended to devise methods of guaranteeing the reliability of this instrument as used in multicenter trials.2 3 The efficacy of these methods, including the use of standardized instructional and certification videotaped examinations of stroke patients, has been demonstrated for trial investigators, most of whom were neurologists.3 4 However, because of the large number of ongoing and planned acute stroke trials, physicians other than neurologists are now acting as investigators. Additionally, to ensure adequate blinding, efficacy trials often require that outcome assessments be performed by a set of observers other than those involved in the active treatment of the patient. Because of this requirement, nonphysician study coordinators and non-neurologists are becoming responsible for obtaining neurological assessments. Therefore, establishing the reliability of the assessment scale and the efficacy of training procedures for a broader range of observers in the setting of a clinical trial is critical. As part of a planned interventional acute stroke trial, we tested the reliability of the NIHSS after a brief training session in a diverse group of physician investigators and study coordinators, many of whom had no prior experience with the scale.
Subjects and Methods
Subjects were 30 physician investigators and 29 study coordinators representing 30 institutions who attended an informational and training conference in anticipation of the planned initiation of a randomized trial testing the efficacy of a neuroprotective agent for patients with acute carotid-distribution ischemic stroke. As part of this conference, investigators and coordinators participated in a training session in the use of the NIHSS (Table⇓). Participants were given written instructions in the use of the rating scale with an emphasis on pitfalls potentially affecting reliability.5 6 These instructions were then reviewed in a didactic session. In particular, specific situations in which individual items could not be rated were stressed (eg, amputation or joint fusion would not permit rating of motor function or ataxia of the affected limb; dysarthria could not be rated in intubated patients), and the need to score every item was emphasized. Only the first response or attempt was scored for responses to the level of consciousness questions (item 1A), level of consciousness commands (item 1C), and distal motor function (item 12). Raters were instructed to give a score of 2 for responses to the level of consciousness questions (item 1B) in aphasic or stuporous patients and a score of 1 for patients unable to speak for reasons unrelated to aphasia. Detailed instructions for the rating of facial paresis were reviewed (item 4). Ataxia (item 7) was scored only if present and out of proportion to weakness. Sensory disturbances (item 8) were also to be scored only if present, except for comatose patients who were given a score of 2. For language function (item 9), a score of 3 was to be given to comatose patients.
After this instructional session, the participants were asked to independently rate a series of 4 videotaped stroke patients with a wide range of impairments (Henry Ford Hospital and Health Sciences Center NIHSS Training Tape Cases 1 through 4)3 using the NIHSS to provide an initial measure of interobserver reliability. Data were also collected concerning physician specialty, study coordinator training, whether the rater had used the NIHSS in the past, and whether the rater had used the NIHSS training and certification videotapes in the past. This session was followed by a discussion period during which any uncertainties in the use of the NIHSS were addressed. After 3 months, the investigators and coordinators were asked to rerate the 4 videotaped stroke patients that they had scored at the earlier training session without reference to their prior responses to provide a measure of intraobserver reliability. In addition, they were asked to rate 4 new cases (Henry Ford Hospital and Health Sciences Center NIHSS Certification Tape 1, Cases 1 through 4) to both provide another measure of interobserver reliability and determine whether reliability had diminished since the initial training.
Mean scores were first calculated for each case on the basis of the various groupings. The degree to which the scores were calibrated between groups of raters was then determined with linear regression analyses. Overall levels of agreement were assessed with intraclass correlation coefficients (ICC).7 After ANOVA indicated that the variance components for physician specialty, role (physician versus study coordinator), previous NIHSS experience, previous NIHSS videotape experience, and time were not statistically distinguishable from 0, the ICCs were calculated as ICC=σs2/(σs2+σe2), where σs2 is the variance component for subject and σe2 is the variance component for the residual error. The ICC will be high if raters are similarly calibrated (ie, the slopes of the regression lines are near 1) and the variation between subjects is large relative to the variation between raters. A calculated ICC=1 reflects perfect reliability. Frequencies were compared with χ2 statistics.
The physician-investigator group consisted of 21 neurologists, 7 intensive care or emergency physicians, 1 physiatrist, and 1 internist (70% neurologists and 30% non-neurologists). All except 1 study coordinator were registered nurses or certified nurse practitioners. Overall, 28% of the raters had previous experience with the NIHSS, and 22% had previously used the training and certification videotapes used in the present trial. There were no significant differences in the proportions of physicians and study coordinators who had previously used the NIHSS (30% of physicians versus 28% of study coordinators; χ2=0.041, P=.84) or the videotapes (23% of physicians versus 21% of study coordinators; χ2=0.060, P=.81).
Fig 1⇓ gives plots of the mean scores for neurologists compared with other types of physicians, for physicians compared with study coordinators, for raters who had prior experience with the NIHSS compared with those without prior experience, and for raters who had used the videotapes in the past compared with those who had never viewed the videotapes for both the 4 cases rated during the initial training session and for the second group of 4 new cases viewed 3 months later. The mean scores reflect patients with both mild and severe impairments. The coefficients of determination (r2) are greater than .98 for each comparison, indicating that the scores given by each group account for more than 98% of the variance in the scores given by the respective comparison group. The slopes of the regression lines are all near 1, indicating that the raters were similarly calibrated in their scorings (ie, a lack of systematic differences in the use of the scale). Fig 2⇓ gives plots of the mean scores for the same groups for the 4 cases rated during the initial training session and the repeated scores of the same 4 cases 3 months later. The calculated r2s are all greater than .95, and the slopes of the regression lines are all near 1.
The ICC was .94 for the 4 cases rated at the initial training session and .92 for the 4 new cases rated 3 months later. The overall ICC based on the ratings of these 8 cases was .95, reflecting a high level of interobserver reliability. The ICC was .93 for the cases rated during the initial training session and rerated after 3 months had elapsed, indicating a high level of intraobserver reliability.
These data extend the previously demonstrated reliability of the NIHSS to non-neurologists and show that both a variety of physician investigators and nurse study coordinators can be rapidly trained to reliably apply the scale in the context of an actual clinical trial. The reliability of the assessments did not vary with physician specialty, and the reliability was not influenced by whether the rater was a physician or study coordinator or whether the rater had prior experience with the NIHSS or with the training or certification videotapes. These later findings suggest that the procedures used in the present study to train the raters were effective.
The reliabilities of the individual items that compose the NIHSS have been studied extensively, both with the videotapes used in the present study and with live patients.3 4 5 6 The items rating facial paresis and limb ataxia were consistently found to be the least reliable. The scoring of these items and other items that have proved difficult in certain types of patients was stressed to the raters. Although there was some decrease in reliability between the initial scoring of the training-videotape patients (ICC=.94) and the scoring of the certification-videotape patients 3 months later (ICC=.92), the difference was small and not statistically significant. However, this decrease in reliability could have increased in magnitude with time, the so-called “drift effect.”4 Because the present trial was not initiated, we do not have further longitudinal data to address this issue. Reliability assessments and periodic recertification during the course of a long clinical trial may be necessary.
The Dextrorphan in Acute Stroke Trial was organized and supported by Hoffman-LaRoche, Inc. The videotaped examinations of stroke patients were produced at the Henry Ford Hospital and Health Sciences Center and used with permission. The authors wish to thank Dr Lynna M. Lesko for ensuring that the investigators and study coordinators completed the certification ratings even after the trial was canceled and Jannie E. Del Vecchio for providing the raw data that permitted these analyses.
- Received June 27, 1996.
- Revision received August 23, 1996.
- Accepted August 23, 1996.
- Copyright © 1997 by American Heart Association
Lyden P, Brott T, Tilley B, Welch KMA, Mascha EJ, Levine S, Haley EC, Grotta J, Marler J, for the NINDS TPA Stroke Study Group. Improved reliability of the NIH Stroke Scale using video training. Stroke. 1994;25:2220-2226.
Albanese MA, Clarke WR, Adams HP Jr, Woolson RF, for the TOAST Investigators. Ensuring reliability of outcome measures in multicenter clinical trials of treatments for acute ischemic stroke: the program developed for the Trial of ORG 10172 in Acute Stroke Treatment (TOAST). Stroke. 1994;25:1746-1751.
Brott T, Adams HP Jr, Olinger CP, Marler JR, Barsan WG, Biller J, Spilker J, Holleran R, Eberle R, Hertzberg V, Rorick M, Moomaw CJ, Walker M. Measurements of acute cerebral infarction: a clinical examination scale. Stroke. 1989;20:864-870.
Fleiss JL. The Design and Analysis of Clinical Experiments. New York, NY: John Wiley & Sons Inc; 1986:1-31.