Modified National Institutes of Health Stroke Scale for Use in Stroke Clinical Trials
Prospective Reliability and Validity
Background and Purpose— To prospectively evaluate the reliability and validity of this previously developed stroke scale in an independently collected cohort. The National Institutes of Health Stroke Scale (NIHSS) has been criticized for its complexity and variability. Prior formal clinimetric analyses were used to obtain a modified version of NIHSS (mNIHSS), which retrospectively demonstrated improved reliability and validity. We sought to prospectively measure the reliability and validity of the mNIHSS.
Methods— Forty-five patients with a history of stroke or intracerebral hemorrhage were evaluated at the University of California, San Diego, Stroke Center from September 2000 through March 2001. Each patient was tested by 2 NIHSS-certified neurologists using the NIHSS, mNIHSS, Barthel Index, and Modified Rankin scales.
Results— There were a large percentage of high κ values using the mNIHSS. Only 10 (66.67%) of 15 NIHSS κ scores showed excellent agreement, whereas 10 (90.91%) of 11 mNIHSS κ scores showed excellent agreement. As predicted, the mNIHSS was more reliable than the NIHSS because of the exclusion of items with low κ values. With the use of correlation coefficient analysis, the mNIHSS was as valid as the NIHSS.
Conclusions— This prospective study found high reliability and continued validity by using a previously developed mNIHSS. Items found to have low κ values were consistent with the previously derived retrospective mNIHSS. The resulting mNIHSS scale has much higher κ values. The mNIHSS showed improved agreement between examiners and was also easier to administer, having fewer and simpler items. Further prospective evaluation should assess whether the mNIHSS could be used in lieu of the NIHSS.
The National Institutes of Health Stroke Scale (NIHSS) is a graded neurological examination that assesses speech, language cognition, inattention, visual field abnormalities, motor and sensory impairments, and ataxia. The scale was developed for use in acute-stroke therapy trials1,2⇓ and has since been widely used as a standard part of the assessment in clinical trials. This scale, along with many others, has been evaluated in its clinical usefulness in the assessment of the stroke patient.3,4⇓
The ideal stroke scale should be valid, reliable, and easy to administer in multiple settings by a broad range of health care practitioners.5 The NIHSS satisfies many, but not all, of the criteria for an ideal stroke scale. The NIHSS is not time-consuming to administer, taking <8 minutes to perform.1 Overall interrater reliability has been shown in multicenter stroke trials.1,2,6,7⇓⇓⇓ NIHSS reliability has been extended to nonneurologist physicians and nonphysician study coordinators in clinical trials,8 as well as in community-based studies.9 This reliability improves with personal and videotape training.6,10⇓ Factor analysis demonstrated content validity of the NIHSS.11 Regarding outcomes, the NIHSS has very good sensitivity, specificity, and accuracy in predicting clinical results at 3 months.12,13⇓ However, this scale does contain specific items with poor reliability and redundancy9,14⇓ and has been criticized for its complexity and resulting variability.
An ideal stroke scale, one that is reliable, valid, and predictive of patient outcome, would be frequently used in the evaluation of stroke patients. Although the NIHSS is useful in clinical research trials, it may not be routinely used in other situations. When emergency department acute-stroke study data were evaluated, a CT scan was performed in almost all cases, and a neurology consult was obtained in the majority of cases, but only 1.2% of patients with acute stroke had evidence of an NIHSS being performed.15 Although the NIHSS is well suited for acute and general use, it still falls short of the ideal stroke scale requirements.
Formal clinimetric analyses were used to improve the NIHSS, resulting in a modified version of the NIHSS (mNIHSS).14 With fewer items and simpler grading scales, the mNIHSS was intended to be easier to administer. The reliability has been demonstrated with certification data used previously for reliability testing of the NIHSS.14 Items with poor κ values were reduced from 20% to 14% with use of the mNIHSS. Validity was tested with National Institute of Neurological Disorders and Stroke outcome result data, which has also been used previously to investigate the NIHSS14: The mNIHSS demonstrated improved reliability and validity by use of factor and coefficient analysis. Power was greater with the mNIHSS, resulting in the potential for smaller sample sizes in clinical trials. The resulting simplified mNIHSS was felt to be a scale that is reliable, valid, and easy to administer in the clinical research setting. However, the mNIHSS has not been prospectively evaluated to confirm these findings. The present study prospectively evaluated the reliability and validity of the mNIHSS.
Subjects and Methods
The Modified NIHSS
The Figure displays the mNIHSS scoring sheet.14 Four questions were removed from the NIHSS to create the final mNIHSS. The first item, level of consciousness (LOC), was dropped because factor analysis demonstrated it to be redundant. The remaining “consciousness” items had higher κ values and were retained. The ataxia, facial weakness, and dysarthria items exhibited poor reliability, with facial weakness and dysarthria items also being redundant. Therefore, these items were deleted. The sensory question was simplified to 2 choices because of poor reliability of the third choice item. The maximum possible score with the use of this simplified scale is 31, compared with 42 for the original scale. All examiners in the present study were certified in use of the NIHSS with the use of previously published methods.10
Patients and Procedures
Patients seen at the University of California, San Diego, Stroke Center from September 2000 through March 2001 with a diagnosis of cerebrovascular accident or intracerebral hemorrhage were included in the present study. All patients exhibited deficits lasting >24 hours. Hospitalized and clinic patients were included in the present study to test the reliability of stroke scales in a wide range of clinical settings. Patients with nonvascular structural lesions or seizures accounting for symptoms were excluded. A series of 45 patients was evaluated by a pool of 4 certified NIHSS examiners; each patient was tested simultaneously by 2 of the 4 total examiners with the use of the NIHSS and mNIHSS. Scoring was performed in a blinded fashion, with neither examiner knowing the other examiner’s scoring results. The Barthel Index (BI)16 and Modified Rankin scale (MRS)17 were scored only for outpatients (n=27) to allow for comparison with assessments of activities of daily living and functional impairment, because it was not possible to determine accurate functional scales on patients with new neurological deficits in the acute hospital setting. The BI and MRS were obtained at the same time as the NIHSS and mNIHSS. Patient demographic data and risk factor information were also obtained.
Nonparametric methods were used, because the evaluated stroke scale scores are ordinal level data.5,18⇓ Interobserver reliability was rated with weighted κ statistics for each item of the NIHSS and mNIHSS. The κ statistics measure agreement among observers over and above that expected by chance alone.19 In the present study, weighted κ values were used because this method assigns weights to disagreements based on the magnitude of disagreement and also accounts for agreements based on chance alone.18 This method is generally used for ordinal scales, such as the ones being evaluated here, in which changes in numerical values are not consistent across all questions. The weighted κ value is qualified as follows: κ<0.40 defines poor agreement, κbetween 0.40 and 0.75 defines moderate agreement, and κ>0.75 defines excellent agreement.18 Reliability was tested by using data from all 4 potential observers. Specifically, interrater reliability was assessed by comparing stroke scale scores between 2 raters at a time. At times, however, each of the 4 raters were compared with each other for interrater reliability assessments. Each examiner was blinded to the other rater’s scoring.
Validity was examined with the Spearman rank coefficients to compare the stroke scales. In settings such as those in the present study, in which an independent gold-standard method of evaluation is not available, criterion validity cannot be assessed. Predictive/outcome validity was evaluated by comparing the NIHSS and new mNIHSS with functional outcome and disability scales. Construct validity was assessed by comparing the accepted NIHSS with the newly proposed stroke scale (mNIHSS) directly.20 Validity measurements were performed on only those patients evaluated as outpatients (n=27). In these patients, it was possible to obtain accurate functional outcome scale measurements (BI and MRS) to assess validity.
Table 1 presents the baseline characteristics of the 45 patients. In the present study, 23 (51%) patients were white, 7 (16%) patients were African American, and 8 (18%) patients identified themselves as Hispanic. There were 30 (67%) male patients with the mean age of 65 years (range 42 to 88 years). Ischemic stroke represented 42 (93%) and intracerebral hemorrhage represented 3 (7%) of the total patients. There were no patients with transient ischemic attack (TIA) evaluated. There were 18 (40%) acute inpatient evaluations and 27 (60%) outpatient clinic evaluations. Time since symptom onset ranged from 1 to 13 days (mean 4.72 days) for the inpatients and 0 to 12 410 days (mean 531.6 days) for the outpatients. One patient was seen 34 years after the onset of symptoms. Excluding this patient from analysis, the range was 0 to 478 days (mean 74.77 days).
Other demographic data obtained are summarized in Table 1. Forty-four percent of the patients had a history of prior stroke or TIA. Seventy-three percent of the patients had hypertension, 42% had coronary artery disease, 47% had diabetes, 49% had hyperlipidemia, and 18% had atrial fibrillation. A family history of stroke or TIA was present in 22% of the patients. Tobacco use was acknowledged in 18%, and current alcohol use was found in 22%. Forty-seven percent of the patients had only mild neurological deficits at the time of examination.
The NIHSS scores obtained ranged from 0 to 24 (median 5). The mNIHSS scores ranged from 0 to 20 (median 3). Total NIHSS scores did not differ between examiners by >4 points, whereas total mNIHSS scores did not differ between examiners by >2 points.
The NIHSS and mNIHSS were prospectively evaluated for reliability on the basis of weighted κ scores. The specific κ values for each item on the NIHSS and mNIHSS are shown (Table 2). Regarding the original NIHSS, 10 (66.67%) items (LOC questions, LOC commands, visual fields, left arm motor, right arm motor, left leg motor, right leg motor, sensory, aphasia, and neglect) displayed excellent agreement beyond chance alone. Four (26.67%) NIHSS items (LOC, gaze palsy, facial palsy, and ataxia) displayed good agreement beyond chance alone. One (6.67%) NIHSS item (dysarthria) displayed poor to no agreement beyond chance alone. The mNIHSS was also evaluated. Ten (90.91%) mNIHSS items (LOC questions, LOC commands, visual fields, left arm motor, right arm motor, left leg motor, right leg motor, sensory, aphasia, and neglect) displayed excellent agreement beyond chance alone. Only 1 mNIHSS item (gaze palsy) displayed good agreement beyond chance alone. No mNIHSS items displayed poor to no agreement beyond chance alone (Table 3).
The relatively good reliability of the standard NIHSS was again shown prospectively with 10 (66.67%) of 15 of the items having an excellent κ value (>0.75). With use of the mNIHSS, the reliability between examiners was highly apparent with a much larger percentage, 10 (90.91%) of 11 of the high/excellent agreement of κ values. The remaining 1 item on the mNIHSS (gaze palsy) still maintained good agreement of κ values (as in the NIHSS).
The NIHSS has been shown to be a valid clinical deficit scale.1 In retrospective analyses,14 the mNIHSS showed high correlation with other scales, similar to the NIHSS, and was a valid predictor of outcome. In prospective evaluation, we also found construct validity, in that the mNIHSS performed similarly to the NIHSS. The Spearman correlation coefficient between NIHSS and mNIHSS (for both examiners) was high (0.947, 0.941), with an overall value of 0.944. As a measure of concurrent validity, the NIHSS and mNIHSS were compared with functional outcome measures (BI and MRS). The coefficients for examiner 1 for NIHSS versus BI and MRS were −0.166 and −0.169, respectively. The coefficients for mNIHSS versus BI and MRS were −0.230 and 0.281, respectively. The absolute Spearman values were improved with the use of the mNIHSS, although values were not statistically significant. Examiner 2 and combined examiners revealed similar trends. Spearman coefficients comparing NIHSS and mNIHSS with BI and MRS are shown in Table 4.
The mNIHSS is a revised stroke scale primarily developed for clinical research and is simpler and easier to administer than the previous NIHSS. The main prerequisites to the ideal stroke-scoring system are reliability, consistency, and validity.5,20,21⇓⇓ Many stroke scales have been developed in the past, but few have been thoroughly tested for interobserver reliability and validity.3,20,21⇓⇓ A few well-known scales have shown high interobserver reliability (International Classification of Diseases score of 10), high reliability across items (NIHSS, the Canadian Neurological Scale, and the European Stroke Scale), and highly reliable measures of disability (BI).21
The present study prospectively evaluated a retrospectively constructed mNIHSS; we found much improved reliability and consistent validity when it was compared with the original scale. Prior studies have retrospectively proved the reliability, validity, and responsiveness of the mNIHSS.14 The items removed from the scale were LOC, facial weakness, ataxia, and dysarthria. The sensory item was collapsed to 2 responses. Items were removed because of poor reliability or redundancy in prior clinimetric studies. The resulting scale continued to have validity, and improved reliability was noted. However, the mNIHSS had not previously been prospectively studied with the use of an independently collected cohort of patients.
In the present study, the NIHSS and mNIHSS were tested prospectively. As in the retrospective evaluation,14 the same unreliable items (LOC, facial palsy, limb ataxia, and dysarthria) were again confirmed (Table 2). These were the same items that were removed from the final version of the mNIHSS.14 In the present study, the gaze-palsy item also had only good reliability.
Reliability may be tested with κ statistics.21–23⇓⇓ In a previous retrospective study, reliability data were collected on the 11 remaining items from the original NIHSS scoring sheet (15 total items). Theoretically, this may have allowed for inadvertent biasing of the scores. To exclude this possibility, the 11-item scale was tested prospectively. Previous studies have used a difference of ≥4 points on the NIHSS to reflect a clinically significant change beyond interrater variability.24,25⇓ Total NIHSS scores did not differ between examiners by >4 points, whereas total mNIHSS scores did not differ by >2 points. The mNIHSS score was more consistent and reproducible.
The reliability of the mNIHSS compares favorably with that of the NIHSS reported by Brott et al1 and Goldstein et al,2 with 10 (90.91%) of 11 mNIHSS items having κ values in the excellent range compared with only 10 (66.67%) of 15 items on the NIHSS. As predicted, the mNIHSS is made more reliable by excluding NIHSS items with low κ values and is, therefore, an improved scale.
Construct validity is measured by using correlation coefficients to test a newly developed scale against a previously used scale.20 In the present study, the mNIHSS was found to be a valid predictor of the original NIHSS. The correlation coefficients between the total NIHSS score and the total mNIHSS score were excellent (Table 4). When prospectively compared with the original scale, the nearly identical correlation coefficients confirm that the mNIHSS behaves similarly to the original.
As a measure of concurrent validity, the mNIHSS was compared with the BI and MRS. The absolute Spearman values were higher for the mNIHSS than for the NIHSS; however, the trend was not statistically significant. In the present study, the majority of the patients had only mild clinical deficits. This clustering of patients with mild deficits has made it difficult to draw further conclusions regarding concurrent validity of this new stroke scale, especially at the higher end of the scale.
The mNIHSS, which is easier to administer because of fewer and less complicated items, was prospectively shown to be a more reliable and accurate stroke deficit scale. The mNIHSS can potentially be used in lieu of the original NIHSS in similar settings. Accurate assessment of clinical deficit, inclusion and exclusion criteria for clinical trial enrollment, and potential guidelines for safe thrombolytic use are all areas for potential use.
Other indications, capitalizing on strengths of the mNIHSS, can be further evaluated. NIHSS data abstraction has previously been found to be a reliable and valid method for the estimation of the NIHSS from medical records.26–28⇓⇓ This means of data collection has been adopted, in several studies, to evaluate increased numbers of patients when an initial NIHSS was not performed at the time of admission. If the mNIHSS can also be abstracted from medical records with a high degree of reliability and validity, a record of initial patient presentation that is more accurate and easier to obtain may be found. This may make clinical trial data analysis more efficient and accurate, allowing for increased patient numbers included in study trials. A prospective analysis of medical record abstraction using the mNIHSS may be indicated.
Previous studies have evaluated the role of telemedicine in the evaluation of stroke. One study tested whether NIHSS agreement would persist if performed over a telemedicine link.29 A good interrater correlation coefficient was seen (0.97, P<0.001). Although 4 items had excellent weighted κ correlations, certain items (LOC, ataxia, and commands) had poor reliability. If the improved mNIHSS were implemented, this application of telemedicine could be potentially expanded, allowing for broader-reaching evaluation of stroke patients in areas previously not staffed by NIHSS-certified examiners. A prospective evaluation using the mNIHSS could be performed.
The potential benefits of the present study must be viewed in light of its limitations. First, the small study size must be taken into consideration. Although the results are statistically significant, further evaluation in a setting with larger patient numbers and multiple examiners should be considered. The validity assessments were performed only on outpatients (n=27) to accurately assess functional outcome scales, thus limiting the conclusions that can be drawn. Further validity assessments are planned with a larger number of patients. Second, the relatively low stroke scale values obtained in the present study may not be representative of the stroke population as a whole. Most patients had mild to moderate strokes (median NIHSS 5, maximum 24, and minimum 0). This is a 7- to 9-point lower median than is found in other studies.24,30⇓ A broader range of deficits must be evaluated to fully test reliability. This clustering of mild deficits may have negatively affected the accurate assessment of scale validity. Further validity assessments are planned with a larger number of patients. Third, the present study did not correlate mNIHSS with imaging findings of stroke size, nor did it assess the stroke scale as a predictor of outcome after a stroke. As previously shown, baseline NIHSS strongly predicts outcomes after stroke, with a score of ≥16 being associated with a high probability of death or severe disability and a score of ≤6 being associated with a good recovery.13 The mNIHSS was retrospectively shown to be a predictor of stroke outcome. This should be prospectively addressed in the future. The correlation of outcome with stroke scale score can be a measure of validity. Because the NIHSS, mNIHSS, BI, and MRS were all obtained at the same time, a true measure of predictive validity cannot be assessed by the present study. A prospective follow-up of these patients, or another study, could help to strengthen the concurrent or predictive validity of this scale.
Previous findings revealed that the NIHSS favors left/dominant hemisphere strokes, with right hemisphere events being consistently larger than left hemisphere events.31 In the NIHSS, 7 of 42 points are related to language function, whereas only 2 of 42 points are attributed to neglect functions. By dropping the dysarthria question from the mNIHSS, the balance may be shifted more toward minimizing the lateralization bias. Subsequently, the mNIHSS may be a more accurate representation of the true clinical deficit. This should be further evaluated with studies on the NIHSS and mNIHSS with regard to potential hemispheric bias.
The clinical utility of the mNIHSS has yet to be evaluated outside of the clinical research setting or with nonneurologist physicians or nurse coordinators. A simplified mNIHSS may allow for improved and increased use to provide the physician with a reliable and accurate assessment of patient deficit at 1 specific time point, with possible far-reaching implications for patient outcome. The mNIHSS has initially been tested by a specialized group of neurologists previously certified in use of the NIHSS. This may limit the applicability of the results to other groups, and further evaluation in a broader setting with nonneurologist physicians or nurse coordinators should be performed.
The mNIHSS is not the ideal stroke-scoring scale. However, it is an improvement over many of the scales used in the past. Overall, the improved reliability and the preserved validity of the mNIHSS make it a very attractive clinical deficit scale for the evaluation of stroke patients in the clinical research setting and beyond.
This study was supported in part by the Veterans’ Affairs Medical Research Division and a Grant-in-Aid from the American Heart Association.
- Received November 15, 2001.
- Revision received February 2, 2002.
- Accepted February 19, 2002.
- ↵Brott T, Adams HP, Olinger CP, Marler JR, Barsan WG, Biller J, Spilker J, Holleran R, Eberle R, Hertzberg V, Rorick M, Moomaw CJ, Walker M. Measurements of acute cerebral infarction: a clinical examination scale. Stroke. 1989; 20: 864–870.
- ↵Cote R, Battista RN, Wolfson C, Boucher J, Adam J, Hachinski V. The Canadian Neurological Scale: validation and reliability assessment. Neurology. 1989; 39: 638–643.
- ↵de Haan R, Horn J, Limburg M, Van Der Meulen J, Bossuyt P. A comparison of five stroke scales with measures of disability, handicap, and quality of life. Stroke. 1993; 24: 1178–1181.
- ↵Asplund K. Clinimetrics in stroke research. Stroke. 1987; 18: 528–530.
- ↵Schmulling S, Grond M, Rudolf J, Kiencke P. Training as a prerequisite for reliable use of NIH Stroke Scale. Stroke. 1998; 29: 1258–1259.
- ↵Albanese MA, Clarke WR, Adams HP Jr, Woolson RF. Ensuring reliability of outcome measures on multicenter clinical trials of treatments for acute ischemic stroke: the program developed for the trial of ORG 10172 in acute stroke treatment (TOAST). Stroke. 1994; 25: 1746–1751.
- ↵Goldstein L, Samsa G. Reliability of the National Institutes of Health Stroke Scale. Stroke. 1997; 28: 307–310.
- ↵Lyden P, Brott T, Tilley B, Welch KMA, Mascha EJ, Levine S, Haley EC, Grotta J, Marler J. Improved reliability of the NIH Stroke Scale using video training: NINDS tPA Stroke Study Group. Stroke. 1994; 25: 2220–2226.
- ↵Lyden P, Lu M, Jackson C, Marler J, Kothari R, Brott T, Zivin J. Underlying structure of the National Institutes of Health Stroke Scale: results of a factor analysis: NINDS tPA Stroke Trial Investigators. Stroke. 1999; 30: 2347–2354.
- ↵Muir KW, Weir CJ, Murray GD, Povey C, Lees KR. Comparison of neurological scales and scoring systems for acute stroke prognosis. Stroke. 1996; 27: 1817–1820.
- ↵Adams HP Jr, Bendixen BH, Leira E, Chang KC, Davis PH, Woolson RF, Clarke WR, Hansen MD. Antithrombotic treatment of ischemic stroke among patients with occlusion or severe stenosis of the internal carotid artery: a report of the Trial of Org 10172 in Acute Stroke Treatment (TOAST). Neurology. 1999; 53: 122–125.
- ↵Lyden PD, Lu M, Levine S, Brott TG, Broderick J. A modified National Institutes of Health Stroke Scale for use in stroke clinical trials: preliminary reliability and validity. Stroke. 2001; 32: 1310–1317.
- ↵Mahoney FT, Barthel DW. Functional evaluation: Barthel Index. Md Med J. 1965; 14: 61–65.
- ↵Fleiss JL. Statistical Methods for Rates and Proportions. New York, NY: Wiley and Sons; 1981: 212–236.
- ↵Lyden PD, Lau GT. A critical appraisal of stroke evaluation and rating scales. Stroke. 1991; 22: 1345–1352.
- ↵D’Olhaberriague L, Litvan I, Mitsias P, Mansbach H. A reappraisal of reliability and validity studies in stroke. Stroke. 1996; 27: 2331–2336.
- ↵Wityk R, Pessin MS, Kaplan RF, Caplan LR. Serial assessment of acute stroke using the NIH stroke scale. Stroke. 1994; 25: 362–365.
- ↵Kasner SE, Chalela JA, Luciano JM, Cucchiara BL, Raps EC, McGarvey ML, Conroy MB, Localio AR. Reliability and validity of estimating the NIH Stroke Scale score from medical records. Stroke. 1999; 30: 1534–1537.
- ↵Williams LS, Yilmaz E, Lopez-Yunez AM. Retrospective assessment of initial stroke severity with the NIH Stroke Scale. Stroke. 2000; 31: 858–862.
- ↵Bushnell CD, Johnston DC, Goldstein L. Retrospective assessment of initial stroke severity: comparison of the NIH Stroke Scale and the Canadian Neurological Scale. Stroke. 2001; 32: 656–660.
- ↵Shafqat S, Kvedar JC, Guanci MM, Chang Y, Schwamm LH. Role for telemedicine in acute stroke. Stroke. 1999; 30: 2141–2145.
- ↵Woo D, Broderick J, Kothari R, Lu M, Brott T, Marler J, Grotta J, for the NINDS rt-PA Stroke Study Group. Does the National Institutes of Health Stroke Scale favor left hemisphere strokes? Stroke. 1999; 30: 2355–2359.