A Modified National Institutes of Health Stroke Scale for Use in Stroke Clinical Trials
Preliminary Reliability and Validity
Background and Purpose—The National Institutes of Health Stroke Scale (NIHSS) is accepted widely for measuring acute stroke deficits in clinical trials, but it contains items that exhibit poor reliability or do not contribute meaningful information. To improve the scale for use in clinical research, we used formal clinimetric analyses to derive a modified version, the mNIHSS. We then sought to demonstrate the validity and reliability of the new mNIHSS.
Methods—The mNIHSS was derived from our prior clinimetric studies of the NIHSS by deleting poorly reproducible or redundant items (level of consciousness, face weakness, ataxia, dysarthria) and collapsing the sensory item into 2 responses. Reliability of the mNIHSS was assessed with the certification data originally collected to assess the reliability of investigators in the National Institute of Neurological Disorders and Stroke (NINDS) rtPA (recombinant tissue plasminogen activator) Stroke Trial. Validity of the mNIHSS was assessed with the outcome results of the NINDS rtPA Stroke Trial.
Results—Reliability was improved with the mNIHSS: the number of scale items with poor κ coefficients on either of the certification tapes decreased from 8 (20%) to 3 (14%) with the mNIHSS. With the use of factor analysis, the structure underlying the mNIHSS was found identical to the original scale. On serial use of the scale, goodness of fit coefficients were higher with the mNIHSS. With data from part I of the trial data, the proportion of patients who improved ≥4 points within 24 hours after treatment was statistically significantly increased by tPA (odds ratio, 1.3; 95% confidence limits, 1.0, 1.8; P=0.05). Likewise, the odds ratio for complete/nearly complete resolution of stroke symptoms 3 months after treatment was 1.7 (95% confidence limits, 1.2, 2.6) with the mNIHSS. Other outcomes showed the same agreement when the mNIHSS was compared with the original scale. The mNIHSS showed good responsiveness, ie, was useful in differentiating patients likely to hemorrhage or have a good outcome after stroke.
Conclusions—The mNIHSS appears to be identical clinimetrically to the original NIHSS when the same data are used for validation and reliability. Power appears to be greater with the mNIHSS with the use of 24-hour end points, suggesting the need for fewer patients in trials designed to detect treatment effects comparable to rtPA. The mNIHSS contains fewer items and might be simpler to use in clinical research trials. Prospective analysis of reliability and validity, with the use of an independently collected cohort, must be obtained before the mNIHSS is used in a research setting.
The ideal stroke scale should be valid, reliable, and simple to administer. While no current scale satisfies all these requirements,1 2 3 4 5 introducing a new scale is a formidable undertaking, given the expense associated with rigorous clinimetric scale design.6 To improve the usefulness of a currently used scale, the National Institutes of Health Stroke Scale (NIHSS), we conducted several investigations of reliability, validity and internal structure.7 8 The NIHSS is generally reliable, although some items such as ataxia and dysarthria exhibit low κ scores (poor interrater reliability).9 We found that the internal structure of the scale consists of 2 basic factors, relating to each of the 2 cerebral hemispheres.8 These previous investigations suggested that the NIHSS might be modified to make it simpler and easier to use. A scale with excess items, or items that do not contribute to the score in a meaningful way, wastes time and effort. Therefore, we propose a simplified version, the modified NIHSS (mNIHSS). To estimate the reliability and validity of the new scale, we used the same data we used previously to investigate the NIHSS. Using the previously published data allows easy comparison of the clinimetric features of the mNIHSS and facilitates the design of a prospective investigation, which is required to fully evaluate the mNIHSS.
The mNIHSS was devised on the basis of the results of prior work. The first item in the scale, level of consciousness, exhibited reasonably good reliability (κ=0.62) but in a factor analysis was redundant.8 This item was dropped, but the 2 remaining “consciousness” items were retained because they showed higher κ values. The reliability of the ataxia item was poor, and it contributed little to the internal structure of the scale; this item was therefore deleted.7 The facial weakness and dysarthria items exhibited poor reliability and appeared to be redundant in a factor analysis and therefore were removed. The sensory item was collapsed from 3 to 2 choices, on the basis of the poor κ seen in the reliability study.7 The resulting mNIHSS is shown in Table 1⇓.
To evaluate the mNIHSS, we used the data from the National Institute of Neurological Disorders and Stroke (NINDS) rtPA Stroke Trials, which were 2 randomized, double-blind, placebo-controlled trials to assess the effect of recombinant tissue plasminogen activator (rtPA) treatment for patients with acute ischemic stroke.10 Eight clinical centers participated in the study, with >40 hospitals involved. Six hundred twenty-four patients were enrolled in 2 separate trials (part I, n=291; part II, n=333). The 2 trials had the same recruitment and data collection protocol but different primary outcomes. In part I, rtPA activity (≥4-point improvement over baseline) was assessed 24 hours after stroke onset, but all end points were evaluated 90 days after stroke as well. Part II was a study of the rtPA treatment efficacy 90 days after stroke onset. In both parts, the NIHSS scores were collected at baseline, 2 hours after treatment onset, and 24 hours, 7 to 10 days, and 3 months after stroke onset. Lesion volumes were estimated from unenhanced CT scans obtained 3 months after stroke with quantitative volumetry. The details of the CT method will be the subject of another publication.
We measured mNIHSS interrater and intrarater reliability with the certification data collected previously.7 The details of the design of the prior version of the scale, as well as our video certification method, are described in detail in our prior report.7 Briefly, we printed the scale instructions on the face of the stroke scale form so that the examiner always had them available for reference. We videotaped 4 different examiners examining 11 certification patients on 2 different certification tapes. Each certification tape contained a brief introduction, intended to standardize the manner in which the investigators viewed the patients. To measure the effectiveness of the training process, we summarized the agreement among raters using κ statistics for the case of multiple raters who examine a few subjects.11 12 The unweighted κ is qualified as follows: κ <0.40 defines poor concordance, κ between 0.40 and 0.75 defines moderate concordance, and κ >0.75 defines excellent concordance.12
Content Validity: Factor Analysis
To assess the underlying structure of mNIHSS items, we used factor analysis, as we have described in detail.8 We recalculated the goodness of fit on the basis of a 4-factor solution restricted to 11 NIHSS items involved in the mNIHSS, using data collected in the NINDS rtPA Stroke Trials. We evaluated the goodness of fit on data collected at the baseline in parts I (n=291) and II (n=333) for reliability.13 To ensure that the factor structure was free of time or treatment confounding effects, the goodness of fit was calculated for data collected at 2 hours, 24 hours, 7 to 10 days, and 3 months after rtPA treatment or placebo treatment. This analysis was intended to determine whether the mNIHSS behaved in a manner similar to the NIHSS over serial examinations.
Concurrent and Predictive Validity
To estimate the validity of the scale, we determined mNIHSS scores from data collected in the NINDS rtPA Stroke Trials on 624 patients.10 To assess criterion and predictive validity, we compared the relationship between mNIHSS and NIHSS scores at baseline, 24 hours, and 3 months using Spearman correlation coefficients. The intention-to-treat approach was used to impute the worst score (31 for mNIHSS or 42 for NIHSS score) for patients who died or missed the follow-up. To measure concurrent validity, we compared the correlation of mNIHSS with the other neurological functions (Barthel Index,14 modified Rankin Scale,15 and Glasgow Coma Scale16 ) measured at 3 months on the basis of scores and dichotomized variables. The φ coefficients were calculated among the binary outcomes. We used the mNIHSS to test for treatment effect on improvement at 24 hours and treatment effect on minimal or no disability at 3 months after stroke (a 3-month favorable outcome), for comparison with the original report.10 The improvement at 24 hours is defined as a 4-point decrease from baseline in the mNIHSS or complete resolution of neurological deficits at 24 hours from stroke onset, as was used in the prior trial. The 3-month favorable outcome is the summary statistic defined from a set of measures at 3 months including a Barthel Index >95, Rankin Scale <1, Glasgow Outcome Scale=1, and mNIHSS <1. The Mantel-Haenszel test was used with stratification according to clinical centers and/or time when studying the 0- to 180-minute group for improvement at 24 hours.17 The relative risk and 95% confidence limits were reported. A relative risk >1 with 95% confidence limits excluding 1 indicated significant treatment benefit of rtPA compared with placebo.
To test whether the mNIHSS could predict differential stroke outcomes (similar to responsiveness), we used the mNIHSS to replace the NIHSS in a predictive model of acute intracranial hemorrhages.18 The details of this model have been published: the NIHSS proved to be one of the key predictors in that analysis, suggesting that the data set would be a good test of the responsiveness of the mNIHSS.19 We also modeled the probability of a 3-month favorable outcome in a prior study; again, we substituted the mNIHSS for the NIHSS in that model.20
The reliability of the mNIHSS, measured from the certification videotapes, was as good as or superior to the NIHSS (Table 2⇓). On both tapes, the number of items showing poor κ values decreased from 20% to 14%. For the mNIHSS, 55% of the items showed excellent agreement, compared with 40% for the NIHSS. Compared with 8 items for the NIHSS, only 3 mNIHSS items showed poor agreement: loss of consciousness questions (item 1C), gaze, and aphasia.
The internal structure of the mNIHSS was identical to that of the NIHSS, a measure of content validity (sometimes called structure validity). The 4-factor solution, which was the best in our prior investigation of the NIHSS, was also found in the mNIHSS (Table 3⇓). The 4-factor loading for each new scale item is shown. The higher the loading, the greater is the contribution of that item to the factor. As with the NIHSS, the first factor seems related to left hemisphere function, containing language and the 2 items requiring correct verbal responses. The second factor, containing the gaze, visual field, neglect, and sensory items, can be related to right hemisphere function, since right cortical neglect influences the responses on the sensory and visual items, while aphasia interferes with responses to the gaze, visual, and sensory items. These 2 effects may work together to cause these items to load most heavily on the right hemisphere factor. Gaze, visual, and sensory items are influenced by left hemisphere as well; they load on the right hemisphere because of the combined effects of aphasia and neglect. The third factor represents left hemisphere motor function, while the fourth factor represents right hemisphere motor function. The dysarthria item loads weakly on both motor items, as might be expected, but loads most heavily on the left hemisphere. The goodness of fit statistic (comparative fit index=0.96) is equal to that of the NIHSS. When used over time, and in placebo-treated compared with rtPA-treated groups, the mNIHSS values were as robust as those of the NIHSS, ranging from 0.93 to 0.96. The goodness of fit coefficients were all greater than the comparable values found for the NIHSS.
The mNIHSS was valid for detecting drug treatment effect. The scale scores clearly differentiated the 2 treatment groups at 24 hours, as illustrated in Table 4⇓. In the original trial, early improvement was defined as complete resolution of the neurological deficit or an improvement from baseline in the NIHSS score by ≥4 points 24 hours after the onset of stroke. For comparison with the original scale, the data previously published are listed.10 The relative risk for scoring ≥4 points better than baseline is shown. Part I of the study was designed prospectively to detect this effect and was not statistically significant in the original study. With the use of the mNIHSS, however, part I was statistically significant in the 91- to 180-minute and the 0- to 180-minute strata. Each analysis in Table 4⇓ indicates that the confidence levels are generally comparable with the use of the mNIHSS. Using another definition of early improvement, such as 5 or ≥6 points, results in similar findings: more rtPA-treated patients showed early improvement.
The mNIHSS also differentiated the treatment groups at 3 months (Table 5⇓). The results are shown for various strata from the original study to allow comparison with Table 4⇑ in the original publication.10 In the original trial, the primary outcome was defined as an improved odds of recovery 3 months after treatment, using a global odds ratio. The global odds ratio is computed from all 4 stroke scales. The odds ratios calculated with the mNIHSS in place of the original NIHSS are comparable. The slight decreases in the probability values indicate a possible increase in power. The odds ratios and relative risk ratios for a score of 0 or 1 at 3 months are also shown. Again, the ratios are comparable, and the probability values are slightly lower with the use of the mNIHSS. Power was estimated directly with the use of a method for multiple correlated binary outcomes.21 With part I data, the power to detect a ≥4-point improvement by 24 hours was increased from 24% to 51% with the mNIHSS. With part II data, the power to detect a favorable outcome (score of 0 or 1) 3 months after stroke did not improve with the mNIHSS.
When we substituted the mNIHSS for the original scale in the logistic models, the results were identical to our prior reports. The predication model of hemorrhage was based on the rtPA-treated group only because there were so few hemorrhages in the placebo group. The final multivariable symptomatic hemorrhage prediction model contained the same 2 variables as in our prior report: early CT findings and categorized mNIHSS. The odds ratios for symptomatic hemorrhage in the rtPA-treated patients (n=306) were 1.65 (95% confidence limits, 1.13, 3.40) for the mNIHSS score and 6.90 (95% confidence limits, 2.00, 23.78) for early CT findings of ischemia. These odds ratios and confidence limits are similar to those obtained in the prior report with the original NIHSS. The model for predicting all hemorrhages within 36 hours of treatment, symptomatic and asymptomatic, was also identical to the previous report, containing the same variables: smoking, baseline mNIHSS, early-ischemic CT findings, and admission blood pressure. Odds ratios and confidence limits were similar to the previous report (data not shown).
The prediction of a favorable outcome uses baseline variables that may predict a favorable outcome on the global odds ratio, incorporating all 4 outcome variables. The analysis with mNIHSS was identical to the analysis with the original NIHSS, reported previously.20 The odds ratio for a treatment effect was 2.28 (95% confidence limits, 1.62, 3.22), which compares favorably with the first report. Of considerable importance is the observation that the use of the mNIHSS yields the same interaction terms in the final model, showing that the mNIHSS is similar to the original scale in logistic regression analyses.
We propose a revised stroke scale for clinical research, the mNIHSS, that is simplified and should be somewhat easier to administer than the NIHSS. Our data indicate that the mNIHSS is more reliable than the preceding scale and is valid when used to analyze the data from the NINDS rtPA Stroke Trial. The items removed from the scale were level of consciousness, facial weakness, ataxia, and dysarthria. These were dropped because they exhibited poor reliability or were redundant in our previous clinimetric studies. Our data confirm that deleting these items improved the reliability and maintained the validity of the mNIHSS. The sensory item was collapsed from 3 to 2 responses to improve reliability. Of course, the mNIHSS must be tested in a prospective fashion, with a new data set used for confirmation. The previous version of the scale may be more appropriate for clinical monitoring of patients because the additional items may or may not detect changes in patient status that might be missed by the mNIHSS. The clinical utility of both scales outside of a research setting remains to be explored.
The scale proved nearly identical to the original when used in various statistical modeling procedures. The correlation of various outcomes, shown in Table 6⇓, may be viewed as validity, in the sense that patients who score well on the Rankin, Glasgow Outcome, or Barthel scales also score well on the NIHSS. Furthermore, when compared with the original scale, the correlation coefficients are nearly identical, confirming that the mNIHSS behaves in a manner similar to the original. More importantly, the scale imitated the original in the predictive models, which can be taken as an indicator of responsiveness. That is, the mNIHSS tends to predict response of patients to tPA as well as the original scale, when used in the multivariable model.20 Likewise, the mNIHSS predicts likelihood of hemorrhage after tPA treatment as well as the original in the multivariable model of symptomatic hemorrhage.19 However, another view of responsiveness includes the property of predicting within-individual changes; we could not assess this aspect of responsiveness.
A limitation of this investigation is the use of data collected previously. The NIHSS scores collected during the reliability study and during the actual NINDS rtPA Stroke Trial were collected on forms that contained all 15 items of the NIHSS. For the present analysis we simply analyzed the 11 items proposed for the mNIHSS. While reasonable, this exercise has the potential pitfall that scores are biased. It is a theoretical concern that the presence of the 4 unsettled items may inadvertently influence the scores on the remaining items. To truly estimate the reliability statistics, the 11-item scale must be tested directly in a prospective design. Nevertheless, our data support further investigation of the mNIHSS and suggest that it may indeed be more reliable than the NIHSS.
As an indicator of validity, we used the mNIHSS to predict outcome in the original NINDS rtPA Stroke Trial. The scale certainly predicted the outcome that was detected with the NIHSS. The values for relative risk shown in Table 4⇑ parallel those seen in the original study using the NIHSS.10 However, the absolute values are higher, the confidence levels are generally narrower, and more of the tests are statistically significant, results predicted from the improved reliability of the mNIHSS. One might question, however, whether the mNIHSS might somehow overestimate a putative drug effect. It is true that the original study showed no significant effect on the main outcome variable at 24 hours: a 4-point improvement on the NIHSS. However, we have shown elsewhere that in fact there was a highly significant effect on outcome at 24 hours.22 If any other criterion had been used for early improvement, ie, 5 or ≥6 points, then the original study would have been significantly positive. Therefore, the observation that the mNIHSS would have yielded a positive study on the 24-hour outcome is consistent with the improved power of the modified scale. The greater power of the mNIHSS would imply that fewer patients would be needed in future trials to detect significant treatment effects with the use of the 24-hour end points.
The outcome data 3 months after treatment suggest that the power for the mNIHSS is similar to the original. For each stratum we studied (Table 5⇑), the confidence limits were generally narrower and the probability values were generally lower than with the NIHSS. Reassuringly, the global odds ratio, calculated from all 4 outcome scales, changed only slightly when the mNIHSS replaced the NIHSS. This confirms the robust nature of the global test and further suggests that the mNIHSS may be combined with the other scales in the global test. In confirmation of this, we note that the correlation coefficients among the scales remained about the same when the mNIHSS was substituted for the NIHSS (Table 6⇑).
Our study does not address the potential bias of the original NIHSS toward hemispheric lateralization. For each 5-point category of the NIHSS <20, the median volume of right hemisphere infarction was approximately double the volume of left hemisphere infarction.23 The significance of this observation is not yet clear, and we plan further studies of both the NIHSS and the mNIHSS with regard to potential hemispheric bias. Additionally, we did not assess the effect of eliminating other, low-reliability items such as gaze or neglect; these items were believed essential to characterizing the stroke study population.
The mNIHSS is not the ideal stroke scale, but it is clearly an improvement over the previous NIHSS. When used by trained investigators, it shows greater reliability and power than the predecessor and is valid for detecting a treatment effect. A further, prospective confirmation is needed before clinical use.
The following persons and institutions participated in the NINDS rtPA Stroke Trial: Clinical Centers: University of Cincinnati (150 patients): Principal Investigator: T. Brott; Co-investigators: J. Broderick, R. Kothari; M. O’Donoghue, W. Barsan, T. Tomsick; Study Coordinators: J. Spilker, R. Miller, L. Sauerbeck; Affiliated Sites: St Elizabeth Hospital (South), J. Farrell, J. Kelly, T. Perkins, R. Miller; University Hospital, T. McDonald; Bethesda North Hospital, M. Rorick, C. Hickey; St Luke Hospital (East), J. Armitage, C. Perry; Providence Hospital, K. Thalinger, R. Rhude; The Christ Hospital, J. Armitage, J. Schill; St Luke Hospital (West), P.S. Becker, R.S. Heath, D. Adams; Good Samaritan Hospital, R. Reed, M. Klei; St Francis/St George Hospital, A. Hughes, R. Rhude; Bethesda Oak Hospital, J. Anthony, D. Baudendistel; St Elizabeth Hospital (North), C. Zadicoff, R. Miller; St Luke Hospital–Kansas City, M. Rymer, I. Bettinger, P. Laubinger. University of California, San Diego (146 patients): Principal Investigator: P. Lyden; Co-investigators: J. Dunford, J. Zivin; Study Coordinators: K. Rapp, T. Babcock, P. Daum, D. Persona; Affiliated Sites: University of California, San Diego, M. Brody, C. Jackson, S. Lewis, J. Liss, Z. Mahdavi, J. Rothrock, T. Tom, R. Zweifler; Sharp Memorial Hospital, R. Kobayashi, J. Kunin, J. Licht, R. Rowen, D. Stein; Mercy Hospital, J. Grisolia, F. Martin; Scripps Memorial Hospital, Chaplin, N. Kaplitz, J. Nelson, A. Neuren, D. Silver; Tri-City Medical Center, T. Chippendale, E. Diamond, M. Lobatz, D. Murphy, D. Rosenberg, T. Ruel, M. Sadoff, J. Schim, J. Schleimer; Mercy General Hospital, Sacramento, R. Atkinson, D. Wentworth, R. Cummings, R. Frink, P. Heublein; San Diego Veterans Administration Medical Center. University of Texas Medical School, Houston (104 patients): Principal Investigator: J.C. Grotta; Co-investigators: T. DeGraba, M. Fisher, A. Ramirez, S. Hanson, L. Morgenstern, C. Sills, W. Pasteur, F. Yatsu, K. Andrews, C. Villar-Cordova, P. Pepe; Study Coordinators: P. Bratina, L. Greenberg, S. Rozek, K. Simmons; Affiliated Sites: Hermann Hospital; St Lukes Episcopal Hospital; Lyndon Baines Johnson General Hospital; Memorial Northwest Hospital; Memorial Southwest Hospital; Heights Hospital; Park Plaza Hospital; Twelve Oaks Hospital; Houston Fire Department Emergency Medical Services. Long Island Jewish Medical Center (72 patients): Principal Investigators: T.G. Kwiatkowski, S.H. Horowitz; Co-investigators: R. Libman, R. Kanner, R. Silverman, J. LaMantia, C. Mealie, R. Duarte; Study Coordinators: R. Donnarumma, M. Okola, V. Cullin, E. Mitchell. Henry Ford Hospital (62 patients): Principal Investigator: S.R. Levine; Co-investigators: C.A. Lewandowski, G. Tokarski, N.M. Ramadan, P. Mitsias, M. Gorman, B. Zarowitz, J. Kokkinos, J. Dayno, P. Verro, C. Gymnopoulos, R. Dafer, L. D’Olhaberriague; Study Coordinators: K. Sawaya, S. Daley, M. Mitchell. Emory University School of Medicine (39 patients): Principal Investigator: M. Frankel, B. Mackay; Co-investigators: J. Weissman, J. Washington, B. Nguyen, A. Cook, H. Karp, M. Williams, T. Williamson; Study Coordinators: C. Barch, J. Braimah, B. Faherty, J. MacDonald, S. Sailor; Affiliated Sites: Grady Memorial Hospital; Crawford Long Hospital; Emory University Hospital; South Fulton Hospital, M. Kozinn, L. Hellwick. University of Virginia Health System (37 patients): Principal Investigator: E.C. Haley, Jr; Co-investigators: T.P. Bleck, W.S. Cail, G.H. Lindbeck, M.A. Granner, S.S. Wolf, M.W. Gwynn, R.W. Mettetal, Jr, C.W.J. Chang, N.J. Solenski, D.G. Brock, G.F.Ford; Study Coordinators: G.L. Kongable, K.N. Parks, S.S. Wilkinson, M.K. Davis; Affiliated Sites: University of Virginia Health System, E.C. Haley, Jr; Winchester Medical Center, G.L. Sheppard, D.W. Zontine, K.H. Gustin, N.M. Crowe, S.L. Massey. University of Tennessee (14 patients): Principal Investigators: M. Meyer, K. Gaines; Study Coordinators: A. Payne, C. Bales, J. Malcolm, R. Barlow, M. Wilson; Affiliated Sites: Baptist Memorial Hospital, C. Cape; Methodist Hospital Central, T. Bertorini; Jackson Madison County General Hospital, K. Misulis; University of Tennessee Medical Center, W. Paulsen, D. Shepard. Coordinating Center: Henry Ford Health Sciences Center: Principal Investigator: B.C. Tilley; Co-investigators: K.M.A. Welch, S.C. Fagan, M. Lu, S. Patel, E. Masha, J. Verter; Study Coordinators: J. Boura, J. Main, L. Gordon; Programmers: N. Maddy, T. Chociemski; CT Reading Centers: Part A—Henry Ford Health Sciences Center, J. Windham, H. S. Zadeh; Part B—University of Virginia Medical Center, W. Alves, M.F. Keller, J.R. Wenzel; Central Laboratory: Henry Ford Hospital, N. Raman, L. Cantwell; Drug Distribution Center: A. Warren, K. Smith, E. Bailey. NINDS, Project Officer: J.R. Marler. Data and Safety Monitoring Committee: J.D. Easton, J.F. Hallenbeck, G. Lan, J.D. Marsh, M.D. Walker. Genentech, Inc, Participants: J. Froelich, MD, J. Breed, F. Wang-Chow.
This study was supported by grants from the National Stroke Association, the Veterans Affairs Research Service, and the NINDS (N01-NS02382, N01-NS02374, N01-NS02377, N01-NS02381, N01-NS02379- N01-NS02373, N01-NS02376, N01-NS02378, N01-NS02380). The authors are very grateful to E. Clarke Haley, Jr, for editorial review.
A list of the trial investigators is found in the Appendix.
- Received August 15, 2001.
- Revision received December 1, 2000.
- Accepted December 20, 2000.
- Copyright © 2001 by American Heart Association
Asplund K. Clinimetrics in stroke research. Stroke. 1987;18:528–530.
Boysen G. Stroke scores and scales. Cerebrovasc Dis. 1992;2:239–247.
de Haan R, Horn J, Limburg M, Van Der Meulen J, Bossuyt P. A comparison of five stroke scales with measures of disability, handicap, and quality of life. Stroke. 1993;24:1178–1181.
D’Olhaberriague L, Litvan I, Mitsias P, Mansbach H. A reappraisal of reliability and validity studies in stroke. Stroke. 1996;27:2331–2336.
Lyden PD, Hantson L. Assessment scales for the evaluation of stroke patients. J Stroke Cerebrovasc Dis. 1998;7:113–127.
Cote R, Battista RN, Wolfson C, Boucher J, Adam J. The Canadian Neurological Scale: validation and reliability assessment. Neurology. 1989;39:638–643.
Lyden P, Brott T, Tilley B, Welch KMA, Mascha EJ, Levine S, Haley EC, Grotta J, Marler J, for the NINDS tPA Stroke Study Group. Improved reliability of the NIH Stroke Scale using video training. Stroke. 1994;25:2220–2226.
Lyden P, Lu M, Jackson C, Marler J, Kothari R, Brott T, Zivin J, for the National Institute of Neurological Disorders and Stroke rt-PA Stroke Study Group. Underlying structure of the National Institutes of Health Stroke Scale: results of a factor analysis. Stroke. 1999;30:2347–2354.
Fleiss JL. Statistical Methods for Rates and Proportions. 2nd ed. New York, NY: John Wiley & Sons; 1981.
Siegel S, Castellan NJ Jr. Nonparametric Statistics for the Behavioral Sciences. 2nd ed. New York, NY: McGraw-Hill; 1992:284–291.
Tilley BC, Marler J, Geller NL, Lu M, Legler J, Brott T, Lyden P, Grotta J. Use of a global test for multiple outcomes in stroke trials with application to the National Institute of Neurological Disorders and Stroke t-PA Trial. Stroke. 1996;27:2136–2142.
NINDS t-PA Stroke Study Group. Intracerebral hemorrhage after intravenous t-PA therapy for ischemic stroke. Stroke. 1997;28:2109–2118.
NINDS t-PA Stroke Study Group. Generalized efficacy of t-PA for acute stroke: subgroup analysis of the NINDS t-PA Stroke Trial. Stroke. 1997;28:2119–2125.
Tilley BC, Lu M, for the NINDS t-PA Trial Investigators. Analytic issues for stroke clinical trials. Presented at: Society for Clinical Trials meeting; May 1996; Pittsburgh, Pa.
Woo D, Broderick J, Kothari R, Lu M, Brott T, Marler J, Grotta J, for the NINDS rt-PA Stroke Study Group. Does the National Institutes of Health Stroke Scale favor left hemisphere strokes? Stroke. 1999;30:2355–2359.
It has long been recognized that clinical observations need to be objectively measured to better compare data from different sources, improve communication between health professionals, and better plan patient care.R1 R2 Although stroke scales have been in use for several decades, initially they did not adhere to any well-established or rigorous criteria but merely included items from the neurological examination to which were associated different numerical values.R3 Over the years, under the influence of the fields of psychometrics and biostatistics, criteria to assess the reliability and the validity of stroke scales have been proposed and implemented.R4 Although there has been some debate concerning the need and/or value of stroke scales,R5 most today would recognize their utility, especially in order to measure neurological impairment in the acute and subacute setting.
Because they can be used for different purposes, such as to evaluate a new therapy within a randomized controlled trial or to closely monitor the neurological evolution of stroke patients after admission to hospital to guide management, stroke scales have to answer different needs and thus can vary in their design and content.R6 For example a scale used to measure a therapeutic effect would benefit from being more comprehensive and sensitive to detect even small but beneficial effects of a new drug.R7 On the other hand, a scale that would be used by different health care professionals to monitor a patient’s evolution and permit the detection of clinically meaningful changes in neurological status would need to put more emphasis on reproducibility, simplicity, and responsiveness to within-individual changes.R8
Regardless of the indication for which they were designed, stroke scales should adhere to some basic and recognized principles that will ensure their reliability and validity.R4 Some of these include (1) simple and nonambiguous definitions for the modalities tested, (2) a minimum number of grades per item to minimize variability, (3) selection of the most relevant neurological deficits, (4) ease of use and interpretation by observers with different medical backgrounds, and (5) brevity and simplicity. Reliability, an important attribute of any scale, can be assessed through different approaches. These may include intraobserver reliability, which determines the stability of a measure at 2 different points in time when applied by the same observer; interrater reliability, which assesses the reproducibility of a measure between 2 or more raters at the same point in time; and internal consistency, which measures the extent to which the scale’s items substantively measure the same clinical concept. The scale also has to possess validity, which means that it should reflect as closely as possible the clinical phenomenon under study. In general, one should be concerned with 3 types of validity: first, content validity, which is an index of how well the scale reflects the components of what is being measured; second, criterion validity, which determines whether the scale reflects the current neurological status as defined by a gold standard (concurrent validity) or predicts the future health status of the patient (predictive validity); and finally, construct validity, which assesses whether the neurological deficit measured by the stroke scale is different from other types of deficits quantified by another scale (discriminant validity) and also evaluates the capacity of the scale to correlate with other measures of the same construct over time to detect meaningful changes (convergent validity). Convergent validation is equivalent to establishing the responsiveness of the scale, which is another important property of any clinical instrument.
In the past several years, the NIH Stroke Scale (NIHSS) has been the most widely used clinical instrument in clinical trials to assess therapeutic interventions in the acute stroke setting. It is fairly comprehensive and has also been submitted to several assessments to test its reliability and validity.R7 R9 In addition, a certification program using training videotapes has been in use to increase its reliability.R10
In the preceding article, Lyden et al propose a modified and simplified version of the NIHSS. Their goal was to increase both the reliability of the scale and its capacity to reflect more meaningful clinical information. To do this, they used 2 techniques: first, they identified and excluded selected items that had shown either poor reliability or redundancy in previous analyses, and second, they collapsed the choices for the sensory item. The resulting modified NIHSS (mNIHSS) contains 11 items from the 15 in the initial version, with a worse score of 31 compared with 42 in the original scale, a reduction of about 25% in terms of total points. Based on the data sets from the NINDS rt-PA Stroke Trials and previously collected certification data, they then proceeded to test the mNIHSS for validity and reliability. The authors report improved reliability with the mNIHSS, which was to be expected, but interestingly also show that the modified scale retains good content validity when compared with the NIHSS. In addition, the modified scale performs as well as the original NIHSS in terms of correlation with other concurrently administered scales that reflect various other aspects of patient function, and it also appears at least as powerful (if not more so) in predicting certain specific outcomes, such as neurological impairment at 24 hours. This study represents an important advance; however, some issues remain unresolved and could be further explored in future work. For example, this includes the determination of new cut-point criteria to define meaningful clinical changes or to predict outcome and differential weighting of items to address the problem of hemispheric bias, which was alluded to by the authors.
The current report constitutes very good news for all physicians and other healthcare professionals involved in the evaluation of new therapies for acute stroke patients. These results, based on retrospective analyses, show promise and represent a first step toward greater acceptance and utilization of the NIH stroke scale for the quantification of neurological impairment in clinical trials. Appropriately so, the authors recognize that additional prospective validation studies will be required to confirm and strengthen the present findings. We strongly encourage the authors in this undertaking and look forward to using the modified NIHSS in future clinical trials.
- Received August 15, 2001.
- Revision received December 1, 2000.
- Accepted December 20, 2000.
Asplund K. Clinimetrics in stroke research. Stroke. 1987;18:528–530.
Feinstein AR. An additional basic science for clinical medicine, III: the challenges of comparison and measurement. Ann Intern Med. 1983;99:705–712.
Candelise L (ed). Stroke scores and scales. Cerebrovasc Dis. 1992;2:239–247.
Hantson L, DeKeyser J. Neurological scales in the assessment of cerebral infarction. Cerebrovasc Dis. 1994;4(suppl 2):7–14.
Brott T, Adams HP Jr, Olinger CP, Marler JR, Barsan WG, Biller J, Spiker J, Holleran R, Eberle R, Hertzberg V, Rorick M, Moomaw CJ, Walker M. Measurements of acute cerebral infarction: a clinical examination scale. Stroke. 1989;20:864–870.
Côté R, Battista RN, Wolfson C, Boucher J, Adam J, Hachinski V. The Canadian Neurological Scale: validation and reliability assessment. Neurology. 1989;39:638–643.
Goldstein LB, Bertels C, Davis JN. Interrater reliability of the NIH Stroke Scale. Arch Neurol. 1989;46:660–662.
Lyden P, Brott T, Tilley B, Welch KMA, Mascha EJ, Levine S, Haley EC, Grotta J, Marler J, and the NINDS TPA Stroke Study Group. Improved reliability of the NIH Stroke Scale using video training. Stroke. 1994;25:2220–2226.