(Stroke. 1997;28:1174-1180.)
© 1997 American Heart Association, Inc.
Articles |
From the Rehabilitation Institute of Chicago (A.W.H., R.L.H., J.R.M., D.I., L.L., P.S., E.J.R.) and the Department of Physical Medicine and Rehabilitation, Northwestern University Medical School (A.W.H., R.L.H., J.R.M., D.I., E.J.R.), Chicago, Ill.
| Abstract |
|---|
|
|
|---|
Methods Rating scale (or Rasch) analysis of the 15 NIH Stroke Scale items was conducted using the BIGSTEPS computer program to evaluate (1) the range of impairment assessed by the items, (2) the items' coherence with an underlying construct of impairment, and (3) range of impairment measured in rehabilitation patients. We sought to maximize the range of impairment measured by conducting analyses recursively; at each subsequent step, the worst fitting item was deleted or rescored. The sample comprised 1291 admission and discharge records from 693 rehabilitation inpatients with stroke.
Results Thirteen items arrayed the sample across a sufficient range of impairment. The limb ataxia item fit poorly and was deleted; lower ratings for this item were associated with higher scores on the total scale. Pupillary response was also deleted because ratings reflected poor congruence with the total score. Best language was rescored because intermediate ratings were inconsistently related to the total score. Patients with hemorrhagic strokes had poorer fitting measures than did patients with ischemic strokes.
Conclusions The items in a revised NIH Stroke Scale worked well together to define the severity of impairment resulting from stroke that is observed during medical rehabilitation. Directions regarding limb ataxia should be modified to indicate untestability due to hemiplegia.
Key Words: disability evaluation rehabilitation stroke assessment
| Introduction |
|---|
|
|
|---|
=.69); limited evidence of interrater
reliability was reported by Goldstein and associates,2 who
found moderate to substantial agreement for 9 of 13 items in 20 cases.
Construct validity was supported by modest correlations between raw
score and volume of lesion at 1 week from CT scan (r=.68)
and clinical outcome at 3 months (r=.79; 1). Sensitivity to
change was documented by Rothrock et al3 ; they reported a
24% rate of spontaneous improvement to the point of no or only mild
deficits in patients with ischemic stroke who did not receive
rehabilitation, although they did not describe item-specific rates of
improvement. Lyden and associates4 found that video
training of raters was effective in achieving moderate to excellent
agreement on all but two items, facial paresis and ataxia, for which
they recommended item revisions. Scale limitations are reflected in the
finding that many patients could not be tested on some items and that
normal scores were obtained on admission by many patients. A valid measure of impairment severity resulting from stroke would allow clinicians and researchers to quantify both the extent of neurological recovery that occurs during acute management and medical rehabilitation and the relationships between disease, impairment, disability, and handicap. The WHO5 described a model of disablement that draws distinctions between various aspects of disease consequences, including effects on individuals, their families, and society. Four illness realms defined in this model are (1) underlying diagnosis or disease, (2) loss or abnormality of physical or psychological capabilities or impairment, (3) restriction in activities of daily living or disability, and (4) social disadvantage due to limited ability to fulfill a role that is normal for that person or handicap.6 The NIH Stroke Scale focuses on impairment severity. A scale with good measurement properties should contain items that cover a wide range of impairment levels, which when combined are sensitive to improvements over time or due to treatment. The clinical basis for item definition appears clear and is based on neurological examination: sensory, motor, reflex, and language functions are often disrupted after stroke. However, because certain impairments have a greater impact on disability than others, patients selected for inpatient rehabilitation may require a different or modified measurement tool than that used for patients in acute settings.
While the NIH Stroke Scale appears to be useful for constructing a measure of impairment, its utility and validity in medical rehabilitation have not been evaluated extensively. The high proportion of acute-care cases that could not be scored on some items and the apparent insensitivity of the items to neurological recovery in the original sample of Brott et al1 requires that the measurement properties be evaluated more thoroughly. Also, the sum of item scores is not an interval-level measure, although the items may form the basis of such a measure. The goal of this study was to evaluate the clinical utility and validity of the scale for patients during medical rehabilitation using a psychometric approach called rating scale (or Rasch) analysis.
| Subjects and Methods |
|---|
|
|
|---|
Instrument
The NIH Stroke Scale1 consists of 15 items that
assess the severity of impairment in LOC, ability to respond to
questions and to obey simple commands, pupillary response, deviation of
gaze, extent of hemianopsia, facial palsy, resistance to gravity in the
weaker limb, plantar reflex, limb ataxia, sensory loss, visual neglect,
dysarthria, and aphasia severity. Each item is rated on a 3- or 4-point
ordinal scale. It was intended to assist in the examination of patients
with acute cerebral infarction. Table 1
lists the items
and reproduces Brott's original summary of patient testability and
incidence of impairment for each item. Many of the items appear to have
limited utility, given the high proportion of subjects who were rated
as normal at admission (LOC), who were untestable (limb ataxia), or who
were rated more poorly 1 week later (sensory).
|
Statistical Analysis
The raw score obtained by summing NIH Stroke Scale item
responses is ordinal in nature, which precludes its use in
parametric statistical comparisons because these raw data only
allow rank ordering of scores. A measurement procedure that can be used
to develop reliable, valid, and interval-scaled measures from ordinal
scores is rating scale or Rasch analysis,7 named
after the Danish mathematician whose work in the 1950s and 1960s has
been widely applied in educational testing and more recently in
rehabilitation outcome measurement. Interval (also called linear)
measures possess the advantage of having equal intervals between units
of the scale. When distributed in a reasonably normal fashion, measures
from an interval scale can be confidently subjected to
parametric statistical analyses that relate independent
and dependent variables. Transforming ordinal raw scores to
interval measures allows one to quantify individuals' impairments
along an equal-interval continuum of severity, make quantitative
comparisons within an individual across time, or compare severity
levels between individuals or across groups of individuals.
Rasch analysis8 helps evaluate to what extent patients' responses to NIH Stroke Scale items are dominated by the one dimension it purports to measure (severity of stroke impairment). This procedure requires viewing impairment items as forming a continuum of tasks that range from easy to perform to difficult to perform. If the tasks form a single construct, one would expect patients of any impairment level to be more able to perform easy tasks and less able to perform difficult tasks. Thus, when patients are able (or unable) to perform a task that they would (or would not) have been expected to perform on the basis of their overall impairment level, their responses misfit the model. While individual patients do not always respond as expected on a given series of tasks, the finding that a substantial proportion respond unexpectedly to a task provides an indication that the task does not "fit" with the remaining tasks in forming a unidimensional construct. It is often the case that an item set contains "noisy" items or that the definitions of scale categories are vague, which contributes to imprecise measurement. Strategies for fine-tuning item sets include rewording or deleting items, rescoring rating scales, or segregating patients into more homogeneous subgroups. The derived interval measure becomes useful in the description and evaluation of patients' stroke severity when the evidence for unidimensionality is compelling. Attaining a fine-tuned item set allows each patient to be characterized by a single interval-level impairment measure and each item by an estimate of its difficulty (called its calibration).
Several criteria are used to judge and improve the adequacy of a measure. These criteria include (1) person separation (the range of impairments represented by the patients in the sample) and item separation (the range of impairments covered by the measure), (2) item fit (the extent to which the sample as a whole responds unexpectedly to specific items) and person fit (the extent to which individuals or diagnostic subgroups respond idiosyncratically to the item set), and (3) scale structure (the extent to which raters are using the steps in the scale correctly and consistently). The BIGSTEPS program9 provides these statistics. The range of impairment represented in a given sample is summarized with a person separation index, defined as the ratio of true spread of the measures with their measurement error. The index indicates the spread of a given sample of patients in units of the test error in their measures. A clinically useful set of items should define at least three strata of patients (eg, "high," "moderate," and "low" levels of impairment), which are reflected in a separation index of 2.0. A related statistic, called item separation, indicates the item set's potential range of measurement; larger values indicate a potentially greater range of impairment that the item set can measure.
Two indicators of fit are defined: infit and outfit. Infit is sensitive to irregular patterns of responses for items that are close to patients' impairment levels. Outfit is sensitive to extremely unexpected or rare responses. For example, a problem with large outfit would occur when a score associated with an overall severe level of impairment reflected a pattern of impairment in which relatively common signs (eg, facial palsy) are absent while signs indicative of severe impairment are present (eg, reduced LOC). While both infit and outfit are useful indicators of noise, large outfit usually reflects gross anomalies that might reflect the presence of patients with unique patterns of impairment. However, large infit usually reflects more serious problems in the item's coherence with the measure's underlying construct. A pattern of poorly fitting items should give users pause to consider whether their construct of impairment should be defined differently from what they hypothesized. Other reasons for poor item fit include ambiguous wording of an item, a misordered scale structure, and peculiar item use by a subgroup of patients. Small fit statistics are generally not a concern, although they provide insights into how an item set might be shortened by deleting redundant items.
Items in a set may be rated using a common scale that ranges from devastating impairment (eg, rated as 3) to no impairment (eg, rated as 0). When items do not share a single rating scale (eg, a consistently defined 0 to 3 scale), a model is used in which each item is allowed to define a unique scale structure. While interpretation of such a model is cumbersome, this approach is necessary given the varying number of categories defined for each NIH Stroke Scale item (ranging from 3 to 5) and various definitions of the scale across items. Accordingly, we used what is called a partial credit approach. A desirable scale characteristic is that the average measure of impairment across all items should increase with each step on each individual item. Our analytic strategy was to maximize person separation (ie, the range of impairment represented in the sample) while minimizing problems with inconsistently used scale structures and poorly fitting items.
| Results |
|---|
|
|
|---|
Rasch Analysis
The initial Rasch analysis of all 15 items (summarized on
row 1 in Table 2
) yielded a person separation of 1.88
and an item separation of 15.33. While the item separation indicates
that the potential breadth of the measure was large (2.0 indicates
adequate spread), the items were able to distinguish slightly fewer
than three strata of rehabilitation patients (person separation, 1.88).
This initial analysis revealed problems with several items that
contributed to poor person separation. We used a strategy of rescoring
those items that had poor scale structure (ie, items with rating steps
that are associated inconsistently with better total scores)
and deleting the worst fitting items sequentially to improve the
measure. Reducing problems of fit with item or scale structures has the
effect of improving person separation by deleting the sources of error
(noise) and thus increasing precision (the ratio of signal to noise).
The most notable problem was with limb ataxia; better scores on this
item were associated with poorer scores on the remaining items. The
large item infit (1.81) and outfit (8.43) revealed a considerable
number of unexpected responses in ataxia ratings. The negative
correlation (-.22) between the item and the total score also
illustrates this problem. The directions to score ataxia as
"normal" (0) in patients with hemiplegia who were rated as more
impaired on other items accounts for this problem.
|
Deletion of limb ataxia (row 2 in Table 2
) yielded an improved person
separation (2.04) and item separation of 16.79. The language item had a
confusing step structure in that the two intermediate steps (mild to
moderate versus severe aphasia) were used inconsistently;
patients rated as having severe aphasia had total measures that
indicated less impairment overall than did patients with mild to
moderate aphasia. Consequently, we rescored the item by combining the
middle two levels of this item. The results of rescoring this item are
summarized in row 3 of Table 2
. The separation statistics were
essentially unchanged, and pupillary response still was the poorest
fitting item based on an item outfit of 2.58. Deletion of pupillary
response (shown in row 4 of Table 2
) yielded a 13-item solution with
slightly improved person separation (2.05).
Final Statistics
Table 3
shows the item fit statistics for the 13
retained items in order of increasing difficulty. All items had
excellent fit statistics. The "noisiest" item based on infit was
best visual function; its infit of 1.12 means that it contains 12%
more information than does the average item. In contrast, the best
motor arm item with an infit of .85 provides redundant information
because it contains only 85% of the expected amount of information.
Inclusion of two similar items (best motor arm and leg) probably
accounts for this finding. Several items with large outfits remained.
However, these items tended to be relatively difficult (best gaze, best
visual function) or easy (LOCquestions). Relatively difficult or easy
items with large infits are of less concern because they provide
diagnostic information about individual patients.
|
Not shown in Table 3
was the finding that all items have increasing
average measures across the scale steps and increasing step measures.
Another index of good step structure is the ratio of the observed to
expected outfit at each step for each item; a ratio of 1.0 is
desirable, with values greater than 1.6 providing cause for concern.
The worst ratio of observed to expected fit at any step for any item
was 3.27 for best visual at step 2. While relatively few patients were
rated with bilateral hemianopsia, some of them performed unexpectedly
well on the other items, revealing that visual function can be
disrupted in isolation from other functions.
Table 3
shows that LOC was the rarest characteristic of stroke
rehabilitation patients (with a difficulty of 2.73 logits, only the
most impaired patients showed impairment on it), followed by best gaze,
best visual, and best language. The item that reflected the most common
characteristic of stroke patients was facial palsy (with a difficulty
of -1.19 logits, even patients with the least impairment were
affected), followed by plantar reflex, dysarthria, LOCquestions, best
motor arm, sensory, and LOCcommands. Fig 1
illustrates
the range of person impairment and item difficulties for these 13
items. The left-hand column shows the distribution of patients'
measures (under the "Persons" heading); the right-hand column
shows the distribution of item difficulties. The mean±SD person
measure of -1.84±1.45 logits is considerably below the mean item
difficulty (fixed at 0.00 logits), indicating that the items are
targeted above the average impairment level of this sample, that is, to
a more impaired sample of patients. Although no patients were scored as
completely impaired by these 13 items, 25 patients were scored as
having no impairment. The capacity of the items to reflect more severe
impairment may be useful in acute medical settings but not in a
rehabilitation setting.
|
Person Fit Analysis
We investigated the nature of misfitting ratings of patient
impairments by examining relationships between patient outfits (outfit
selected because it is more sensitive to anomalous patterns than infit)
and stroke characteristics. Multivariate ANOVA was used
to examine differences in person outfits at admission and discharge
(time period) for four stroke categories (intracerebral
hemorrhage, subarachnoid hemorrhage, and
thrombotic and embolic stroke) and for patients with left- and
right-sided strokes. A significant main effect was found for stroke
category (F[3371]=4.92, P<.01) and time period
(F[1371]=6.14, P<.02); the interaction between stroke
category and side approached significance (F[3371]=2.26,
P=.08). Fig 2
shows that patients with left
intracerebral and subarachnoid
hemorrhage strokes had larger outfits (noisier measures) than
did patients with thrombotic or embolic strokes, that admission outfits
generally were greater than discharge outfits, and that patients with
left hemorrhagic strokes tended to have larger outfits than did
patients with right hemorrhagic strokes. This item set reflects the
idiosyncratic ways in which hemorrhagic stroke impairments are
manifested. "Noisier" measurement of impairment occurs at
admission among patients with hemorrhagic strokes, and particularly
left hemorrhagic strokes.
|
Improvement in Impairment
Admission and discharge NIH Stroke Scale measures for the 598
patients with ratings at both time points were correlated significantly
(r=.82, P<.001); the majority of patients had a
statistically significant reduction in impairment by discharge
(admission mean, -1.65 logits; discharge mean, -2.20 logits;
t[df=597]=14.5, P<.001).
Clinical Applications
The conversion between raw scores and linear measures for patients
with complete items is listed in Table 4
; a plot of the
raw scores against the linear measure would reveal an ogival ({sans
serif S}-shaped) relationship in which the raw scores and interval
measures are related linearly in the mid range of values even though
the relationship becomes curvilinear toward the top and bottom ends of
the range. This curve illustrates the mathematically necessary
relationship between the finite range of impairment measured by the raw
scores and the infinite range of impairment implied by the interval
measure. In the middle ranges of impairment, the raw NIH Stroke Scale
scores provide a reasonably linear estimate of impairment. The
consequence of using raw scores for patients with very low ratings is
to overestimate their actual impairment and for patients with very high
ratings, to underestimate their actual impairment.
|
Fig 3
provides a self-scoring key for the 13-item
measure. For a given patient, clinicians can circle responses to the 13
items and then mark a vertical line that passes through the midpoint of
the ratings; the point where this line intersects the horizontal axis
is the estimated measure for that person. In this hypothetical example,
the patient's average measure is about .5 logits, indicating moderate
impairment. Unusual responses should be immediately evident, giving
cause to reconsider the ratings or to explore further the
idiosyncrasies of impairment in a specific patient. The rating of no
sensory impairment fits poorly with the patient's overall level of
impairment and should be investigated further. Estimates of impairment
can be derived when fewer than 13 items are rated as well, although
with less precision.
|
| Discussion |
|---|
|
|
|---|
Improvements to the measure could be made by revising the directions for ataxia; providing a "not testable" option would eliminate confusion between hemiplegia resulting in untestability and unimpaired walking. Aphasia assessments may have been difficult to make given the presence of tracheostomy in some patients, particularly those with reduced LOC; provision of an "untestable" option may also be useful for this item. The need to rescore the aphasia rating may reflect patient selection criteria for rehabilitation. Patients with less motor disability and severe aphasia are often admitted to rehabilitation, whereas patients with severe motor disability and severe aphasia are often poor rehabilitation candidates. While the potential range of the measure is large, this sample was arrayed across a relatively small range of the items. In contrast, our work with the Functional Independence Measure instrument,10 an ordinal measure of disability, used with a similar sample, produced a measure that differentiated patients on a wide range of tasks and yielded a person separation of 3.74 for the motor items and 2.17 for the cognitive items. The items may be targeted better in patients in acute medical settings because the average item severity was targeted above the average patient measure of this sample. In contrast, relatively little variance may be seen in samples comprising patients drawn from outpatient and community settings. Additional items that reflect more subtle impairment for patients during medical rehabilitation would better target the scale in this setting. Separate ratings for sitting and standing balance might help distinguish impairment severity better. Finally, the utility of this revision in acute medical settings should be evaluated also.
Measures for patients with hemorrhagic strokes had poorer fits than for patients with thrombotic or embolic stroke; a tendency for patients with left-sided hemorrhagic strokes to have poorer fit was also found. Patients with hemorrhage are of two types: intracerebral and subarachnoid. These are quite distinct conditions and are likely to influence person fit in a heterogeneous sample such as this. Patients with left-sided strokes, especially hemorrhagic strokes, typically perform poorly LOC questions and commands, which reflect LOC, language, or both. Hemorrhagic stroke causes variable impairments, unlike the predictable impairments that follow vascular anatomy in ischemic stroke. It is also difficult to assess sensation in patients with aphasia. Reasons for these differences need to be explored further.
The NIH Stroke Scale is used widely to evaluate neurological change in pharmaceutical trials. Use of an interval measure with known reliability derived from this scale should provide greater sensitivity for these studies. Rehabilitation studies that wish to distinguish functional recovery and handicap reduction attributable to clinical interventions from neurological recovery will also benefit from this linear measure of impairment.
Limitations of this study include selection of patients from only one rehabilitation hospital and ratings provided by a small number of physician raters. It is possible that a narrower or broader range of impairment is seen in patients referred for medical rehabilitation in other settings. Additional training of raters might enhance item fit. However, the extent of training provided to physicians in this study is apt to reflect real world situations, thus enhancing the generalizability of these findings.
In summary, clinicians and researchers now have a valid method of describing severity of impairment found in patients undergoing stroke rehabilitation. Such a measure complements the available variety of linear disability (Functional Independence Measure,10 Patient Evaluation Conference System,11 LORS III12 ) and handicap measures (Craig Hospital Assessment and Reporting Technique13 ), thus realizing the assessment of the WHO model of impairment, disability, and handicap.
| Selected Abbreviations and Acronyms |
|---|
|
| Acknowledgments |
|---|
| Footnotes |
|---|
An earlier version was presented at the Mid-Western Educational Research Association Meeting, Chicago, Ill, October 13, 1995.
Received August 7, 1996; revision received February 18, 1997; accepted March 14, 1997.
| References |
|---|
|
|
|---|
2.
Goldstein LB, Bertels C, David JN. Inter-rater
reliability of the NIH stroke scale. Arch Neurol. 1989;46:660-662.
3.
Rothrock JF, Clark WM, Lyden PD. Spontaneous
early improvement following ischemic stroke.
Stroke. 1995;26:1358-1360.
4. Lyden P, Brott T, Tilley B, Welch KMA, Mascha EJ, Levine S, Haley EC, Grotta J, Marler J, for the National Institute of Neurological Disorders and Stroke TPA Stroke Study Group. Improved reliability of the NIH Stroke Scale using video training. Stroke. 1994;25:2220-2226.[Abstract]
5. World Health Organization. Classification of impairments, disabilities, and handicaps. Geneva, Switzerland: World Health Organization; 1980.
6. US Department of Health and Human Services. Post-Stroke Rehabilitation. AHCPR publication 95-0662, May 1995.
7. Rasch G. Probabilistic Models for Some Intelligence and Attainment Tests. Copenhagen, Denmark: Danmarks Paedogogiske Institut; 1960 (Chicago, Ill: University of Chicago Press; 1980).
8. Wright BD, Masters G. Rating scale analysis: Rasch measurement. Chicago, Ill: MESA Press; 1982.
9. Linacre JM. BIGSTEPS for PC Compatibles. Chicago, Ill: Mesa Press; 1995.
10. Linacre JW, Heinemann AW, Wright BD, Granger C, Hamilton BB. The structure and stability of the Functional Independence Measure. Arch Physical Med Rehabil. 1994;75:127-132.[Medline] [Order article via Infotrieve]
11. Fisher WP, Harvey RF, Taylor P, Kilgore KM, Kelly CK. Rehabits: a common language of functional assessment. Arch Phys Med Rehabil. 1995;76:113-122.[Medline] [Order article via Infotrieve]
12. Velozo CA, Magalhaes LC, Pan AW, Leiter P. Functional scale discrimination at admission and discharge: Rasch analysis of the Level of Rehabilitation Scale III. Arch Phys Med Rehabil. 1995;76:705-712.[Medline] [Order article via Infotrieve]
13. Whiteneck GG, Charlifue SW, Gerhart KA, Richardson GN. Quantifying handicap: a new measure of long-term rehabilitation outcomes. Arch Phys Med Rehabil. 1992;73:519-526.[Medline] [Order article via Infotrieve]
This article has been cited by other articles:
![]() |
J. P. Davis, A. A. Wong, P. J. Schluter, R. D. Henderson, J. D. O'Sullivan, and S. J. Read Impact of Premorbid Undernutrition on Outcome in Stroke Patients Stroke, August 1, 2004; 35(8): 1930 - 1934. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Liu, N. Chino, T. Tuji, Y. Masakado, K. Hase, and A. Kimura Psychometric Properties of the Stroke Impairment Assessment Set (SIAS) Neurorehabil Neural Repair, December 1, 2002; 16(4): 339 - 351. [Abstract] [PDF] |
||||
![]() |
E. J. Roth, L. Lovell, R. L. Harvey, A. W. Heinemann, P. Semik, and S. Diaz Incidence of and Risk Factors for Medical Complications During Stroke Rehabilitation Stroke, February 1, 2001; 32(2): 523 - 529. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Lyden, M. Lu, C. Jackson, J. Marler, R. Kothari, T. Brott, and J. Zivin Underlying Structure of the National Institutes of Health Stroke Scale : Results of a Factor Analysis Stroke, November 1, 1999; 30(11): 2347 - 2354. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Shafqat, J. C. Kvedar, M. M. Guanci, Y. Chang, and L. H. Schwamm Role for Telemedicine in Acute Stroke : Feasibility and Reliability of Remote Administration of the NIH Stroke Scale Stroke, October 1, 1999; 30(10): 2141 - 2145. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Stroke Home | Subscriptions | Archives | Feedback | Authors | Help | AHA Journals Home | Search Copyright © 1997 American Heart Association, Inc. All rights reserved. Unauthorized use prohibited. |