| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
(Stroke. 2009;40:2507.)
© 2009 American Heart Association, Inc.
Original Contributions |
From the Departments of Neurosciences (P.L., R.R.) and Family and Preventive Medicine (R.R., L.L.), University of California–San Diego School of Medicine, San Diego, Calif; the Department Neurology (P.L.), Veterans Administration Medical Center, San Diego, Calif; and the National Institute of Neurological Disorders and Stroke (M.E., M.W., J.M.), Bethesda, Md.
Correspondence to Patrick Lyden, UCSD Stroke Center, OPC Third Floor, Suite #3, 200 W Arbor Drive, San Diego CA 92103. E-mail plyden{at}ucsd.edu
| Abstract |
|---|
|
|
|---|
Methods— We sought to measure interrater reliability of the certification DVD among general users using methodology previously published for the DVD. All raters who used the DVD certification through the American Heart Association web site were included in this study. Each rater evaluated one of 3 certification groups.
Results— Responses were received from 8214 raters overall, 7419 raters using the Internet and 795 raters using other venues. Among raters from other venues, 33% of all responses came from registered nurses, 23% from emergency department MD/other emergency department/other physicians, and 44% from neurologists. Half (51%) of raters were previously National Institutes of Health Stroke Scale-certified and 93% were from the United States/Canada. Item responses were tabulated, scoring performed as previously published, and agreement measured with unweighted kappa coefficients for individual items and an intraclass correlation coefficient for the overall score. In addition, agreement in this study was compared with the agreement obtained in the original DVD validation study to determine if there were differences between novice and experienced users. Kappas ranged from 0.15 (ataxia) to 0.81 (Item 1c, Level of Consciousness-commands [LOCC] questions). Of 15 items, 2 showed poor, 11 moderate, and 2 excellent agreement based on kappa scores. Agreement was slightly lower to that obtained from expert users for LOCC, best gaze, visual fields, facial weakness, motor left arm, motor right arm, and sensory loss. The intraclass correlation coefficient for total score was 0.85 (95% CI, 0.72 to 0.90). Reliability scores were similar among specialists and there were no major differences between nurses and physicians, although scores tended to be lower for neurologists and trended higher among raters not previously certified. Scores were similar across various certification settings.
Conclusions— The data suggest that certification using the National Institute of Neurological Disorders and Stroke DVDs is robust and surprisingly reliable for National Institutes of Health Stroke Scale certification across multiple venues.
Key Words: clinimetrics reliability scales stroke
| Introduction |
|---|
|
|
|---|
| Methods |
|---|
|
|
|---|
We obtained certification scores from users in the following venues: single user (home or desktop), small groups, large groups, and a web site. Single users took the DVD home or to an office, watched the training video, and then watched the certification video cases. Small group certifications occurred at single sites where the training video was shown and then no more than 12 users watched the certification video and marked score sheets individually. Large group certifications occurred at meetings of trial investigators participating in a variety of clinical trials; the training video was shown and then certification patients were shown. In the large group settings, each user marked their own score sheet without discussion among other users. From all venues, score sheets were faxed to the University of California–San Diego Stroke Clinical Trial Coordinating Center for scoring using the published algorithm.7 The training/certification web site is sponsored by the American Heart Association. Users were encouraged to watch the training video over the Internet before certifying on one of the 3 certification groups; scores were recorded on the web site and then raw data were transmitted to the University of California–San Diego.
Descriptive analysis was performed on all data in the data set. The number of raters who certified using this DVD was tabulated by setting (individual, small group, investigator meeting, and web site) as well as specialty (RN, emergency department MD, neurology, other emergency department, other), prior certification status (yes, no), and country (US/Canada, others), if collected. Summaries of the individual item score as well as the total NIHSS were generated.
Reliability was assessed for the individual items of the NIHSS as well as the overall score. Scores of the individual items were tabulated. Agreement for the individual items among raters was assessed using the unweighted kappa statistic (
) for multiple raters9 with a 95% CI obtained using the bootstrap resampling technique with 1000 replicates. The methods used here are similar to the methods used in the original DVD validation study to allow comparison between the 2 studies.8 In this study, the bootstrap technique was used instead of the jackknife technique because there are several instances when the jackknife technique was not appropriate.10 Agreement between this study and the original DVD study was considered to be statistically different if the estimated
in the original study did not fall into the 95% CI for
in this study. Using similar methods, reliability of the individual items was assessed separately for the subgroups of patients by setting as well as specialty, certification status, and country, if available. Comparison of
statistics across subgroups was done using the bootstrap technique for correlated data.11 Ninety-five percent CIs for differences in
between 2 subgroups were calculated. The Bonferroni correction was used to adjust for multiple comparisons within each subgroup comparison. In addition, the scatterplot of the item scores for each subject was used to visually compare and confirm the reliability graphically and the consistency of item score by group.
Agreement on the overall total NIHSS was assessed with an intraclass correlation coefficient (ICC) obtained using a one-way random effects model for repeated measurements with continuous outcomes (with ratings nested within patients).12 The bootstrap resampling technique was used to obtain 95% CIs for the ICC. There are 2 comparisons that are of interest in this study: (1) ICC in the current study with that obtained in the DVD validation study; and (2) ICC in this study among the subgroups. The first was assessed by determining if the 95% CI for the ICC in this study contained the ICC from the DVD validation study. If true, there was no evidence to indicate a difference in ICC between the 2 studies. ICCs in the present study were compared between subgroups for setting, specialty, prior certification status, and country by calculating the 95% CI for the difference in ICC for correlate data between 2 subgroups. If zero is included in the CI, there is no evidence to indicate a difference. To compare ICC among the 3 groups of patients (A, B, and C), the Fishers Z transformation for comparison of independent ICCs was used.1 In both instances, the Bonferroni correction was applied to adjust for multiple comparisons. Similar to item score, the scatterplot of the total NIHSS for each subject was used to visualize the variability of scores by subgroups.
To assess the mean effect of the covariates on the total NIHSS, a random intercept mixed effects regression model was fit to the data.
| Results |
|---|
|
|
|---|
coefficients for individual items and an ICC for the overall score. Table 1 indicates the range of values obtained on each item over all 18 patients. The mean NIHSS total score was 8.0±6.6 (median, 7; range, 0 to 41). The spread of responses in individual items and total scores appeared similar among the subgroups, namely, sites, specialties, and prior NIHSS certification status.
|
Table 2 compares the agreement obtained using the unweighted
from the current data set with that of the original DVD study.1 The agreements ranged from 0.15 (ataxia) to 0.81 (Item 1c, Level of Consciousness-commands [LOCC]) using the current data set. The agreements obtained from this group of raters were similar to that of the original DVD study on all items of the NIHSS except for 7 items with lower agreement (LOCC, best gaze, visual fields, facial weakness, motor left arm, motor right arm, and sensory loss).
|
Among all 18 certification patients, the agreement was similar across all subgroups and among all venues. Results were remarkably similar to the results in the original DVD validation study except for some small inconsistent differences across certain subgroups (data not shown). Agreement in 4 fields (LOCQ, LOCC, visual fields, and motor left leg) was higher in other countries compared with the United States/Canada. Among specialties, emergency department MDs had higher agreement in motor right leg compared with nurses; in LOCC, motor right leg and sensory loss compared with neurologists and in motor left leg and motor right leg compared with other specialties; nurses showed greater agreement in dysarthria compared with neurologists and in motor left arm and motor left leg when compared with other specialties. Agreement in LOCQ was higher in noncertified raters than that in certified raters. Comparing venues, individual users showed higher agreement in extinction/neglect compared with the large group setting and higher agreement in visual fields and motor left arm compared with web users; in the large group setting, scores showed lower agreement in extinction/neglect compared with the web setting; the small group setting showed higher agreement in motor left arm than web users. There is no significant difference in agreement across 3 certification groups.
Table 3 lists the intraclass correlation coefficient for the overall total NIHSS score and total NIHSS by subgroup. There continues to be very good agreement in the total NIHSS score across all venues and subgroups (overall ICC of 0.85; 95% CI, 0.72 to 0.90). There are no statistically significant differences in mean NIHSS scores by country and prior NIHSS certification status. There was a statistically significant interaction between specialty and setting in mean NIHSS scores (P=0.046); however, there were no clinically significant differences. Although there were slight differences in ICC across covariates, in all cases, the agreement still remained very high. Agreement was lower among raters from the United States/Canada compared with the raters from other countries. The ICC was slightly lower among neurologists compared with the nurses, emergency department MDs, other MDs, and other physicians. Similarly, the raters with prior certification had slightly lower agreement than those who were not certified previously. The ICC was slightly lower in the case of small group setting as compared with individual, investigator meeting setting, or web users. The ICCs for certification Groups A and B were slightly lower than Group C.
|
| Discussion |
|---|
|
|
|---|
We found no differences in the ICC of the total NIHSS when the DVD was used by neurologists, emergency department physicians, and nurses, suggesting that the NIHSS may be appropriate for use in clinical research trials as well as in daily communication among healthcare providers. Agreement among those identifying themselves as neurologists was slightly lower than individuals identifying themselves as registered nurses, emergency department/other MDs, or other specialties, but the results were statistically similar and generally excellent. Agreement across various settings was similar and generally moderate to excellent.
The DVD format has some advantages over videotape. The digital images can be loaded onto a web site, and the American Heart Association successfully implemented a web-based training campus using our images. This web site allows raters to view the training and certification patient videos online. The DVD technology is more widely available now than videotapes, so NIHSS certification should be possible for many more years, even if videotapes become obsolete.
This study contains certain limitations, the most important of which is that most of the raters were from the United States and Canada. We were able to determine that the scoring sheet works well for novice as well as experienced users in North America. However, these scores may not be generally applicable for non-English-speakers or raters in other countries. Therefore, we continue to collect scores from the web site to determine if the same scoring sheet generally works well outside of North America. Another inherent limitation is that video technology is a poor substitute for direct examination. In the absence of widespread proctored certification, however, no other option is available. Video certification is now widely used in many disciplines with reasonable validity and reliability.2 It is likely that web-based video training and certification will become more widespread, because the cost efficiencies are significant. Finally, the web site does not require viewing of the training video before attempted certification, so an unknown number of novice users could have tried to certify without proper training.
Due to the unbalanced group sizes, small cells for item scores, and a crossed study design, we did not use weighted
statistics. Unweighted
scores may underestimate agreement, yet in this study, the unweighted
scores were comparable to the unweighted scores obtained in the primary DVD study and the weighted scores obtained in previous videotape studies. Therefore, the agreement among the viewers was at least as good and likely better than that seen previously with the videotapes. Agreement using the DVD continues to be surprisingly good and consistent among experienced as well as novice users.
| Acknowledgments |
|---|
Sources of Funding
This work was supported by National Institute of Neurological Disorders and Stroke P50 NS044148 and the Veterans Affairs Medical Research Service.
Disclosures
None.
Received July 23, 2008; accepted August 8, 2008.
| References |
|---|
|
|
|---|
2. Mohammad YM, Divani AA, Jradi H, Hussein HM, Hoonjan A, Qureshi AI. Primary stroke center: basic components and recommendations. South Med J. 2006; 99: 749–752.[CrossRef][Medline] [Order article via Infotrieve]
3. Lyden P, Lu M, Jackson C, Marler J, Kothari R, Brott T, Zivin J. Underlying structure of the National Institutes of Health Stroke Scale: results of a factor analysis. NINDS tPA Stroke Trial Investigators. Stroke. 1999; 30: 2347–2354.
4. Goldstein L, and Samsa, G. Reliability of the National Institutes of Health Stroke Scale. Stroke. 1997; 28: 307.
5. Goldstein LB, Bartels C, Davis JN. Interrater reliability of the NIH Stroke Scale. Arch Neurol. 1989; 46: 660.
6. Albanese MA, Clarke WR, Adams HP Jr, Woolson RF. Ensuring reliability of outcome measures on multicenter clinical trials of treatments for acute ischemic stroke: the program developed for the Trial of ORG 10172 in Acute Stroke treatment (TOAST). Stroke. 1994; 25: 1746.[Abstract]
7. Lyden P, Brott T, Tilley B, Welch KM, Mascha EJ, Levine S, Haley HC, Grotta J, Marler J. Improved reliability of the NIH Stroke Scale using video training. NINDS tPA Stroke Study Group. Stroke. 1994; 25: 2220–2226.[Abstract]
8. Lyden P, Raman R, Liu L, Grotta J, Broderick J, Olson S, Shaw S, Spilker S, Meyer B, Emr M, Warren M, Marler J. NIHSS training and certification using a new digital video disk is reliable. Stroke. 2005; 36: 2446–2449.
9. Fleiss JL. Statistical Methods for Rates and Proportions. New York: John Wiley and Sons; 1981.
10. Efron B, Tibshirani RJ. An Introduction to the Bootstrap. New York: Chapman & Hall/CRC; 1993: 436.
11. McKinzie DP, Mackinnon AJ, Peladeau N, Onghena P, Bruce PC, Clarke DM, Harrigan S, McGorry PD. Comparing correlated kappas by resampling: is one level of agreement significantly different from another? J Psychiatr Res. 1996; 30: 483.[CrossRef][Medline] [Order article via Infotrieve]
12. Zar JH. Biostatistical Analysis, IV ed. 1999. Princeton, NJ: Prentice Hall; 1999: 390–392.
Related Article:
Stroke 2009 40: 2297.
This article has been cited by other articles:
![]() |
K. R. Lees Training and Consistency in Stroke Assessments Stroke, July 1, 2009; 40(7): 2297 - 2297. [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Stroke Home | Subscriptions | Archives | Feedback | Authors | Help | AHA Journals Home | Search Copyright © 2009 American Heart Association, Inc. All rights reserved. Unauthorized use prohibited. |