Improving the Efficiency of Stroke Trials
Feasibility and Efficacy of Group Adjudication of Functional End Points
Background and Purpose—Use of the modified Rankin scale (mRS) in multicenter trials may be limited by interobserver variability. We assessed the effect of this on trial power and developed a novel group adjudication approach.
Methods—We generated power and sample size estimates from simulated trials modeled with varying mRS reliability. We conducted a virtual acute stroke trial across 14 UK sites to develop a group adjudication approach. Traditional mRS interviews, performed at local sites, were digitally recorded and scored by adjudication committee. We assessed the effect of translation by comparing scores in translated mRS interviews, originally conducted in English and Mandarin. Agreement was measured using κ and weighted κ (κw) statistics and intraclass correlation coefficient.
Results—Statistical simulations suggest that improving mRS reliability from κ=0.25 to κ=0.5 or 0.7 may allow reductions in sample size of n=386 or 490 in a typical n=2000 study. Our virtual acute stroke trial included 370 participants and 563 mRS video assessments. We adjudicated mRS in 538 of 563 (96%) study visits. At 30 and 90 days, 161 of 280 (57.5%) and 131 of 258 (50.8%) clips showed interobserver disagreement. Agreement within the adjudication committee was good (30-day κw=0.85 [95% confidence interval, 0.81–0.86]; 90-day κw=0.86 [95% confidence interval, 0.82–0.88]) without significant or systematic bias in mRS scoring compared with the local mRS. Interobserver reliability of translated mRS assessments was similar to native language clips (native [n=69] κw=0.91 [95% confidence interval, 0.94–0.99]; translated [n=89] κw=0.90 [95% confidence interval, 0.83–0.96]).
Conclusions—Achievable improvements in interobserver reliability may substantially reduce study sample size, with associated financial benefits. Central adjudication of mRS assessments is feasible (including across international centers), valid and reliable despite the challenges of mRS assessment in large clinical trials.
Clinical trials in acute stroke are large and expensive, even with recent innovations in trial design.1 The modified Rankin scale (mRS) is the most commonly used outcome measure in stroke research.2 The mRS is an ordinal scale of 7 categories describing full recovery, by increasing degrees of disability, to death.3 Typically, mRS assessment is based on a clinician’s rating of a patient interview, and interobserver variability is common.4 Meta-analysis suggests an overall reliability of κ=0.62 (κw=0.9),4 but this may be less (κ=0.25) in multicenter studies.5 Mandatory training in mRS assessment is used in most trials to mitigate this,6 but the problem persists. The end point misclassification inherent in this interobserver variability may affect trial power7 and treatment effect size.8
Central adjudication of trial end points is routinely used in a variety of settings but has been rarely used in stroke. Group adjudication of mRS has been based on review of written summaries9 or telephone interview.10,11 Advantages to a remote adjudication approach include the following: expert opinion and experience; quality control of mRS interviews and avoiding potential bias from use of mRS where local observers are not or cannot be blinded to treatment allocation. To date, remote functional assessment has been limited as a result of the difficulty in capturing a high-fidelity recording, suitable for off-line review. Furthermore, most trials are international, and culturally sensitive translation may also prove problematic. Our pilot data suggest that a video-based mRS assessment is a valid and reliable solution.12
We present data exploring 3 complementary themes: (1) We describe the effect of varying magnitudes of mRS interobserver variability on required study sample size using statistical modeling techniques (henceforth referred to as sample size simulations). (2) We describe the feasibility and reliability of central adjudication of mRS in a clinical trial setting (Central Adjudication of modified Rankin scale assessments in acute Stroke trials [CARS]). (3) We describe the feasibility and reliability of central adjudication of translated mRS interviews (translation substudy).
Sample Size Simulations
We performed simulations to demonstrate the effect of increasing mRS reliability and using multiple observers to assign mRS scores. We generated power estimates from simulated mRS studies under various combinations of sample size (N), effect size (δ), reliability (unweighted κ and quadratically weighted κw), adjudication panel size (Nadj), and methods of summarizing mRS across adjudicators (mode, mean, and median; Appendix B in the online-only Data Supplement).
Effect of Increasing mRS Reliability
Effect of Using Multiple Scores
To assess the benefit of multiple mRS assessments, we performed simulations using summary statistics (mode/mean/median) of Nadj=1, 2, 4, and 9.
We performed a virtual multicenter acute stroke trial to assess the reliability and feasibility of central adjudication of locally recorded mRS assessments in a multicenter trial setting. We enrolled patients within 48 hours of stroke onset, who had a demonstrable deficit on the National Institutes of Health Stroke scale. Our sole exclusion criterion was a premorbid mRS of >3. Ethical approval for all study procedures was granted by Scotland A Research Ethics Committee (08/MRE00/72) and Essex2 Research Ethics Committee (08/H0203/147). We collected written informed consent from the patients or their proxies.
Baseline clinical, demographic, imaging, and laboratory data were collected at the time of recruitment (Appendix C in the online-only Data Supplement). Patients were reviewed at 30 and 90 days with mRS and National Institutes of Health Stroke scale assessments. All observers were trained and certified in the use of mRS.6 An electronic case report form (eCRF), held by the Robertson Center for Biostatistics, was used to enter all data.
These were performed according to the normal practice of each center and were recorded using a digital video camera (initially a Canon high-definition video camera [Canon HF100] but later a Flip Mino camera). Using a desktop tripod, the video camera was positioned to capture the patient’s face and torso, and an external boundary microphone was used (unnecessary for the Flip camera). The local investigator assigned an mRS score and recorded this in the eCRF but did not reveal it on the video clip. The recorded assessments were uploaded via the eCRF.
Central Review of mRS Assessments
Uploaded clips were assessed for quality, anonymity, and blinding of locally assigned score by the trial outcomes manager. Windows Movie Maker software was used to remove any patient-identifying information. Thereafter, mRS assessments were distributed to 4 members of the end point committee for scoring. The end point committee was composed of experienced stroke clinicians from the coordinating center (3 professors, 2 senior lecturers, and 2 clinical research fellows). Each member reviewed the assessment and assigned a score, blinded to every other score. On completion of scoring, full agreement delivered a final score. Where ≥1 scores disagreed, the video was forwarded for committee review and discussion (Figure I in the online-only Data Supplement). Agreement among assessors was measured using κ statistics (κ/κw [Fleiss–Cohen weights]),14 intraclass correlation coefficient (ICC), and Bland–Altman plots.15 We assessed for temporal trends in the reliability to quantify any learning effect in the adjudication committee. We examined the degree to which each assessor was precise and accurate to exclude any systematic bias in scoring between raters (Appendix C in the online-only Data Supplement).
To estimate the reliability that would be delivered by combining multiple ratings, we used the Spearman–Brown prediction formula.16 We used the observed reliability of a single panel member (ICC) to predict the likely improvement in reliability (ICC) with groups of ≤10 observers (for full details, refer to Appendix D in the online-only Data Supplement). We compared our observed reliability estimate from 2 observers with the predicted value.
The effect of translation of mRS assessments was assessed in collaboration with the Department of Neurology, Peking University First Hospital, China. Trained and certified mRS assessors between 2 university hospitals (n=5 Glasgow and n=5 Beijing) scored mRS assessments in English and Mandarin, each working in his or her native language and after translation. In an initial sample (n=20), 2 versions of the translated transcript were prepared: 1 by an mRS-certified clinician and the other by a linguist with no medical background. These mRS interviews were scored twice using each transcript in turn ≥2 months apart. A larger sample of translated Mandarin clips was subsequently scored to assess interobserver variability.
We assessed the feasibility of incorporating translation into the central adjudication process. A sample of CARS clips was randomly selected (using R statistical software) for translation and rescoring. Trained and certified CARS investigators overdubbed English language summaries for mRS interviews they had not previously seen. Digital dictaphone audio files were merged with the original video file automatically via the eCRF. The new file was presented to be rescored by committee members. Interobserver variability was compared with that seen in scores from the original video files.
Power/Sample Size Simulations
We describe findings using data from an exemplar phase III randomized controlled trial (National Institute of Neurological Disorders and Stroke tissue-type plasminogen activator trial) and a simulated trial sample size of n=2000 (for full results, see Tables I and II in the online-only Data Supplement).
Effect of Increasing mRS Reliability
Improving reliability in mRS scoring from κ (κw) 0.25 (0.74) to 0.5 (0.92) and 0.7 (0.96) reduced sample size by n=386 and 490, respectively (Table 1).
Effect of Using Multiple Scores
Using mean scores from 2, 4, or 9 adjudicators reduced sample size by n=54, 172, and 318, respectively. Use of the mode or median score did not convey comparable benefits (Table 2).
We recruited 373 participants from 14 centers (Scotland: 6, England: 7, Wales: 1). Baseline demographic characteristics, stroke severity, and comorbidities were similar in patients who completed versus withdrawals (Appendix E and Table III in the online-only Data Supplement). We completed 563 follow-up visits with uploaded interviews. Interview duration ranged from 1 to 24 minutes (mean [SD], 5.5 minutes [14 seconds]; Figure 1).
An adjudicated mRS score was possible in 538 (96%) assessments; editing for anonymity was required in 39 (7.2%). Early technical failures were responsible for the majority of missing adjudicated scores. Technical problems were at a median of 159 days (interquartile range, 111 to 221 days). The median (interquartile range) of total study days was 510 (458 to 558). Poor audio was encountered in 19 clips (3.4%), rendering scoring impractical in 7 (1.2%; Tables IV and V in the online-only Data Supplement). A repeat assessment was requested in 15 cases (2.7%). The adjudicating committee scored 538 clips, and disagreement from ≥1 of the 4 observers was noted in 161 of 280 (57.5%) and 131 of 258 (50.8%) clips at 30 and 90 days, respectively (Figure 2).
Agreement between local mRS and adjudicated mRS was good (Table 3; Figure IIa and IIb in the online-only Data Supplement). Agreement among panel members was κ=0.59 (95% confidence interval, 0.53–0.63) and κw=0.86 (95% confidence interval, 0.82–0.88) at 90 days (Table 4). Agreement was similar for clips scored early or late in the course of the study: κw=0.88 early versus κw=0.82 late (P=0.146).
The magnitude of disagreement between raters equates to small levels of variability on the mRS scale, typically less than one tenth of an mRS grade. We found that there was no systematic bias between panel members (Table VI and Figure III in the online-only Data Supplement).
Using the estimated reliability of a single panel member (ICC) at day 90 of 0.87 and the Spearman–Brown prediction formula, we derived the reliability of mRS with multiple raters. The observed reliability with 2 raters (ICC, 0.92) was similar to the predicted figure (ICC, 0.93). Increasing the number of raters to 4 predicted an increase in the reliability of ICC to 0.96 (Table VII in the online-only Data Supplement).
We assessed 69 translated clips (9 English and 60 Mandarin). Interobserver reliability for native language assessment (n=69) was good (κw=0.91). Translated mRS assessments (total n=89 and dual translation n=20) maintained good reliability (κw=0.90). Placing an mRS-trained clinician in the translation role had no demonstrable benefit in the reliability of translated mRS assessments: κw=0.91 with medical input (n=20) and κw=0.91 with linguist transcription (n=20).
To assess the feasibility of incorporating a translation step into the central adjudication model, we assessed 60 mRS clips; the duration ranged from 1.13 to 9.5 minutes (mean [SD], 4.5 minutes [10 seconds]). Overdubbing to mimic translation was undertaken by 6 investigators. All audio files were successfully uploaded and automatically merged with the video clip. Reliability in the modified clips (κw=0.85) was similar to that seen in the original files (κw=0.88; Table 4).
We found significant potential for reducing required trial sample size and increasing trial power by improving the reliability of mRS assessments. We also developed and assessed a system for central adjudication of mRS end points that was both feasible and performed favorably compared with traditional mRS assessments. We report for the first time measures of mRS reliability from a multicenter, multinational, and multilingual study. This is important because previous studies assessing the interobserver reliability of the mRS have predominantly been conducted in a small single center with highly motivated individuals.
Sample Size Simulations
Studies with a sample size larger than necessary are economically and ethically unjustified. Our simulations using real-life mRS distributions from previous phase IV randomized controlled trials suggest that modest improvements in interobserver reliability may have substantial effects on sample size. The use of multiple mRS assessments conveys similar benefits. Involving mRS reliability in sample size calculations would prove too complex; however, optimizing reliability of end point assessment is crucial. Strategies such as mRS training will have a beneficial effect; however, encouraged by the results of our modeling, we assessed the potential use of a novel group adjudication process for mRS.
We developed a system for central adjudication of mRS end points that was feasible and performed favorably compared with traditional mRS assessments. Acceptability and feasibility are suggested because recruitment superseded our target of n=300 with prompt enrollment across several centers, including investigators with differing levels of experience in stroke research. The use of patient interview videos is familiar to most investigators completing mRS training and certification.6
We have demonstrated feasibility of low-cost video transfer with a high rate of technical success. The Internet eCRF allowed convenient and secure transfer of video files to the adjudication committee. Although there was heterogeneity in the length and content of the video clips, most were between 4 and 6 minutes long (mean [SD], 5.5 minutes [14 seconds]). This heterogeneity is a function of all traditional mRS interviews (face to face or video) and reflects the complexities of some stroke survivors’ disabilities. The original camera system mandated an external microphone and postinterview processing of video files before upload. The simpler camera system we used for later assessments proved more reliable and economical. Thus, most technical failures occurred early in the study. In the event of technical failure, had this been a real intervention trial, we would have used the local mRS score as a default. The adjudication committee was able to reach consensus in all but 3 mRS clips (0.5%).
Our data on reliability of adjudicated mRS assessment require some consideration. Intrarater reliability has been previously reported4,17; we considered only inter-rater reliability in this study with a virtual trial design. In the context of a multicenter clinical trial in which end points will be assessed only once, interobserver variability is most relevant. Comparing local and remote mRS scores gives some measure of the inherent variation in scoring across a multicenter study. This interobserver variation was considerable (κ=0.48), albeit not as pronounced as in a previous smaller study (κ=0.25). Through the use of end point committee review, we hope to score the true mRS of the study participant. In the absence of a gold standard disability assessment, we are unable to assess this directly. We are encouraged that there is a degree of disagreement between local and adjudicated mRS (suggesting that the adjudication process adds something to standard assessment), but this variability is not too large (which may suggest that adjudicated mRS is systematically different from traditional mRS).
Structured mRS assessment tools, including the recent Rankin Focused Assessment,18 have been proposed to improve mRS reliability. Compared with our group adjudication system, it could be argued that these tools offer an economical and simplified method in improving mRS. However, few of the tools have been independently validated or assessed in a contemporary randomized controlled trial context. In meta-analysis, subgroup analysis comparing structured and unstructured approaches did not affect reliability,4 and our previous data suggest that questionnaire-based mRS assessments confer no advantage when used in a real-world setting.17 If more focussed assessments prove to be increasingly popular and widely adopted, our approach could readily incorporate them and offer further advantages described below.
Including patient-reported outcome measures is a challenge in multicultural and multilingual trials. Where standardized assessment tools are used, there is guidance to ensure that translation is accurate and reliable with several stages of forward and back translation and validation of each step.19 There is no guidance for using translated patient interviews as an outcome measure. We have shown that the reliability of the mRS as an outcome measure is maintained when scoring translated assessments from 2 culturally diverse populations. The incorporation of a translation step into the central adjudication process seems technically feasible.
Strengths of Our Study Program
The potential advantages of central outcome assignment are numerous. Any central adjudication panel allows a degree of expert review, and we have demonstrated that no important data are lost in the process of video recording and remote assessment. Blinding is crucial to the integrity of trial outcomes, and a remote group adjudication approach may prove invaluable where this is difficult (eg, neurosurgical interventions or complex rehabilitation trials). Central adjudication assures quality control: repeat assessment or further information can be gathered if assessment is inadequate or below standard. In these circumstances, a group review approach may prevent a potentially erroneous outcome score being recorded. The video approach retains a hard copy of the outcome assessment allowing trialists to re-examine functional outcome data where there are data queries. Finally, it offers remote source data verification of the patient’s existence and consent in a way that no document can offer.
Limitations of Our Study Program
Our study population had a large proportion of participants with mild stroke and may not be representative of all acute intervention studies. We experienced a substantial number of withdrawals (17% visits missed). This may be attributable to the observational nature of the study and the mild clinical deficits, allowing participants to return to their usual active lives within the study period, reducing their motivation for participation in research. The majority of withdrawals occurred before the first assessment, and so we assume that they were not related to the study procedures; these patients declined standard mRS assessment and not just the video recording. Intuitively, it might seem likely that there are some aspects of a participant encounter that may affect mRS score that is not captured on video; that is, how did the participant travel and mobilize to the consultation? Previous research comparing face-to-face mRS interview with a video mRS interview did not suggest that this limits reliability.17 To ensure that we were able to measure the performance of the adjudication team, the design of this study ensured that the committee was blinded to the local investigator score at the time of consensus scoring. In practice, where there is disagreement, open discussion between the adjudication committee and local center for clarification or further information can improve the quality of scoring. Such contact has both scientific and educational values unless the local rater may be prejudiced through knowledge of treatment assignment.
Our translation work involves only 1 language in a small sample of our data set. We have demonstrated that the translation step is feasible to incorporate into a central adjudication model, but further work with multiple languages is desirable for generalizability.
Implications for Future Practice and Research
The mRS is imperfect, but as a global functional outcome measure, it should be retained as the preferred primary outcome measure in future research.20 Our iterative work program has demonstrated the importance of reliable trial outcomes and a potential method to improve functional end point assessment. There is now a need to apply our approach to real-world intervention trials. Based on our encouraging initial experiences, central adjudication using our infrastructure is already being used in the National Institutes of Health–supported Clot Lysis Evaluating Accelerated Resolution (CLEAR-3) trial for treatment of intraventricular hemorrhage (ISRCTN70157009; National Institutes of Health grant number 5U01NS062851-02).
We are grateful to the Central Adjudication of modified Rankin Scale assessments in Acute Stroke trial investigators, our trial manager (Pamela MacKenzie), the Robertson Centre for Biostatistics staff (Sharon Kean, Jane Aziz, and Alan Stevenson), and the study participants for making this work possible.
Sources of Funding
This study was competitively funded by a grant from Chief Scientist Office, UK (CBZ/4/595), and supported by the Stroke Research Network. The funder did not contribute to study design, study conduct, report preparation, or submission.
We hold grants to provide central adjudication in modified Rankin Scale (mRS) end points from the US National Institutes of Health (NIH), NIH grant number 5U01NS062851-02, and from the European Union FP7 program for the European, multicentre, randomized, phase III, clinical trial of hypothermia plus medical treatment versus best medical treatment alone for acute ischemic stroke (EuroHYP-1) trial. Dr Dawson has received honoraria for delivering lectures on mRS from Lundbeck. Dr Quinn has received honoraria and consulting fees from training campus for assistance with the development of training materials relating to mRS and Barthel Index. The other authors report no conflicts.
Guest Editor for this article was Bruce Ovbiagele, MD, MSc, MAS.
The online-only Data Supplement is available with this article at http://stroke.ahajournals.org/lookup/suppl/doi:10.1161/STROKEAHA.113.002266/-/DC1.
- Received May 23, 2013.
- Accepted August 15, 2013.
- © 2013 American Heart Association, Inc.
- Rankin J
- Quinn TJ,
- Dawson J,
- Walters MR,
- Lees KR
- Wilson JTL,
- Hareendran A,
- Hendry A,
- Potter J,
- Bone I,
- Muir KW
- Quinn TJ,
- Lees KR,
- Hardemark HG,
- Dawson J,
- Walters MR
- Jaffar S,
- Leach A,
- Smith PG,
- Cutts F,
- Greenwood B
- Juttler E,
- Schwab S,
- Schmiedek P,
- Unterberg A,
- Hennerici M,
- Woitzik J,
- et al.
- Quinn TJ,
- Dawson J,
- Walters MR,
- Lees KR
- van Swieten JC,
- Koudstaal PJ,
- Visser MC,
- Schouten HJ,
- vanGijn J
- Quinn TJ,
- Dawson J,
- Walters MR,
- Lees KR
- Saver JL,
- Filip B,
- Hamilton S,
- Yanes A,
- Craig S,
- Cho M,
- et al.
- Lees KR,
- Bath PMW,
- Schellinger PD,
- Kerr DM,
- Fulton R,
- Hacke W,
- et al