| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
(Stroke. 2006;37:1065.)
© 2006 American Heart Association, Inc.
Original Contributions |
From the Department of Neurology (R.D., M.A.R., D.G.N., E.B.R., D.W.D.), University Hospital of Münster, Germany; Department of Neurology (M.K.), University Hospital of Giessen, Germany; Department of Neurology (M.S.), University Hospital of Düsseldorf, Germany; Division of Cardiovascular and Medical Sciences (K.L.), University of Glasgow, United Kingdom; Department of Neurology (V.L.), University of Toulouse, France; Centre for Clinical Neuroscience (H.S.M.), St. Georges University of London, United Kingdom; and Department of Neurology (D.W.D.), Centre Hospitalier de Luxembourg.
Correspondence to Martin A. Ritter, MD, Department of Neurology, University of Münster, Albert-Schweitzer-Straße 33, 48129 Münster, Germany. E-mail ritterm{at}uni-muenster.de
| Abstract |
|---|
|
|
|---|
Methods As primary quality control, centers participating in CARESS evaluated a reference digital audio tape (DAT) before the study containing both MES and artifacts. Interobserver agreement of classifying signals as MES was expressed as proportions of specific agreement of positive ratings (ps±values). For all DATs included in CARESS (n=300), online number of MES and off-line number of MES read by the central reader were compared using correlation coefficients. As secondary control, a sample of 16 of 300 DATs was cross-validated by another independent reader (post-trial validator).
Results For the reference tape, the cumulative ps±value was 0.894 based on 12 of 14 observers. Two observers with very different results improved after a training procedure. Agreement between post-trial validator and central reader was ps+=0.805, indicating very good agreement. Correlation between online evaluation and off-line evaluation of DATs was very good overall (cumulative
=0.84; P<0.001).
Conclusion Multicenter studies using MES as outcome parameter are feasible. However, primary and secondary quality control procedures are important.
Key Words: antiplatelet agents carotid stenosis stroke ultrasonography, Doppler, transcranial
| Introduction |
|---|
|
|
|---|
The aim of the present study was to implement primary and secondary quality control procedures for the evaluation of MES. Primary quality control mechanisms before patient recruitment were designed to ensure consistency in criteria used to detect MES across centers. Secondary mechanisms were designed to ensure reproducibility of the off-line analysis performed by the central reader.
| Methods |
|---|
|
|
|---|
All recordings had been made previously for research purposes.3,13 The Doppler signal had been recorded with an EME TCD device (TC 4040; Nicolet/EME) using a 2-MHz transducer on digital audio tape (DAT). The machine used a 128-point fast Fourier transform (FFT) analysis and used a graded color scale to display the intensity of the received Doppler signal. FFT time frame overlap was 67%. A sample volume of 4 to 5 mm in length was used, and ultrasound emission power was set to 22% of maximum output. Copies of this tape were sent to each center with an interest to participate in the trial. The settings of the ultrasound devices in the centers were the settings that the centers used for MES evaluation in daily routine. The following types of ultrasound devices were used for playback in the respective centers: DWL Multidop X4 and Nicolet/EME TC4040. Signal processing was performed at the playback stage using FFTs, filter settings, and window overlaps as specified in Table 1.
|
Published international consensus criteria for MES detection were used by each observer.12 Although a decibel threshold has been shown to improve agreement,10,11 no decibel threshold was used for this study because different analysis packages on different TCD systems measure decibel levels in different ways, making comparisons of measured intensities unreliable. Each observer noted the exact time of MES occurrence in (hh:mm:ss) according to the counter on the DAT recorder. A time window of ±1 s was allowed for allocating a MES correctly. The decibel value measured in the reference center was used in analysis to determine whether disagreement in signal evaluation was dependent on signal intensity.
In total, 14 different experienced observers from 8 different centers were involved:1 Münster, Germany (6 investigators)2; Glasgow, United Kingdom (2 investigators)38; Bern, Switzerland; Düsseldorf, Germany; Gießen, Germany; London, United Kingdom; Lübeck, Germany; and Toulouse, France (1 investigator each).
Because the classification of MES depends on the judgment of a human observer, we classified an event as definite MES in case of agreement of
9 different investigators on the same signal. These events were taken as the "gold standard." Interobserver agreement was expressed as proportions of specific agreement for positive ratings (ps+):15 ps+=2MESb/2MESb+MES1+MES2, where MESb indicates the number of events classified as MES by both observers; MES1 indicates number of events classified as MES only by the first observer but not by the second observer; and MES2 indicates number of events classified as MES only by the second observer but not by the first observer.
ps±values provide probabilities that if one randomly selected observer makes a positive rating (in this case, declaring a signal to be a MES at a defined time), another randomly selected observer will also make a positive rating (ie, declare the same signal as MES). With multiple observers, the proportion declares the probability in a cumulative manner. ps+values are comparable with widely used
-statistics. However,
-values are not meaningful in the case of MES because no specific negative rating is made during MES analysis.15
Secondary Quality Control: Cross-Validation of Off-Line Reading Results by Post-Trial Validator
Relevant Study Design
Patients were eligible for the CARESS trial if they had recently symptomatic (with TIA or stroke within the last 3 months) carotid stenosis
50% based on ultrasound criteria. In subjects meeting these entry criteria, a screening TCD was performed, and subjects were eligible for randomization if
1 MES were detected during a 1-hour recording. The detailed study design has been reported previously.9
During CARESS, 3 recordings within 1 week were performed on-site in the randomizing centers and were stored on DAT. DATs were analyzed off-line in the central reader. The central reader was blinded to patient identity and to which tape of a given patient corresponded to which visit.
Assessment of Off-Line Reader
For secondary quality control, 16 DATs of 300 recorded for the core study were transferred to London (the reference center for the United Kingdom) after the central reading procedure and were evaluated in the same fashion as in Münster by 1 experienced observer. The second reference reader was termed "post-trial validator."
Five tapes of the 16 were randomly selected by the central data management center in Paris while the study was ongoing. Eleven tapes were selected after the trial based stratified by center and MES count. All tapes had been allocated to 1 of 5 strata and at least 1 tape was selected from each stratum as well as 1 from each center. The strata and the number of tapes within each stratum are given in Table 2.
|
Stratification ensured that no tape was selected without any MES (ie, 142 tapes negative on both online and off-line analysis). A selection of tapes was made with a range containing a few too many MES. The number of MES and the exact time of their occurrence were documented, thus enabling a signal-to-signal comparison with the results of the central reader, and ps±values were calculated.
Correlation of Off-Line and Online Reading Results
After the end of the CARESS trial and unblinding of the reading results, the online findings documented by the study site investigator were compared with the results of the central reader. A signal-to-signal comparison was not possible because of the fact that online reading did not record the exact time points at which MES occurred. Intercenter agreement was therefore expressed by both numbers of MES detected and the decision whether a tape was MES positive or not. Correlation coefficients were calculated for each center independently, as well as for all centers cumulatively.
| Results |
|---|
|
|
|---|
1 observer. Ninety-one signals were classified as MES by
9 observers. Table 3 shows the results separately for each individual observer.
|
Two observers (12 and 13) had results very different from the remaining 12 observers. These observers detected only 25 and 27 of the 91 reference MES. A total of 145 signals were detected by only 1 or both of these 2 observers. The performances of these 2 observers were independent of signal intensity. The other 12 observers detected
79 of the 91 reference MES (86%). From the remaining 56 signals that were detected by
1 of the 12 observers who performed consistently, only 7 signals had intensities >5 dB as measured by the reference center, indicating that disagreement occurred mainly in signals with low intensity.
Excluding observers 12 and 13, the ps±value was 0.894, signifying a 90% likelihood that if 1 of a pair of observers declared an event as MES, the other observer also declared the event as MES. After including investigators 12 and 13, the overall ps±value (agreement between all observers) fell to 0.694. The results of the centers apart from 12 and 13 were considered excellent.
Observers 12 and 13 were therefore considered outliers requiring retraining. Retraining was performed by 1 of the reference centers. The observers were then re-evaluated using a second reference tape created in an analogous way as the first one. Two observers from the 12 observers with good agreement on the first tape agreed on 75 MES on the second reference tape. For this tape, ps+ was 0.95 for investigator 12 and 0.71 for investigator 13, showing marked improvement.
Secondary Quality Control: Cross-Validation of the Off-Line Reading Results With the Findings of the Post-Trial Validator
The ps±value for the 5 randomly selected tapes during the study was 0.815. An additional 11 tapes were validated after the study. For the 3 stratum-1 tapes judged MES negative during the online analysis but found positive during off-line analysis, all were found MES negative by the post-trial validator. For the 3 stratum-2 tapes judged MES positive during the online analysis but MES negative during the off-line analysis, all 3 were also found MES negative by the post-trial validators analysis. For the 1 tape from stratum-3 (ie, MES positive online and MES positive off-line, but with only few MES) the post-trial validator agreed with all 6 signals of the central reader but also found 2 more signals. There was good agreement for the 1 tape from stratum-4 (ie, MES positive on-line, as well as MES positive off-line, with a considerable number of MES); the post-trial validator and central reader agreed on 7 signals, and each validator just found 1 extra. There was also good agreement on the 3 tapes in the high-count stratum. The overall ps±value for these 11 tapes was 0.802. The corresponding value for all 16 tapes was ps+=0.805. In summary, there was very good agreement between central reader and post-trial validator.
Correlation of Off-Line and Online Reading Results
For 6 of 107 randomized patients, the central reading center did not report off-line MES during baseline recording. Therefore, these 6 patients were excluded from the per-protocol analysis to avoid bias because only online MES-positive patients at baseline met the entry criteria. The numbers of MES registered during central reading versus online analysis within each tape are given separately for each center in the Figure.
|
There was an excellent correlation between the online reader and the central reader in 6 of 11 centers (
0.94 for 6 out 11 centers). For 3 centers, agreement was very good in 1 (
=0.88), good in 1 (
=0.69), and borderline in 1 (
=0.42; P=0.057) case. For 2 centers, there was no correlation between off-line and online analysis (
=0.02 and 0.13). In both of these centers, the ultrasound technician had changed during the trial. The correlation of all centers combined compared with the central reader was excellent (
=0.84; P<0.001).
The main source of disagreement was, as expected, low intensity of the signals (
5 dB as measured by the EME machine used for central analysis). This was true as well for the comparison between the online reader and the central reader as well as for the comparison of the post-trial validator and the central reader.
| Discussion |
|---|
|
|
|---|
Similar issues apply to all studies using neuroimaging as a surrogate end point. For the interpretation of computed tomography (CT) or MRI scans as surrogate markers of disease, agreement rates of 0.6 to 0.7 are considered very acceptable if
statistics are applied.1618 ps±values are comparable with
statistics.15 Therefore, our study shows that equivalent or higher levels of agreement can be reached in MES reading than in CT or MRI reading.
We demonstrated that an important quality control feature is an assessment of performance before study commencement. Although most observers demonstrated high levels of agreement at this stage, 2 observers reported MES very different from the others. The high overall level of agreement for most observers is consistent with previous international reproducibility studies, which have shown good agreement among most centers except for a few outliers.10,11 In our study, the identification of outliers before enrollment of any patients allowed retraining of technicians in these centers, after which performance markedly improved. Subsequent comparison of online and off-line analysis demonstrated poor correlation in only 2 centers, both of which had a change in ultrasound technician during the study. This emphasizes the importance of re-evaluating any new technicians during the study, something that was not performed in CARESS.
Previous studies have shown high levels of interobserver reproducibility in the detection of MES.10,11 These were cross-sectional studies and involved a number of internationally recognized and expert centers. We demonstrated similar levels of agreement between most centers in this study. Our study extends these findings in a number of ways. First, we included all centers in the CARESS trial, not all of which had such extensive experience in MES detection. Second, we evaluated the effect of retraining less well-performing centers and demonstrated this resulted in a marked improvement in performance. Third, in this prospective study, we monitored the performance of centers during the trial by comparing their online analysis with the off-line analysis of the central reading center. This demonstrated the need for continued monitoring and particularly re-evaluation and, if necessary, training if technicians performing the recordings change during the study. This is a standard procedure in other clinical trials using neuroimaging.1618 This was something that was not performed in the CARESS study.
There are a number of potential limitations in our study. First, 290 signals were declared as MES by
1 observer, but only 91 MES were considered as "reference MES" against which statistical tests were performed. This seemingly high level of disagreement in the number of MES is well explained by the fact that the 2 outliers were responsible for 145 of 199 of the signals not considered as MES for the purpose of the study and that only 7 of the remaining 56 signals had high signal intensities. This demonstrates that the disagreement occurred mainly in the low-intensity range, something that was expected before the trial.
Therefore, not using a decibel threshold for MES could be regarded as the second limitation. However, because of the different equipment used, it was not possible to use a fixed value. The lack of an intensity threshold also did allow us to evaluate agreement across the full range of MES intensity and demonstrated the relationship between disagreement and intensity. A third limitation could be that not all centers used the same ultrasound device for signal analysis. This could lead to a systematic error for centers that use a device different from the central reading center. However, our study does not support this assumption because the results of the readers for the reference tape and during the trial using DWL or EME machines were not very different.
In summary, we have shown that it is possible to standardize MES reading and that MES are a robust tool suitable as a surrogate marker for clinical trials as long as independent quality control mechanisms are installed.
| Footnotes |
|---|
Received August 30, 2005; revision received November 28, 2005; accepted December 14, 2005.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
E. B. Ringelstein and R. Dittrich MES, what a mess! A modern version of Russian roulette J. Neurol. Neurosurg. Psychiatry, March 1, 2008; 79(3): 238 - 238. [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Stroke Home | Subscriptions | Archives | Feedback | Authors | Help | AHA Journals Home | Search Copyright © 2006 American Heart Association, Inc. All rights reserved. Unauthorized use prohibited. |