Qualitative Comparison of the Reliability of Health Status Assessments With the EuroQol and SF-36 Questionnaires After Stroke
Background and Purpose—The reliability of the EuroQol and SF-36 questionnaires after stroke is not known. We therefore aimed to assess and compare the test-retest reliability of both instruments in a group of stroke patients.
Methods—A total of 2253 patients with stroke entered by United Kingdom hospitals in the International Stroke Trial were randomized to follow up with either the EuroQol or the SF-36 instruments. For both instruments, we randomly selected one third of respondents and asked them to complete another, identical questionnaire. We assessed test-retest reliability using agreement statistics: unweighted κ statistics for the categorical domains of the EuroQol and intraclass correlation coefficients for the EuroQol visual analog scale, utility scores, and SF-36.
Results—For the five categorical domains of the EuroQol, reproducibility was generally good (κ ranged from 0.63 to 0.80). The reproducibility of the domains of the SF-36 was qualitatively similar for all the domains except mental health (intraclass correlation coefficient=.28). However, the 95% confidence intervals for the difference in scores between test and retest were substantial. For both instruments, reproducibility was better when the patient completed the questionnaires than when a proxy did.
Conclusions—Both the EuroQol and SF-36 have acceptable and qualitatively similar test-retest reliability. Therefore, either instrument might function effectively as a discriminatory measure for assessing health-related quality-of-life outcomes in groups of patients after stroke. However, our data do not support the use of either instrument for serial assessments in individual patients unless very large differences over time are expected.
The selection of an outcome measure must be based on its psychometric attributes, which include feasibility, validity, reliability, and sensitivity to change.1 2 Reliability is the extent to which a measure is free from random error in the population of interest1 3 4 and refers to its internal consistency as well as its reproducibility. The reproducibility of a measure is the degree to which it yields consistent scores over time among respondents who are assumed not to have changed (test-retest reproducibility) or the extent to which different observers may administer it to a particular patient and achieve similar results (interobserver reproducibility). Measures with poor reliability will be less efficient at distinguishing patients with different health states because differences in score may be obscured by random error.
The EuroQol and SF-36 are widely used generic instruments for the measurement of HRQoL that have been validated recently in patients with stroke.5 6 Although both instruments provide reliable assessments of HRQoL in the general population,7 8 9 their reliability in stroke patients has not been assessed. We therefore aimed to assess the test-retest reliability of both instruments in a group of stroke patients.
Patients and Allocation to the EuroQol or SF-36
In a previous study, we examined response rates to postal versions of the EuroQol and SF-36; we randomly allocated patients to receive either the EuroQol or the SF-36. We have described in detail elsewhere the methods used to identify patients and the format of the instruments.10 Briefly, the study included patients with confirmed or suspected ischemic stroke who had been enrolled between March 2, 1993 and May 31, 1995 by any of the United Kingdom hospitals participating in the International Stroke Trial.11 We included all patients who were not known to have died by the time of the survey. We incorporated the EuroQol and SF-36 into booklets that included some additional questions recording the patients’ demographic details, their functional outcome after stroke, and whether the patient completed the booklet by themselves. The questionnaire booklets were identical in all respects, other than the nature of the HRQoL instrument.
We then randomly sampled one third of the patients who had responded within approximately 3 weeks to the first questionnaire for repeat testing with the same HRQoL instrument (test-retest reliability). We mailed the second questionnaire booklet containing the appropriate instrument to all eligible patients along with a personalized letter and a postage-paid reply envelope. The letter explained the purpose of the repeat questionnaire and asked the subjects to respond if possible without the help of another person, and if not, to give the questionnaire to a close relative or caregiver who was willing to respond on the patient’s behalf. We sent a reminder letter and another, identical questionnaire to any patient who had not responded within 14 days. We made no further attempts to contact nonrespondents thereafter. We marked individual questionnaire booklets with labels that included details of the patient’s name, address, trial identifying number, and questionnaire allocation. We generated all letters and labels directly from the randomization code using a computerized mail-merge program.
Reliability is a generic term used to indicate both the internal consistency of a scale and its reproducibility.12 We assessed the internal consistency, the extent to which items within a dimension are correlated with each other, among the items composing each of the domains of the SF-36 using Cronbach’s α-coefficient (SPSS for Windows, release 6.1). We calculated α-coefficients for each of the eight SF-36 domains using responses to the initial questionnaires. Accepted minimal standards for α-coefficients are .7 for group comparisons and coefficients greater than .9 for comparisons between individual patients or the same patient over time.1
We examined test-retest reliability by calculating agreement statistics. We only performed these analyses for patients who had complete data on test and retest for any particular domain. For the categorical domains of the EuroQol, we used an unweighted κ statistic to calculate agreement beyond that which might be expected by chance.13 We used the ICC to examine agreement for the continuous data generated by the eight domains of the SF-36 and the visual analog scale (and utilities) of the EuroQol13 ; for these data, we also calculated the arithmetic mean and standard deviation of the differences between the test and retest administration.
To aid in the clinical interpretation of the findings, we aimed to determine the frequency of potentially important differences between test and retest for both instruments. For the five categorical domains of the EuroQol, we considered that any change in score was potentially important because each of the three levels are all explicitly defined. Because “important clinical change” is harder to define for the SF-36, we examined the frequency of differences of varying size.
Of the 4016 patients randomized by the United Kingdom centers in the International Stroke Trial between March 2, 1993 and May 31, 1995, 2253 (56%) patients were known to be alive and at a known address at the start of the present study. Of these, 1125 were randomized to receive a EuroQol questionnaire and 1128 an SF-36 questionnaire (Fig⇓). Patients received the initial questionnaires after a mean period of 64±30 weeks (mean±SD) from their stroke. The response frequency was significantly greater in patients allocated to the EuroQol (80% versus 75% of those allocated to the SF-36 responded after one reminder; P=.003).10 Of these respondents, 271 were selected at random to receive another, identical EuroQol questionnaire and 253 were randomized to follow-up with an additional SF-36 (fewer patients received a repeat SF-36 because fewer patients responded to the initial questionnaire). Both groups had similar characteristics at the time they entered the International Stroke Trial (Table 1⇓).
Of the patients allocated to another, identical EuroQol, 234 patients (86%) responded; of these, 122 (52%) completed the repeat EuroQol without help. Of the 111 repeat EuroQol questionnaires completed with the help of another person (data regarding who completed the questionnaire was missing for 1 patient), 94 were completed with the help of the same individual. Of the 122 patients who managed to complete the EuroQol without help, 54 were independent in activities of daily living. Only 7 of the patients who required help to complete the questionnaire were independent in activities of daily living.
A similar proportion (83%) of patients allocated to another, identical SF-36 responded; of these, 106 (51%) completed the repeat SF-36 without help (58 of these 106 patients were independent in activities of daily living). Of the 101 remaining forms completed with the help of another person (data regarding who completed the questionnaire was missing for 2 patients), 79 were completed with the help of the same individual. Of these 101 patients, 16 were independent in activities of daily living. The mean period between completion of the initial questionnaire and mailing of the repeat questionnaire was 21±7 days for the SF-36 and 21±9 days for the EuroQol. There were no significant differences in time from stroke for patients who did or did not require help with form completion for either instrument.
Table 2⇓ shows the internal consistency for the SF-36. Cronbach α reliability coefficients were .8 or greater for all the domains, suggesting very good or excellent internal consistency. We have reported test-retest reliability separately for the forms completed by the patients, for the forms completed on behalf of patients by a proxy and for all forms combined; the ICCs were generally acceptable or good (Table 3⇓). For all eight domains, reproducibility was better when the patient assessed HRQoL than when a proxy did. The mean of the difference between test and retest ranged from −1.8 to 3.1 for the different domains.
Reproducibility ranged from moderate to good for the five descriptive domains of the EuroQol (κ statistics ranged from 0.63 to 0.80) (Table 4⇓). As for the SF-36, reliability was consistently better for questionnaires completed by the patients. The overall assessments of HRQoL with the EuroQol and the EuroQol utilities have excellent reproducibility, as shown in Table 4⇓. Tables 5⇓ and 6⇓ report the frequency of potentially important disagreements between test and retest for both instruments.
We found that the test-retest reliability of the EuroQol and SF-36 were generally good when assessed after stroke. For both instruments, we observed the worst reproducibility in the domains that examined psychological functioning. Mental health measured with the SF-36 had particularly poor reliability (ICC=.28). This finding may be because of the subjective nature of this domain; alternatively, it may be because the ICC compares the variance between patients with the total variance, and all patients had relatively similar outcomes for this domain.12
There is no standard that qualitatively defines the results of agreement statistics; for example, no consensus exists about the meaning of “κ=.5.”14 15 This gives rise to inconsistency in the interpretation of the clinical significance of any given κ value or ICC.12 We therefore examined the mean and standard deviation of the differences and the frequency of potentially important disagreement for both the SF-36 and EuroQol to try to clarify the practical implications of our findings. We did not find substantial mean differences between test and retest for any of the domains of the SF-36 or for the assessment of overall HRQoL using the EuroQol “thermometer.” However, we found the standard deviations of the differences were large for most domains (approximately ±20) and even larger for the physical and emotional role functioning domains. This degree of variability means that neither instrument would be suitable for serial studies within the same stroke patient or for making serial comparisons between individual patients after stroke. Potentially important disagreement was also frequent. Our findings do indicate that either instrument would function adequately to compare groups of patients, such as in a parallel group randomized, controlled trial.
There were a number of potential sources for poor test-retest reproducibility in the current study. These included the nature of the domain under study, whether the patients completed the questionnaires themselves, change in the patient’s health state between test and retest, and measurement error. We consistently observed better reliability when patients completed the questionnaires themselves than when a proxy completed them on the patient’s behalf. This finding may be because these instruments were designed to assess a patient’s uniquely personal view of their own health state.4 We might have underestimated the reproducibility of the assessments in patients who required help, because in approximately 20% of cases they sought help from a different person for the repeat form. Alternatively, it may simply be that HRQoL is less stable for more severely affected patients who are unable to complete the questionnaires themselves (usually because of physical and cognitive deficits after the stroke).16 In this situation, rating of the patient’s health status by individuals other than the patient (eg, a family member, friend, or caregiver) may be the only means of assessing the patient’s HRQoL. Although these proxy assessments were not as reproducible and may not be as valid17 18 as those performed by the patients themselves, they appeared to be at least reasonably reliable in the current study.
Poor test-retest reproducibility may be due in part to change in the patient’s health state between the initial test and the subsequent retest. We assessed reproducibility over an interval of several weeks, when the patient’s neurological status was likely to be stable. We considered this period to be long enough to minimize memory effects but short enough that a real change in the patient’s health was unlikely. Some investigators suggest that patients who report a change in health state during the study period should be excluded from comparisons of test-retest reliability to identify the “noise” associated with the instrument.1 19 We did not do this, because this method does not give an indication of the true “noise” in the population of interest, and this level is the variability above which a measure must be responsive to detect change in a treatment group.
Measurement error associated with the instrument can result from either a lack of intelligibility or ambiguity in its wording. It may also occur if patients find the content lacks relevance to their situation. Elderly people may not regard some of the questions of the SF-36 about work or vigorous activities (domains of physical and emotional role functioning) as being relevant to them.20 In our study (in which the mean age of the patients was 70 years), these domains had particularly poor test-retest reliability.
The current estimates for the internal consistency and reproducibility of the SF-36 in stroke patients are similar to those obtained in previous studies in other patient groups.19 21 22 Our estimates for the internal consistency of the SF-36 are also consistent with those reported by Anderson and coworkers5 in their study of the validity of the SF-36 when administered by interview after stroke. Weinberger and colleagues21 reported that the mode of administration (face-to-face interview, self-completed questionnaire, or telephone interview) did not appear to affect the reproducibility of the SF-36. It therefore seems reasonable to generalize our conclusions to other modes of administration of the SF-36 after stroke, for instance, by interview.
We were only able to compare the test-retest reliability of the EuroQol and SF-36 indirectly in a qualitative manner because no one statistical technique could be used to assess agreement for both categorical and continuous data. Within this limit, both instruments seemed to have similar reliability. We could have reclassified the outcome data with the SF-36 into several new categories to make a direct comparison with the EuroQol possible. However, this kind of arbitrary approach would be hard to validate. We therefore reported the frequency of what we considered might be “potentially important differences” for both instruments. This approach would at least allow a broad qualitative comparison of the reliability of the two instruments. There is no consensus regarding what a clinically meaningful change for either instrument might be, so even this approach has limited value. Because the mobility, self-care, social functioning, pain, and psychological domains of the EuroQol have just three distinct levels (for example, mobility:  I have no problems in walking about,  I have some problems in walking about,  I am confined to bed), we considered any change for these domains to be potentially important. The definition of a potentially important change with the SF-36 is more controversial. Some investigators consider differences of five points in any of its domains as potentially important.19 However, this difference is not directly comparable to a change of one level for the EuroQol. We therefore reported the frequency of disagreement for four empirically chosen differences in score (5, 10, 20, and 40 points). These analyses support the conclusion that unless investigators are seeking to identify very large differences (eg, >40 points with the SF-36), neither instrument is likely to be effective in reliably identifying change over time in HRQoL within an individual patient after a stroke.
We were only able to compare the reliability of the EuroQol and SF-36 indirectly. The groups who received the initial EuroQol and SF-36 were similar, but there were inevitably some differences between the groups who were sent repeat questionnaires because some selection bias had taken place at this stage. An alternative approach would have been to give all patients both instruments twice (test-retest). We felt, however, that this would place an unacceptable burden on patients and so might have adversely affected the response rates. The comparison may also have been biased because the EuroQol asks patients to report their health state on that particular day, whereas the SF-36 asks patients about their health over the previous 4 weeks. We were therefore surprised that the qualitative estimates of reliability of the SF-36 and EuroQol were so similar. This finding suggests that either day-to-day fluctuation in a patient’s health state was small or that the patients did not pay much attention to the exact wording of the questionnaires.
In summary, both the EuroQol and SF-36 have acceptable and qualitatively similar test-retest reliability when administered after stroke and completed by patients or their proxies. Either instrument might function effectively as a discriminatory measure for assessing HRQoL outcomes in groups of patients, as in a large, parallel group, randomized, controlled trial or an audit study. Sample size calculations for observational studies and randomized trials must take the reliability of both instruments into account. Doing so will generally increase the sample size but should reduce the risk of a false-negative or type II statistical error. Our data do not support the use of either instrument for serial assessments in individual patients unless very large differences over time are expected.
Selected Abbreviations and Acronyms
|HRQoL||=||health-related quality of life|
|ICC||=||intraclass correlation coefficient|
Paul Dorman is supported by a UK Medical Research Council Training Fellowship. Barbara Farrell, Jim Slattery, and Peter Sandercock are supported by grants from the UK Medical Research Council. The International Stroke Trial was sponsored by the UK Medical Research Council, the European Union, and the Stroke Association. This study was supported by a grant from Glaxo Wellcome plc. We would like to thank all the patients, their families, and caregivers for their keen participation.
- Received June 24, 1997.
- Revision received August 18, 1997.
- Accepted September 15, 1997.
- Copyright © 1998 by American Heart Association
Scientific Advisory Committee. Instrument review criteria. Medical Outcome Trust Bulletin. 1995; September:I-IV.
Hobart JC, Lamping DL, Thompson AJ. Evaluating neurological outcome measures: the bare essentials. J Neurol Neurosurg Psychiatry. 1996;60:127–130.
Anderson C, Laubscher S, Burns R. Validation of the Short Form 36 (SF-36) health survey questionnaire among stroke patients. Stroke. 1996;27:1812–1816.
Dorman PJ, Waddell F, Slattery J, Dennis M, Sandercock P. Is the EuroQol a valid measure of health-related quality of life after stroke? Stroke. 1997;28:1876–1882.
Brazier JE, Harper R, Jones NMB, O’Cathain A, Thomas KJ, Usherwood T, Westlake L. Validating the SF-36 health survey questionnaire: new outcome measure for primary care. BMJ. 1992;305:160–164.
van Agt HME, Essink-Bot ML, Krabbe PFM, Bonsel GJ. Test-retest reliability of health state valuations collected with the EuroQol questionnaire. Soc Sci Med. 1994;39:1537–1544.
Dorman PJ, Slattery JM, Farrell B, Dennis MS, Sandercock PA, and the United Kingdom Collaborators in the International Stroke Trial. A randomised comparison of the EuroQol and SF-36 after stroke. BMJ.. 1997;315:461.
Deyo RA, Diehr P, Patrick DL. Reproducibility and responsiveness of health status measures. Control Clin Trials. 1991;12(suppl):142S–158S.
Brennan P, Silman A. Statistical methods for assessing observer variability in clinical measures. BMJ. 1992;304:1491–1494.
McDowell I, Newell C. Measuring Health: A Guide to Rating Scales and Questionnaires. New York, NY: Oxford University Press; 1996.
Segal ME, Schall RR. Determining functional/health status and its relation to disability in stroke survivors. Stroke. 1994;25:2391–2397.
Dorman PJ, Waddell F, Slattery J, Dennis M, Sandercock P. Are proxy assessments of health status after stroke with the EuroQol questionnaire feasible, accurate and unbiased? Stroke. 1997;28:1883–1887.
Ruta DA, Abdalla MI, Garratt AM, Coutts A, Russell IT. SF 36 health survey questionnaire, I: reliability in two patient based studies. Quality Health Care. 1994;3:180–185.
Hayes V, Morris J, Wolfe C, Morgan M. The SF-36 health survey questionnaire: is it suitable for use with older adults? Age Ageing.. 1995;24:120–125.