Quality of Life Measurement After Stroke
Uses and Abuses of the SF-36
Background and Purpose— The Medical Outcomes Study 36-item Short-Form Health Survey (SF-36) is widely used to measure health status after stroke. However, a fundamental assumption for its valid use after stroke has not been comprehensively tested: is it legitimate to generate scores for 8 scales and 2 summary measures using the standard algorithms? We tested this assumption.
Methods— SF-36 data from 177 people after stroke were examined (71% male; mean age, 62). We tested 6 scaling criteria to determine the legitimacy of generating the 8 SF-36 scale scores using Likert’s method of summed ratings, and we tested 2 scaling criteria to determine the appropriateness of the standard SF-36 algorithms for weighting and combining scale scores to generate 2 summary measures (physical and mental).
Results— Scaling assumptions were fully satisfied for 6 of the 8 scales, but 3 of these 6 scales had notable floor and/or ceiling effects. Assumptions for generating 2 SF-36 summary measures were not satisfied.
Conclusions— In this sample, 5 of the 8 SF-36 scales had limited validity as outcome measures after stroke, and the reporting of physical and mental summary scores was not supported. Results raise questions about the use of the SF-36 in stroke, and the SF-12 that is developed from it, and highlight the importance of testing scaling assumptions when applying existing scales to new populations.
Recognition that healthcare evaluations should incorporate patients’ perspectives has led to the widespread use of patient-reported health rating scales as outcome measures. The purpose of these rating scales is to quantify, as rigorously as possible, aspects of health so that differences between people, changes over time, and changes associated with interventions can be detected accurately. Health rating scales are usually disease specific or generic. A key advantage of generic scales is that they enable comparisons across diseases. However, the use of generic rating scales assumes that they satisfy minimum psychometric requirements across diverse clinical populations.1,2⇓ These assumptions are frequently untested. This is of concern because quality of inferences made from any study hinge directly on the scientific quality of the measurement instrument used.
The Medical Outcomes Study 36-item Short-Form Health Survey (SF-36) is a widely used, generic, patient-report, health status measure.3 It is recommended for use in health policy evaluations, general population surveys, clinical research, and clinical practice.4 In neurology, the SF-36 has been used in stroke (a MEDLINE search indicates more than 30 articles), motor neuron disease,5 Parkinson’s disease,6 epilepsy,7 headache,8 and multiple sclerosis.9–12⇓⇓⇓ Moreover, it has been used frequently as a validating instrument in the psychometric evaluation of new measures.13–17⇓⇓⇓⇓
Two types of SF-36 scores can be generated (Figure 1). Scores for 8 scales are generated by summing items, without weighting or standardization, and scores for 2 summary measures are generated by combining weighted scale scores. The 8 scales provide a comprehensive profile of health status; however, the 2 summary measures have features that make them more advantageous for clinical trials. These features include better measurement precision, smaller confidence intervals, the elimination of floor and ceiling effects, simpler analysis by reducing the number of statistical tests required and avoiding the problem of multiple testing, and superior (theoretically) responsiveness.18,19⇓ Summary measures are also more easily interpreted because their scores are directly related to scores for the general US population, which have been transformed to a mean of 50 and a standard deviation of 10.
Among the studies using the SF-36 in people with stroke, several have examined some of its psychometric properties. These studies report adequate internal consistency reliability20 and support the convergent and discriminant construct validity21 and group differences validity20 of the SF-36 in stroke. Floor and ceiling effects have been demonstrated by some21,22⇓ but not others,20 and the adequacy of completion rates in the elderly has been questioned.22 Despite these articles and comments that the SF-36 is a valid measure of health-related quality of life after stroke,21 no study has examined a fundamental prerequisite for rigorous measurement: tests of scaling assumptions. Tests of scaling assumptions determine whether it is legitimate to generate scale and summary scores using the algorithms proposed by the developers. Because psychometric properties are sample dependent,1,23⇓ the performance of a measure in a specific application is more important than its performance generally.1 This study tested the scaling assumptions for the SF-36 in a sample of people with ischemic stroke.
Subjects and Methods
Participants, Recruitment, and Data Collection
People with ischemic stroke were recruited from 3 hospitals in Indianapolis between March 1999 and April 2001: an inner-city county hospital, a university-affiliated tertiary care hospital, and a Department of Veterans Affairs medical center. Patients were identified at the time of admission and, at follow-up, those who could communicate reliably (National Institutes of Health Stroke Scale Language score <2 and no history of dementia) completed the SF-36 as part of a study to develop a disease-specific stroke scale (Stroke-Specific Quality of Life Scale24). Demographic and clinical data were collected by interview and from medical records. Stroke severity on admission was determined using the retrospective Canadian Neurologic Scale.25 Low scores indicate greater impairment. SF-36 data were collected between 1 and 11 months after stroke in a face-to-face interview.
The Measurement Model of the SF-36
The measurement model of the SF-36 hypothesizes that 35 of the 36 items are grouped into 8 multi-item scales (physical functioning [PF], role limitations physical [RP], bodily pain [BP], general health perceptions [GH], energy/vitality [VT], social functioning [SF], role limitations emotional [RE], and mental health [MH]) that are aggregated into 2 summary measures (physical component [PCS] and mental component [MCS]). The remaining item is not used in scoring (Figure 1).
Different processes are used to generate scores for scales and summary measures. Likert’s method of summated ratings is used for scales.26 That is, item responses are summed without weighting or standardization. Before this is undertaken, 2 items are recalibrated and the scoring of 9 items is reversed so that high scores always indicate better health.3 Finally, because the 8 scales have different ranges, they are transformed to have a common range of 0 (worst health) to 100 (best health). Although these summed scores have the same range, they have different metrics. For example, a score of 60 does not have the same meaning across scales. To facilitate comparisons across scales, summed scores can be expressed as z scores (standard) or T scores (standardized) so that individual scores are reported in standard deviation units.27 These transformations are rarely undertaken.
Scores for the 2 summary measures are generated in 3 stages.18 First, scale scores are standardized (z score transformation) by subtracting the US population mean for that scale and dividing the difference by the US population standard deviation for that scale. Next, z scores are multiplied by their respective factor score coefficients, derived from US population data, and summed. Finally, these aggregated scores are standardized using a linear T score transformation to have a mean of 50 and standard deviation of 10, in the general US population.
The results of factor analytic studies led to the discovery of the 2 SF-36 summary measures. Principal components analysis (PCA) of intercorrelations among SF-36 scales from several studies consistently extracted 2 components, with similar scale-to-component correlations, that accounted for the majority of total SF-36 variance and variance in each of the individual SF-36 scales.18 These findings indicated that 2 summary measures could be generated without substantial loss of information. The 2 components were interpreted as measures of physical and mental health because the scales correlating highest with them were PF and RP, and MH and RE, respectively.
Is It Legitimate to Report SF-36 Scale Scores in Stroke?
Six scaling assumptions must be satisfied for SF-36 scale scores to be generated using the item groups proposed by the developers and Likert’s method of summated ratings.
(1) Items in each scale must be roughly parallel (that is, measured at the same point on the scale and have similar variances) otherwise they do not contribute equally to the variance of the total score and must be standardized before combination.26 This criterion is evaluated by examining the symmetry of item response distributions and the equivalence of item means scores and standard deviations.1
(2) Items in each scale must measure a common underlying construct otherwise it is not appropriate to combine them to generate a total score.26 This criterion is evaluated by examining the correlation between each item and the total score computed from the remaining items in that scale. This prevents overinflated values due to item overlap. These values are known as item-total correlations corrected for overlap, corrected item-total correlations, or item-own scale correlations. A range of minimum values has been recommended: 0.20,27 0.30,28 and 0.40.1
(3) Items in each scale should contain a similar proportion of information concerning the construct being measured otherwise they should be given different weights.26 This criterion is determined by examining the equivalence of corrected item-total correlations. Recently, Ware et al29 have stated that this criterion can be considered satisfied when values exceed 0.30, even if they vary.
(4) Items must be correctly grouped into scales. That is, items must correlate substantially higher with the construct they are purported to measure than with the other constructs measured by the instrument.29 This criterion, termed item convergent and discriminant validity, is evaluated for each scale by comparing the difference between item-own and item-other scale correlations with 2 standard errors of a correlation coefficient (2×1/√n).1 Results for each scale are reported as percent definite and probable scaling success and failure rates. Definite scaling successes are scored when item-own scale correlations exceed item-other scale correlations by more than 2 standard errors. Probable scaling successes are scored when item-own scale correlations exceed item-other scale correlations by less than 2 standard errors. Probable scaling failures are scored when item-other scale correlations exceed item-own scale correlations by less than 2 standard errors. Definite scaling failures are scored when item-other scale correlations exceed item-own scale correlations by more than 2 standard errors.29
(5) Scales must generate reliable estimates otherwise their scores cannot be confidently interpreted. This criterion is satisfied when Cronbach’s alpha coefficients30 for each scale exceed 0.7031 or 0.80.28 Cronbach’s alpha is a measure of reliability reflecting the (weighted) average correlation among all items in a scale (internal consistency).32,33⇓
(6) Scales must demonstrate that they measure distinct constructs otherwise interpretation of their scores is confounded. This criterion is satisfied when the correlations among scales are substantially less than their respective reliability estimates.29
Is It Legitimate to Report SF-36 Summary Scores in Stroke?
(1) PCA, with orthogonal (varimax) rotation of extracted components, of the correlations among the 8 SF-36 scales should support a 2D model of health that explains approximately 80% to 85% of the total reliable variance in the SF-36 scales and at least 75% of the reliable variance in each of the 8 SF-36 scales.3,18⇓
(2) The magnitude and pattern of correlations between the 8 SF-36 scales and the 2 rotated components should support their interpretation as measures of physical and mental health and be consistent with other studies. That is, the PCS measure should correlate strongly (>0.70) with the PF, RP, and BP scales and weakly (<0.30) with the MH and RE scales, and vice versa for the MCS measure.3,18⇓
These 2 scaling assumptions were examined by undertaking a scale-level PCA with varimax rotation. Eigenvalues34 and the scree plot35 were examined to determine the optimum number of components to rotate.
A total of 177 people with stroke were studied. Table 1 reports the gender and race distribution, stroke subtype, length of stay, stroke severity at onset, post-stroke rehabilitation and rehabilitation status at time of interview, and employment status at time of interview. There were more men than women, more whites than blacks, and almost half had small-vessel strokes (which have a better outcome). The stroke severity and rehabilitation data indicate that this is a sample of survivors from mild to moderate stroke.
Is It Legitimate to Generate SF-36 Scale Scores in Stroke?
All response options were endorsed for each item (Table 2), but endorsement frequencies were quite variable (range, 1.7% to 77.7%). Item response-option frequency distributions were relatively symmetrical for 6 of the 8 scales (PF, BP, GH, VT, SF, and MH), skewed toward worse health (low scores) for the RP scale, and skewed toward better health for the RE scale. Nevertheless, items within each of the 8 scales had similar mean scores and standard deviations, indicating they were roughly parallel. Skewness is explained below.
All item-own scale correlations corrected for overlap, except 2 items in the GH scale (11a and 11c), exceeded 0.40 indicating, for the other 7 scales, the items in each scale measured a common underlying construct, and that the criterion of Ware et al29 for equivalence of item-total correlations was satisfied (Table 3). All item-own scale correlations exceeded item-other scale correlations, indicating no scaling failures. Most item-own scale correlations exceeded item-other scale correlations by a least 2 standard errors (>0.15), indicating definite scaling successes. However, in 2 scales, a substantial proportion of item-own scale correlations exceeded item-other scale correlations by less than 2 standard errors (<0.15), indicating high probable scaling success rates (SF=43%, GH=29%). This finding indicated that items in these 2 scales were limited in their ability to discriminate between constructs that are hypothesized to be different.
Alpha coefficients ranged from 0.68 to 0.90, indicating that most scales generated reliable scores (Table 4). One scale (GH) failed to satisfy the criterion of 0.70, and 2 scales (GH and SF) just failed to satisfy the more stringent criterion of 0.80. Intercorrelations among scales (range, 0.16 to 0.55) were substantially below their respective alphas, indicating that they were measuring 8 related but distinct constructs.
Scores for the 8 SF-36 scales spanned the entire, or almost entire (MH scale), scale ranges demonstrating good variability (Table 5). The shape of distributions were examined to determine the extent to which they were non-normal. Skewed distributions are those that are bunched up at one end. This is reflected in skewness statistics outside the recommended range of −1 to +1.36 Scores for the RP scale were positively skewed (skewness +1.18; distribution bunched to the left, low score end of the scale) indicating that respondents tended to be more physically disabled. In contrast, scores for the RE and MH scales were negatively skewed (skewness −0.84 and −0.67, respectively, distributions bunched to the right, high score end of the scale), indicating respondents had relatively better health in these domains. Scores for the other 6 scales were more evenly distributed (skewness −0.32 to +0.28). There were notable floor effects for 2 scales (RP 59.1%, RE 19.9%) and ceiling effects for 3 scales (RE 63.1%, SF 29.9%, BP 25.6%).
Is It Legitimate to Generate SF-36 Summary Scores in Stroke?
Principal components analysis of intercorrelations among SF-36 scales extracted 2 components with Eigenvalues exceeding unity, and the scree plot supported the existence of 2 higher order factors. This supported the hypothesis that a 2D model of health underpins the SF-36 in stroke. However, these 2 components explained less than 60% of the total reliable variance in all SF-36 scales and less than 75% of the reliable variable in 5 of the 8 scales (Table 6). Therefore, a substantial amount of information from SF-36 scales is lost when summary measures are reported in stroke. In addition, the magnitude and pattern of scale-to-component correlations in stroke differ from the US general population, indicating that the scale weights used to generate scores for the summary measures are not entirely applicable to people with stroke (Figure 2).
This study has comprehensively examined the basic assumptions underpinning the scoring of the SF-36 in people with ischemic stroke. Scaling assumptions were not fully satisfied for either scale or summary scores. This has important clinical implications for the use of the SF-36, and the SF-12 that is derived from it,37,38⇓ in stroke.
Results support the generation of summed scores for 6 of the 8 SF-36 scales (PF, RP, BP, VT, RE, and MH). However, GH and SF scales have low alphas and limited item convergent and discriminant validity, indicating imprecision and confounding of scores, respectively. In addition, 3 of the 6 scales for whom scaling assumptions were satisfied have important floor (RP, RE) and/or ceiling effects (BP, RE) that limit the value of these scales in clinical trials. The ceiling effect (percent sample at maximum score) represents a subsample of people whose score cannot increase regardless of any clinical improvement and for whom worsening of function may not register as a change in score. The floor effect (percent at minimum score) represents a subsample for whom clinical improvement may not register as a change in score and for whom worsening of function cannot be assessed. Therefore, floor and ceiling effects underestimate the impact of treatments in clinical studies, resulting in type II errors.
It is therefore important that the spectrum of health covered by the scale matches that of the study sample. This can be determined, to some extent, by examining the items and item response options of the scale, and the samples from whom and for whom the scale was developed. Even generic measures have some degree of specificity. For example, the SF-36 was developed to compare the effects of different healthcare financing arrangements and, consequently, is aimed toward less disabling medical conditions. Similarly, the Barthel Index, a generic measure of physical function, was developed for people with severe problems and is less appropriate for samples with milder physical disabilities. Investigators should recognize that the term generic is relative and does not indicate universal applicability.
Results do not support the computation of SF-36 PCS and MCS scores in stroke. Even though PCA supports a 2D model of health, this model does not explain as much of the variance in SF-36 scales as required, and the pattern of scale-to-component correlations is not consistent with findings in other clinical populations.2,39,40⇓⇓ Scale-to-component correlations for the GH and SF scales are the reverse of that required for the interpretation of the 2 components as measures of physical and mental health. Also, the VT scale loads on both components equally and, therefore, fails to discriminate between them. The finding that the 2 components explain less than 60% of the variance in SF-36 scales (therefore almost half of the total information in SF-36 scale scores is lost) suggest that even stroke-specific algorithms for PCS and MCS summary scores may not be feasible.
Other authors have questioned the validity of SF-36 summary scores.41–43⇓⇓ They argue that orthogonal rotation of extracted factors (which assume factors are uncorrelated) is inappropriate because there is strong evidence that physical and mental health are related. They recommend oblique factor rotations (which assume factors are correlated). We repeated the factor analytic studies using oblique (promax) rotations and examined different methods of extraction (principal axis factoring). Similar results were generated. Because our failure to reproduce the higher order factor structure of SF-36 scales could be explained by the limited reliability and validity of the GH and SF scales, we repeated the analyses without them. This did not rectify the problem.
Results from this study replicate findings in multiple sclerosis44 and motor neuron disease.45 These 3 studies suggest the problem may be more fundamental than the method of factor rotation. Factor analysis is a data reduction technique that analyses the relationships (correlations or covariances) among a group of variables with the aim of identifying clusters of variables that are empirically distinct. Figure 2 demonstrates that the scales do not cluster as in other studies. Table 4 indicates the correlations among the 4 scales hypothesized to correlate highest with the physical health component (PF, RP, BP, and GH; range, 0.25 to 0.55; mean, 0.36), and those hypothesized to correlate highest with the mental component (MH, RE, SF, VT; range, 0.28 to 0.51; mean, 0.43), are notably lower than those reported by others (PCS=0.52 to 0.65, mean 0.57; MCS=0.44 to 0.63, mean 0.5418) and similar to those between physical and mental scales (range, 0.16 to 0.53; mean, 0.36). Consequently, relationships among the 8 health constructs measured by the SF-36 are disease specific.
This is a small study and its generalizability is uncertain. However, the SF-36 scale scores, their rank order, and the presence of floor and ceiling effects are similar to those reported in a large study of ischemic stroke patients for the UK collaborators of the International Stroke Trial.46 In addition, work from other neurological diseases demonstrates that results are consistent across diverse clinical sample.44 Nevertheless, the data indicate that this is a group of survivors of mild to moderate strokes. Severe strokes are underrepresented, as a result of language difficulty and difficulty returning to the clinics for appointments. Further studies are now essential to carefully evaluate the role of the SF-36 in stroke. This could be achieved from existing data sets.
It is also important to note that this study has only addressed scaling assumptions. These are fundamental criteria that should be satisfied before more detailed psychometric evaluations are undertaken. More extensive evaluations of SF-36 scales are required to determine the extent to which they are valid and responsive indicators of the health constructs they purport to measure. Finally, this study demonstrates that stroke has direct or secondary effects on the major aspects of health (physical, psychological, and social function). Other health domains, not addressed by generic measures, are almost certainly affected by stroke. Although studies developing disease-specific rating scales are beginning to define these domains,24,47⇓ the health impact of stroke and the patterns of change in health over time remain poorly understood.
The findings of this study provide the beginnings of evidence-based guidance for the use of the SF-36 in stroke. Until more extensive data are available, we recommend that SF-36 GH and SF scale scores should be interpreted with caution because there is evidence from this small study that neither are reliable and valid indicators. The RP, BP, SF, and RE scales might be poor choices as outcome measures for effectiveness and longitudinal studies because results from this study suggest they could underestimate health changes. Scores for the SF-36 PCS and MCS, and therefore the SF-12 from which only PCS and MCS scores can be constructed, cannot definitely be considered reliable and valid indicators of physical and mental health. This study also highlights the importance of comprehensive scale evaluation and underlines the need for caution when taking measures “off the shelf,” expecting that psychometric attributes have been fully tested or that measures are generally applicable.
Dr Hobart is funded by a grant from the UK National Health Service Health Technology Assessment Programme, but the views and opinions expressed do not necessarily reflect those of the NHS Executive. Dr Williams is supported by a Research Career Development Award, Department of Veterans Affairs, Health Services Research and Development. This work was performed, in part, in the Regenstrief Institute for Health Care.
- Received September 9, 2001.
- Revision received February 5, 2002.
- Accepted February 5, 2002.
- ↵Ware JE Jr, Snow KK, Kosinski M, Gandek B. SF-36 Health Survey manual and interpretation guide. Boston, Mass: Nimrod Press; 1993.
- ↵Jenkinson C, Fitzpatrick R, Swash M, Peto V, and the ALS-HPS Steering Group. The ALS Health Profile Study: quality of life of ALS patients and carers in Europe. J Neurol. 2000; 247: 35–40.
- ↵Jenkinson C, Peto V, Fitzpatrick R, Greenhall R, Hyman N. Self-reported functioning and well-being in patients with Parkinson’s disease: comparison of the Short-Form Health Survey (SF-36) and the Parkinson’s Disease Questionnaire (PDQ-39). Age Ageing. 1995; 24: 505–509.
- ↵Freeman JA, Langdon DW, Hobart JC, Thompson AJ. Health-related quality of life in people with multiple sclerosis undergoing inpatient rehabilitation. J Neurol Rehabil. 1996; 10: 185–194.
- ↵Rothwell PM, McDowell Z, Wong CK, Dorman PJ. Doctors and patients don’t agree: cross sectional study of patients’ and doctors’ perceptions and assessments of disability in multiple sclerosis. BMJ. 1997; 314: 1580–1583.
- ↵Cella DF, Dineen K, Arnason B, Reder A, Webster KA, Karabatsos G, Chang C, Lloyd S, Steward J, Stefoski D. Validation of the functional assessment of multiple sclerosis quality of life instrument. Neurology. 1996; 47: 129–139.
- ↵Ware JE Jr, Kosinski MA, Keller SD. SF-36 Physical and Mental Health Summary Scales: A User’s Manual. Boston, Mass: The Health Institute, New England Medical Center; 1994.
- ↵Anderson C, Laubscher S, Burns R. Validation of the Short Form 36 (SF-36) health survey questionnaire among stroke patients. Stroke. 1996; 27: 1812–1816.
- ↵Dorman P, Dennis M, Sandercock P. How do scores on the EuroQol relate to scores on the SF-36 after stroke. Stroke. 1999; 30: 2146–2151.
- ↵O’Mahony PG, Rodgers H, Thomson RG, Dobson R, James OF. Is the SF-36 suitable for assessing health status of older stroke patients? Age Ageing. 1998; 27: 19–22.
- ↵Lord FM, Novick MR. Statistical Theories of Mental Test Scores. Reading, Mass: Addison-Wesley; 1968.
- ↵Williams LS, Weinberger M, Harris LE, Clark DO, Biller J. Development of a stroke-specific quality of life scale. Stroke. 1999; 30: 1362–1369.
- ↵Goldstein LB, Chilukuri V. Retrospective assessment of initial stroke severity with the Canadian Neurologic Scale. Stroke. 1997; 28: 1181–1184.
- ↵Likert RA. A technique for the development of attitudes. Arch Psychol. 1932; 140: 5–55.
- ↵Streiner DL, Norman GR. Health Measurement Scales: A Practical Guide to Their Development and Use. 2nd ed. Oxford: Oxford University Press; 1995.
- ↵Nunnally JC, Bernstein IH. Psychometric Theory. 3rd ed. New York: McGraw-Hill; 1994.
- ↵Ware JE Jr, Harris WJ, Gandek B, Rogers BW, Reese PR. MAP-R for Windows: Multitrait/Multi-item Analysis Program—Revised User’s Guide. Boston, Mass: Health Assessment Laboratory; 1997.
- ↵Stewart AL, Ware JE Jr, eds. Measuring Functioning and Well-being: The Medical Outcomes Study Approach. Durham, NC: Duke University Press; 1992.
- ↵Spector PE. Summated Rating Scale Construction: An Introduction. Newbury Park, Calif: Sage; 1992.
- ↵Ware JE Jr, Kosinski M, Keller SD. How to Score the SF-12 Physical and Mental Summary Scales. Boston, Mass: The Health Institute, New England Medical Center; 1994.
- ↵Brazier JE, Harper R, Jones NMB, O’Cathain A, Thomas KJ, Usherwood T, Westlake L. Validating the SF-36 health survey questionnaire: new outcome measure for primary care. BMJ. 1992; 305: 160–164.
- ↵Hays RD, Prince-Embury S, Chen H. RAND-36 Health Status Inventory. San Antonio, Tex: Psycholoical Corporation; 1998.
- ↵Hobart JC, Freeman JA, Lamping DL, Fitzpatrick R, Thompson AJ. The SF-36 in multiple sclerosis (MS): why basic assumptions must be tested. J Neurol Neurosurg Psychiatry. 2001; 71: 363–370.
- ↵Dorman P, Slattery J, Farrell B, Dennis M, Sandercock P. A randomised comparison of the EuroQol and SF-36 after stroke: United Kingdom collaborators in the International Stroke Trial. BMJ. 1997; 315: 461.