Chapter 18: Patient-reported outcomes

Bradley C Johnston, Donald L Patrick, Tahira Devji, Lara J Maxwell, Clifton O Bingham III, Dorcas E Beaton, Maarten Boers, Matthias Briel, Jason W Busse, Alonso Carrasco-Labra, Robin Christensen, Bruno R da Costa, Regina El Dib, Anne Lyddiatt, Raymond W Ostelo, Beverley Shea, Jasvinder Singh, Caroline B Terwee, Paula R Williamson, Joel J Gagnier, Peter Tugwell, Gordon H Guyatt

Key Points:

Cite this chapter as: Johnston BC, Patrick DL, Devji T, Maxwell LJ, Bingham III CO, Beaton D, Boers M, Briel M, Busse JW, Carrasco-Labra A, Christensen R, da Costa BR, El Dib R, Lyddiatt A, Ostelo RW, Shea B, Singh J, Terwee CB, Williamson PR, Gagnier JJ, Tugwell P, Guyatt GH. Chapter 18: Patient-reported outcomes [last updated October 2019]. In: Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA (editors). Cochrane Handbook for Systematic Reviews of Interventions version 6.5. Cochrane, 2024. Available from www.training.cochrane.org/handbook.

18.1 Introduction to patient-reported outcomes

18.1.1 What are patient-reported outcomes?

A patient-reported outcome (PRO) is “any report of the status of a patient’s health condition that comes directly from the patient without interpretation of the patient’s response by a clinician or anyone else” (FDA 2009). PROs are one of several clinical outcome assessment methods that complement biomarkers, measures of morbidity (e.g. stroke, myocardial infarction), burden (e.g. hospitalization), and survival used and reported in clinical trials and non-randomized studies (FDA 2018).

Patient-reported outcome measures (PROMs) are instruments that are used to measure the PROs, most often self-report questionnaires. Although investigators may address patient-relevant outcomes via proxy reports or observations from caregivers, health professionals, or parents and guardians, these are not PROMs but rather clinician-reported or observer-reported outcomes (Powers et al 2017).

PROs provide crucial information for patients and clinicians facing choices in health care. Conducting systematic reviews and meta-analyses including PROMs and interpreting their results is not straightforward, and guidance can help review authors address the challenges.

The objectives of this chapter are to: (i) describe the category of outcomes known as PROs and their importance for healthcare decision making; (ii) illustrate the key issues related to reliability, validity and responsiveness that systematic review authors should consider when including PROs; and (iii) address the structure and content (domains, items) of PROs and provide guidance for combining information from different PROs. This chapter outlines a step-by-step approach to addressing each of these elements in the systematic review process. The focus is on the use of PROs in randomized trials, and what is crucial in this context when selecting PROs to include in a meta-analysis. The principles also apply to systematic reviews of non-randomized studies addressing PROs (e.g. dealing with adverse drug reactions).

18.1.2 Why patient-reported outcomes?

PROs provide patients’ perspectives regarding treatment benefit and harm, directly measure treatment benefit and harm beyond survival, major morbid events and biomarkers, and are often the outcomes of most importance to patients and families.

Self-reported outcomes often correlate poorly with physiological and other outcomes such as performance-related outcomes, clinician-reported outcomes, or biomarkers. In asthma, Yohannes and colleagues (Yohannes et al (1998) found that variability in exercise capacity contributed to only 3% of the variability in breathing problems on a patient self-report questionnaire. In chronic obstructive pulmonary disease (COPD), the reported correlations between forced expiratory volume (FEV1) and quality of life (QoL) are weak (r=0.14 to 0.41) (Jones 2001). In peripheral arterial occlusive disease, correlations between haemodynamic variables and QoL are low (e.g. r=–0.17 for QoL pain subscale and Doppler sonographic ankle/brachial pressure index) (Müller-Bühl et al 2003). In osteoarthritis, there is discordance between radiographic arthritis and patient-reported pain (Hannan et al 2000). These findings emphasize the often important limitations of biomarkers for informing the impact of interventions on the patient experience or the patient’s perspective of disease (Bucher et al 2014).

PROs are essential when externally observable patient-important outcomes are rare or unavailable. They provide the only reasonable strategy for evaluating treatment impact of many conditions including pain syndromes, fatigue, disorders such as irritable bowel syndrome, sexual dysfunction, and emotional function and adverse effects such as nausea and anxiety for which physiological measurements are limited or unavailable.

18.2 Formulation of the review

In this section we describe PROMs in more detail and discuss some issues to consider when deciding which PROMs to address in a review.

A common term used in the health status measurement literature is construct. Construct refers to what PROMs are trying to measure, the concept that defines the PROM such as pain, physical function or depressive mood. Constructs are the postulated attributes of the person that investigators hope to capture with the PROM (Cronbach and Meehl 1955).

Many different ways exist to label and classify PROMs and the constructs they measure. For instance, reports from patients include signs (observable manifestations of a condition), sensations (most commonly classified as symptoms that may be attributable to disease and/or treatment), behaviours and abilities (commonly classified as functional status), general perceptions or feelings of well-being, general health, satisfaction with treatment, reports of adverse effects, adherence to treatment, and participation in social or community events and health-related quality of life (HRQoL).

Investigators can use different approaches to capture patient perspectives, including interviews, self-completed questionnaires, diaries, and via different interfaces such as hand-held devices or computers. Review authors must identify the postulated constructs that are important to patients, and then determine the extent to which the PROMs used and reported in the trials address those constructs, the characteristics (measurement properties) of the PROMs used, and communicate this information to the reader (Calvert et al 2013).

Focusing now on HRQoL, an important PRO, some approaches attempt to cover the full range of health-related patient experience – including, for instance, self-care, and physical, emotional and social function – and thus enable comparisons between the impact of treatments on HRQoL across diseases or conditions. Authors often call these approaches generic instruments (Guyatt et al 1989, Patrick and Deyo 1989). These include utility measures such as the EuroQol five dimensions questionnaire (EQ-5D) or the Health Utilities Index (HUI). They also include health profiles such as the Short Form 36-item (SF-36) or the SF-12; these have come to dominate the field of health profiles (Tarlov et al 1989, Ware et al 1995, Ware et al 1996). An alternative approach to measuring PROs is to focus on much more specific constructs: PROMs may be specific to function (e.g. sleep, sexual function), to a disease (e.g. asthma, heart failure), to a population (e.g. the frail elderly) or to a symptom (pain, fatigue) (Guyatt et al 1989, Patrick and Deyo 1989). Another domain-specific measurement system now receiving attention is Patient-Reported Outcomes Measurement Instruments System (PROMIS). PROMIS is a National Institutes of Health funded PROM programme using computerized adaptive testing from large item banks for over 70 domains (e.g. anxiety, depression, pain, social function) relevant to wide variety of chronic diseases (Cella et al 2007, Witter 2016, PROMIS 2018).

Authors often use the terms ‘quality of life’, ‘health status’, ‘functional status’, ‘HRQoL’ and ‘well-being’ loosely and interchangeably. Systematic review authors must therefore consider carefully the constructs that the PROMs have actually measured. To do so, they may need to examine the items or questions included in a PROM.

Another issue to consider is whether and how the individual items of instruments are weighted. A number of approaches can be used to arrive at weights (Wainer 1976). Utility instruments designed for economic analysis put greater emphasis on item weighting, attempting ultimately to present HRQoL as a continuum anchored between death and full health. Many PROMs weight items equally in the calculation of the overall score, a reasonable approach. Readers can refer to a helpful overview of classical test theory and item response theory to understand better the merits and limitations of weighting (Cappelleri et al 2014).

Table 18.2.a presents a framework for considering and reporting PROMs in clinical trials, including their constructs and how they were measured. A good understanding of the PROMs identified in the included studies for a review is essential to appropriate analysis of outcomes across studies, and appraisal of the certainty of the evidence.

Table 18.2.a Checklist for describing and assessing PROMs in clinical trials. Adapted from Guyatt et al (1997)

1. What were the PROMs assessing?

1.1. What concepts or constructs were the PROMs used in the study assessing?

1.2. What rationale (if any) for selection of concepts or constructs did the authors provide?

1.3. Were patients involved in the development (e.g. focus groups, surveys) of PROMs?

2. Omissions

2.1 Were there any important aspects of patient’s health (e.g. symptoms, function, perceptions) or quality of life (e.g. overall evaluation, satisfaction with life) that were not reported in this study? A search for ‘Core Outcome Sets’ for condition would be helpful (see Section 18.4.1).

3. What were the measurement strategies?

3.1. Did investigators use instruments that yield a single indicator or index number, or a profile, or a battery of instruments?

3.2. Did investigators use specific or generic measures, or both?

4. Did the instruments work in the way they were supposed to work – validity?

4.1. Was evidence of prior validation for use in the current population presented?

5. Did the instruments work in the way they were supposed to work – responsiveness?

5.1 Are the PROMs able to detect important change in patient status, even if those changes are small?

6. Can you make the magnitude of effect (if any) understandable to readers – interpretability?

6.1 If the intervention has had an apparent impact on a PROM, can you provide users with a sense of whether that effect is trivial, small but important, moderate, or large?

18.3 Appraisal of evidence

18.3.1 Measurement of PROs: single versus multiple time-points

To be useful, instruments must be able to distinguish between situations of interest (Boers et al 1998). When results are available for only one time-point (e.g. for classification), the key issue for PROMs is to be able to distinguish individuals with more desirable scores from those whose scores are less desirable. The key measurement issues in such contexts are reliability and cross-sectional construct validity (Kirshner and Guyatt 1985, Beaton et al 2016).

In longitudinal studies such as randomized trials, investigators usually obtain measurements at multiple time-points, for example at the beginning of the trial and again following administration of the interventions. In this context, PROMs must be able to distinguish those who have experienced positive changes over time from those who have experienced negative changes, those who experienced less positive change, or those who experienced no change at all, and to estimate accurately the magnitude of those changes. The key measurement issues in these contexts – sometimes referred to as evaluative – are responsiveness and longitudinal construct validity (Kirshner and Guyatt 1985, Beaton et al 2016).

18.3.2 Reliability

Intuitively, many think of reliability as obtaining the same scores on repeated administration of an instrument in stable respondents. That stability (or lack of measurement error) is important, but not sufficient. Satisfactory instruments must be able to distinguish between individuals despite measurement error.

Reliability statistics therefore look at the ratio of the variability between respondents (typically the numerator of a reliability statistic) and the total variability (the variability between respondents and the variability within respondents). The most commonly used statistics to measure reliability is a kappa coefficient for categorical data, a weighted kappa coefficient for ordered categorical data, and an intraclass correlation coefficient for continuous data (de Vet et al 2011).

Limitations in reliability will be of most concern for the review author when randomized trials have failed to establish the superiority of an experimental intervention over a comparator intervention. The reason is that lack of reliability cannot create intervention effects that are not present, but can obscure true intervention effects as a result of random error. When a systematic review does not find evidence that an intervention affects a PROM, review authors should consider whether this may be due to poor reliability (e.g. if reliability coefficients are less than 0.7) rather than lack of an effect.

18.3.3 Validity

Validity has to do with whether the instrument is measuring what it is intended to measure. Content validity assessment involves patient and clinician evaluation of the relevance and comprehensiveness of the content contained in the measures, usually obtained through qualitative research with patients and families (Johnston et al 2012). Guidance is available on the assessment of content validity for PROMs used in clinical trials (Patrick et al 2011a, Patrick et al 2011b).

Construct validity involves examining the logical relationships that should exist between assessment measures. For example, in patients with COPD, we would expect that patients with lower treadmill exercise capacity generally will have more dyspnoea (shortness of breath) in daily life than those with higher exercise capacity, and we would expect to see substantial correlations between a new measure of emotional function and existing emotional function questionnaires.

When we are interested in evaluating change over time – that is, in the context of evaluation when measures are available both before and after an intervention – we examine correlations of change scores. For example, patients with COPD who deteriorate in their treadmill exercise capacity should, in general, show increases in dyspnea, while those whose exercise capacity improves should experience less dyspnea. Similarly, a new emotional function instrument should show concurrent improvement in patients who improve on existing measures of emotional function. The technical term for this process is testing an instrument’s longitudinal construct validity. Review authors should look for evidence of the validity of PROMs used in clinical studies. Unfortunately, reports of randomized trials using PROMs seldom review or report evidence of the validity of the instruments they use, but when these are available review authors can gain some reassurance from statements (backed by citations) that the questionnaires have been previously validated, or could seek additional published information on named PROMs. Ideally, review authors should look for systematic reviews of the measurement properties of the instruments in question. The Consensus-based standards for the selection of health measurement instruments (COSMIN) website offers a database of such reviews (COSMIN Database of Systematic Reviews). In addition, the Patient-Reported Outcomes and Quality of Life Instruments Database (PROQOLID) provides documentation of the measurement properties for over 1000 PROs.

If the validity of the PROMs used in a systematic review remains unclear, review authors should consider whether the PROM is an appropriate measure of the review’s planned outcomes, or whether it should be excluded (ideally, this would be considered at the protocol stage), and any included results should be interpreted with appropriate caution. For instance, in a review of flavonoids for haemorrhoids, authors of primary trials used PROMs to ascertain patients’ experience with pain and bleeding (Alonso-Coello et al 2006). Although the wording of these PROMs was simple and made intuitive sense, the absence of formal validation raises concerns over whether these measures can give meaningful data to distinguish between the intervention and its comparators.

A final concern about validity arises if the measurement instrument is used with a different population, or in a culturally and linguistically different environment from the one in which it was developed. Ideally, PROMs should be re-validated in each study, but systematic review authors should be careful not to be too critical on this basis alone.

18.3.4 Responsiveness

In the evaluative context, randomized trial participant measurements are typically available before and after the intervention. PROMs must therefore be able to distinguish among patients who remain the same, improve or deteriorate over the course of the trial (Guyatt et al 1987, Revicki et al 2008). Authors often refer to this measurement property as responsiveness; alternatives are sensitivity to change or ability to detect change.

As with reliability, responsiveness becomes an issue when a meta-analysis suggests no evidence of a difference between an intervention and control. An instrument with a poor ability to measure change can result in false-negative results, in which the intervention improves how patients feel, yet the instrument fails to detect the improvement. This problem may be particularly salient for generic questionnaires that have the advantage of covering all relevant areas of HRQoL, but the disadvantage of covering each area superficially or without the detail required for the particular context of use (Wiebe et al 2003, Johnston et al 2016a). Thus, in studies that show no difference in PROMs between intervention and control, lack of instrument responsiveness is one possible reason. Review authors should look for published evidence of responsiveness. If there is an absence of prior evidence of responsiveness, this represents a potential reason for being less certain about evidence from a series of randomized trials. For instance, a systematic review of respiratory muscle training in COPD found no effect on patients’ function. However, two of the four studies that assessed a PROM used instruments without established responsiveness (Smith et al 1992).

18.3.5 Reporting bias

Studies focusing on PROs often use a number of PROMs to measure the same or similar constructs. This situation creates a risk of selective outcome reporting bias, in which trial authors select for publication a subset of the PROMs on the basis of the results; that is, those that indicate larger intervention effects or statistically significant P values (Kirkham et al 2010). Further detailed discussion of selective outcome reporting is presented in Chapter 7 (Section 7.2.3.3); see also Chapter 8 (Section 8.7).

Systematic reviews focusing on PROs should be alert to this problem. When only a small number of eligible studies have reported results for a particular PROM, particularly if the PROM is mentioned in a study protocol or methods section, or if it is a salient outcome that one would expect conscientious investigators to measure, review authors should note the possibility of reporting bias and consider rating down certainty in evidence as part of their GRADE assessment (see Chapter 14) (Guyatt et al 2011). For instance, authors of a systematic review evaluating the responsiveness of PROs among patients with rare lysosomal storage diseases encountered eligible studies in which the use of a PRO was described in the methods, but there were either no data or limited PRO data in the results. When authors did present some information about results, the reports sometimes included only interim or end-of-study results. Such instances are likely to be an indication of selective outcome reporting bias: it seems implausible that, if results showed apparent benefit on PROs, investigators would mention a PRO in the methods and subsequently fail to report results (Johnston et al 2016b).

18.4 Synthesis and interpretation of evidence

18.4.1 Selecting from multiple PROMs

The definition of a particular PRO may vary between studies, and this may justify use of different instruments (i.e. different PROMs). Even if the definitions are similar (or if, as happens more commonly, the investigators do not define the PRO), the investigators may choose different instruments to measure the PROs, especially if there is a lack of consensus on which instrument to use (Prinsen et al 2016).

When trials report results for more than one instrument, authors should – independent of knowledge of the results and ideally at the protocol stage – create a hierarchy based on reported measurement properties of PROMs (Tendal et al 2011, Christensen et al 2015), considering a detailed understanding of what each PROM measures (see Table 18.2.a), and its demonstrated reliability, validity, responsiveness and interpretability (see Section 18.3). This will allow authors to decide which instruments will be used for data extraction and synthesis. For example, the following instruments are all validated, patient-reported pain instruments that an investigator may use in a primary study to assess an intervention’s usefulness for treating pain:

In some clinical fields core outcome sets are available to guide the use of appropriate PROs (COMET 2018). Only rarely do these include specific guidance on which PROMs are preferable, although methods have been proposed for this (Prinsen et al 2016). Within the field of rheumatology, the Outcome Measures in Rheumatology (OMERACT) initiative has developed a conceptual framework known as OMERACT Filter 2.0 to identify both core domain sets (what outcome should be measured) and core outcome measurement sets (how the outcome should be measured, i.e. which PROM to use) (Boers et al 2014). This is a generic framework and applicable to those developing core outcome sets outside the field of rheumatology.

As an example of a pre-defined hierarchy, for knee osteoarthritis, OMERACT has used a published hierarchy based on responsiveness for extraction of PROMs evaluating pain and physical function for performing systematic reviews (Juhl et al 2012).

Authors should decide in advance whether to exclude PROMs not included in the hierarchy, or to include additional measures where none of the preferred measures are available.

18.4.2 Synthesizing data from multiple PROMs

While a hierarchy can be helpful in identifying the review authors’ preferred measures, and excluding some measures considered inappropriate, it remains likely that authors will encounter studies using several different PROMs to measure a given construct, either within one study or across multiple studies. Authors must then decide how to approach synthesis of multiple measures, and among them, consider which measures to include in a single meta-analysis on a particular construct (Tendal et al 2011, Christensen et al 2015).

When deciding if statistical synthesis is appropriate, review authors will often find themselves reading between the lines to try and get a precise notion of the underlying construct for the PROMs used. They may have to consult the articles that describe the development and prior use of PROMs included in the primary studies, or look at the instruments to understand the concepts being measured.

For example, authors of a Cochrane Review of cognitive behavioural therapy (CBT) for tinnitus included HRQoL as a PRO (Martinez-Devesa et al 2007), assessed with different PROMs: four trials using the Tinnitus Handicap Questionnaire; one trial the Tinnitus Questionnaire; and one trial the Tinnitus Reaction Questionnaire. Review authors compared the content of the PROMs and concluded that statistical pooling was appropriate.

The most compelling evidence regarding the appropriateness of including different PROMs in the same meta-analysis would come from a finding of substantial correlations between the instruments. For example, the two major instruments used to measure HRQoL in patients with COPD are the Chronic Respiratory Questionnaire (CRQ) and the St. George’s Respiratory Questionnaire (SGRQ). Correlations between the two questionnaires in individual studies have varied from 0.3 to 0.6 in both cross-sectional (correlations at a point in time) and longitudinal (correlations of change) comparisons (Rutten-van Mölken et al 1999, Singh et al 2001, Schünemann et al 2003, Schünemann et al 2005). In one study, investigators examined the correlations between group mean changes in the CRQ and SGRQ in 15 studies including 23 patient groups and found a correlation of 0.88 (Puhan et al 2006).

Ideally, the decision to combine scores from different PROMs would be based not only on their measuring similar constructs but also on their satisfactory validity, and, depending on whether before and after intervention or only after intervention measurements were available, and on their responsiveness or reliability. For example, extensive evidence of validity is available for both CRQ and the SGRQ. The CRQ has, however, proved more responsive than the SGRQ: in an investigation that included 15 studies using both instruments, standardized response means of the CRQ (median 0.51, interquartile range (IQR) 0.19 to 0.98) were significantly higher (P <0.001) than those associated with the SGRQ (median 0.26, IQR −0.03 to 0.40) (Puhan et al 2006). As a result, pooling results from trials using these two instruments could lead to underestimates of intervention effect in studies using the SGRQ (Puhan et al 2006, Johnston et al 2010). This can be tested using a sensitivity analysis of studies using the more responsive versus less responsive instrument.

Usually, detailed data such as those described above will be unavailable. Investigators must then fall back on intuitive decisions about the extent to which different instruments are measuring the same underlying concept. For example, the authors of a meta-analysis of psychosocial interventions in the treatment of pre-menstrual syndrome faced a profusion of outcome measures, with 25 PROMs used in their nine eligible studies (Busse et al 2009). They dealt with this problem by having two experienced clinical researchers, knowledgeable to the study area and not otherwise involved in the review, independently examine each instrument – including all domains – and group 16 PROMs into six discrete conceptual categories. Any discrepancies were resolved by discussion to achieve consensus. Table 18.4.a details the categories and the included instruments within each category.

Authors should follow the guidance elsewhere in this Handbook on appropriate methods of synthesizing different outcome measures in a single analysis (Chapter 10) and interpreting these results in a way that is most meaningful for decision makers (Chapter 15).

Table 18.4.a Examples of potentially combinable PROMs measuring similar constructs from a review of psychosocial interventions in the treatment of pre-menstrual syndrome (Busse et al 2009). Reproduced with permission of Karger

Beck Anxiety Inventory

Menstrual Symptom Diary-Anxiety domain