Learn more: PMC Disclaimer | PMC Copyright Notice
Generalizing Evidence From Randomized Clinical Trials to Target Populations
Abstract
Properly planned and conducted randomized clinical trials remain susceptible to a lack of external validity. The authors illustrate a model-based method to standardize observed trial results to a specified target population using a seminal human immunodeficiency virus (HIV) treatment trial, and they provide Monte Carlo simulation evidence supporting the method. The example trial enrolled 1,156 HIV-infected adult men and women in the United States in 1996, randomly assigned 577 to a highly active antiretroviral therapy and 579 to a largely ineffective combination therapy, and followed participants for 52 weeks. The target population was US people infected with HIV in 2006, as estimated by the Centers for Disease Control and Prevention. Results from the trial apply, albeit muted by 12%, to the target population, under the assumption that the authors have measured and correctly modeled the determinants of selection that reflect heterogeneity in the treatment effect. In simulations with a heterogeneous treatment effect, a conventional intent-to-treat estimate was biased with poor confidence limit coverage, but the proposed estimate was largely unbiased with appropriate confidence limit coverage. The proposed method standardizes observed trial results to a specified target population and thereby provides information regarding the generalizability of trial results.
Properly planned and conducted randomized clinical trials (henceforth referred to as trials) typically provide stronger internal validity than observational study designs, such as prospective cohort studies. Such trials accomplish heightened internal validity by ensuring that the conditions necessary for proper inference are met. Specifically, trials ensure consistency (1–3) and positivity (1, 4) by design and no unmeasured confounding in expectation by randomization (5, 6). Trials and cohort studies constrain the amount of selection bias (7) due to dropout when near-complete patient follow-up is attained. However, even such trials are susceptible to a lack of external validity, or generalizability (8, 9), as recently discussed (10–12). This susceptibility is a function of the extent to which trial participants do not represent the target population. For an example of when trials might selectively enroll from the target population, a recent study (13) applied eligibility criteria from 32 human immunodeficiency virus (HIV) trials (largely funded by the National Institutes of Health) to the Women's Interagency HIV Study (14) (the largest observational cohort of HIV-infected women in the United States) and found that, across trials, a median of 58% of women would have been eligible for a given trial (range, 32.4%–100%).
In simple settings, trial results may be mapped to a target population by using nonparametric direct standardization (15, 16 (p. 49)). However, when there are many covariates, or some covariates are continuous, direct standardization will fail. Here, we illustrate a model-based method to standardize observed trial results to a specified target population. Thereby, this method provides information regarding the generalizability of the trial results to the specified target population. We apply the method to the AIDS (acquired immunodeficiency syndrome) Clinical Trial Group (ACTG) 320 study (17), a landmark trial in HIV care that compared a novel highly active antiretroviral therapy combination (henceforth referred to as treatment) with a largely ineffective existing therapy combination (henceforth referred to as control). In addition, we provide a limited Monte Carlo simulation evaluation of the proposed method.
MATERIALS AND METHODS
Study population
Between January 1996 and January 1997, 1,156 patients were recruited from 33 AIDS clinical trial units and 7 National Hemophilia Foundation sites in the United States and Puerto Rico (17). Eligible patients were 1) at least 16 years of age, 2) HIV positive, 3) immunosuppressed (i.e., CD4 cell count <201 cells/mm3), 4) experienced with antiretroviral therapy (i.e., at least 3 months of prior zidovudine use), and 5) able to care for themselves (i.e., Karnofsky performance test score ≥70). Patients were excluded if they had a week or more prior treatment with the nucleoside reverse transcriptase inhibitor lamivudine or had any prior treatment with protease inhibitors. Institutional review boards at each of the participating institutions approved the study protocol, and written informed consent was given by all study patients. The public-use ACTG 320 data set was used in the present study and is available from the National Technical Information Service (http://www.ntis.gov/).
At enrollment, patients were stratified by CD4 cell count (i.e., 0–50 vs. 51–200 cells/mm3) and were randomly assigned with equal allocation to the treatment group (n = 577) or control group (n = 579) (17). The therapy for the control group consisted of the 2 nucleoside reverse transcriptase inhibitors zidovudine and lamivudine, whereas the therapy for the treatment group consisted of the same 2 nucleoside reverse transcriptase inhibitors plus the protease inhibitor indinavir. Characteristics of the 1,156 trial patients are given in Table 1.
Table 1.
Characteristics of 1,156 HIV-infected Patients in the AIDS Clinical Trial Group 320 Study in 1996–1997 Followed for 1 Year and of the Estimated 54,220 HIV-infected Individuals in the United States in 2006
Characteristica | Trial Patients | US Population | ||
No. | % | No. | % | |
Age, years | 38 (33, 44) | NA | ||
Age group, yearsb | ||||
13–29 | 106 | 09 | 18,500 | 34 |
30–39 | 515 | 45 | 16,740 | 31 |
40–49 | 388 | 34 | 13,370 | 25 |
≥50 | 147 | 13 | 5,610 | 10 |
Male sex | 956 | 83 | 39,810 | 73 |
Race | ||||
White, non-Hispanic | 623 | 54 | 19,580 | 36 |
Black, non-Hispanic | 328 | 28 | 24,920 | 46 |
Hispanic | 205 | 18 | 9,720 | 18 |
CD4 count (cells/mm3)c | 75 (33, 137) | NA |
Abbreviations: AIDS, acquired immunodeficiency syndrome; HIV, human immunodeficiency virus; NA, not available.
After randomization, patients were monitored with study visits at weeks 4, 8, and 16, and every 8 weeks thereafter, until a first occurrence of an AIDS-defining illness, death, or the planned end of follow-up at 52 weeks. Fifty-one of 1,156 patients (4%) dropped out during follow-up. Of the 51 dropouts, 20 and 31 were in the treatment and control groups, respectively. Ninety-six of 1,156 patients (8%) incurred endpoints: 70 developed AIDS, and 26 died. Of the 96 endpoints, 33 were observed in the treatment group and 63 in the control group. Noncompliance is ignored here; it was previously described (17), and methods to account for noncompliance (18, 19) revealed only a modest difference from the hazard ratio obtained by intent-to-treat (20).
Target population
For illustrative purposes, we chose as the target population the US estimate of the number of people infected with HIV in 2006. This estimate was provided by the Centers for Disease Control and Prevention (21, 22). HIV incidence is not directly measured in the United States. However, innovative immunoassays are able to distinguish between recent and established infections, allowing estimates of HIV incidence (23–25). Information on newly diagnosed HIV cases in 22 states was reported to the Centers for Disease Control and Prevention for 2006. Remnant diagnostic serum specimens from patients aged 13 years or older were tested with an immunoassay to classify infections as recent or established. HIV incidence was estimated by using a statistical approach with adjustment for testing frequency and was extrapolated to the United States (26). Characteristics of the target population are also provided in Table 1. For this target population, we did not have individual-level data but did have the joint distribution (i.e., cross-classification) of select characteristics, namely, sex, race, and age groups.
Statistical methods
We begin with a description of the notation we will use. Uppercase letters denote random variables and lowercase letters denote realizations of random variables, or constants. Let Ti* be Ti ∧ Ci, where Ti and Ci are positive, real valued times to the event of interest and right censoring, respectively, for population member i = 1 to n. We assume here that right censoring is not informative, or formally that f(T*) = f(T* | T, C), where f(.|.) is the conditional density function. Let Yi = 1 denote the occurrence of the event of interest (i.e., Ti* = Ti).
A population-level treatment effect is a comparison of potential event times across different levels of a treatment, say X = x. Formally, this is a comparison of the distribution of Ti1 and Ti0, where Tx is the potential event time under treatment x. One way to quantify this comparison is to imagine a Cox proportional hazards model (27) on the potential event times as hTx(t) = h0(t)×exp(αx), where the estimand exp(α) is the ratio of the hazard had the population been exposed to the treatment to the hazard had the population been exposed to the control.
Let Si = 1 denote selection from the target population into the trial sample of Si patients. Where Si = 1, let Xi = 1 denote random assignment to the treatment group and 0 to the control group. Typically, a treatment effect is estimated in the trial sample, perhaps by using an analogous Cox model, hT(t) = h0(t) × exp(βXi), where estimation is by Cox's partial likelihood (28). The log hazard ratio in the trial sample β will not generally equal the population estimand α, except under conditions described in the Appendix. A derivation of the bias in the trial sample estimate of the population effect, defined as differences in means or proportions, is also given in the Appendix. Next, we describe the use of inverse probability-of-selection weights, which are an extension of Horvitz-Thompson weights (29) and have been used extensively in survey sampling (30–32), for confounder control (33), and have been discussed in the context of selection bias (7, 19, 34, 35) or response bias in 2-phase studies (36).
Define the inverse probability-of-selection weight as
![An external file that holds a picture, illustration, etc.
Object name is amjepidkwq084fx2_ht.jpg](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2915476/bin/amjepidkwq084fx2_ht.jpg)
where P(.|.) is the conditional probability function. Let Z be an n-by-p matrix of discrete or continuous variables that describe the composition of the target population. For instance, in the simplest form, say the target population may be described by only a single binary characteristic such as sex, Z = Zi = 0 or 1. In our example, the target population is described by the complete cross-classification of sex, race, and age groups.
From the weight definition above, zero weights are given to target population members who are not selected into the trial sample, and real-valued positive weights are given to members who are selected. For selected members, the numerator of the weights, which is an estimate of the marginal probability of being selected, implies that E(Wi|Si = 1) = 1, where E(.|.) is the conditional expectation. The numerator is used to ensure that the weighted sample remains the same size as the observed sample in expectation. For selected members, the denominator of the weights is an estimate of the probability of being selected into the sample conditional on a vector of measured characteristics Z. The weights Wi are therefore inversely proportional to an estimate of the conditional probability of being selected. On the basis of the findings given in the Appendix, the collection of characteristics Z is chosen, based on prior knowledge and data exploration, to include factors 1) on which the trial sample differs from the target population and 2) for which there is heterogeneity in the effect of treatment on the outcome of interest.
The conditional probabilities for both the numerator and denominator of the weights were obtained by using linear-logistic regression models, specifically,
![An external file that holds a picture, illustration, etc.
Object name is amjepidkwq084fx3_ht.jpg](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2915476/bin/amjepidkwq084fx3_ht.jpg)
where 1/[1 + exp(−δ)] is the marginal probability of being selected into the trial sample from the target population; Zi includes a column of 1’s for the intercept; and exp(φk), for k = 1 to p, are the log odds ratios for being selected for each component of the n-by-p covariate matrix Z. In the models used for the ACTG 320 trial data, we included the characteristics themselves, as well as product terms to account for the joint distribution.
An inverse probability-of-selection-weighted Cox proportional hazards model may be fit by using the following weighted partial likelihood:
![An external file that holds a picture, illustration, etc.
Object name is amjepidkwq084fx4_ht.jpg](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2915476/bin/amjepidkwq084fx4_ht.jpg)
where Rk(ti) = 1 if patient k is at risk for the event at the event time for patient i, namely ti. The resultant log hazard ratio, γ, provides a consistent, asymptotically normal estimate of the population treatment effect α under the assumption that the model for the denominator of the selection weight includes all characteristics that both 1) differ between trial sample and target population and 2) demonstrate heterogeneity in the treatment effect. A proof of the consistency of the proposed method for the special case of the difference in means or proportions is provided in the Appendix. Throughout, hazard ratios are used to measure the strength of association, 95% confidence limits are used to measure precision, robust variances (37–39) are used in conjunction with weighted Cox models (40), and confidence limit ratios are used to compare precision across estimates. The confidence limit ratio is simply the ratio of the upper to the lower confidence limit. The proportional hazards assumption appeared reasonable in these data under the original intent-to-treat analysis (P for heterogeneity = 0.263) and the proposed weighted analysis (P for heterogeneity = 0.211).
RESULTS
The intent-to-treat analysis of the ACTG 320 trial found a hazard of AIDS or death of 0.51 (95% confidence limits: 0.33, 0.77) for the 577 people randomly assigned to the treatment group relative to the 579 randomly assigned to the control group. In the trial, older age was associated with higher incidence of AIDS or death (P for trend = 0.0315); compared with the age group 13–29 years, the hazard ratios for the age groups 30–39, 40–49, and ≥50 were 1.33 (95% confidence limits: 0.56, 3.15), 1.43 (95% confidence limits: 0.59, 3.44), and 2.32 (95% confidence limits: 0.92, 5.82), respectively. However, neither male sex (hazard ratio = 0.98, 95% confidence limits: 0.55, 1.73) nor race was strongly associated with incident AIDS or death (compared with the hazard ratio for whites, the hazard ratio for black non-Hispanics was 0.77 (95% confidence limits: 0.46, 1.28) and for Hispanics was 1.19 (95% confidence limits: 0.71, 2.00)).
Table 2 presents adjusted odds ratios for selection into the trial from the estimated US population infected in 2006. For presentation in Table 2, we omitted product terms between components of Z, but such terms are included for construction of W, as noted above. Males, those of white race or Hispanic ethnicity (compared with black race), or those older than age 30 years were more likely to be selected into the trial.
Table 2.
Odds Ratios and 95% Confidence Limits for Selection Into the AIDS Clinical Trial Group 320 Study in 1996–1997 From the Estimated US Population Infected With HIV in 2006
Characteristica | Odds Ratio | 95% CL |
Age group, yearsb | ||
13–29 | 1 | |
30–39 | 4.93 | 4.02, 6.12 |
40–49 | 4.75 | 3.82, 5.89 |
≥50 | 4.29 | 3.34, 5.52 |
Male sex | 1.51 | 1.29, 1.77 |
Race | ||
White, non-Hispanic | 1 | |
Black, non-Hispanic | 0.51 | 0.45, 0.59 |
Hispanic | 1.53 | 1.28, 1.83 |
Abbreviations: AIDS, acquired immunodeficiency syndrome; CL, confidence limit; HIV, human immunodeficiency virus.
Table 3 presents the hazard ratios and 95% confidence limits applicable to the trial patients as well as for the target population. As expected, based on the results in Table 2 and the age-stratified trial results in Table 3, when accounting only for the difference in age between the trial sample and target population, the hazard ratio was markedly muted from 0.51 to 0.68 because the trial selected for older population members, for whom the treatment effect appeared stronger. Similar results can be obtained by use of direct standardization when the dimension of Z is low. For instance, if the age-stratified hazard ratios (i.e., 1.87, 0.21, 0.84, and 0.59; Table 3) are log-transformed and combined by using the target population frequency distribution (i.e., 0.34, 0.31, 0.25, and 0.10; Table 1), the antilog of the direct standardized estimate is 0.69, which is similar to our model-based standardized estimate of 0.68. Furthermore, when we accounted only for the difference in sex between the trial sample and target population, the hazard ratio was slightly weaker because (as shown in Tables 2 and and3)3) the trial selected for males and the treatment effect appeared stronger in males. Finally, when we accounted only for the difference in race/ethnicity between the trial sample and target population, the hazard ratio was stronger because the trial selected against blacks and the treatment effect appeared stronger in blacks.
Table 3.
Hazard Ratios and 95% Confidence Limits for Incident AIDS or Death Within 1 Year for Patients in the AIDS Clinical Trial Group 320 Study in 1996–1997 and for the Population of Individuals Infected With HIV in 2006, United States
Hazard Ratio | 95% CL | CL Ratio | |
Trial results | |||
Intent-to-treata | 0.51 | 0.33, 0.77 | 2.33 |
Age-group stratified, yearsbc | |||
13–29 | 1.87 | 0.34, 10.2 | |
30–39 | 0.21 | 0.09, 0.48 | |
40–49 | 0.84 | 0.41, 1.70 | |
≥50 | 0.59 | 0.24, 1.45 | |
Sex stratifiedd | |||
Male | 0.47 | 0.29, 0.74 | |
Female | 0.76 | 0.28, 2.10 | |
Race stratifiede | |||
White, non-Hispanic | 0.59 | 0.34, 1.01 | |
Black, non-Hispanic | 0.30 | 0.11, 0.83 | |
Hispanic | 0.54 | 0.22, 1.36 | |
Population results | |||
Age weighted | 0.68 | 0.39, 1.17 | 3.00 |
Sex weighted | 0.53 | 0.34, 0.82 | 2.41 |
Race weighted | 0.46 | 0.29, 0.72 | 2.48 |
Age-sex-race weighted | 0.57 | 0.33, 1.00 | 3.03 |
Abbreviations: AIDS, acquired immunodeficiency syndrome; CL, confidence limit; HIV, human immunodeficiency virus.
All weighted estimates have wider confidence intervals than those in the trial, expressed in Table 3 as confidence limit ratios. The wider interval widths reflect the difference between the trial sample and target population. In fact, for the 3 single-attribute-weighted estimates in Table 3, the ranking of the confidence limit ratios accords with the distance between the hazard ratio for the trial sample and the hazard ratio for the target population.
When we simultaneously accounted for differences in age, sex, and race/ethnicity between the trial sample and target population, the hazard ratio was weakened from 0.51 to 0.57. This somewhat muted effect is apparent in Figure 1, which presents the complement of the Kaplan-Meier survival curves for the trial and the analogous curves (41) for the target population. Moreover, precision is somewhat decreased when inference is generalized to the target population, as is evident by the confidence limit ratios in Table 3.
Complement of the Kaplan-Meier survival curves, acquired immunodeficiency syndrome (AIDS) Clinical Trial Group 320 Study, 1996–1997, United States. A) intent-to-treat; B) selection probability weighted. Solid lines represent patients randomly assigned to the control group; dashed lines represent patients randomly assigned to the treatment group.
In the next section, we describe a simulation experiment. Our goal was to assess some finite-sample properties of the proposed method.
SIMULATIONS
Simulation design
We compare the proposed method with conventional intent-to-treat estimates of the hazard ratio in a setting that mimics the ACTG 320 trial. To compare the approaches, we calculated bias, computed as the estimated log hazard ratio minus the true log hazard ratio (described below); standard error, computed as the average of the estimated standard errors; Monte Carlo standard error, computed as the standard deviation of the estimated log hazard ratios; root mean squared error, computed as the square root of the squared bias plus the squared Monte Carlo standard error; and confidence limit coverage, computed as the proportion of times the confidence limit contains the true hazard ratio (estimated with the standard error, not the Monte Carlo standard error). Simulation results are subject to Monte Carlo error; on the basis of the 10,000 simulations, the 95% confidence limit coverage estimates have a simulation standard error of about 0.2%.
A simulated data record comprises a value for Z, X, T, S; we drew i = 1 to 10,000 simulated population member records for each of 10,000 simulation data sets. First, a Bernoulli random variable was generated with marginal probability of 0.5 for a single demographic characteristic Z. Second, a Bernoulli random variable was generated with marginal probability 0.5 for treatment X. Third, a lognormal random variable was generated conditional on the realized value of Z and X with density
![An external file that holds a picture, illustration, etc.
Object name is amjepidkwq084fx5_ht.jpg](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2915476/bin/amjepidkwq084fx5_ht.jpg)
The parameters {α0, α1, α2, σ} were chosen to represent 3 scenarios. To inform these simulations, the parameters of a lognormal model (which was the best-fitting parametric model of those explored for the control group in the ACTG 320 trial) were α0 = 8.585 (standard error, 0.4340) and σ = 2.709 (standard error, 0.2885). First, we chose α0 = 8.585, α1 = 0, α2 = 0, and σ = 2.709 such that the expectation of the true hazard ratio was 1, and we term this scenario “no effect.” Second, we chose α0 = 8.585, α1 = 1.23, α2 = 0, and σ = 2.709 such that the expectation of the true hazard ratio was approximately 0.5, as in the ACTG 320 trial; we term this scenario “homogeneous effect.” Third, we chose α0 = 8.585, α1 = 0.75, α2 = 2.10, and σ = 2.709 such that the expectation of the true hazard ratio was again approximately 0.5, as in ACTG 320, and we term this scenario “heterogeneous effect.”
The lognormal-distributed times were administratively censored at a fixed time such that, in expectation, we observed approximately 100 events per study, as in the ACTG 320 trial. In all cases, the reference for calculation of bias and confidence limit coverage was the true hazard ratio obtained in the complete target population. Last, a Bernoulli random variable was generated for selection into the trial S, conditional on the realized value of the demographic characteristic Z, as 1/{1 + exp[ − β0 − β1Zi]}, with β1 set at log(4) to reflect the size of selection effects observed in the ACTG 320 trial for age groups and β0 chosen to maintain a marginal probability of 0.1. We calculated both naïve and robust (40) variance estimates for the weighted models.
Simulation results
Across all simulations, the estimated stabilized weights (in the selected samples) had a mean of 1.00 (standard deviation, ≅0.66) with minimum and maximum values of about 0.6 and 2.75, respectively.
In Table 4, for the no-effect and homogeneous-effect scenarios, both the conventional intent-to-treat estimate of the hazard ratio and the weighted estimate of the hazard ratio provide unbiased estimates with appropriate confidence limit coverage. In such cases, the conventional hazard ratio is more precise than the weighted hazard ratio, as evidenced by a 1.2-fold relative root mean squared error (0.227/0.189, Table 4).
Table 4.
Simulation Results for 10,000 Samples per Scenario Each of Size 10,000 With 1,000 Patients per Trial
Bias | Average SE | Monte Carlo SE | Root MSE | 95% CL Coverage | |
Intent-to-treat | |||||
No effect | −0.003 | 0.188 | 0.189 | 0.189 | 0.955 |
Homogeneous effect | −0.014 | 0.232 | 0.235 | 0.235 | 0.945 |
Heterogeneous effect | −0.842 | 0.229 | 0.230 | 0.810 | 0.070 |
Selection probability weighted | |||||
No effect | −0.003 | 0.225 | 0.227 | 0.227 | 0.948 |
Homogeneous effect | −0.020 | 0.275 | 0.280 | 0.279 | 0.948 |
Heterogeneous effect | 0.000 | 0.246 | 0.246 | 0.246 | 0.955 |
Abbreviations: CL, confidence limit; MSE, mean squared error; SE, standard error.
For the heterogeneous-effect scenario in Table 4, the conventional estimate is severely biased, leading to abysmal confidence limit coverage. However, the weighted estimate is largely unbiased with appropriate confidence limit coverage. Use of naïve standard errors for the weighted estimate led to undercoverage of the confidence limits (coverage of 0.902, 0.906, and 0.871 for the 3 scenarios, respectively), but the use of robust standard errors raised the coverage to nominal levels, as shown in Table 4. Similar simulations were conducted for the odds ratio, with equally supportive results (data not shown).
DISCUSSION
We illustrated a method to generalize inferences from a randomized clinical trial to a specified target population using inverse probability-of-selection weights. In the ACTG 320 trial, the method demonstrated that inferences apply, albeit muted by 12%, to estimates of the US population infected in 2006, under the assumption that we have measured and correctly modeled the determinants of selection that reflect heterogeneity in the etiologic effect. The proposed method is supported by a limited Monte Carlo experiment.
The approach proposed here is one of model-based standardization (16, 42–44). Perhaps Lane and Nelder stated the idea most succinctly, in its simpler one-sample form, “consider a survey of the incidence of disease in cattle, where proportions affected are recorded for different age groups in different regions. After the selection and fitting of a suitable model, the fitted proportions can be combined with population frequencies (assuming these to be known) and summed over regions, to give a prediction of the total incidence for the whole country” (42, p. 614). Next, we provide some important caveats.
First, akin to the assumptions of no unmeasured confounders and no unmeasured informative censoring, in practice we will only at best be able to identify and measure a subset of the characteristics that lead to effect heterogeneity. Therefore, the proposed estimator will only approximate the etiologic effect in a defined target population to the extent that we capture said characteristics and specify them appropriately in the selection model. The proposed approach, or any that we can imagine, will need to have information on the joint distribution of the modifying characteristics in the target population and have these characteristics measured in the trial. In some settings, this may be difficult. In our case, we did not have individual-level data on the target population and instead used the summary statistics in the target population to construct a pseudo-population with the appropriate joint distribution of sex, race, and age groups. The method could easily be extended to incorporate individual-level data on the population if it were available, as illustrated by Stuart et al. (45).
Furthermore, in our example, the CD4 cell count differs between the target population and the trial sample. Based on an understanding of the natural history of HIV infection, the target population of recently infected people has relatively normal immune function. However, the trial sample is immune suppressed, as shown in Table 1. If the treatment effect is heterogeneous with respect to immune function, then we would be missing an important characteristic. Indeed, the related issue of when to start HIV therapies with respect to CD4 cell count is of prime clinical concern (46–48) but beyond our current scope. Moreover, even with a correct set of measured characteristics, one must correctly specify the selection model to maintain valid inferences; this requirement is relaxed if the selection model is saturated, as in our example. By “saturated” we mean that there is a parameter for every cell in the cross-classified data table such that data are not smoothed (of course, data reflecting continuous factors, such as age, are categorized). Exploration of this central assumption of measuring the relevant heterogeneity factors requires more attention. However, it seems that even a partially corrected mapping of the trial result to a target population is a step forward.
A second caveat is that we concentrated on mapping an observed intent-to-treat result estimated in a trial to a specified target population defined by baseline characteristics that are measured in the trial and are known in the target population. Here, we did not attempt to account for postrandomization variables (12), but such steps may be important especially when the intent-to-treat estimate is subject to nontrivial bias due to noncompliance (49, 50).
Third, we ignored uncertainty in the distribution of characteristics in the target population. Information for the target population used here was based on a large, nationally representative sample. However, in settings where the distribution of characteristics in the target population is subject to a large amount of random error, the proposed method should be extended to account for this uncertainty. This is a topic for future research.
Fourth, the information observed for the trial patients is optimal for standardization to the same population structure, so the reduction in precision observed when generalizing from a trial to a target population will be an increasing function of the “distance” between the trial patients and target population. At one extreme, where the trial patients and target population are completely divergent, there is no information for estimating the effect in the target population, and one must rely on extrapolation.
Last, we illustrated methods here using the Cox proportional hazards model (27) because the hazard ratio is a central measure of association in randomized clinical trials. However, the hazard ratio is intrinsically susceptible to selection bias, and perhaps survival curves or relative survival times provide more clear measures of causal effects (51–53).
In conclusion, the proposed method standardizes observed trial results to a specified target population. Therefore, the proposed method provides direct information regarding generalizability of the trial results to the specified target population. The approach may prove useful in projecting the effects of interventions in populations that may differ in composition from those studied in randomized clinical trials. Moreover, the present approach can itself be generalized to nonrandomized settings, which is a topic we intend to discuss in future work.
Acknowledgments
Author affiliations: Department of Epidemiology, Gillings School of Global Public Health and Center for AIDS Research, University of North Carolina, Chapel Hill, North Carolina (Stephen R. Cole); Department of Mental Health, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland (Elizabeth A. Stuart); and Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland (Elizabeth A. Stuart).
Dr. Cole was supported in part through National Institutes of Health (NIH) grants R03-AI-071763, R01-AA-01759, and P30-AI-50410. Dr. Stuart was supported in part through NIH grant K25-MH083946.
The authors thank Drs. Sander Greenland and Tyler VanderWeele for their expert advice.
Conflict of interest: none declared.
Glossary
Abbreviations
ACTG | AIDS Clinical Trial Group |
AIDS | acquired immunodeficiency syndrome |
HIV | human immunodeficiency virus |
APPENDIX
In this Appendix, we provide a proof of the asymptotic consistency of the proposed method for the case of the mean or proportion; results extend to the hazard but are not proven here. As a preliminary step, we derive the bias in the conventional intent-to-treat estimator.
Let α = E(Y1) − E(Y0) and β = E(Y1|S = 1) − E(Y0|S = 1), where E(.|.) is the conditional expectation taken with respect to an enumerated target population indexed by i = 1 to n, Yix is the potential outcome that would have occurred for person i under treatment X = x, and S∈{0,1} is an indicator of pretreatment selection into a sample of the target population. Although public health practitioners and clinicians may be interested in the population average treatment effect α, a conventional intent-to-treat comparison of groups randomized to treatment X = x from the sample provides an estimate of β.
If α≠E(Y1|Z = z) − E(Y0|Z = z) and P(S = 1)≠P(S = 1|Z = z) for a pretreatment covariate Z, where P(.|.) is a conditional probability taken with respect to the target population, then, except in circumstances of chance balancing cancellations, α≠β. Informally, the expectation of the difference in potential outcomes in the target population differs from the expectation of the difference in potential outcomes in an observed sample of the target population defined by S = 1 when there is heterogeneity in the causal effect of treatment X due to Z and the sample selection mechanism depends on Z.
If we limit our setting only in that Z ∈ {0,1}, then the bias of β, as a measure of α, is , where bxz is a coefficient representing the heterogeneity in the X effect due to Z in the outcome model E(Yi) = b0 + bxXi + bxzXiZi. A main effect for Z can be added to the outcome model without inducing any problem other than unnecessarily complicating the steps below. Therefore, the bias depends positively on the effect heterogeneity, the prevalence of the heterogeneity characteristic Z, the proportion of the target population not sampled, and the strength of the association between the heterogeneity characteristic and selection. In particular, there is no bias if there is no heterogeneity in the X effect due to Z, no one in the population has the heterogeneity characteristic (i.e., P(Z = 1) = 0), the “sample” consists of the entire population (i.e., P(S = 1) = 1), or sample selection does not depend on Z (i.e., P(S = 1|Z = z) = P(S = 1)). A derivation of the bias follows. First,
![An external file that holds a picture, illustration, etc.
Object name is amjepidkwq084fx7_ht.jpg](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2915476/bin/amjepidkwq084fx7_ht.jpg)
and
![An external file that holds a picture, illustration, etc.
Object name is amjepidkwq084fx8_ht.jpg](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2915476/bin/amjepidkwq084fx8_ht.jpg)
where by the consistency assumption (2). Then,
![An external file that holds a picture, illustration, etc.
Object name is amjepidkwq084fx10_ht.jpg](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2915476/bin/amjepidkwq084fx10_ht.jpg)
where, in the fourth line of the above derivation, by Bayes’ rule. We have also assumed the independence condition E(X|S,Z,Yx) = E(X). This condition says that treatment assignment is independent of sample selection, the covariate, and the potential outcomes. This ignorable treatment mechanism would be granted in expectation by random treatment assignment that does not depend on S or Z. This assumption may be relaxed for stratified randomization or nonrandomized studies (where E(X) may depend on Z), but that leads to a more complex formula for the bias.
Assume further that P(S = 1|Z = z,Yx) = P(S = 1|Z = z) or, in words, that we have an ignorable sample selection mechanism, conditional on Z, and that P(S = 1|Z = z) > 0 for all individuals. Such a mechanism would be granted in expectation by Z-stratified random sampling from the target population, where every individual has a positive probability of being selected.
Let , where w(Z) = P(S = 1|Z = z) and e(∅) = P(X = x). Then, γ is an inverse probability-of-selection-weighted expectation of the difference in potential outcomes in an observed sample of the target population, and γ = α under the above-stated assumptions. A proof builds on the work of Horvitz and Thompson (29) and extends that of Lunceford and Davidian (54) to allow for selection as
![An external file that holds a picture, illustration, etc.
Object name is amjepidkwq084fx13_ht.jpg](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2915476/bin/amjepidkwq084fx13_ht.jpg)
where I(.) is the indicator function. Steps are given by the law of conditional expectations, the consistency assumption, rearrangement, Z-conditional ignorable selection and ignorable treatment mechanisms, and the definitions of w(Z) and e(∅), respectively. Similarly, .