This vignette will compare different ways of applying inverse-probability weighting to address selection bias when there is missing data. We will consider a running data example from the ACT study, where an association of interest can only be obtained from the subset of individuals in the autopsy sub-cohort, and demonstrate methods of analysis to address this selection bias using the R software. Of interest are design-based analyses, which will apply weights to the completely observed data in order to address the varying probabilty of being observed across different indivduals.This makes fewer assumptions, and thus is more robust, than a model that relies on multiple imputation to fill in the missing data. However, in this vignette we will explore augmented inverse-probability weighted estimators that will be able to improve on the efficiency of typical weighted estimators by involving data on the whole cohort, rather than only those with completely observed data.
We will first look at a couple of equivalent ways to obtain the commonly used, but inefficient, Horwitz-Thompson inverse-probability weighted (IPW) estimator. When using an IPW estimator, one needs to account for the estimated weights when making inference. In this vignette, we also show a simple method that properly account for the estimation of weights in the IPW analysis by using a type of sandwich estimator of the variance (from the stacked estimating equation approach [1]), which is more convienient than the bootstrap esitmator. We will then demonstrate how to implement an augmented inverse-probability weighted estimator (AIPW) using a technique called generalized raking (GR), which is also referred to as survey calibration [2]. AIPW estimators are generally going to be recommeded over IPW analyses of a richly characterized cohort such as the ACT cohort, because there will likely be a rich amount of information on the broader cohort that can be used to increase efficiency compared to the more typical IPW analyses that ignores data on those individuals without complete data.
We will use data from the autopsy cohort (complete data subset N=850) and demonstrate IPW and AIPW analyses for a binary outcome first in detail, then provide examples of similar estimators for continuous, ordinal and survival outcomes. Table 1 below shows the the subject characteristics of the analysis data set (N=850) from the autopsy cohort and how the characteristics compare to all those in ACT and all those in ACT that are now deceased.
#dat <- as.data.table(read_sas(paste(data_dir,"NP_example_2023_05_22.sas7bdat" ,sep="")))
dat <- as.data.table(read_sas(paste(data_dir,"NP_example_2024_04_16.sas7bdat" ,sep="")))
## creat FU_years variable in years (computed as days_to_last_biennial/365.25)
dat[, FU_years := days_to_last_biennial/365.25]
## create potential factor variables of interest (exposures, adjustment covariates, or other descriptors)
dat[, postHS := factor(postHS, levels=c(0,1), labels=c("High school or less","Beyond high school"))]
dat[, sex := factor(gender, levels=c(1,2), labels=c("Male","Female"))]
dat[, cohort := factor(cohort, levels=c(1,2,3), labels=c("Original","Expansion","Replacement"))]
dat[, nh_white := factor(nh_white, levels=c(0,1), labels=c("No","Yes"))]
dat[, apoe := factor(apoe, levels=c(0,1), labels=c("No e4 allele","Yes e4 allele"))]
dat[, dementia := factor(dementia, levels=c(0,1), labels=c("No dementia","Yes dementia"))]
dat[, htn_f := factor(htn_f, levels=c(0,1), labels=c("No","Yes"))]
dat[, heart_dis_f := factor(heart_dis_f, levels=c(0,1), labels=c("No","Yes"))]
dat[, cv_dis_f := factor(cv_dis_f, levels=c(0,1), labels=c("No","Yes"))]
dat[, gait_able_f := factor(gait_able_f, levels=c(0,1), labels=c("Not able","Able"))]
## create select versions of factor variables with missing as a separate (allowed) category
dat[, htn_f_nacat := fct_na_value_to_level(htn_f, "Unknown")]
dat[, heart_dis_f_nacat := fct_na_value_to_level(heart_dis_f, "Unknown")]
dat[, cv_dis_f_nacat := fct_na_value_to_level(cv_dis_f, "Unknown")]
dat[, gait_able_f_nacat := fct_na_value_to_level(gait_able_f, "Unknown")]
## create select versions of continuous variables with missing as a separate (allowed) category
dat[, gait_time_f_nomiss := gait_time_f]
dat[, gait_nomiss := as.numeric(!is.na(gait_time_f))]
dat[gait_nomiss==0 , gait_time_f_nomiss := 0]
dat[, adl_sum_f_nomiss := adl_sum_f]
dat[, adl_nomiss := as.numeric(!is.na(adl_sum_f))]
dat[adl_nomiss==0 , adl_sum_f_nomiss := 0]
dat[, iadl_sum_f_nomiss := iadl_sum_f]
dat[, iadl_nomiss := as.numeric(!is.na(iadl_sum_f))]
dat[iadl_nomiss==0 , iadl_sum_f_nomiss := 0]
dat[, cesd_score_f_nomiss := cesd_score_f]
dat[, cesd_nomiss := as.numeric(!is.na(cesd_score_f))]
dat[cesd_nomiss==0 , cesd_score_f_nomiss := 0]
## construct spline terms - age_f
### Currently not used, but could explore if this made a difference in IPW and outcome models.
knots_age_f <- quantile(dat[,age_f], seq(0.05, 0.95, 0.30), na.rm=TRUE)
dat[, c("age_f_spl1","age_f_spl2","age_f_spl3") := data.table(rcspline.eval(dat$age_f, knots=knots_age_f, inclx=TRUE))]
## create binary and categorical outcome variables of interest
dat[braak_stg %in% c(0,1,2,3,4), braak_bin := 0]
dat[braak_stg %in% c(5,6), braak_bin := 1]
dat[is.na(braak_stg), braak_bin := NA]
label(dat$braak_bin) <- "Braak stage (V-VI vs. 0-IV)"
dat[, braak_stg := factor(braak_stg, levels=c(0,1,2,3,4,5,6), labels=c("0","I","II","III","IV","V","VI"))]
label(dat$braak_stg) <- "Braak stage"
## create inclusion criteria flags for 3 samples: autopsy, deceased, all
dat[, auto_inc := 0]
dat[neuropath==1, auto_inc := 1]
dat[, dth_inc := 0]
dat[!is.na(age_death), dth_inc := 1]
dat[, all_inc := 0]
dat[, all_inc := 1]
| Autopsy sample (N=850) |
All deceased (N=3198) |
Entire ACT cohort (N=5763) |
|
|---|---|---|---|
| Educational degree | |||
| High school or less | 429 (50%) | 1891 (59%) | 2602 (45%) |
| Beyond high school | 421 (50%) | 1306 (41%) | 3160 (55%) |
| Missing | 0 (0%) | 1 (0.0%) | 1 (0.0%) |
| Sex | |||
| Male | 360 (42%) | 1376 (43%) | 2420 (42%) |
| Female | 490 (58%) | 1822 (57%) | 3343 (58%) |
| ACT enrollment cohort | |||
| Original | 518 (61%) | 2217 (69%) | 2581 (45%) |
| Expansion | 193 (23%) | 565 (18%) | 811 (14%) |
| Replacement | 139 (16%) | 416 (13%) | 2371 (41%) |
| Age (ACT study entry) | |||
| Mean (SD) | 76.7 (6.6) | 76.3 (6.6) | 74.1 (6.4) |
| Age (Latest ACT biennial visit) | |||
| Mean (SD) | 86.1 (6.6) | 84.3 (7.0) | 81.9 (7.7) |
| Age (Death) | |||
| Mean (SD) | 88.7 (6.7) | 87.0 (7.1) | 87.0 (7.1) |
| Missing | 0 (0%) | 0 (0%) | 2565 (44.5%) |
| Non-Hispanic white | |||
| No | 49 (6%) | 299 (9%) | 638 (11%) |
| Yes | 801 (94%) | 2897 (91%) | 5113 (89%) |
| Missing | 0 (0%) | 2 (0.1%) | 12 (0.2%) |
| APOE genotype | |||
| No e4 allele | 593 (70%) | 2095 (66%) | 3453 (60%) |
| Yes e4 allele | 228 (27%) | 738 (23%) | 1233 (21%) |
| Missing | 29 (3.4%) | 365 (11.4%) | 1077 (18.7%) |
| Dementia (DSM-IV) | |||
| No dementia | 464 (55%) | 2104 (66%) | 4416 (77%) |
| Yes dementia | 386 (45%) | 1094 (34%) | 1347 (23%) |
| Hypertension | |||
| No | 293 (34%) | 1022 (32%) | 2179 (38%) |
| Yes | 510 (60%) | 1970 (62%) | 3291 (57%) |
| Missing | 47 (5.5%) | 206 (6.4%) | 293 (5.1%) |
| Coronary artery disease | |||
| No | 513 (60%) | 1771 (55%) | 3790 (66%) |
| Yes | 258 (30%) | 1069 (33%) | 1467 (25%) |
| Missing | 79 (9.3%) | 358 (11.2%) | 506 (8.8%) |
| Cerebrovascular disease | |||
| No | 520 (61%) | 1983 (62%) | 4107 (71%) |
| Yes | 245 (29%) | 828 (26%) | 1109 (19%) |
| Missing | 85 (10.0%) | 387 (12.1%) | 547 (9.5%) |
| Able to perform gait test | |||
| Not able | 50 (6%) | 138 (4%) | 153 (3%) |
| Able | 554 (65%) | 2144 (67%) | 4383 (76%) |
| Missing | 246 (28.9%) | 916 (28.6%) | 1227 (21.3%) |
| Seconds to complete gait test | |||
| Mean (SD) | 5.5 (3.4) | 5.2 (2.9) | 4.6 (2.4) |
| Missing | 296 (34.8%) | 1054 (33.0%) | 1380 (23.9%) |
| # of ADL difficulties | |||
| Mean (SD) | 1.6 (1.8) | 1.3 (1.7) | 0.9 (1.5) |
| Missing | 100 (11.8%) | 484 (15.1%) | 657 (11.4%) |
| # of IADL difficulties | |||
| Mean (SD) | 1.5 (1.7) | 1.2 (1.6) | 0.8 (1.4) |
| Missing | 112 (13.2%) | 528 (16.5%) | 725 (12.6%) |
| 10-item CES-D score | |||
| Mean (SD) | 4.4 (4.6) | 4.7 (4.8) | 4.4 (4.6) |
| Missing | 158 (18.6%) | 651 (20.4%) | 863 (15.0%) |
Now we fit the selection models that predict the probability that an observation is complete, i.e. all the covariates and the outcome are observed for that observation. A good rule of thumb for which covariates to stick in the selection model would be to include those variables you plausibly think could be related to the probablity of consenting to autopsy, dying (so getting autopsied), and the outcome of interest. If the variable is not plausibly related to outcome, then it wont help to put it in the selection model, and in fact could inflate the variance [9,10]. A bit more discussion of considerations of variables to include in the selection model is provided at the end of this document.
## restrict data to target population of interest (in this case, entire ACT cohort)
dat.target <- copy(dat[all_inc==1])
## set basic predictors to include in selection models
vars.sel <- c("postHS", "sex", "age_f", "cohort", "dementia")
## restrict data further to require non-missing values for all selection model predictors
dat.target <- dat.target[rowSums(is.na(dat.target[, ..vars.sel]))==0]
dat.target <- dat.target[!(auto_inc==1 & is.na(braak_bin))] # to avoid break code
## fit selection model for whether included in autopsy sample (logistic regression for odds of auto_inc=1)
selection.formula <-as.formula("auto_inc ~ postHS + sex + age_f + cohort + dementia")
fit.sel <- glm(selection.formula,
family = binomial(link="logit"),
data = dat.target)
## generate inverse probability of selection weights among the autopsy sample
#### currently cant use stabilized weights in two-phase poisson sampling survey variance calculations
dat.target$sel.prob<-predict(fit.sel, type="response")
dat.target$wts.sel <- 1/dat.target$sel.prob
summary(fit.sel) # Selection model summary
##
## Call:
## glm(formula = selection.formula, family = binomial(link = "logit"),
## data = dat.target)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.966334 0.516949 -13.476 < 2e-16 ***
## postHSBeyond high school 0.276636 0.083051 3.331 0.000866 ***
## sexFemale -0.179054 0.080951 -2.212 0.026974 *
## age_f 0.061354 0.005995 10.234 < 2e-16 ***
## cohortExpansion 0.147153 0.100544 1.464 0.143313
## cohortReplacement -0.948522 0.111718 -8.490 < 2e-16 ***
## dementiaYes dementia 0.875689 0.082048 10.673 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4802.2 on 5756 degrees of freedom
## Residual deviance: 4260.4 on 5750 degrees of freedom
## AIC: 4274.4
##
## Number of Fisher Scoring iterations: 5
### Summarize distribution of the weights
summary(dat.target$wts.sel)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.767 4.669 8.364 12.879 19.429 61.696
Some weights are a bit large, but since < 100, we will not do anything to stabilize. This is a bias-variance tradeoff that needs to be considered for each individual setting. There is no universal rule of thumb. If of concern, one could trim the weights to be say < 10, and see if that qualitatively changes conclusions from the analysis. Before seeing the results decide first which analysis is primary, and report the other one as a sensitivity analysis. Dont make this a p-value driven decision.
auc.rocr <- performance(prediction(fitted(fit.sel), fit.sel$y), "auc")
roc.rocr <- performance(prediction(fitted(fit.sel), fit.sel$y), "tpr","fpr")
plot(roc.rocr, main=paste("AUC =", round(auc.rocr@y.values[[1]],4)), cex.main=1)
abline(a=0,b=1)
The AUC is 0.74, showing moderate success at predicting missingness (0.5 is a useless binary prediction model and 1 is perfection).
To work with AIPW, it is handy to review how to perform the standard IPW analysis with a regression function in the survey package. First we will look at three different ways to fit IPW, in order to compare estimators and convince you that survey gives you the same answer as the other packages that are perhaps more familiar to some users. Note that these models below have naive standard errors, that dont account for the fact that the weights were estimated, but the point of this section is just to compare point estimates. Answers are identical between these three methods. This could lead to conservative SE estimates. A commonly fitted fix for that is to bootstrap the SE, however, there is a simpler way to fix the standard errors, which we will demonstrate below in IPW Method 4. After this section we will perform all analyses (both IPW and AIPW) using the survey package.
We fit an outcome model using modified Poisson regression to estimate the relative risk model for a binary outcome with standard IPW. We fit the IPW model 3 different ways using some common approaches in R.
##########################################################################
#### Three IPW approaches that do not adjust SE for estimated weights ####
### The main idea here is to see that the point estimates are the same ###
##########################################################################
### Using an outcome formula variable to save a little typing, with some simple R code
outcome.formula<- as.formula("braak_bin ~postHS + sex + age_death + cohort")
## Method 1: using svyglm (from survey package)
svyglm.bin.wt <- svyglm(formula = outcome.formula,
design = svydesign(id = ~1,
weights = ~wts.sel,
data = dat.target[auto_inc==1]),
family = quasipoisson(link="log"))
## Method 2: using geeglm (from geepack package)
gee.bin.wt <- geeglm(formula = outcome.formula,
id = SUBJECT,
family = poisson(link="log"),
data = dat.target[auto_inc==1],
weights = wts.sel)
## Method 3: using glm with sandwich package
glm.bin.wt <- glm(formula = outcome.formula,
family = quasipoisson(link="log"),
data = dat.target[auto_inc==1],
weights = wts.sel)
## equivalent naive Wald CIs from the 3 approaches
#### Original Scale
#round(exp(cbind(svyglm.bin.wt$coefficients, confint(svyglm.bin.wt))), 3)
#round(exp(cbind(gee.bin.wt$coefficients, confint.default(gee.bin.wt))), 3)
#round(exp(cbind(glm.bin.wt$coefficients,
# "2.5 %"=glm.bin.wt$coefficients-1.96*sqrt(diag(sandwich(glm.bin.wt))),
# "97.5 %"=glm.bin.wt$coefficients+1.96*sqrt(diag(sandwich(glm.bin.wt))))), 3)
#### CI for each method on the log Scale
round(cbind(svyglm.bin.wt$coefficients, confint(svyglm.bin.wt)), 3)
## 2.5 % 97.5 %
## (Intercept) -7.246 -8.802 -5.690
## postHSBeyond high school -0.134 -0.374 0.106
## sexFemale 0.125 -0.119 0.368
## age_death 0.067 0.050 0.084
## cohortExpansion 0.132 -0.113 0.377
## cohortReplacement 0.179 -0.120 0.479
round(cbind(gee.bin.wt$coefficients, confint.default(gee.bin.wt)), 3)
## 2.5 % 97.5 %
## (Intercept) -7.246 -8.799 -5.693
## postHSBeyond high school -0.134 -0.373 0.105
## sexFemale 0.125 -0.119 0.368
## age_death 0.067 0.050 0.084
## cohortExpansion 0.132 -0.112 0.376
## cohortReplacement 0.179 -0.120 0.478
round(cbind(glm.bin.wt$coefficients,
"2.5 %"=glm.bin.wt$coefficients-1.96*sqrt(diag(sandwich(glm.bin.wt))),
"97.5 %"=glm.bin.wt$coefficients+1.96*sqrt(diag(sandwich(glm.bin.wt)))), 3)
## 2.5 % 97.5 %
## (Intercept) -7.246 -8.799 -5.693
## postHSBeyond high school -0.134 -0.373 0.105
## sexFemale 0.125 -0.119 0.368
## age_death 0.067 0.050 0.084
## cohortExpansion 0.132 -0.112 0.376
## cohortReplacement 0.179 -0.120 0.478
This method produces estimates that are asymptoptically equivalent to the above IPW estimates, but not exactly the same point estimates. The estWeights function from the survey package is applying the method discussed in Robins, Rotnitzky, and Zhao 1994 (RRZ) [3] to estimate weights and then perform a weighted outcome regression. Further details on this approach are provided in section 2.3 of Lumley et al 2011 [2]. Using the estWeights function returns a design with the estimated weights and the necessary information to allow the subsequent outcome regression applied in the survey package to adjust the standard errors to account for estimated weights, using the stacked estimating equation sandwich approach. Note, this “sandwich” SE estimator is different than the “robust SE” sandwich estimator returned for a GEE weighted regression, which would be accounting for the weighting but not the fact that the weights were estimated.
The following chunk of code uses the estWeights command [4] to obtain the IPW weights. Here we only need to provide estWeights with the list of variables to put in the selection model. We use the same covariates that determined wts.sel above. We then estimate the IPW outcome regression and compare the results to svyglm (which had naive SE estimates). There isnt a clear pattern in terms of which method has smaller SE, but we trust the RRZfit estimates to appropriately account for the estimated weights. Thus regardless of whether SE are smaller, we reccomend this RRZ method of applying IPW and getting SE.
# the formula here is for the selection model
designEstWt<-estWeights(dat.target, formula=~postHS+sex+age_f+cohort+dementia,
subset=~I(auto_inc==1))
# Now we peform the IPW outcome regression
RRZfit<-svyglm(braak_bin~postHS+sex+age_death+cohort,design=designEstWt,family = quasipoisson(link="log"))
### log scale confidence interval, compare to IPW
round(cbind(RRZfit$coefficients, confint(RRZfit)), 3)
## 2.5 % 97.5 %
## (Intercept) -7.246 -8.759 -5.734
## postHSBeyond high school -0.134 -0.378 0.110
## sexFemale 0.125 -0.123 0.372
## age_death 0.067 0.051 0.084
## cohortExpansion 0.132 -0.117 0.381
## cohortReplacement 0.179 -0.130 0.489
### We can compare the two IPW fits. We see point estimates are the same, though in this case the SE arent always bigger for the naive approach.
summary(RRZfit)
##
## Call:
## svyglm(formula = braak_bin ~ postHS + sex + age_death + cohort,
## design = designEstWt, family = quasipoisson(link = "log"))
##
## Survey design:
## estWeights(dat.target, formula = ~postHS + sex + age_f + cohort +
## dementia, subset = ~I(auto_inc == 1))
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.246350 0.770446 -9.405 < 2e-16 ***
## postHSBeyond high school -0.134053 0.124276 -1.079 0.281
## sexFemale 0.124512 0.126082 0.988 0.324
## age_death 0.067262 0.008349 8.056 2.7e-15 ***
## cohortExpansion 0.131842 0.126920 1.039 0.299
## cohortReplacement 0.179378 0.157536 1.139 0.255
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 0.6920678)
##
## Number of Fisher Scoring iterations: 5
summary(svyglm.bin.wt)
##
## Call:
## svyglm(formula = outcome.formula, design = svydesign(id = ~1,
## weights = ~wts.sel, data = dat.target[auto_inc == 1]), family = quasipoisson(link = "log"))
##
## Survey design:
## svydesign(id = ~1, weights = ~wts.sel, data = dat.target[auto_inc ==
## 1])
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.24635 0.79278 -9.140 < 2e-16 ***
## postHSBeyond high school -0.13405 0.12224 -1.097 0.273
## sexFemale 0.12451 0.12419 1.003 0.316
## age_death 0.06726 0.00857 7.849 1.28e-14 ***
## cohortExpansion 0.13184 0.12469 1.057 0.291
## cohortReplacement 0.17938 0.15260 1.175 0.240
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 0.6920678)
##
## Number of Fisher Scoring iterations: 5
This approach relies on building a working model that uses data from the entire cohort (not just the complete subset) to produce auxilliary variables that will be used to adjust the weights for the inverse-weighted analysis. As discussed by Lumley et al [2] or Breslow et al 2009 [5], the ideal auxiliary variables would be the influence functions for each regression coefficient of interest. These are not observable since we have missing data, but we can approximate the influence functions for the desired regression coefficients by fitting an approximation to this model, which buys efficiency over the standard IPW if it is a good model, harmless if not (namely efficiency will remain roughly the same as for IPW). The steps are outlined in Breslow et al 2009 [5].
Build imputation model for missing (phase 2) variables using all phase 1 data. Something unusual here is we impute data for the variables prone to missingness even for people who had the phase 2 data observed (namely even for autopsy subset). Note, you could do this by building an imputation model and using multiple imputations, which has theortetical advantages, or you could simply use a single imputation [8]. A single imputation approach could also be simply to use a proxy or surrogate for the missing data. We will demonstrate each of these approaches below.
Fit the outcome working model using the imputed missing data (e.g. age at death and neuropath outcome) for everyone and get influence function for all regression coef in outcome model. These are the auxilliary variables used for raking in step 3. Raking is the name of the process (aka survey calibration) that uses these influence functions to adjust the IPW weights in order to incorporate into the weights the information from the indivduals not in the complete case subset. This process leverages the correlation between the auxiiliary variables (observed on whole cohort) and the true regression coefficient influence functions (only estimable on the complete case subset).
Calibrate the IPW weights using these influence functions.
Question: how much better is this than IPW?
Answer: For our example, we will see it appears we are getting some numerical benefit in terms of narrower confidence intervals, up to 10% efficiency.
We build an imputation model for the missing variables. In our example this is age at death and braak_bin outcome. We implement MICE. We first set up a dataset for imputation. We need to impute values of all variables (including outcome) prone to missingness for ALL individuals (whether originally observed or not). So, we append the original dataset with rows for each individual with observed data on all variables, but having the value of variables prone to missingness as NA; these rows are called “miss_phase2” below.
# Defining parameters and formulas to be used below.
miss_formula<-"auto_inc~postHS+sex+age_f+cohort+dementia"
formula <- "braak_bin~postHS+sex+age_death+cohort" # for the outcome (substantive) model
NimpRaking <- 25 # Number of imputed datasets. This takes a few min to run
# Counting parameters in outcome model
initfit <- glm(formula = as.formula(formula), family = "quasipoisson", data = dat.target)
coefnum <- length(coef(initfit))
data.mi <- model.frame(as.formula(formula), data = dat.target, na.action = NULL)
phase_2_indx <- (dat.target$auto_inc==1)
miss_phase2 <- data.mi[dat.target$auto_inc==1, ]
miss_cols <- which(colSums(is.na(data.mi)) > 0)
miss_phase2[, miss_cols] <- NA
data.mi2 <- rbind(data.mi, miss_phase2)
fake_phase_2 <- ((1:nrow(data.mi2)) %in% (nrow(data.mi) + 1):nrow(data.mi2))
data.mi <- data.mi2
# Create NimpRaking datasets where phase 2 data are imputed for all N individuals.
init <- mice::mice(data.mi, maxit = 0)
pred.matrix <- init$predictorMatrix
data.imputed <- mice::mice(data.mi, predictorMatrix = pred.matrix, m = NimpRaking, maxit = 20, print = FALSE)
We generate auxiliary variables for each imputed dataset. We do this by obtaining the influence functions from fitting a logistic regression model within each imputed dataset and then averaging values of influence functions across imputed datasets. This final step generates the expected value of the true influence function given the observed data with under ideal conditions would be the maximally efficient AIPW (Breslow et al 2009, [5]).
infMat_all <- array(data = 0, dim = c(nrow(dat.target), coefnum, NimpRaking))
for (iter in 1:NimpRaking) {
# Limiting the dataset to imputed data, i.e. removing rows where phase 2 data was originally observed.
imp_init <- mice::complete(data.imputed, iter)
impData_i <- imp_init[1:nrow(dat.target), ]
impData_i[phase_2_indx, ] <- imp_init[fake_phase_2, ]
mifit <- glm(formula = as.formula(formula), family = quasipoisson(link="log"), data = impData_i,x=TRUE)
infMat_all[, , iter] <- IC(mifit)
}
infMat <- rowMeans(infMat_all, dims = 2)
# Choose raking variables and add relevant raking variables to the original data frame. Here, we use generalized raking with all influence functions;
# rakeformula = ~ inf1 + ... + infk, where k = # coef fit in regression.
rakeformula <- "~ inf1"
for (i in 1:coefnum) {
varname <- paste0("inf", i)
dat.target$inf <- infMat[, i]
names(dat.target)[names(dat.target) == "inf"] <- varname
if (i > 1) {
rakeformula <- paste0(rakeformula, "+", varname)
}
}
Calibrate the weights and obtain the AIPW esitmator by fitting the desired outcome model with the calibrated weights.
# Creating a survey object and calibrating the weights. The design is individual (Poisson) sampling with selection prob = 1/wts.sel. This is the design to use for arbirary MAR models
mydesign <- survey::twophase(
id = list(~1, ~1), subset = ~I(dat.target$auto_inc == 1),
prob = list(NULL, ~I(1 / wts.sel)), data = dat.target,
pps = list(NULL, poisson_sampling(1 / dat.target$wts.sel[dat.target$auto_inc==1]))
)
infcal <- survey::calibrate(mydesign, formula = as.formula(rakeformula), phase = 2, calfun = "raking")
# Fitting the outcome model: conditional treatment effect of interest.
rakefit <- survey::svyglm(formula, design = infcal, family = quasipoisson)
summary(rakefit)
##
## Call:
## svyglm(formula = formula, design = infcal, family = quasipoisson)
##
## Survey design:
## survey::calibrate(mydesign, formula = as.formula(rakeformula),
## phase = 2, calfun = "raking")
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.225998 0.781732 -9.244 < 2e-16 ***
## postHSBeyond high school -0.140157 0.123838 -1.132 0.258
## sexFemale 0.136729 0.123960 1.103 0.270
## age_death 0.066753 0.008449 7.901 8.69e-15 ***
## cohortExpansion 0.156903 0.125136 1.254 0.210
## cohortReplacement 0.216525 0.152092 1.424 0.155
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 0.6905394)
##
## Number of Fisher Scoring iterations: 5
We can calculate the relative efficiency of the raking fit to IPW (with naive SE) and the RRZ procedure. Raking is more efficient for all variables with RE >1.
var.rakefit<-summary(rakefit)$coef[,2]^2 # GR it
var.RRZfit<-summary(RRZfit)$coef[,2]^2 # RRZ fit
### Percent relative efficiency with RRZ
round(var.RRZfit/var.rakefit*100,0)
## (Intercept) postHSBeyond high school sexFemale
## 97 101 103
## age_death cohortExpansion cohortReplacement
## 98 103 107
There was general improvement.
The above raking procedure is exactly the same for any glm regression. In this section, we will demonstrate the procedure for an ordinal and a survival outcome.
In this example, we will apply raking to the ordinal outcome braak_stg. For raking an ordinal outcome, we need a modified version of svy_vglm(), writting by Rod Walker. The VGAM package, on which this relies, is a bit fussy with the data frame needing to be a data.frame and not a tibble. So the svy_vglm_modify() function below was the way that we were able to get it to work. Thomas Lumley (core R developer) noted he didnt need the modification of svy_vglm(), so somehow this may depend on the R environment whether you need the modified function or can get away with using the original svy_vglm() in the svyVGAM package.
We begin by defining the modified version of the svyVGAM fucntion so that the column names are correctly abstracted from the design object when using estWeights function from the survey package.
Now we are ready to do the raking on the oridnal outcome braak stage. Note, the VGAM functions like the ordinal variable to be an ordered factor. This isnt imperative, but it does allow you to run these functions without getting any error messages. We will fit a proportional odds model using vglm, which requires the outcome to be an ordered factor. It is strongly recommended that to use the is.ordered() command in R to check that the ordinal factor variable is of ordered type. In particular, if one is going to use raking, which relies on #mice# to create auxiliary variables to calibrate on, the imputation procedures may be of higher quality if the variable is correctly coded as an ordered factor. For instance, a proportional odds type model would be used for imputation by default for an orded factor, whereas a multinomial model would be used for an unordered factor.
The code chunk below does both the IPW and raking (AIPW) esitmators using the survey package functions. For comparison, we estimate the IPW fit with naive SE and the recomended RRZ approach (IPW method #4 above).
dat.target$braakStgO<-factor(dat.target$braak_stg,ordered=TRUE)
# Check that the variable type is ordered factor
is.ordered(dat.target$braakStgO)
## [1] TRUE
outcome.formula<- as.formula("braakStgO ~ postHS + sex + age_death + cohort")
# Creating a survey object for complete case data with IPW weights
IPWdesign<-svydesign(id = ~1,weights = ~wts.sel, data = dat.target[auto_inc==1])
### Fit the IPW weighted proportional odds model
IPWfit<- svy_vglm(formula = outcome.formula,
design = IPWdesign,
family = cumulative(link="logitlink",
parallel=TRUE,
reverse=TRUE))
summary(IPWfit)
## svy_vglm.survey.design(formula = outcome.formula, design = IPWdesign,
## family = cumulative(link = "logitlink", parallel = TRUE,
## reverse = TRUE))
## Independent Sampling design (with replacement)
## svydesign(id = ~1, weights = ~wts.sel, data = dat.target[auto_inc ==
## 1])
## Coef SE z p
## (Intercept):1 -7.711895 1.039343 -7.4200 1.171e-13
## (Intercept):2 -9.357175 1.033995 -9.0495 < 2.2e-16
## (Intercept):3 -10.700008 1.043315 -10.2558 < 2.2e-16
## (Intercept):4 -11.543105 1.049985 -10.9936 < 2.2e-16
## (Intercept):5 -12.477361 1.072494 -11.6340 < 2.2e-16
## (Intercept):6 -13.643352 1.116115 -12.2240 < 2.2e-16
## postHSBeyond high school -0.164737 0.145307 -1.1337 0.256912
## sexFemale 0.220568 0.152764 1.4438 0.148781
## age_death 0.128366 0.011788 10.8895 < 2.2e-16
## cohortExpansion 0.346020 0.166464 2.0786 0.037650
## cohortReplacement 0.543068 0.180278 3.0124 0.002592
We apply the RRZ approach to estimating the selection weights, which ensures we get the correct SE for an IPW analysis.
designEstWt<-estWeights(data=as.data.frame(dat.target), formula=~postHS+sex+age_f+cohort+dementia,
subset=~I(auto_inc==1))
RRZfit <-svy_vglm_modify(formula = outcome.formula,
design = designEstWt,
family = cumulative(link="logitlink",
parallel=TRUE,
reverse=TRUE))
summary(RRZfit)
## svy_vglm_modify(formula = outcome.formula, design = designEstWt,
## family = cumulative(link = "logitlink", parallel = TRUE,
## reverse = TRUE))
## Two-phase sparse-matrix design:
## estWeights(data = as.data.frame(dat.target), formula = ~postHS +
## sex + age_f + cohort + dementia, subset = ~I(auto_inc ==
## 1))
## Phase 1:
## Independent Sampling design (with replacement)
## svydesign(ids = ~1)
## Phase 2:
## Independent Sampling design
## svydesign(ids = ~1, fpc = `*phase1*`)
## Coef SE z p
## (Intercept):1 -7.711895 1.017806 -7.5770 3.537e-14
## (Intercept):2 -9.357175 1.009936 -9.2651 < 2.2e-16
## (Intercept):3 -10.700008 1.018049 -10.5103 < 2.2e-16
## (Intercept):4 -11.543105 1.026218 -11.2482 < 2.2e-16
## (Intercept):5 -12.477361 1.048769 -11.8971 < 2.2e-16
## (Intercept):6 -13.643352 1.097642 -12.4297 < 2.2e-16
## postHSBeyond high school -0.164737 0.145291 -1.1338 0.25686
## sexFemale 0.220568 0.153727 1.4348 0.15134
## age_death 0.128366 0.011488 11.1738 < 2.2e-16
## cohortExpansion 0.346020 0.168005 2.0596 0.03944
## cohortReplacement 0.543068 0.181427 2.9933 0.00276
We implement the generalized raking approach, which involves three steps as described above.
Do imputation needed to generate the auxiliary variable.
### First get rid of old influence functions generated above, because otherwise code below breaks
dat.target<-dat.target[,-c("inf1","inf2","inf3","inf4","inf5","inf6")]
# Defining parameters and formulas to be used below.
miss_formula<-"auto_inc~postHS+sex+age_f+cohort+dementia"
formula <- "braakStgO~postHS+sex+age_death+cohort" # for the outcome (substantive) model
NimpRaking <- 25 # Number of imputed datasets. This takes a few min to run
# Counting parameters in outcome model
coefnum <- length(coef(IPWfit))
data.mi <- model.frame(outcome.formula, data = dat.target, na.action = NULL)
phase_2_indx <- (dat.target$auto_inc==1)
miss_phase2 <- data.mi[dat.target$auto_inc==1, ]
miss_cols <- which(colSums(is.na(data.mi)) > 0)
miss_phase2[, miss_cols] <- NA
data.mi2 <- rbind(data.mi, miss_phase2)
fake_phase_2 <- ((1:nrow(data.mi2)) %in% (nrow(data.mi) + 1):nrow(data.mi2))
data.mi <- data.mi2
# Create NimpRaking datasets where phase 2 data are imputed for all N individuals.
init <- mice::mice(data.mi, maxit = 0)
pred.matrix <- init$predictorMatrix
data.imputed <- mice::mice(data.mi, predictorMatrix = pred.matrix, m = NimpRaking, maxit = 20, print = FALSE)
Estimate value of influence functions by fitting a logistic regression model within each imputed dataset, calculating the resulting influence functions, and then averaging values of influence functions across imputed datasets. Note we could not use the IC function for the model returned by vgam and so we calculate them by hand.
infMat_all <- array(data = 0, dim = c(nrow(dat.target), coefnum, NimpRaking))
for (iter in 1:NimpRaking) {
# Limiting the dataset to imputed data, i.e. removing rows where phase 2 data was originally observed.
imp_init <- mice::complete(data.imputed, iter)
impData_i <- imp_init[1:nrow(dat.target), ]
impData_i[phase_2_indx, ] <- imp_init[fake_phase_2, ]
mifit <- vgam(outcome.formula,data=impData_i,family=propodds)
scores <- weights(mifit, deriv = TRUE, type = "working")$deriv
inv_inf <- summaryvglm(mifit)@cov.unscaled
cons <- do.call(cbind, constraints(mifit))
mmat <- model.matrixvlm(mifit)
case_index <- as.numeric(gsub("^(.+):.*", "\\1", rownames(mmat)))
mmatsum <- t(t(rowsum(mmat, case_index, reorder = FALSE))/colSums(cons))
pwts<-1
inffuns <- (((scores/pwts) %*% cons) * mmatsum) %*% inv_inf
infMat_all[, , iter] <- inffuns
}
infMat <- rowMeans(infMat_all, dims = 2)
# rakeformula = ~ inf1 + ... + infk, where k = # coef fit in regression.
rakeformula <- "~ inf1"
for (i in 1:coefnum) {
varname <- paste0("inf", i)
dat.target$inf <- infMat[, i]
names(dat.target)[names(dat.target) == "inf"] <- varname
if (i > 1) {
rakeformula <- paste0(rakeformula, " + ", varname)
}
}
Calibrate the IPW weights and obtain the AIPW estimator.
# Creating a survey object for complete case data with IPW weights
IPWdesign<-svydesign(id = ~1,weights = ~wts.sel, data = dat.target[auto_inc==1])
# Creating a survey object and calibrating the weights. Note the as.data.frame() in the data statement instide the twophase() design. This seemed important in order to get the svy_vglm_modify() funciton to work.
IPWdesign <- survey::twophase(
id = list(~1, ~1), subset = ~I(dat.target$auto_inc == 1),
prob = list(NULL, ~I(1 / wts.sel)), data = as.data.frame(dat.target),
pps = list(NULL, poisson_sampling(1 / dat.target$wts.sel[dat.target$auto_inc==1]))
)
infcal <- survey::calibrate(IPWdesign, formula = as.formula(rakeformula), phase = 2, calfun = "raking")
rakefit <- svy_vglm_modify(formula = outcome.formula,
design = infcal,
family = cumulative(link="logitlink",
parallel=TRUE,
reverse=TRUE))
summary(rakefit)
## svy_vglm_modify(formula = outcome.formula, design = infcal, family = cumulative(link = "logitlink",
## parallel = TRUE, reverse = TRUE))
## Two-phase sparse-matrix design:
## survey::calibrate(IPWdesign, formula = as.formula(rakeformula),
## phase = 2, calfun = "raking")
## Phase 1:
## Independent Sampling design (with replacement)
## svydesign(ids = ~1)
## Phase 2:
## Sparse-matrix design object:
## calibrate.pps(phase2, formula, population, calfun = calfun, ...)
## Coef SE z p
## (Intercept):1 -7.524555 1.012833 -7.4292 1.092e-13
## (Intercept):2 -9.087809 1.009447 -9.0028 < 2.2e-16
## (Intercept):3 -10.411351 1.017050 -10.2368 < 2.2e-16
## (Intercept):4 -11.259861 1.025628 -10.9785 < 2.2e-16
## (Intercept):5 -12.190383 1.046002 -11.6543 < 2.2e-16
## (Intercept):6 -13.356525 1.088947 -12.2655 < 2.2e-16
## postHSBeyond high school -0.165664 0.143180 -1.1570 0.247260
## sexFemale 0.221857 0.151971 1.4599 0.144328
## age_death 0.125183 0.011509 10.8771 < 2.2e-16
## cohortExpansion 0.308621 0.167009 1.8479 0.064612
## cohortReplacement 0.506467 0.181357 2.7926 0.005228
As before, we can calculate the relative efficiency of the raking fit to the RRZ procedure. Raking is more efficient for all variables with RE >1.
var.rakefit<-diag(summary(rakefit)$var) # GR fit
var.RRZfit<-diag(summary(RRZfit)$var) # RRZ fit
### Percent relative efficiency with RRZ
round(var.RRZfit/var.rakefit*100,0)
## (Intercept):1 (Intercept):2 (Intercept):3
## 101 100 100
## (Intercept):4 (Intercept):5 (Intercept):6
## 100 101 102
## postHSBeyond high school sexFemale age_death
## 103 102 100
## cohortExpansion cohortReplacement
## 101 100
For the slope paramters in the regression there was mostly a slight improvement. Here the modest improvement is likely due to the weak correlation between the true influence functions and working auxiliary variables. We investigate this.
Really the only step different for applying raking for any particular outcome is how to generate the influence functions. Recall for glm objects, we could just use the IC function from the lava package, as done above. For the ordinal model in the previous example we had to calculate them by hand using the relevant peices from the fitted model. For the Cox model, the survival function in R makes this easy by providing jack-knifed influence functions available in the resid function applied to the coxph model fit. The residual type is “dfbetas”. We illustrate this by adapting an example in Section 8.5.2 in Thomas Lumley’s book on Complex Surveys: A guide to Analysis Using R [6]. This example using the classic National Wilms Tumor Study (NWTS) data set. Data are from Kulich and Lin [7]. The question of interest is to examine whether a tumor’s unfavorale histology affects relapse.
In these data, the exposure variable histol signifies whether the tumour histology was unfavorable, as measured by central pathology (the gold standard). Local histology is also available in the variable instit, which is an error prone proxy for the gold standard histology. We will artificially induce missingness in the variable histol so that is only observed on a subset. In this example, the histol is singly imputed using using a logistic regression prediction model. AIPW with singly imputated auxiliary variables has been noted to be nearly as effcient as with multiple imputation (Han et al 2020 [8]).
# read in the data, which is inside package addhazard
# Documentation for the data is here: https://rdrr.io/cran/addhazard/man/nwtsco.html
# Unfavorable histology
data(nwtsco,package="addhazard")
This version of the NWTS dataset has no missing data, which differs from the dataset used by Lumley [6]. For this example, we artificially induce some missing data based on case status; we can imagine a procedure that verified findings from the local pathology that the rarer and concering unfavorable histology with high probabiity and a subset of the non-cases. For this example, we induce some missingness in the variable histol, the gold standard exposure.
#Institutional histology (instit), 0 = Favorable (FH), 1 = Unfavorable (UH)
pmiss<-ifelse(nwtsco$instit==1, .8, .08)
nwtsco$R<-ifelse(runif(nrow(nwtsco)) < pmiss,1,0)
nwtsco$histol.obs <-ifelse(nwtsco$R==1, nwtsco$histol, NA)
We impute the missing data. The imputation model is the same as in Kulich and Lin [7] and Lumley [6].
### Now let's assume we only observe histol.obs and predict the missing data
impmodel<-glm(histol~instit +I(age>10) + I(stage==4)*study, data=nwtsco, subset=I(R==1), family=binomial)
nwtsco$imphist<-predict(impmodel,newdata=nwtsco, type="response")
The next step of raking proceeds: the working model is fit to the imputed data, the influence functions are extracted from this fit.
Fit working model and estimate auxiliary variables for calibration
workingmodel<-coxph(Surv(trel,relaps)~imphist*age+I(stage>2)*tumdiam, data=nwtsco)
IF<-resid(workingmodel, "dfbeta")
# matrix of influence functions nxk, where n=number of subjects and k=number of regression parameters
colnames(IF)<-paste("inf",1:ncol(IF),sep="")
IF<-data.frame(IF)
nwtsco$inf1<-IF$inf1
nwtsco$inf2<-IF$inf2
nwtsco$inf3<-IF$inf3
nwtsco$inf4<-IF$inf4
nwtsco$inf5<-IF$inf5
nwtsco$inf6<-IF$inf6
The IPW weights are calibrated using these influence function auxiliary variables.
#Set up IPW weights using RRZ approach. The phase II design was stratified sampling on instit
## not sure why couldnt do with twophase()
#designIPW<-survey::twophase(id= list(~1,~1), subset= ~I(R==1), strata=~list(NULL,~instit), data=as.data.frame(nwtsco))
designEstWt<-estWeights(nwtsco, formula=~instit,subset=~I(R==1))
## calibrate on influence functions and stratification variable instit
designCal<-survey::calibrate(designEstWt, phase=2, calfun="raking", ~inf1 + inf2+inf3+inf4+inf5+inf6+instit)
Finally the outcome model is fit with these calibrated weights.
RRZfit<-svycoxph(Surv(trel,relaps)~histol.obs*age+I(stage>2)*tumdiam, design=designEstWt)
AIPWfit<-svycoxph(Surv(trel,relaps)~histol.obs*age+I(stage>2)*tumdiam, design=designCal)
# Compare models
summary(RRZfit)$coef
## Two-phase sparse-matrix design:
## estWeights(nwtsco, formula = ~instit, subset = ~I(R == 1))
## Phase 1:
## Independent Sampling design (with replacement)
## svydesign(ids = ~1)
## Phase 2:
## Independent Sampling design
## svydesign(ids = ~1, fpc = `*phase1*`)
## coef exp(coef) se(coef) robust se z
## histol.obs 2.33895656 10.3704100 0.41834624 0.46527016 5.027093
## age 0.06793264 1.0702932 0.04827570 0.05886851 1.153972
## I(stage > 2)TRUE 3.53821951 34.4056059 0.75861335 0.88956318 3.977480
## tumdiam 0.18990012 1.2091288 0.04049039 0.05284044 3.593841
## histol.obs:age -0.17258445 0.8414872 0.08752148 0.08730338 -1.976836
## I(stage > 2)TRUE:tumdiam -0.22328576 0.7998862 0.05663788 0.06409693 -3.483564
## Pr(>|z|)
## histol.obs 4.979698e-07
## age 2.485114e-01
## I(stage > 2)TRUE 6.964956e-05
## tumdiam 3.258390e-04
## histol.obs:age 4.806019e-02
## I(stage > 2)TRUE:tumdiam 4.947841e-04
summary(AIPWfit)$coef
## Two-phase sparse-matrix design:
## survey::calibrate(designEstWt, phase = 2, calfun = "raking",
## ~inf1 + inf2 + inf3 + inf4 + inf5 + inf6 + instit)
## Phase 1:
## Independent Sampling design (with replacement)
## svydesign(ids = ~1)
## Phase 2:
## Independent Sampling design
## calibrate(phase2, formula, population, calfun = calfun, ...)
## coef exp(coef) se(coef) robust se z
## histol.obs 2.28904358 9.8654976 0.44786865 0.38752116 5.906887
## age 0.12300428 1.1308893 0.05018026 0.03233580 3.803966
## I(stage > 2)TRUE 1.30594782 3.6911860 0.76044241 0.41594987 3.139676
## tumdiam 0.05896458 1.0607377 0.04336911 0.02517739 2.341965
## histol.obs:age -0.19661601 0.8215060 0.09315553 0.07818661 -2.514702
## I(stage > 2)TRUE:tumdiam -0.07547965 0.9272986 0.06068383 0.03050509 -2.474329
## Pr(>|z|)
## histol.obs 3.486335e-09
## age 1.423977e-04
## I(stage > 2)TRUE 1.691348e-03
## tumdiam 1.918249e-02
## histol.obs:age 1.191331e-02
## I(stage > 2)TRUE:tumdiam 1.334865e-02
As before, we can calculate the relative efficiency of the raking fit to the RRZ procedure. Raking is more efficient for all variables with RE >1.
var.rakefit<-summary(AIPWfit)$coef[,4]^2 # GR fit
## Two-phase sparse-matrix design:
## survey::calibrate(designEstWt, phase = 2, calfun = "raking",
## ~inf1 + inf2 + inf3 + inf4 + inf5 + inf6 + instit)
## Phase 1:
## Independent Sampling design (with replacement)
## svydesign(ids = ~1)
## Phase 2:
## Independent Sampling design
## calibrate(phase2, formula, population, calfun = calfun, ...)
var.RRZfit<-summary(RRZfit)$coef[,4]^2 # RRZ fit
## Two-phase sparse-matrix design:
## estWeights(nwtsco, formula = ~instit, subset = ~I(R == 1))
## Phase 1:
## Independent Sampling design (with replacement)
## svydesign(ids = ~1)
## Phase 2:
## Independent Sampling design
## svydesign(ids = ~1, fpc = `*phase1*`)
### Percent relative efficiency with RRZ
round(var.RRZfit/var.rakefit*100,0)
## histol.obs age I(stage > 2)TRUE
## 144 331 457
## tumdiam histol.obs:age I(stage > 2)TRUE:tumdiam
## 440 125 441
Note the dramatic improvement in efficiency, due to having good auxiliary variables.
If you have a good proxy for the missing data, a perhaps simpler way to generate the auxiliary variables is to skip the data imputation step and simply use the error prone data in place of the missing data. You could then directly obtain the influence functoins from this working model using the IC function from lava for glm models and the resid function for survival models, as illustrated from the examples above. Each of the columns of the influence function matrix is a separate auxiliary variable and can be used to perform the calibration of the IPW weights. This approach was mentioned by Kulich and Lin 2004 [7]. We demonstrate this using again the Wilms Tumor dataset. The error prone histology variable is instit, which is observed on everyone whereas the gold standard pathology histol is only observed on a subset.
We can skip imputation and just use the error prone variable instit.
Generate influence functions from working model, to be used as auxiliary variable to calibrate the weights.
workingmodel2<-coxph(Surv(trel,relaps)~instit*age+I(stage>2)*tumdiam, data=nwtsco)
IF<-resid(workingmodel2, "dfbeta")
# matrix of influence functions nxk, where n=number of subjects and k=number of regression parameters
colnames(IF)<-paste("inf",1:ncol(IF),sep="")
IF<-data.frame(IF)
nwtsco$inf1<-IF$inf1
nwtsco$inf2<-IF$inf2
nwtsco$inf3<-IF$inf3
nwtsco$inf4<-IF$inf4
nwtsco$inf5<-IF$inf5
nwtsco$inf6<-IF$inf6
The IPW weights are calibrated using these influence function auxiliary variables.
#Set up IPW weights using RRZ approach. The phase II design was stratified sampling on instit
## not sure why couldnt do with twophase()
#designIPW<-survey::twophase(id= list(~1,~1), subset= ~I(R==1), strata=~list(NULL,~instit), data=as.data.frame(nwtsco))
designEstWt<-estWeights(nwtsco, formula=~instit,subset=~I(R==1))
## calibrate on influence functions and stratification variable instit
designCal<-survey::calibrate(designEstWt, phase=2, calfun="raking", ~inf1 + inf2+inf3+inf4+inf5+inf6+instit)
Finally the outcome model is fit with these calibrated weights. We see this AIPW approach is very similar to the previous AIPW results that relied on imputation.
AIPWfit2<-svycoxph(Surv(trel,relaps)~histol.obs*age+I(stage>2)*tumdiam, design=designCal)
# Compare AIPW models
summary(AIPWfit2)$coef
## Two-phase sparse-matrix design:
## survey::calibrate(designEstWt, phase = 2, calfun = "raking",
## ~inf1 + inf2 + inf3 + inf4 + inf5 + inf6 + instit)
## Phase 1:
## Independent Sampling design (with replacement)
## svydesign(ids = ~1)
## Phase 2:
## Independent Sampling design
## calibrate(phase2, formula, population, calfun = calfun, ...)
## coef exp(coef) se(coef) robust se z
## histol.obs 2.33540461 10.3336401 0.44854905 0.38116505 6.127017
## age 0.12396396 1.1319751 0.05016502 0.03237237 3.829313
## I(stage > 2)TRUE 1.33449667 3.7980838 0.76148773 0.41782159 3.193939
## tumdiam 0.06028343 1.0621375 0.04345830 0.02516227 2.395787
## histol.obs:age -0.20405354 0.8154187 0.09410745 0.07621885 -2.677206
## I(stage > 2)TRUE:tumdiam -0.07714500 0.9257556 0.06069808 0.03059389 -2.521582
## Pr(>|z|)
## histol.obs 8.954212e-10
## age 1.285014e-04
## I(stage > 2)TRUE 1.403458e-03
## tumdiam 1.658474e-02
## histol.obs:age 7.423902e-03
## I(stage > 2)TRUE:tumdiam 1.168284e-02
summary(AIPWfit)$coef
## Two-phase sparse-matrix design:
## survey::calibrate(designEstWt, phase = 2, calfun = "raking",
## ~inf1 + inf2 + inf3 + inf4 + inf5 + inf6 + instit)
## Phase 1:
## Independent Sampling design (with replacement)
## svydesign(ids = ~1)
## Phase 2:
## Independent Sampling design
## calibrate(phase2, formula, population, calfun = calfun, ...)
## coef exp(coef) se(coef) robust se z
## histol.obs 2.28904358 9.8654976 0.44786865 0.38752116 5.906887
## age 0.12300428 1.1308893 0.05018026 0.03233580 3.803966
## I(stage > 2)TRUE 1.30594782 3.6911860 0.76044241 0.41594987 3.139676
## tumdiam 0.05896458 1.0607377 0.04336911 0.02517739 2.341965
## histol.obs:age -0.19661601 0.8215060 0.09315553 0.07818661 -2.514702
## I(stage > 2)TRUE:tumdiam -0.07547965 0.9272986 0.06068383 0.03050509 -2.474329
## Pr(>|z|)
## histol.obs 3.486335e-09
## age 1.423977e-04
## I(stage > 2)TRUE 1.691348e-03
## tumdiam 1.918249e-02
## histol.obs:age 1.191331e-02
## I(stage > 2)TRUE:tumdiam 1.334865e-02
As before, we can calculate the relative efficiency of the raking fit to the RRZ procedure. Raking is more efficient for all variables with RE >1.
var.rakefit<-summary(AIPWfit2)$coef[,4]^2 # GR fit
## Two-phase sparse-matrix design:
## survey::calibrate(designEstWt, phase = 2, calfun = "raking",
## ~inf1 + inf2 + inf3 + inf4 + inf5 + inf6 + instit)
## Phase 1:
## Independent Sampling design (with replacement)
## svydesign(ids = ~1)
## Phase 2:
## Independent Sampling design
## calibrate(phase2, formula, population, calfun = calfun, ...)
var.RRZfit<-summary(RRZfit)$coef[,4]^2 # RRZ fit
## Two-phase sparse-matrix design:
## estWeights(nwtsco, formula = ~instit, subset = ~I(R == 1))
## Phase 1:
## Independent Sampling design (with replacement)
## svydesign(ids = ~1)
## Phase 2:
## Independent Sampling design
## svydesign(ids = ~1, fpc = `*phase1*`)
### Percent relative efficiency with RRZ
round(var.RRZfit/var.rakefit*100,0)
## histol.obs age I(stage > 2)TRUE
## 149 331 453
## tumdiam histol.obs:age I(stage > 2)TRUE:tumdiam
## 441 131 439
Again there was dramatic improvement over the RRZ approach, which relied on complete case data only and ignored the data missing the histol variable. We can compare the relative efficiency (RE) of two AIPW estimators, the one relying on the imputation prediction model (AIPWfit) versus just using the errorprone variable to impute the missing histol (AIPWfit2). Here RE > 1 indicates the approach that used the error prone variable instit in the working model (and not the predicted histol) is more efficent.
var.rakefit2<-summary(AIPWfit2)$coef[,4]^2 # GR fit with error prone auxiliary variables
## Two-phase sparse-matrix design:
## survey::calibrate(designEstWt, phase = 2, calfun = "raking",
## ~inf1 + inf2 + inf3 + inf4 + inf5 + inf6 + instit)
## Phase 1:
## Independent Sampling design (with replacement)
## svydesign(ids = ~1)
## Phase 2:
## Independent Sampling design
## calibrate(phase2, formula, population, calfun = calfun, ...)
var.rakefit<-summary(AIPWfit)$coef[,4]^2 # GR fit with imputed auxiliary variables
## Two-phase sparse-matrix design:
## survey::calibrate(designEstWt, phase = 2, calfun = "raking",
## ~inf1 + inf2 + inf3 + inf4 + inf5 + inf6 + instit)
## Phase 1:
## Independent Sampling design (with replacement)
## svydesign(ids = ~1)
## Phase 2:
## Independent Sampling design
## calibrate(phase2, formula, population, calfun = calfun, ...)
### Relative efficiency for the two raking estimators. > 1 means denominator has smaller variance
round(var.rakefit/var.rakefit2*100,0)
## histol.obs age I(stage > 2)TRUE
## 103 100 99
## tumdiam histol.obs:age I(stage > 2)TRUE:tumdiam
## 100 105 99
The two efficiencies are very similar, perhaps a slight gain in efficiency for the approach that used the error prone variables directly in the imputation model.
A common question about IPW models is which variables should be included in the selection model. Here are two helpful references that review the basic principles of building reasonable IPW models [9,10]. One thing to consider, predictors in the selection model should be related to the outcome of interst and not just the missingness, otherwise you risk just inflating the variance [10]. For inverse-probability weighted estimators in a different setting, Schuster et al 2016 [11] warns that traditional principles of avoiding overfitting in small sampes should be considered.
For AIPW esitmators obtained using generalized raking, one should consider only raking on a subset of influence functions in small samples- particuarly with small numbers of events for a binary or survival outcome. Each raking variable introduces an additional constraint that must be satistied. If you choose all influence functions you’ve doubled the number of estimation constraints, possibly moving towards a more unstable estimation problem if there are small numbers of events. In this case, choosing a subset of the influence functions to approximate for raking variables that are thought to be well measured and/or related to the variables of interst could be a good strategy.
Stefanski LA, Boos DD. The calculus of M-estimation. The American Statistician. 2002 Feb 1;56(1):29-38.
Lumley T, Shaw PA, Dai JY. Connections between survey calibration estimators and semiparametric models for incomplete data. International Statistical Review. 2011 Aug;79(2):200-20.
Robins JM, Rotnitzky A, Zhao LP. (1994) Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89, 846-866.
https://r-survey.r-forge.r-project.org/survey/html/estweights.html
Breslow NE, Lumley T, Ballantyne CM, Chambless LE, Kulich M (2009). Improved Horvitz Thompson Estimation of Model Parameters from Two-Phase Stratified Samples: Applications in Epidemiology.” Statistics in Biosciences, 1(1), 32{49. doi:10.1007/s12561-009-9001-6.
Lumley T. Complex surveys: a guide to analysis using R. John Wiley & Sons; 2011 Sep 20.
Kulich M, Lin DY. Improving the efficiency of relative-risk estimation in case-cohort studies. Journal of the American Statistical Association. 2004 Sep 1;99(467):832-44.
Han K, Shaw PA, Lumley T. Combining multiple imputation with raking of weights: An efficient and robust approach in the setting of nearly true models. Statistics in Medicine. 2021 Dec 30;40(30):6777-91.
Seaman SR, White IR. Review of inverse probability weighting for dealing with missing data. Statistical Methods in Medical Research. 2013 Jun;22(3):278-95.
Little RJ, Carpenter JR, Lee KJ. A comparison of three popular methods for handling missing data: complete-case analysis, inverse probability weighting, and multiple imputation. Sociological Methods & Research. 2024 Aug;53(3):1105-35.
Schuster T, Lowe WK, Platt RW. Propensity score model overfitting led to inflated variance of estimated odds ratios. Journal of Clinical Epidemiology. 2016 Dec 1;80:97-106.
Placeholder for how to cite this document.
Shaw, PA and Walker R (2026, Jan 12). ACT Statistical Guidance:
Weighting methods to handle selection bias due to missing data.
Retrieved from