Adult Changes in Thought (ACT) study, Data Science and Informatics Core, Kaiser Permanente Washington Health Research Institute, Seattle WA; contact email: pamela.a.shaw@kp.org

Introduction

This vignette will compare different ways of applying inverse-probability weighting to address selection bias when there is missing data. We will consider a running data example from the ACT study, where an association of interest can only be obtained from the subset of individuals in the autopsy sub-cohort, and demonstrate methods of analysis to address this selection bias using the R software. Of interest are design-based analyses, which will apply weights to the completely observed data in order to address the varying probabilty of being observed across different indivduals.This makes fewer assumptions, and thus is more robust, than a model that relies on multiple imputation to fill in the missing data. However, in this vignette we will explore augmented inverse-probability weighted estimators that will be able to improve on the efficiency of typical weighted estimators by involving data on the whole cohort, rather than only those with completely observed data.

We will first look at a couple of equivalent ways to obtain the commonly used, but inefficient, Horwitz-Thompson inverse-probability weighted (IPW) estimator. When using an IPW estimator, one needs to account for the estimated weights when making inference. In this vignette, we also show a simple method that properly account for the estimation of weights in the IPW analysis by using a type of sandwich estimator of the variance (from the stacked estimating equation approach [1]), which is more convienient than the bootstrap esitmator. We will then demonstrate how to implement an augmented inverse-probability weighted estimator (AIPW) using a technique called generalized raking (GR), which is also referred to as survey calibration [2]. AIPW estimators are generally going to be recommeded over IPW analyses of a richly characterized cohort such as the ACT cohort, because there will likely be a rich amount of information on the broader cohort that can be used to increase efficiency compared to the more typical IPW analyses that ignores data on those individuals without complete data.

Read in and format example data

We will use data from the autopsy cohort (complete data subset N=850) and demonstrate IPW and AIPW analyses for a binary outcome first in detail, then provide examples of similar estimators for continuous, ordinal and survival outcomes. Table 1 below shows the the subject characteristics of the analysis data set (N=850) from the autopsy cohort and how the characteristics compare to all those in ACT and all those in ACT that are now deceased.

#dat <- as.data.table(read_sas(paste(data_dir,"NP_example_2023_05_22.sas7bdat" ,sep="")))
dat <- as.data.table(read_sas(paste(data_dir,"NP_example_2024_04_16.sas7bdat" ,sep="")))

## creat FU_years variable in years (computed as days_to_last_biennial/365.25) 
dat[, FU_years := days_to_last_biennial/365.25]

## create potential factor variables of interest (exposures, adjustment covariates, or other descriptors)
dat[, postHS := factor(postHS, levels=c(0,1), labels=c("High school or less","Beyond high school"))]
dat[, sex := factor(gender, levels=c(1,2), labels=c("Male","Female"))]
dat[, cohort := factor(cohort, levels=c(1,2,3), labels=c("Original","Expansion","Replacement"))]
dat[, nh_white := factor(nh_white, levels=c(0,1), labels=c("No","Yes"))]
dat[, apoe := factor(apoe, levels=c(0,1), labels=c("No e4 allele","Yes e4 allele"))]
dat[, dementia := factor(dementia, levels=c(0,1), labels=c("No dementia","Yes dementia"))]
dat[, htn_f := factor(htn_f, levels=c(0,1), labels=c("No","Yes"))]
dat[, heart_dis_f := factor(heart_dis_f, levels=c(0,1), labels=c("No","Yes"))]
dat[, cv_dis_f := factor(cv_dis_f, levels=c(0,1), labels=c("No","Yes"))]
dat[, gait_able_f := factor(gait_able_f, levels=c(0,1), labels=c("Not able","Able"))]

## create select versions of factor variables with missing as a separate (allowed) category
dat[, htn_f_nacat := fct_na_value_to_level(htn_f, "Unknown")]
dat[, heart_dis_f_nacat := fct_na_value_to_level(heart_dis_f, "Unknown")]
dat[, cv_dis_f_nacat := fct_na_value_to_level(cv_dis_f, "Unknown")]
dat[, gait_able_f_nacat := fct_na_value_to_level(gait_able_f, "Unknown")]

## create select versions of continuous variables with missing as a separate (allowed) category
dat[, gait_time_f_nomiss := gait_time_f]
dat[, gait_nomiss := as.numeric(!is.na(gait_time_f))]
dat[gait_nomiss==0 , gait_time_f_nomiss := 0]
dat[, adl_sum_f_nomiss := adl_sum_f]
dat[, adl_nomiss := as.numeric(!is.na(adl_sum_f))]
dat[adl_nomiss==0 , adl_sum_f_nomiss := 0]
dat[, iadl_sum_f_nomiss := iadl_sum_f]
dat[, iadl_nomiss := as.numeric(!is.na(iadl_sum_f))]
dat[iadl_nomiss==0 , iadl_sum_f_nomiss := 0]
dat[, cesd_score_f_nomiss := cesd_score_f]
dat[, cesd_nomiss := as.numeric(!is.na(cesd_score_f))]
dat[cesd_nomiss==0 , cesd_score_f_nomiss := 0]

## construct spline terms - age_f
### Currently not used, but could explore if this made a difference in IPW and outcome models.
knots_age_f <- quantile(dat[,age_f], seq(0.05, 0.95, 0.30), na.rm=TRUE)
dat[, c("age_f_spl1","age_f_spl2","age_f_spl3") := data.table(rcspline.eval(dat$age_f, knots=knots_age_f, inclx=TRUE))]

## create binary and categorical outcome variables of interest
dat[braak_stg %in% c(0,1,2,3,4), braak_bin := 0]
dat[braak_stg %in% c(5,6), braak_bin := 1]
dat[is.na(braak_stg), braak_bin := NA]
label(dat$braak_bin) <- "Braak stage (V-VI vs. 0-IV)"
dat[, braak_stg := factor(braak_stg, levels=c(0,1,2,3,4,5,6), labels=c("0","I","II","III","IV","V","VI"))]
label(dat$braak_stg) <- "Braak stage"

## create inclusion criteria flags for 3 samples: autopsy, deceased, all
dat[, auto_inc := 0]
dat[neuropath==1, auto_inc := 1]
dat[, dth_inc := 0]
dat[!is.na(age_death), dth_inc := 1]
dat[, all_inc := 0]
dat[, all_inc := 1]

Table 1

	Autopsy sample (N=850)	All deceased (N=3198)	Entire ACT cohort (N=5763)
Educational degree
High school or less	429 (50%)	1891 (59%)	2602 (45%)
Beyond high school	421 (50%)	1306 (41%)	3160 (55%)
Missing	0 (0%)	1 (0.0%)	1 (0.0%)
Sex
Male	360 (42%)	1376 (43%)	2420 (42%)
Female	490 (58%)	1822 (57%)	3343 (58%)
ACT enrollment cohort
Original	518 (61%)	2217 (69%)	2581 (45%)
Expansion	193 (23%)	565 (18%)	811 (14%)
Replacement	139 (16%)	416 (13%)	2371 (41%)
Age (ACT study entry)
Mean (SD)	76.7 (6.6)	76.3 (6.6)	74.1 (6.4)
Age (Latest ACT biennial visit)
Mean (SD)	86.1 (6.6)	84.3 (7.0)	81.9 (7.7)
Age (Death)
Mean (SD)	88.7 (6.7)	87.0 (7.1)	87.0 (7.1)
Missing	0 (0%)	0 (0%)	2565 (44.5%)
Non-Hispanic white
No	49 (6%)	299 (9%)	638 (11%)
Yes	801 (94%)	2897 (91%)	5113 (89%)
Missing	0 (0%)	2 (0.1%)	12 (0.2%)
APOE genotype
No e4 allele	593 (70%)	2095 (66%)	3453 (60%)
Yes e4 allele	228 (27%)	738 (23%)	1233 (21%)
Missing	29 (3.4%)	365 (11.4%)	1077 (18.7%)
Dementia (DSM-IV)
No dementia	464 (55%)	2104 (66%)	4416 (77%)
Yes dementia	386 (45%)	1094 (34%)	1347 (23%)
Hypertension
No	293 (34%)	1022 (32%)	2179 (38%)
Yes	510 (60%)	1970 (62%)	3291 (57%)
Missing	47 (5.5%)	206 (6.4%)	293 (5.1%)
Coronary artery disease
No	513 (60%)	1771 (55%)	3790 (66%)
Yes	258 (30%)	1069 (33%)	1467 (25%)
Missing	79 (9.3%)	358 (11.2%)	506 (8.8%)
Cerebrovascular disease
No	520 (61%)	1983 (62%)	4107 (71%)
Yes	245 (29%)	828 (26%)	1109 (19%)
Missing	85 (10.0%)	387 (12.1%)	547 (9.5%)
Able to perform gait test
Not able	50 (6%)	138 (4%)	153 (3%)
Able	554 (65%)	2144 (67%)	4383 (76%)
Missing	246 (28.9%)	916 (28.6%)	1227 (21.3%)
Seconds to complete gait test
Mean (SD)	5.5 (3.4)	5.2 (2.9)	4.6 (2.4)
Missing	296 (34.8%)	1054 (33.0%)	1380 (23.9%)
# of ADL difficulties
Mean (SD)	1.6 (1.8)	1.3 (1.7)	0.9 (1.5)
Missing	100 (11.8%)	484 (15.1%)	657 (11.4%)
# of IADL difficulties
Mean (SD)	1.5 (1.7)	1.2 (1.6)	0.8 (1.4)
Missing	112 (13.2%)	528 (16.5%)	725 (12.6%)
10-item CES-D score
Mean (SD)	4.4 (4.6)	4.7 (4.8)	4.4 (4.6)
Missing	158 (18.6%)	651 (20.4%)	863 (15.0%)

Selection Model

Now we fit the selection models that predict the probability that an observation is complete, i.e. all the covariates and the outcome are observed for that observation. A good rule of thumb for which covariates to stick in the selection model would be to include those variables you plausibly think could be related to the probablity of consenting to autopsy, dying (so getting autopsied), and the outcome of interest. If the variable is not plausibly related to outcome, then it wont help to put it in the selection model, and in fact could inflate the variance [9,10]. A bit more discussion of considerations of variables to include in the selection model is provided at the end of this document.

## restrict data to target population of interest (in this case, entire ACT cohort)
dat.target <- copy(dat[all_inc==1]) 

## set basic predictors to include in selection models
vars.sel <- c("postHS", "sex", "age_f", "cohort", "dementia")

## restrict data further to require non-missing values for all selection model predictors 
dat.target <- dat.target[rowSums(is.na(dat.target[, ..vars.sel]))==0]
dat.target <- dat.target[!(auto_inc==1 & is.na(braak_bin))] # to avoid break code

## fit selection model for whether included in autopsy sample (logistic regression for odds of auto_inc=1)
selection.formula <-as.formula("auto_inc ~ postHS + sex + age_f + cohort + dementia")

fit.sel  <- glm(selection.formula,
                family = binomial(link="logit"),
                data = dat.target)

## generate inverse probability of selection weights among the autopsy sample
#### currently cant use stabilized weights in two-phase poisson sampling survey variance calculations
dat.target$sel.prob<-predict(fit.sel, type="response")
dat.target$wts.sel <- 1/dat.target$sel.prob

Summarize selection model and variability of the weights

summary(fit.sel) # Selection model summary

## 
## Call:
## glm(formula = selection.formula, family = binomial(link = "logit"), 
##     data = dat.target)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -6.966334   0.516949 -13.476  < 2e-16 ***
## postHSBeyond high school  0.276636   0.083051   3.331 0.000866 ***
## sexFemale                -0.179054   0.080951  -2.212 0.026974 *  
## age_f                     0.061354   0.005995  10.234  < 2e-16 ***
## cohortExpansion           0.147153   0.100544   1.464 0.143313    
## cohortReplacement        -0.948522   0.111718  -8.490  < 2e-16 ***
## dementiaYes dementia      0.875689   0.082048  10.673  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4802.2  on 5756  degrees of freedom
## Residual deviance: 4260.4  on 5750  degrees of freedom
## AIC: 4274.4
## 
## Number of Fisher Scoring iterations: 5

### Summarize distribution of the weights
summary(dat.target$wts.sel)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.767   4.669   8.364  12.879  19.429  61.696

Some weights are a bit large, but since < 100, we will not do anything to stabilize. This is a bias-variance tradeoff that needs to be considered for each individual setting. There is no universal rule of thumb. If of concern, one could trim the weights to be say < 10, and see if that qualitatively changes conclusions from the analysis. Before seeing the results decide first which analysis is primary, and report the other one as a sensitivity analysis. Dont make this a p-value driven decision.

To evaluate how well the selection model predicted missingness, we plot an ROC curve.

auc.rocr <- performance(prediction(fitted(fit.sel), fit.sel$y), "auc")
roc.rocr <- performance(prediction(fitted(fit.sel), fit.sel$y), "tpr","fpr")
plot(roc.rocr, main=paste("AUC =", round(auc.rocr@y.values[[1]],4)), cex.main=1)
abline(a=0,b=1)

The AUC is 0.74, showing moderate success at predicting missingness (0.5 is a useless binary prediction model and 1 is perfection).

IPW outcome regression

To work with AIPW, it is handy to review how to perform the standard IPW analysis with a regression function in the survey package. First we will look at three different ways to fit IPW, in order to compare estimators and convince you that survey gives you the same answer as the other packages that are perhaps more familiar to some users. Note that these models below have naive standard errors, that dont account for the fact that the weights were estimated, but the point of this section is just to compare point estimates. Answers are identical between these three methods. This could lead to conservative SE estimates. A commonly fitted fix for that is to bootstrap the SE, however, there is a simpler way to fix the standard errors, which we will demonstrate below in IPW Method 4. After this section we will perform all analyses (both IPW and AIPW) using the survey package.

We consider a binary outcome

We fit an outcome model using modified Poisson regression to estimate the relative risk model for a binary outcome with standard IPW. We fit the IPW model 3 different ways using some common approaches in R.

3 IPW methods with naive SE

##########################################################################
#### Three IPW approaches that do not adjust SE for estimated weights ####
### The main idea here is to see that the point estimates are the same ###
##########################################################################

### Using an outcome formula variable to save a little typing, with some simple R code
outcome.formula<- as.formula("braak_bin ~postHS + sex + age_death + cohort")


## Method 1: using svyglm (from survey package)
svyglm.bin.wt  <- svyglm(formula = outcome.formula,
                      design = svydesign(id = ~1,
                                         weights = ~wts.sel,
                                         data = dat.target[auto_inc==1]),
                      family = quasipoisson(link="log"))

## Method 2: using geeglm (from geepack package)
gee.bin.wt  <- geeglm(formula = outcome.formula,
                      id = SUBJECT, 
                      family = poisson(link="log"),
                      data = dat.target[auto_inc==1],
                      weights = wts.sel)

## Method 3: using glm with sandwich package
glm.bin.wt  <- glm(formula = outcome.formula,
                   family = quasipoisson(link="log"),
                   data = dat.target[auto_inc==1],
                   weights = wts.sel)
                   
                   
## equivalent naive Wald CIs from the 3 approaches
#### Original Scale
#round(exp(cbind(svyglm.bin.wt$coefficients, confint(svyglm.bin.wt))), 3)
#round(exp(cbind(gee.bin.wt$coefficients, confint.default(gee.bin.wt))), 3)
#round(exp(cbind(glm.bin.wt$coefficients,
 #               "2.5 %"=glm.bin.wt$coefficients-1.96*sqrt(diag(sandwich(glm.bin.wt))),
  #              "97.5 %"=glm.bin.wt$coefficients+1.96*sqrt(diag(sandwich(glm.bin.wt))))), 3)
                
                
#### CI for each method on the log Scale
round(cbind(svyglm.bin.wt$coefficients, confint(svyglm.bin.wt)), 3)

##                                  2.5 % 97.5 %
## (Intercept)              -7.246 -8.802 -5.690
## postHSBeyond high school -0.134 -0.374  0.106
## sexFemale                 0.125 -0.119  0.368
## age_death                 0.067  0.050  0.084
## cohortExpansion           0.132 -0.113  0.377
## cohortReplacement         0.179 -0.120  0.479

round(cbind(gee.bin.wt$coefficients, confint.default(gee.bin.wt)), 3)

##                                  2.5 % 97.5 %
## (Intercept)              -7.246 -8.799 -5.693
## postHSBeyond high school -0.134 -0.373  0.105
## sexFemale                 0.125 -0.119  0.368
## age_death                 0.067  0.050  0.084
## cohortExpansion           0.132 -0.112  0.376
## cohortReplacement         0.179 -0.120  0.478

round(cbind(glm.bin.wt$coefficients,
                "2.5 %"=glm.bin.wt$coefficients-1.96*sqrt(diag(sandwich(glm.bin.wt))),
                "97.5 %"=glm.bin.wt$coefficients+1.96*sqrt(diag(sandwich(glm.bin.wt)))), 3)

##                                  2.5 % 97.5 %
## (Intercept)              -7.246 -8.799 -5.693
## postHSBeyond high school -0.134 -0.373  0.105
## sexFemale                 0.125 -0.119  0.368
## age_death                 0.067  0.050  0.084
## cohortExpansion           0.132 -0.112  0.376
## cohortReplacement         0.179 -0.120  0.478

IPW Method 4, with correct SE

This method produces estimates that are asymptoptically equivalent to the above IPW estimates, but not exactly the same point estimates. The estWeights function from the survey package is applying the method discussed in Robins, Rotnitzky, and Zhao 1994 (RRZ) [3] to estimate weights and then perform a weighted outcome regression. Further details on this approach are provided in section 2.3 of Lumley et al 2011 [2]. Using the estWeights function returns a design with the estimated weights and the necessary information to allow the subsequent outcome regression applied in the survey package to adjust the standard errors to account for estimated weights, using the stacked estimating equation sandwich approach. Note, this “sandwich” SE estimator is different than the “robust SE” sandwich estimator returned for a GEE weighted regression, which would be accounting for the weighting but not the fact that the weights were estimated.

The following chunk of code uses the estWeights command [4] to obtain the IPW weights. Here we only need to provide estWeights with the list of variables to put in the selection model. We use the same covariates that determined wts.sel above. We then estimate the IPW outcome regression and compare the results to svyglm (which had naive SE estimates). There isnt a clear pattern in terms of which method has smaller SE, but we trust the RRZfit estimates to appropriately account for the estimated weights. Thus regardless of whether SE are smaller, we reccomend this RRZ method of applying IPW and getting SE.

# the formula here is for the selection model
designEstWt<-estWeights(dat.target, formula=~postHS+sex+age_f+cohort+dementia,
                 subset=~I(auto_inc==1))     

# Now we peform the IPW outcome regression
RRZfit<-svyglm(braak_bin~postHS+sex+age_death+cohort,design=designEstWt,family = quasipoisson(link="log"))

### log scale confidence interval, compare to IPW
round(cbind(RRZfit$coefficients, confint(RRZfit)), 3)

##                                  2.5 % 97.5 %
## (Intercept)              -7.246 -8.759 -5.734
## postHSBeyond high school -0.134 -0.378  0.110
## sexFemale                 0.125 -0.123  0.372
## age_death                 0.067  0.051  0.084
## cohortExpansion           0.132 -0.117  0.381
## cohortReplacement         0.179 -0.130  0.489

### We can compare the two IPW fits. We see point estimates are the same, though in this case the SE arent always bigger for the naive approach.
summary(RRZfit)

## 
## Call:
## svyglm(formula = braak_bin ~ postHS + sex + age_death + cohort, 
##     design = designEstWt, family = quasipoisson(link = "log"))
## 
## Survey design:
## estWeights(dat.target, formula = ~postHS + sex + age_f + cohort + 
##     dementia, subset = ~I(auto_inc == 1))
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -7.246350   0.770446  -9.405  < 2e-16 ***
## postHSBeyond high school -0.134053   0.124276  -1.079    0.281    
## sexFemale                 0.124512   0.126082   0.988    0.324    
## age_death                 0.067262   0.008349   8.056  2.7e-15 ***
## cohortExpansion           0.131842   0.126920   1.039    0.299    
## cohortReplacement         0.179378   0.157536   1.139    0.255    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 0.6920678)
## 
## Number of Fisher Scoring iterations: 5

summary(svyglm.bin.wt)

## 
## Call:
## svyglm(formula = outcome.formula, design = svydesign(id = ~1, 
##     weights = ~wts.sel, data = dat.target[auto_inc == 1]), family = quasipoisson(link = "log"))
## 
## Survey design:
## svydesign(id = ~1, weights = ~wts.sel, data = dat.target[auto_inc == 
##     1])
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -7.24635    0.79278  -9.140  < 2e-16 ***
## postHSBeyond high school -0.13405    0.12224  -1.097    0.273    
## sexFemale                 0.12451    0.12419   1.003    0.316    
## age_death                 0.06726    0.00857   7.849 1.28e-14 ***
## cohortExpansion           0.13184    0.12469   1.057    0.291    
## cohortReplacement         0.17938    0.15260   1.175    0.240    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 0.6920678)
## 
## Number of Fisher Scoring iterations: 5

Generalized raking

Now let’s improve efficiency of the IPW appraoch by implementing an AIPW estimator using generalized raking (GR) approach.

This approach relies on building a working model that uses data from the entire cohort (not just the complete subset) to produce auxilliary variables that will be used to adjust the weights for the inverse-weighted analysis. As discussed by Lumley et al [2] or Breslow et al 2009 [5], the ideal auxiliary variables would be the influence functions for each regression coefficient of interest. These are not observable since we have missing data, but we can approximate the influence functions for the desired regression coefficients by fitting an approximation to this model, which buys efficiency over the standard IPW if it is a good model, harmless if not (namely efficiency will remain roughly the same as for IPW). The steps are outlined in Breslow et al 2009 [5].

GR Steps for case where we have no good proxy for the missing (phase 2) data

Build imputation model for missing (phase 2) variables using all phase 1 data. Something unusual here is we impute data for the variables prone to missingness even for people who had the phase 2 data observed (namely even for autopsy subset). Note, you could do this by building an imputation model and using multiple imputations, which has theortetical advantages, or you could simply use a single imputation [8]. A single imputation approach could also be simply to use a proxy or surrogate for the missing data. We will demonstrate each of these approaches below.
Fit the outcome working model using the imputed missing data (e.g. age at death and neuropath outcome) for everyone and get influence function for all regression coef in outcome model. These are the auxilliary variables used for raking in step 3. Raking is the name of the process (aka survey calibration) that uses these influence functions to adjust the IPW weights in order to incorporate into the weights the information from the indivduals not in the complete case subset. This process leverages the correlation between the auxiiliary variables (observed on whole cohort) and the true regression coefficient influence functions (only estimable on the complete case subset).
Calibrate the IPW weights using these influence functions.

Question: how much better is this than IPW?

Answer: For our example, we will see it appears we are getting some numerical benefit in terms of narrower confidence intervals, up to 10% efficiency.

GR Step 1

We build an imputation model for the missing variables. In our example this is age at death and braak_bin outcome. We implement MICE. We first set up a dataset for imputation. We need to impute values of all variables (including outcome) prone to missingness for ALL individuals (whether originally observed or not). So, we append the original dataset with rows for each individual with observed data on all variables, but having the value of variables prone to missingness as NA; these rows are called “miss_phase2” below.

    # Defining parameters and formulas to be used below.
    miss_formula<-"auto_inc~postHS+sex+age_f+cohort+dementia"
    formula <- "braak_bin~postHS+sex+age_death+cohort" # for the outcome (substantive) model
    NimpRaking <- 25 # Number of imputed datasets. This takes a few min to run

    # Counting parameters in outcome model
    initfit <- glm(formula = as.formula(formula), family = "quasipoisson", data = dat.target)
    coefnum <- length(coef(initfit))
     
    data.mi <- model.frame(as.formula(formula), data = dat.target, na.action = NULL)    
    phase_2_indx <- (dat.target$auto_inc==1)
    miss_phase2 <- data.mi[dat.target$auto_inc==1, ]
    miss_cols <- which(colSums(is.na(data.mi)) > 0)
    miss_phase2[, miss_cols] <- NA
    data.mi2 <- rbind(data.mi, miss_phase2)
    fake_phase_2 <- ((1:nrow(data.mi2)) %in% (nrow(data.mi) + 1):nrow(data.mi2))
    data.mi <- data.mi2
    
    # Create NimpRaking datasets where phase 2 data are imputed for all N individuals.
  init <- mice::mice(data.mi, maxit = 0)
  pred.matrix <- init$predictorMatrix
  data.imputed <- mice::mice(data.mi, predictorMatrix = pred.matrix, m = NimpRaking, maxit = 20, print = FALSE)

GR Step 2

We generate auxiliary variables for each imputed dataset. We do this by obtaining the influence functions from fitting a logistic regression model within each imputed dataset and then averaging values of influence functions across imputed datasets. This final step generates the expected value of the true influence function given the observed data with under ideal conditions would be the maximally efficient AIPW (Breslow et al 2009, [5]).

  infMat_all <- array(data = 0, dim = c(nrow(dat.target), coefnum, NimpRaking))
    for (iter in 1:NimpRaking) {
      # Limiting the dataset to imputed data, i.e. removing rows where phase 2 data was originally observed.
      imp_init <- mice::complete(data.imputed, iter)
        impData_i <- imp_init[1:nrow(dat.target), ]
        impData_i[phase_2_indx, ] <- imp_init[fake_phase_2, ]
    mifit <- glm(formula = as.formula(formula), family = quasipoisson(link="log"), data = impData_i,x=TRUE)
    infMat_all[, , iter] <- IC(mifit)
    }
  infMat <- rowMeans(infMat_all, dims = 2)


# Choose raking variables and add relevant raking variables to the original data frame. Here, we use generalized raking with all influence functions;
    # rakeformula = ~ inf1 + ... + infk, where k = # coef fit in regression.
    rakeformula <- "~ inf1"
    for (i in 1:coefnum) {
        varname <- paste0("inf", i)
        dat.target$inf <- infMat[, i]
        names(dat.target)[names(dat.target) == "inf"] <- varname
        if (i > 1) {
            rakeformula <- paste0(rakeformula, "+", varname)
        }
    }

GR Step 3

Calibrate the weights and obtain the AIPW esitmator by fitting the desired outcome model with the calibrated weights.

    # Creating a survey object and calibrating the weights. The design is individual (Poisson) sampling with selection prob = 1/wts.sel. This is the design to use for arbirary MAR models
    mydesign <- survey::twophase(
    id = list(~1, ~1), subset = ~I(dat.target$auto_inc == 1),
    prob = list(NULL, ~I(1 / wts.sel)), data = dat.target,
    pps = list(NULL, poisson_sampling(1 / dat.target$wts.sel[dat.target$auto_inc==1]))
    )
    infcal <- survey::calibrate(mydesign, formula = as.formula(rakeformula), phase = 2, calfun = "raking")
    # Fitting the outcome model: conditional treatment effect of interest.
    rakefit <- survey::svyglm(formula, design = infcal, family = quasipoisson)

summary(rakefit)

## 
## Call:
## svyglm(formula = formula, design = infcal, family = quasipoisson)
## 
## Survey design:
## survey::calibrate(mydesign, formula = as.formula(rakeformula), 
##     phase = 2, calfun = "raking")
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -7.225998   0.781732  -9.244  < 2e-16 ***
## postHSBeyond high school -0.140157   0.123838  -1.132    0.258    
## sexFemale                 0.136729   0.123960   1.103    0.270    
## age_death                 0.066753   0.008449   7.901 8.69e-15 ***
## cohortExpansion           0.156903   0.125136   1.254    0.210    
## cohortReplacement         0.216525   0.152092   1.424    0.155    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 0.6905394)
## 
## Number of Fisher Scoring iterations: 5

We can calculate the relative efficiency of the raking fit to IPW (with naive SE) and the RRZ procedure. Raking is more efficient for all variables with RE >1.

var.rakefit<-summary(rakefit)$coef[,2]^2 # GR it
var.RRZfit<-summary(RRZfit)$coef[,2]^2  # RRZ fit

### Percent relative efficiency with RRZ
round(var.RRZfit/var.rakefit*100,0)

##              (Intercept) postHSBeyond high school                sexFemale 
##                       97                      101                      103 
##                age_death          cohortExpansion        cohortReplacement 
##                       98                      103                      107

There was general improvement.

Generalized raking for other types of outcomes

The above raking procedure is exactly the same for any glm regression. In this section, we will demonstrate the procedure for an ordinal and a survival outcome.

Ordinal outcome

In this example, we will apply raking to the ordinal outcome braak_stg. For raking an ordinal outcome, we need a modified version of svy_vglm(), writting by Rod Walker. The VGAM package, on which this relies, is a bit fussy with the data frame needing to be a data.frame and not a tibble. So the svy_vglm_modify() function below was the way that we were able to get it to work. Thomas Lumley (core R developer) noted he didnt need the modification of svy_vglm(), so somehow this may depend on the R environment whether you need the modified function or can get away with using the original svy_vglm() in the svyVGAM package.

We begin by defining the modified version of the svyVGAM fucntion so that the column names are correctly abstracted from the design object when using estWeights function from the survey package.

Create svy_vglm_modify()

Now we are ready to do the raking on the oridnal outcome braak stage. Note, the VGAM functions like the ordinal variable to be an ordered factor. This isnt imperative, but it does allow you to run these functions without getting any error messages. We will fit a proportional odds model using vglm, which requires the outcome to be an ordered factor. It is strongly recommended that to use the is.ordered() command in R to check that the ordinal factor variable is of ordered type. In particular, if one is going to use raking, which relies on #mice# to create auxiliary variables to calibrate on, the imputation procedures may be of higher quality if the variable is correctly coded as an ordered factor. For instance, a proportional odds type model would be used for imputation by default for an orded factor, whereas a multinomial model would be used for an unordered factor.

The code chunk below does both the IPW and raking (AIPW) esitmators using the survey package functions. For comparison, we estimate the IPW fit with naive SE and the recomended RRZ approach (IPW method #4 above).

Standard IPW with naive SE

dat.target$braakStgO<-factor(dat.target$braak_stg,ordered=TRUE)

# Check that the variable type is ordered factor
is.ordered(dat.target$braakStgO)

## [1] TRUE

outcome.formula<- as.formula("braakStgO ~ postHS + sex + age_death + cohort")

# Creating a survey object for complete case data with IPW weights
IPWdesign<-svydesign(id = ~1,weights = ~wts.sel, data = dat.target[auto_inc==1])

### Fit the IPW weighted proportional odds model
IPWfit<- svy_vglm(formula = outcome.formula,
                      design = IPWdesign,
                      family = cumulative(link="logitlink",
                                                   parallel=TRUE,
                                                   reverse=TRUE))
summary(IPWfit)

## svy_vglm.survey.design(formula = outcome.formula, design = IPWdesign, 
##     family = cumulative(link = "logitlink", parallel = TRUE, 
##         reverse = TRUE))
## Independent Sampling design (with replacement)
## svydesign(id = ~1, weights = ~wts.sel, data = dat.target[auto_inc == 
##     1])
##                                Coef         SE        z         p
## (Intercept):1             -7.711895   1.039343  -7.4200 1.171e-13
## (Intercept):2             -9.357175   1.033995  -9.0495 < 2.2e-16
## (Intercept):3            -10.700008   1.043315 -10.2558 < 2.2e-16
## (Intercept):4            -11.543105   1.049985 -10.9936 < 2.2e-16
## (Intercept):5            -12.477361   1.072494 -11.6340 < 2.2e-16
## (Intercept):6            -13.643352   1.116115 -12.2240 < 2.2e-16
## postHSBeyond high school  -0.164737   0.145307  -1.1337  0.256912
## sexFemale                  0.220568   0.152764   1.4438  0.148781
## age_death                  0.128366   0.011788  10.8895 < 2.2e-16
## cohortExpansion            0.346020   0.166464   2.0786  0.037650
## cohortReplacement          0.543068   0.180278   3.0124  0.002592

RRZ IPW Method with correct SE

We apply the RRZ approach to estimating the selection weights, which ensures we get the correct SE for an IPW analysis.

designEstWt<-estWeights(data=as.data.frame(dat.target), formula=~postHS+sex+age_f+cohort+dementia,
                 subset=~I(auto_inc==1))                          
RRZfit <-svy_vglm_modify(formula = outcome.formula,
                      design = designEstWt,
                      family = cumulative(link="logitlink",
                                                   parallel=TRUE,
                                                   reverse=TRUE))
summary(RRZfit)

## svy_vglm_modify(formula = outcome.formula, design = designEstWt, 
##     family = cumulative(link = "logitlink", parallel = TRUE, 
##         reverse = TRUE))
## Two-phase sparse-matrix design:
##  estWeights(data = as.data.frame(dat.target), formula = ~postHS + 
##     sex + age_f + cohort + dementia, subset = ~I(auto_inc == 
##     1))
## Phase 1:
## Independent Sampling design (with replacement)
## svydesign(ids = ~1)
## Phase 2:
## Independent Sampling design
## svydesign(ids = ~1, fpc = `*phase1*`)
##                                Coef         SE        z         p
## (Intercept):1             -7.711895   1.017806  -7.5770 3.537e-14
## (Intercept):2             -9.357175   1.009936  -9.2651 < 2.2e-16
## (Intercept):3            -10.700008   1.018049 -10.5103 < 2.2e-16
## (Intercept):4            -11.543105   1.026218 -11.2482 < 2.2e-16
## (Intercept):5            -12.477361   1.048769 -11.8971 < 2.2e-16
## (Intercept):6            -13.643352   1.097642 -12.4297 < 2.2e-16
## postHSBeyond high school  -0.164737   0.145291  -1.1338   0.25686
## sexFemale                  0.220568   0.153727   1.4348   0.15134
## age_death                  0.128366   0.011488  11.1738 < 2.2e-16
## cohortExpansion            0.346020   0.168005   2.0596   0.03944
## cohortReplacement          0.543068   0.181427   2.9933   0.00276

AIPW method: Generalized raking

We implement the generalized raking approach, which involves three steps as described above.

GR Step 1

Do imputation needed to generate the auxiliary variable.

### First get rid of old influence functions generated above, because otherwise code below breaks
dat.target<-dat.target[,-c("inf1","inf2","inf3","inf4","inf5","inf6")]

    # Defining parameters and formulas to be used below.
    miss_formula<-"auto_inc~postHS+sex+age_f+cohort+dementia"
    formula <- "braakStgO~postHS+sex+age_death+cohort" # for the outcome (substantive) model
    NimpRaking <- 25 # Number of imputed datasets. This takes a few min to run

    # Counting parameters in outcome model
    coefnum <- length(coef(IPWfit))
     
    data.mi <- model.frame(outcome.formula, data = dat.target, na.action = NULL) 
    phase_2_indx <- (dat.target$auto_inc==1)
    miss_phase2 <- data.mi[dat.target$auto_inc==1, ]
    miss_cols <- which(colSums(is.na(data.mi)) > 0)
    miss_phase2[, miss_cols] <- NA
    data.mi2 <- rbind(data.mi, miss_phase2)
    fake_phase_2 <- ((1:nrow(data.mi2)) %in% (nrow(data.mi) + 1):nrow(data.mi2))
    data.mi <- data.mi2
    
    # Create NimpRaking datasets where phase 2 data are imputed for all N individuals.
  init <- mice::mice(data.mi, maxit = 0)
  pred.matrix <- init$predictorMatrix
  data.imputed <- mice::mice(data.mi, predictorMatrix = pred.matrix, m = NimpRaking, maxit = 20, print = FALSE)

GR Step 2

Estimate value of influence functions by fitting a logistic regression model within each imputed dataset, calculating the resulting influence functions, and then averaging values of influence functions across imputed datasets. Note we could not use the IC function for the model returned by vgam and so we calculate them by hand.

  infMat_all <- array(data = 0, dim = c(nrow(dat.target), coefnum, NimpRaking))
    for (iter in 1:NimpRaking) {
      # Limiting the dataset to imputed data, i.e. removing rows where phase 2 data was originally observed.
      imp_init <- mice::complete(data.imputed, iter)
        impData_i <- imp_init[1:nrow(dat.target), ]
        impData_i[phase_2_indx, ] <- imp_init[fake_phase_2, ]
    mifit <- vgam(outcome.formula,data=impData_i,family=propodds) 
    scores <- weights(mifit, deriv = TRUE, type = "working")$deriv  
    inv_inf <- summaryvglm(mifit)@cov.unscaled                      
    cons <- do.call(cbind, constraints(mifit))                  
    mmat <- model.matrixvlm(mifit)                                      
    case_index <- as.numeric(gsub("^(.+):.*", "\\1", rownames(mmat)))   
    mmatsum <- t(t(rowsum(mmat, case_index, reorder = FALSE))/colSums(cons))
    pwts<-1
    inffuns <- (((scores/pwts) %*% cons) * mmatsum) %*% inv_inf     
    infMat_all[, , iter] <- inffuns                                         
    }
  infMat <- rowMeans(infMat_all, dims = 2)

    # rakeformula = ~ inf1 + ... + infk, where k = # coef fit in regression.
    rakeformula <- "~ inf1"
    for (i in 1:coefnum) {
        varname <- paste0("inf", i)
        dat.target$inf <- infMat[, i]
        names(dat.target)[names(dat.target) == "inf"] <- varname
        if (i > 1) {
            rakeformula <- paste0(rakeformula, " + ", varname)
        }
    }

GR Step 3

Calibrate the IPW weights and obtain the AIPW estimator.

    # Creating a survey object for complete case data with IPW weights
IPWdesign<-svydesign(id = ~1,weights = ~wts.sel, data = dat.target[auto_inc==1])


# Creating a survey object and calibrating the weights. Note the as.data.frame() in the data statement instide the twophase() design. This seemed important in order to get the svy_vglm_modify() funciton to work.
IPWdesign <- survey::twophase(
    id = list(~1, ~1), subset = ~I(dat.target$auto_inc == 1),
    prob = list(NULL, ~I(1 / wts.sel)), data = as.data.frame(dat.target),
    pps = list(NULL, poisson_sampling(1 / dat.target$wts.sel[dat.target$auto_inc==1]))
)
infcal <- survey::calibrate(IPWdesign, formula = as.formula(rakeformula), phase = 2, calfun = "raking")

rakefit <- svy_vglm_modify(formula = outcome.formula,
                               design = infcal,
                               family = cumulative(link="logitlink",
                                                   parallel=TRUE,
                                                   reverse=TRUE))
summary(rakefit)

## svy_vglm_modify(formula = outcome.formula, design = infcal, family = cumulative(link = "logitlink", 
##     parallel = TRUE, reverse = TRUE))
## Two-phase sparse-matrix design:
##  survey::calibrate(IPWdesign, formula = as.formula(rakeformula), 
##     phase = 2, calfun = "raking")
## Phase 1:
## Independent Sampling design (with replacement)
## svydesign(ids = ~1)
## Phase 2:
## Sparse-matrix design object:
##  calibrate.pps(phase2, formula, population, calfun = calfun, ...)
##                                Coef         SE        z         p
## (Intercept):1             -7.524555   1.012833  -7.4292 1.092e-13
## (Intercept):2             -9.087809   1.009447  -9.0028 < 2.2e-16
## (Intercept):3            -10.411351   1.017050 -10.2368 < 2.2e-16
## (Intercept):4            -11.259861   1.025628 -10.9785 < 2.2e-16
## (Intercept):5            -12.190383   1.046002 -11.6543 < 2.2e-16
## (Intercept):6            -13.356525   1.088947 -12.2655 < 2.2e-16
## postHSBeyond high school  -0.165664   0.143180  -1.1570  0.247260
## sexFemale                  0.221857   0.151971   1.4599  0.144328
## age_death                  0.125183   0.011509  10.8771 < 2.2e-16
## cohortExpansion            0.308621   0.167009   1.8479  0.064612
## cohortReplacement          0.506467   0.181357   2.7926  0.005228

As before, we can calculate the relative efficiency of the raking fit to the RRZ procedure. Raking is more efficient for all variables with RE >1.

var.rakefit<-diag(summary(rakefit)$var) # GR fit
var.RRZfit<-diag(summary(RRZfit)$var)  # RRZ fit

### Percent relative efficiency with RRZ
round(var.RRZfit/var.rakefit*100,0)

##            (Intercept):1            (Intercept):2            (Intercept):3 
##                      101                      100                      100 
##            (Intercept):4            (Intercept):5            (Intercept):6 
##                      100                      101                      102 
## postHSBeyond high school                sexFemale                age_death 
##                      103                      102                      100 
##          cohortExpansion        cohortReplacement 
##                      101                      100

For the slope paramters in the regression there was mostly a slight improvement. Here the modest improvement is likely due to the weak correlation between the true influence functions and working auxiliary variables. We investigate this.

Survival outcome

With singly imputed model for the missing data

Really the only step different for applying raking for any particular outcome is how to generate the influence functions. Recall for glm objects, we could just use the IC function from the lava package, as done above. For the ordinal model in the previous example we had to calculate them by hand using the relevant peices from the fitted model. For the Cox model, the survival function in R makes this easy by providing jack-knifed influence functions available in the resid function applied to the coxph model fit. The residual type is “dfbetas”. We illustrate this by adapting an example in Section 8.5.2 in Thomas Lumley’s book on Complex Surveys: A guide to Analysis Using R [6]. This example using the classic National Wilms Tumor Study (NWTS) data set. Data are from Kulich and Lin [7]. The question of interest is to examine whether a tumor’s unfavorale histology affects relapse.

In these data, the exposure variable histol signifies whether the tumour histology was unfavorable, as measured by central pathology (the gold standard). Local histology is also available in the variable instit, which is an error prone proxy for the gold standard histology. We will artificially induce missingness in the variable histol so that is only observed on a subset. In this example, the histol is singly imputed using using a logistic regression prediction model. AIPW with singly imputated auxiliary variables has been noted to be nearly as effcient as with multiple imputation (Han et al 2020 [8]).

Load NWTS data and induce some missingness

# read in the data, which is inside package addhazard
# Documentation for the data is here: https://rdrr.io/cran/addhazard/man/nwtsco.html
# Unfavorable histology
data(nwtsco,package="addhazard")

This version of the NWTS dataset has no missing data, which differs from the dataset used by Lumley [6]. For this example, we artificially induce some missing data based on case status; we can imagine a procedure that verified findings from the local pathology that the rarer and concering unfavorable histology with high probabiity and a subset of the non-cases. For this example, we induce some missingness in the variable histol, the gold standard exposure.

#Institutional histology (instit), 0 = Favorable (FH), 1 = Unfavorable (UH)

pmiss<-ifelse(nwtsco$instit==1, .8, .08)
nwtsco$R<-ifelse(runif(nrow(nwtsco)) < pmiss,1,0)
nwtsco$histol.obs <-ifelse(nwtsco$R==1, nwtsco$histol, NA)

GR Step 1

We impute the missing data. The imputation model is the same as in Kulich and Lin [7] and Lumley [6].

### Now let's assume we only observe histol.obs and predict the missing data
impmodel<-glm(histol~instit +I(age>10) + I(stage==4)*study, data=nwtsco, subset=I(R==1), family=binomial)
nwtsco$imphist<-predict(impmodel,newdata=nwtsco, type="response")

The next step of raking proceeds: the working model is fit to the imputed data, the influence functions are extracted from this fit.

GR Step 2

Fit working model and estimate auxiliary variables for calibration

workingmodel<-coxph(Surv(trel,relaps)~imphist*age+I(stage>2)*tumdiam, data=nwtsco)
IF<-resid(workingmodel, "dfbeta") 

# matrix of influence functions nxk, where n=number of subjects and k=number of regression parameters
colnames(IF)<-paste("inf",1:ncol(IF),sep="")
IF<-data.frame(IF)
nwtsco$inf1<-IF$inf1
nwtsco$inf2<-IF$inf2
nwtsco$inf3<-IF$inf3
nwtsco$inf4<-IF$inf4
nwtsco$inf5<-IF$inf5
nwtsco$inf6<-IF$inf6

GR Step 3

The IPW weights are calibrated using these influence function auxiliary variables.

#Set up IPW weights using RRZ approach. The phase II design was stratified sampling on instit 
## not sure why couldnt do with twophase()
#designIPW<-survey::twophase(id= list(~1,~1), subset= ~I(R==1), strata=~list(NULL,~instit), data=as.data.frame(nwtsco))
designEstWt<-estWeights(nwtsco, formula=~instit,subset=~I(R==1)) 

## calibrate on influence functions and stratification variable instit
designCal<-survey::calibrate(designEstWt, phase=2, calfun="raking", ~inf1 + inf2+inf3+inf4+inf5+inf6+instit)

Obtain AIPW estimator

Finally the outcome model is fit with these calibrated weights.

RRZfit<-svycoxph(Surv(trel,relaps)~histol.obs*age+I(stage>2)*tumdiam, design=designEstWt)

AIPWfit<-svycoxph(Surv(trel,relaps)~histol.obs*age+I(stage>2)*tumdiam, design=designCal)

# Compare models
summary(RRZfit)$coef

## Two-phase sparse-matrix design:
##  estWeights(nwtsco, formula = ~instit, subset = ~I(R == 1))
## Phase 1:
## Independent Sampling design (with replacement)
## svydesign(ids = ~1)
## Phase 2:
## Independent Sampling design
## svydesign(ids = ~1, fpc = `*phase1*`)

##                                 coef  exp(coef)   se(coef)  robust se         z
## histol.obs                2.33895656 10.3704100 0.41834624 0.46527016  5.027093
## age                       0.06793264  1.0702932 0.04827570 0.05886851  1.153972
## I(stage > 2)TRUE          3.53821951 34.4056059 0.75861335 0.88956318  3.977480
## tumdiam                   0.18990012  1.2091288 0.04049039 0.05284044  3.593841
## histol.obs:age           -0.17258445  0.8414872 0.08752148 0.08730338 -1.976836
## I(stage > 2)TRUE:tumdiam -0.22328576  0.7998862 0.05663788 0.06409693 -3.483564
##                              Pr(>|z|)
## histol.obs               4.979698e-07
## age                      2.485114e-01
## I(stage > 2)TRUE         6.964956e-05
## tumdiam                  3.258390e-04
## histol.obs:age           4.806019e-02
## I(stage > 2)TRUE:tumdiam 4.947841e-04

summary(AIPWfit)$coef

## Two-phase sparse-matrix design:
##  survey::calibrate(designEstWt, phase = 2, calfun = "raking", 
##     ~inf1 + inf2 + inf3 + inf4 + inf5 + inf6 + instit)
## Phase 1:
## Independent Sampling design (with replacement)
## svydesign(ids = ~1)
## Phase 2:
## Independent Sampling design
## calibrate(phase2, formula, population, calfun = calfun, ...)

##                                 coef exp(coef)   se(coef)  robust se         z
## histol.obs                2.28904358 9.8654976 0.44786865 0.38752116  5.906887
## age                       0.12300428 1.1308893 0.05018026 0.03233580  3.803966
## I(stage > 2)TRUE          1.30594782 3.6911860 0.76044241 0.41594987  3.139676
## tumdiam                   0.05896458 1.0607377 0.04336911 0.02517739  2.341965
## histol.obs:age           -0.19661601 0.8215060 0.09315553 0.07818661 -2.514702
## I(stage > 2)TRUE:tumdiam -0.07547965 0.9272986 0.06068383 0.03050509 -2.474329
##                              Pr(>|z|)
## histol.obs               3.486335e-09
## age                      1.423977e-04
## I(stage > 2)TRUE         1.691348e-03
## tumdiam                  1.918249e-02
## histol.obs:age           1.191331e-02
## I(stage > 2)TRUE:tumdiam 1.334865e-02

As before, we can calculate the relative efficiency of the raking fit to the RRZ procedure. Raking is more efficient for all variables with RE >1.

var.rakefit<-summary(AIPWfit)$coef[,4]^2 # GR fit

## Two-phase sparse-matrix design:
##  survey::calibrate(designEstWt, phase = 2, calfun = "raking", 
##     ~inf1 + inf2 + inf3 + inf4 + inf5 + inf6 + instit)
## Phase 1:
## Independent Sampling design (with replacement)
## svydesign(ids = ~1)
## Phase 2:
## Independent Sampling design
## calibrate(phase2, formula, population, calfun = calfun, ...)

var.RRZfit<-summary(RRZfit)$coef[,4]^2  # RRZ fit

## Two-phase sparse-matrix design:
##  estWeights(nwtsco, formula = ~instit, subset = ~I(R == 1))
## Phase 1:
## Independent Sampling design (with replacement)
## svydesign(ids = ~1)
## Phase 2:
## Independent Sampling design
## svydesign(ids = ~1, fpc = `*phase1*`)

### Percent relative efficiency with RRZ
round(var.RRZfit/var.rakefit*100,0)

##               histol.obs                      age         I(stage > 2)TRUE 
##                      144                      331                      457 
##                  tumdiam           histol.obs:age I(stage > 2)TRUE:tumdiam 
##                      440                      125                      441

Note the dramatic improvement in efficiency, due to having good auxiliary variables.

Raking in the setting of measurement error

If you have a good proxy for the missing data, a perhaps simpler way to generate the auxiliary variables is to skip the data imputation step and simply use the error prone data in place of the missing data. You could then directly obtain the influence functoins from this working model using the IC function from lava for glm models and the resid function for survival models, as illustrated from the examples above. Each of the columns of the influence function matrix is a separate auxiliary variable and can be used to perform the calibration of the IPW weights. This approach was mentioned by Kulich and Lin 2004 [7]. We demonstrate this using again the Wilms Tumor dataset. The error prone histology variable is instit, which is observed on everyone whereas the gold standard pathology histol is only observed on a subset.

GR Step 1

We can skip imputation and just use the error prone variable instit.

GR Step 2

Generate influence functions from working model, to be used as auxiliary variable to calibrate the weights.

workingmodel2<-coxph(Surv(trel,relaps)~instit*age+I(stage>2)*tumdiam, data=nwtsco)
IF<-resid(workingmodel2, "dfbeta") 

# matrix of influence functions nxk, where n=number of subjects and k=number of regression parameters
colnames(IF)<-paste("inf",1:ncol(IF),sep="")
IF<-data.frame(IF)
nwtsco$inf1<-IF$inf1
nwtsco$inf2<-IF$inf2
nwtsco$inf3<-IF$inf3
nwtsco$inf4<-IF$inf4
nwtsco$inf5<-IF$inf5
nwtsco$inf6<-IF$inf6

GR Step 3

The IPW weights are calibrated using these influence function auxiliary variables.

#Set up IPW weights using RRZ approach. The phase II design was stratified sampling on instit 
## not sure why couldnt do with twophase()
#designIPW<-survey::twophase(id= list(~1,~1), subset= ~I(R==1), strata=~list(NULL,~instit), data=as.data.frame(nwtsco))
designEstWt<-estWeights(nwtsco, formula=~instit,subset=~I(R==1)) 

## calibrate on influence functions and stratification variable instit
designCal<-survey::calibrate(designEstWt, phase=2, calfun="raking", ~inf1 + inf2+inf3+inf4+inf5+inf6+instit)

Finally the outcome model is fit with these calibrated weights. We see this AIPW approach is very similar to the previous AIPW results that relied on imputation.

AIPWfit2<-svycoxph(Surv(trel,relaps)~histol.obs*age+I(stage>2)*tumdiam, design=designCal)

# Compare AIPW models
summary(AIPWfit2)$coef

## Two-phase sparse-matrix design:
##  survey::calibrate(designEstWt, phase = 2, calfun = "raking", 
##     ~inf1 + inf2 + inf3 + inf4 + inf5 + inf6 + instit)
## Phase 1:
## Independent Sampling design (with replacement)
## svydesign(ids = ~1)
## Phase 2:
## Independent Sampling design
## calibrate(phase2, formula, population, calfun = calfun, ...)

##                                 coef  exp(coef)   se(coef)  robust se         z
## histol.obs                2.33540461 10.3336401 0.44854905 0.38116505  6.127017
## age                       0.12396396  1.1319751 0.05016502 0.03237237  3.829313
## I(stage > 2)TRUE          1.33449667  3.7980838 0.76148773 0.41782159  3.193939
## tumdiam                   0.06028343  1.0621375 0.04345830 0.02516227  2.395787
## histol.obs:age           -0.20405354  0.8154187 0.09410745 0.07621885 -2.677206
## I(stage > 2)TRUE:tumdiam -0.07714500  0.9257556 0.06069808 0.03059389 -2.521582
##                              Pr(>|z|)
## histol.obs               8.954212e-10
## age                      1.285014e-04
## I(stage > 2)TRUE         1.403458e-03
## tumdiam                  1.658474e-02
## histol.obs:age           7.423902e-03
## I(stage > 2)TRUE:tumdiam 1.168284e-02

summary(AIPWfit)$coef

## Two-phase sparse-matrix design:
##  survey::calibrate(designEstWt, phase = 2, calfun = "raking", 
##     ~inf1 + inf2 + inf3 + inf4 + inf5 + inf6 + instit)
## Phase 1:
## Independent Sampling design (with replacement)
## svydesign(ids = ~1)
## Phase 2:
## Independent Sampling design
## calibrate(phase2, formula, population, calfun = calfun, ...)

##                                 coef exp(coef)   se(coef)  robust se         z
## histol.obs                2.28904358 9.8654976 0.44786865 0.38752116  5.906887
## age                       0.12300428 1.1308893 0.05018026 0.03233580  3.803966
## I(stage > 2)TRUE          1.30594782 3.6911860 0.76044241 0.41594987  3.139676
## tumdiam                   0.05896458 1.0607377 0.04336911 0.02517739  2.341965
## histol.obs:age           -0.19661601 0.8215060 0.09315553 0.07818661 -2.514702
## I(stage > 2)TRUE:tumdiam -0.07547965 0.9272986 0.06068383 0.03050509 -2.474329
##                              Pr(>|z|)
## histol.obs               3.486335e-09
## age                      1.423977e-04
## I(stage > 2)TRUE         1.691348e-03
## tumdiam                  1.918249e-02
## histol.obs:age           1.191331e-02
## I(stage > 2)TRUE:tumdiam 1.334865e-02

As before, we can calculate the relative efficiency of the raking fit to the RRZ procedure. Raking is more efficient for all variables with RE >1.

var.rakefit<-summary(AIPWfit2)$coef[,4]^2 # GR fit

## Two-phase sparse-matrix design:
##  survey::calibrate(designEstWt, phase = 2, calfun = "raking", 
##     ~inf1 + inf2 + inf3 + inf4 + inf5 + inf6 + instit)
## Phase 1:
## Independent Sampling design (with replacement)
## svydesign(ids = ~1)
## Phase 2:
## Independent Sampling design
## calibrate(phase2, formula, population, calfun = calfun, ...)

var.RRZfit<-summary(RRZfit)$coef[,4]^2  # RRZ fit

## Two-phase sparse-matrix design:
##  estWeights(nwtsco, formula = ~instit, subset = ~I(R == 1))
## Phase 1:
## Independent Sampling design (with replacement)
## svydesign(ids = ~1)
## Phase 2:
## Independent Sampling design
## svydesign(ids = ~1, fpc = `*phase1*`)

### Percent relative efficiency with RRZ
round(var.RRZfit/var.rakefit*100,0)

##               histol.obs                      age         I(stage > 2)TRUE 
##                      149                      331                      453 
##                  tumdiam           histol.obs:age I(stage > 2)TRUE:tumdiam 
##                      441                      131                      439

Again there was dramatic improvement over the RRZ approach, which relied on complete case data only and ignored the data missing the histol variable. We can compare the relative efficiency (RE) of two AIPW estimators, the one relying on the imputation prediction model (AIPWfit) versus just using the errorprone variable to impute the missing histol (AIPWfit2). Here RE > 1 indicates the approach that used the error prone variable instit in the working model (and not the predicted histol) is more efficent.

var.rakefit2<-summary(AIPWfit2)$coef[,4]^2 # GR fit with error prone auxiliary variables

## Two-phase sparse-matrix design:
##  survey::calibrate(designEstWt, phase = 2, calfun = "raking", 
##     ~inf1 + inf2 + inf3 + inf4 + inf5 + inf6 + instit)
## Phase 1:
## Independent Sampling design (with replacement)
## svydesign(ids = ~1)
## Phase 2:
## Independent Sampling design
## calibrate(phase2, formula, population, calfun = calfun, ...)

var.rakefit<-summary(AIPWfit)$coef[,4]^2 # GR fit with imputed auxiliary variables

## Two-phase sparse-matrix design:
##  survey::calibrate(designEstWt, phase = 2, calfun = "raking", 
##     ~inf1 + inf2 + inf3 + inf4 + inf5 + inf6 + instit)
## Phase 1:
## Independent Sampling design (with replacement)
## svydesign(ids = ~1)
## Phase 2:
## Independent Sampling design
## calibrate(phase2, formula, population, calfun = calfun, ...)

### Relative efficiency for the two raking estimators. > 1 means denominator has smaller variance
round(var.rakefit/var.rakefit2*100,0)

##               histol.obs                      age         I(stage > 2)TRUE 
##                      103                      100                       99 
##                  tumdiam           histol.obs:age I(stage > 2)TRUE:tumdiam 
##                      100                      105                       99

The two efficiencies are very similar, perhaps a slight gain in efficiency for the approach that used the error prone variables directly in the imputation model.

A few pointers for applying IPW and AIPW estimators

A common question about IPW models is which variables should be included in the selection model. Here are two helpful references that review the basic principles of building reasonable IPW models [9,10]. One thing to consider, predictors in the selection model should be related to the outcome of interst and not just the missingness, otherwise you risk just inflating the variance [10]. For inverse-probability weighted estimators in a different setting, Schuster et al 2016 [11] warns that traditional principles of avoiding overfitting in small sampes should be considered.

For AIPW esitmators obtained using generalized raking, one should consider only raking on a subset of influence functions in small samples- particuarly with small numbers of events for a binary or survival outcome. Each raking variable introduces an additional constraint that must be satistied. If you choose all influence functions you’ve doubled the number of estimation constraints, possibly moving towards a more unstable estimation problem if there are small numbers of events. In this case, choosing a subset of the influence functions to approximate for raking variables that are thought to be well measured and/or related to the variables of interst could be a good strategy.

References

Stefanski LA, Boos DD. The calculus of M-estimation. The American Statistician. 2002 Feb 1;56(1):29-38.
Lumley T, Shaw PA, Dai JY. Connections between survey calibration estimators and semiparametric models for incomplete data. International Statistical Review. 2011 Aug;79(2):200-20.
Robins JM, Rotnitzky A, Zhao LP. (1994) Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89, 846-866.
https://r-survey.r-forge.r-project.org/survey/html/estweights.html
Breslow NE, Lumley T, Ballantyne CM, Chambless LE, Kulich M (2009). Improved Horvitz Thompson Estimation of Model Parameters from Two-Phase Stratified Samples: Applications in Epidemiology.” Statistics in Biosciences, 1(1), 32{49. doi:10.1007/s12561-009-9001-6.
Lumley T. Complex surveys: a guide to analysis using R. John Wiley & Sons; 2011 Sep 20.
Kulich M, Lin DY. Improving the efficiency of relative-risk estimation in case-cohort studies. Journal of the American Statistical Association. 2004 Sep 1;99(467):832-44.
Han K, Shaw PA, Lumley T. Combining multiple imputation with raking of weights: An efficient and robust approach in the setting of nearly true models. Statistics in Medicine. 2021 Dec 30;40(30):6777-91.
Seaman SR, White IR. Review of inverse probability weighting for dealing with missing data. Statistical Methods in Medical Research. 2013 Jun;22(3):278-95.
Little RJ, Carpenter JR, Lee KJ. A comparison of three popular methods for handling missing data: complete-case analysis, inverse probability weighting, and multiple imputation. Sociological Methods & Research. 2024 Aug;53(3):1105-35.
Schuster T, Lowe WK, Platt RW. Propensity score model overfitting led to inflated variance of estimated odds ratios. Journal of Clinical Epidemiology. 2016 Dec 1;80:97-106.

Citation

Placeholder for how to cite this document.

Shaw, PA and Walker R (2026, Jan 12). ACT Statistical Guidance: Weighting methods to handle selection bias due to missing data. Retrieved from

ACT Statistical Guidance: Weighting methods to handle selection bias due to missing data

Pamela Shaw, Rod Walker

2026-01-12

Adult Changes in Thought (ACT) study, Data Science and Informatics Core, Kaiser Permanente Washington Health Research Institute, Seattle WA; contact email: pamela.a.shaw@kp.org

Introduction

Read in and format example data

Table 1

Selection Model

Summarize selection model and variability of the weights

To evaluate how well the selection model predicted missingness, we plot an ROC curve.

IPW outcome regression

We consider a binary outcome

3 IPW methods with naive SE

IPW Method 4, with correct SE

Generalized raking

Now let’s improve efficiency of the IPW appraoch by implementing an AIPW estimator using generalized raking (GR) approach.

GR Steps for case where we have no good proxy for the missing (phase 2) data

GR Step 1

GR Step 2

GR Step 3

Generalized raking for other types of outcomes

Ordinal outcome

Create svy_vglm_modify()

Standard IPW with naive SE

RRZ IPW Method with correct SE

AIPW method: Generalized raking

GR Step 1

GR Step 2

GR Step 3

Survival outcome

With singly imputed model for the missing data

Load NWTS data and induce some missingness

GR Step 1

GR Step 2

GR Step 3

Obtain AIPW estimator

Raking in the setting of measurement error

GR Step 1

GR Step 2

GR Step 3

A few pointers for applying IPW and AIPW estimators

References

Citation