The CompCausal package provides tools for estimating causal effects in comprehensive cohort studies in the presence of unmeasured confounding and missing outcomes. The primary target of inference is the comprehensive cohort causal effect (CCCE), defined as the average causal effect in the entire study cohort.
This vignette is organized into two sections:
Methodology: an overview of the comprehensive cohort study design, causal estimands, identification assumptions, and estimation procedures.
Implementation: practical guidance on using the package functions to estimate the CCCE under user-specified sensitivity parameters.
Methodology
Comprehensive Cohort Study design
The Comprehensive Cohort Study (CCS) design is a nested study design in which eligible individuals may choose to participate either in a randomized controlled trial (RCT) or in a parallel observational study (OBS). Participants in the RCT are randomly assigned to treatment, whereas participants in the OBS select their preferred treatment from the same treatment options available in the RCT. All participants are followed over time, and baseline covariates, treatment assignments, and outcomes are collected for the entire cohort.
Causal estimands
Let be baseline covariates, randomization consent indicator ( for RCT, for OBS), binary treatment indicator, as outcome and the missing outcome indicator ( if is observed, if is missing). Let be the full data without outcome missingness, and the observed data. Let and be the distribution of and .
We further introduce the following notations:
Our causal estimand of interest is the comprehensive cohort causal effect (CCCE), , which represents the average causal effect in the entire study population. Estimation of the CCCE is based on inference for the marginal potential outcome mean , for .
Assumptions
- Consistency: if , for .
- Positivity: and for all in the support of .
- Randomization in RCT: , for .
- Sensitivity Model for Unmeasured Confounding in the OBS:
- Missing-At-Random Assumption for Outcome: .
Under the preceding assumptions, the marginal potential outcome mean is identified by
and are the identifications of and . Details of estimations for and can be found in the paper. We define the randomized trial causal effect (RTCE) as ; and the patient preference causal effect (PPCE) as .
Methods
Estimation of proceeds in four steps:
- Derive the efficient influence function.
- Estimate the nuisance functions that characterize .
- Construct a one-step estimator using sample splitting and truncation.
- Estimate the asymptotic variance and construct confidence intervals.
(1). Derive the efficient influence function for , denoted
where the exact form of and can be found in the paper.
(2). Estimate
To construct a estimator for , we need to estimate , and for . The nuisance functions are estimated using methods that achieve sufficiently fast convergence rates to ensure asymptotic normality of the resulting estimator at the rates. Specifically, we apply the generalized additive model (GAM) with a logistic link function (Hastie et al. 1990) for and for . We apply the single index model (Chiang and Huang 2012) for for . The conditional expectations , and are subsequently computed via numerical integration.
(3). Compute the one-step, truncated, split sample estimator
We employ K-fold sample splitting together with tuning-free Huberization procedure (Wang et al. 2021) to truncate the estimated influence-function components: and .
where is the split membership of the th observation (i.e, ).
(4). Compute variance and confidence interval
The asymptotic variance of is estimated by A two-sided Wald confidence interval for is given by .
Implementation
We now illustrate the use of CompCausal for estimating the CCCE, RTCE, and PPCE using a simulated dataset included with the package.
Data
The package includes a simulated dataset based on the TOIB study (Underwood et al. 2008). The TOIB study is a comprehensive cohort study on the effect of topical versus oral non-steroidal anti-inflammatory drugs (NSAIDs) in managing knee pain among older adults with chronic knee pain. In the analysis, the outcome of interest is Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC) pain score () at 12 months. Baseline covariates include age, baseline WOMAC pain score, expected pain one year later, and chronic pain grade.
library(CompCausal)
## load data
data(ccohort)
## data structure
str(ccohort)'data.frame': 563 obs. of 8 variables:
$ Y : num 26.7 72.1 55.3 NA 46 ...
$ M : num 1 1 1 0 1 1 1 0 1 1 ...
$ R : int 1 0 1 1 0 1 1 1 1 1 ...
$ t : num 1 1 1 0 0 0 0 1 0 0 ...
$ age : num 54 63 70 74 55 78 76 61 58 52 ...
$ womac_bq : num 22.2 100 25.8 30.4 27.4 31.6 45.2 73 39.4 39.4 ...
$ expectationb: Factor w/ 3 levels "Much/A little worse",..: 3 2 1 2 3 1 1 3 1 1 ...
$ ChronicPainb: Factor w/ 2 levels "1-2","3-4": 1 2 1 2 1 1 1 1 2 1 ...
The dataset has 563 observations and 8 variables, which include:
- Y: WOMAC pain score at 12 months, continuous variable ranging from 0 to 100. NA indicate missing outcome.
- M: outcome missingness indicator, binary variable. if is observed, if is missing.
- R: randomization consent indicator, binary variable. for RCT, for OBS.
- t: treatment indicator, binary variable. for topical NSAIDs treatment group, for oral NSAIDs treatment group.
- age: age, continuous variable.
- womac_bq: WOMAC pain score at baseline, continuous variable ranging from 0 to 100. No missing value.
- expectationb: expected pain a year from now, categorical variable. Three levels: much/a little worse, about the same, a little/much better/free of pain.
- ChronicPainb: chronic pain grade, categorical variable. Two levels: 1-2, 3-4.
Application
The primary function is est_psi(), which returns point estimates, estimated variance and Wald confidence interval for , and under one or more pre-specified .
Users need to input data and specify several parameters:
- Data: Y, M, R, t, X (baseline covariates as a data frame)
- Estimand of interest: trt=1 if estimating , and ; trt=0 if estimating , and
- Sensitivity parameters: gamma, a vector of value.
- Single index model settings (Redd et al. 2025): kernel, kernel smoothing, choices of Gaussian and Epanechnikov kernel; single_index_method, types of constraints in estimation, choices of setting first coefficient to 1, norm of coefficients to 1, and bandwidth to 1; method, optimization methods, default to
optim; use_mave, whether to apply MAVE (Xia et al. 2002; Wang and Xia 2008) or cumulative sliced inverse regression method (Zhu et al. 2010) to estimate initial value of the coefficients, default to TRUE. - Truncation methods: simple_trunc=TRUE to apply quantile truncation of and , simple_trunc=FALSE to apply tuning-free Huberization procedure (Wang et al. 2021) to influence functions. If simple_trunc=TRUE, specify quantile truncation by setting quant from 0 to 1.
- K-fold sample splitting: fold=K, an integer. seed, set.seed(seed).
Here is an example of inferences for , and under , using 5-fold sample splitting, influence function truncation procedure and specific single index model settings.
## data adaptive truncation
out <- with(ccohort, {
est_psi(Y, M, R, X = data.frame(age, womac_bq, expectationb, ChronicPainb),
t, trt = 1, gamma = c(0, 0.5), fold = 5, seed = 1, IF_output = FALSE,
simple_trunc = FALSE, quant = NULL, kernel="dnorm",
single_index_method="norm1coef", method="optim")
})Results are organized into a table:
## organize result into data frame
res_out <- data.frame(est=c(out$est_trunc, out$est_trunc_R1[1], out$est_trunc_R0),
var=c(out$var_trunc, out$var_trunc_R1[1], out$var_trunc_R0),
lowerCI=c(out$lowerCI_trunc, out$lowerCI_trunc_R1[1], out$lowerCI_trunc_R0),
upperCI=c(out$upperCI_trunc, out$upperCI_trunc_R1[1], out$upperCI_trunc_R0),
gamma=c(0, 0.5, NA, 0, 0.5), type=c("CCCE", "CCCE", "RTCE", "PPCE", "PPCE"))
res_out2 <- as.matrix(round(res_out[, 1:4], 2))
rownames(res_out2) <- c("$\\psi_1(\\widetilde{P}; 0)$", "$\\psi_1(\\widetilde{P}; 0.5)$", "$\\psi_{1,1}(\\widetilde{P})$",
"$\\psi_{1,0}(\\widetilde{P}; 0)$", "$\\psi_{1,0}(\\widetilde{P}; 0.5)$")
## present result
knitr::kable(res_out2, col.names=c("Estimates", "Var", "Lower CI", "Upper CI"))| Estimates | Var | Lower CI | Upper CI | |
|---|---|---|---|---|
| 40.12 | 2.16 | 37.24 | 43.00 | |
| 40.44 | 2.19 | 37.54 | 43.34 | |
| 38.50 | 4.72 | 34.24 | 42.76 | |
| 41.70 | 3.95 | 37.80 | 45.60 | |
| 42.33 | 4.04 | 38.39 | 46.26 |
To estimate treatment effects such as , the estimated influence functions must be retained so that the covariance between the two estimators can be properly accounted for when computing standard errors. Note that to make sure the influence functions align for each individuals under and , it is advised that seed is set to the same value under and .
## t=1
out_t1 <- with(ccohort, {
est_psi(Y, M, R, X = data.frame(age, womac_bq, expectationb, ChronicPainb),
t, trt = 1, gamma = c(0, 0.5), fold = 5, seed = 1, IF_output = TRUE,
simple_trunc = FALSE, quant = NULL, kernel="dnorm",
single_index_method="norm1coef", method="optim")
})
## t=0
out_t0 <- with(ccohort, {
est_psi(Y, M, R, X = data.frame(age, womac_bq, expectationb, ChronicPainb),
t, trt = 0, gamma = c(0, 0.5), fold = 5, seed = 1, IF_output = TRUE,
simple_trunc = FALSE, quant = NULL, kernel="dnorm",
single_index_method="norm1coef", method="optim")
})Results are organized into a table:
## data frame for results
res_out_diff <- expand.grid(gamma1=c(0, 0.5), gamma0=c(0, 0.5), type=c("CCCE", "PPCE"))
res_out_diff <- rbind(res_out_diff, data.frame(gamma1=NA, gamma0=NA, type="RTCE"))
res_out_diff$est <- 0
res_out_diff$var <- 0
## compute variance
scale_n <- 1 / dim(ccohort)[1]
fold_idx <- split(seq_along(out_t1$fold_index_l), out_t1$fold_index_l)
t1_IF_mat <- t(do.call(rbind, out_t1$IF_trunc))
t1_IF_R0_mat <- t(do.call(rbind, out_t1$IF_trunc_R0))
t1_IF_R1_mat <- t(do.call(rbind, out_t1$IF_trunc_R1))
res_out_diff$est[which(res_out_diff$type=="RTCE")] <- out_t1$est_trunc_R1[1]-out_t0$est_trunc_R1[1]
IF_R1_diff <- t1_IF_R1_mat - out_t0$IF_trunc_R1[[1]]
var_R1_temp <- vapply(fold_idx, function(id) colVars(IF_R1_diff[id, ]), numeric(2))
res_out_diff$var[which(res_out_diff$type=="RTCE")] <- rowMeans(var_R1_temp)[1]*scale_n
for (g_0 in seq_along(c(0, 0.5))) {
IF_diff <- t1_IF_mat - out_t0$IF_trunc[[g_0]]
IF_R0_diff <- t1_IF_R0_mat - out_t0$IF_trunc_R0[[g_0]]
var_temp <- vapply(fold_idx, function(id) colVars(IF_diff[id, ]), numeric(2))
var_R0_temp <- vapply(fold_idx, function(id) colVars(IF_R0_diff[id, ]), numeric(2))
indx_CCCE <- which(res_out_diff$gamma0==c(0, 0.5)[g_0]&res_out_diff$type=="CCCE")
res_out_diff$est[indx_CCCE] <- out_t1$est_trunc-out_t0$est_trunc
res_out_diff$var[indx_CCCE] <- rowMeans(var_temp)*scale_n
indx_PPCE <- which(res_out_diff$gamma0==c(0, 0.5)[g_0]&res_out_diff$type=="PPCE")
res_out_diff$est[indx_PPCE] <- out_t1$est_trunc_R0-out_t0$est_trunc_R0
res_out_diff$var[indx_PPCE] <- rowMeans(var_R0_temp)*scale_n
}
## 95% CI
res_out_diff$lowerCI <- res_out_diff$est-qnorm(0.975)*sqrt(res_out_diff$var)
res_out_diff$upperCI <- res_out_diff$est+qnorm(0.975)*sqrt(res_out_diff$var)
## present result
res_out_diff2 <- as.matrix(round(res_out_diff[, 4:7], 2))
rownames(res_out_diff2) <- c("$\\psi_1(\\widetilde{P}; 0)-\\psi_0(\\widetilde{P}; 0)$", "$\\psi_1(\\widetilde{P}; 0.5)-\\psi_0(\\widetilde{P}; 0)$",
"$\\psi_1(\\widetilde{P}; 0)-\\psi_0(\\widetilde{P}; 0.5)$", "$\\psi_1(\\widetilde{P}; 0.5)-\\psi_0(\\widetilde{P}; 0.5)$",
"$\\psi_{1,0}(\\widetilde{P}; 0)-\\psi_{0,0}(\\widetilde{P}; 0)$", "$\\psi_{1,0}(\\widetilde{P}; 0.5)-\\psi_{0,0}(\\widetilde{P}; 0)$",
"$\\psi_{1,0}(\\widetilde{P}; 0)-\\psi_{0,0}(\\widetilde{P}; 0.5)$", "$\\psi_{1,0}(\\widetilde{P}; 0.5)-\\psi_{0,0}(\\widetilde{P}; 0.5)$",
"$\\psi_{1,1}(\\widetilde{P})-\\psi_{0,1}(\\widetilde{P})$")
knitr::kable(res_out_diff2, col.names=c("Estimates", "Var", "Lower CI", "Upper CI"))| Estimates | Var | Lower CI | Upper CI | |
|---|---|---|---|---|
| -0.69 | 4.14 | -4.68 | 3.30 | |
| -0.88 | 4.16 | -4.88 | 3.12 | |
| -0.69 | 4.15 | -4.68 | 3.30 | |
| -0.88 | 4.17 | -4.88 | 3.12 | |
| -1.45 | 8.40 | -7.13 | 4.23 | |
| -1.82 | 8.45 | -7.52 | 3.88 | |
| -1.45 | 8.43 | -7.14 | 4.24 | |
| -1.82 | 8.49 | -7.53 | 3.89 | |
| 0.22 | 8.13 | -5.37 | 5.81 |