Introduction: methodology and implementation • CompCausal

The CompCausal package provides tools for estimating causal effects in comprehensive cohort studies in the presence of unmeasured confounding and missing outcomes. The primary target of inference is the comprehensive cohort causal effect (CCCE), defined as the average causal effect in the entire study cohort.

This vignette is organized into two sections:

Methodology: an overview of the comprehensive cohort study design, causal estimands, identification assumptions, and estimation procedures.
Implementation: practical guidance on using the package functions to estimate the CCCE under user-specified sensitivity parameters.

Methodology

Comprehensive Cohort Study design

The Comprehensive Cohort Study (CCS) design is a nested study design in which eligible individuals may choose to participate either in a randomized controlled trial (RCT) or in a parallel observational study (OBS). Participants in the RCT are randomly assigned to treatment, whereas participants in the OBS select their preferred treatment from the same treatment options available in the RCT. All participants are followed over time, and baseline covariates, treatment assignments, and outcomes are collected for the entire cohort.

Causal estimands

Let $X$ be baseline covariates, $R$ randomization consent indicator ( $1$ for RCT, $0$ for OBS), $T$ binary treatment indicator, $Y$ as outcome and $M$ the missing outcome indicator ( $M=1$ if $Y$ is observed, $M=0$ if $Y$ is missing). Let $O = (X, R, T, Y)$ be the full data without outcome missingness, and $\widetilde{O} = (X,R,T,M,Y:M=1)$ the observed data. Let $P$ and $\widetilde{P}$ be the distribution of $O$ and $\widetilde{O}$ .

We further introduce the following notations:

$\widetilde{F}_{t, r}(y|x)=P(Y\leq y\mid T=t, R=r, X=x, M=1)$
$\widetilde{\mu}_{t,r}(h(Y); x)=E[h(Y) \mid T=t, R=r, X=x, M=1]$
$\pi_{t,r}(x)=P(T = t \mid R=r, X=x)$
$g_r(x)=P(R = r \mid X=x)$
$\eta_m(x, r, t)=P(M =m \mid X=x, R=r, T=t)$

Our causal estimand of interest is the comprehensive cohort causal effect (CCCE), $E[Y(1)]-E[Y(0)]$ , which represents the average causal effect in the entire study population. Estimation of the CCCE is based on inference for the marginal potential outcome mean $E[Y(t)]$ , for $t\in\{0, 1\}$ .

Assumptions

Consistency: $Y(t) = Y$ if $T=t$ , for $t=0,1$ .
Positivity: $0 < P(T=1 \mid R=1) < 1$ and $0 < P(T=1 \mid R=0, X=x) < 1$ for all $x$ in the support of $X$ .
Randomization in RCT: $T \perp \{ Y(t), X \} \mid R=1$ , for $t = 0, 1$ .
Sensitivity Model for Unmeasured Confounding in the OBS: $dF(y(t) \mid T=1-t,R=0,X) = dF(y(t) \mid T=t,R=0,X) \times \frac{ \exp\{ \gamma_t s_t(y(t)) \} }{ \mu_{t, 0} (\exp\{ \gamma_t s_t(Y(t)) \}; X)}$
Missing-At-Random Assumption for Outcome: $M \perp Y \mid T, X, R$ .

Under the preceding assumptions, the marginal potential outcome mean $E[Y(t)]$ is identified by

\begin{aligned} \psi_t(\widetilde{P}; \gamma_t) =& \underbrace{E[ \widetilde{\mu}_{t,1}(Y;X) \big| R=1]}_{\psi_{t,1}(\widetilde{P})} P(R=1)\\ &+ \underbrace{E \bigg[ \widetilde{\mu}_{t,0}(Y;X) \pi_{t,0}(X) +\frac{ \widetilde{\mu}_{t,0}(Y \! \exp\{ \gamma_t s_t(Y)\} ;X) }{\widetilde{\mu}_{t,0}( \exp\{ \gamma_t s_t(Y)\} ;X) } \pi_{1-t,0}(X) \bigg| R=0\bigg]}_{\psi_{t,0}(\widetilde{P}; \gamma_t)} P(R=0) \end{aligned}

$\psi_{t, 1}(\widetilde{P})$ and $\psi_{t, 0}(\widetilde{P}; \gamma_t)$ are the identifications of $E[Y(t)|R=1]$ and $E[Y(t)|R=0]$ . Details of estimations for $E[Y(t)|R=1]$ and $E[Y(t)|R=0]$ can be found in the paper. We define the randomized trial causal effect (RTCE) as $\mbox{RTCE}=\psi_{1, 1}(\widetilde{P})-\psi_{0, 1}(\widetilde{P})$ ; and the patient preference causal effect (PPCE) as $\mbox{PPCE}(\gamma_1,\gamma_0) = \psi_{1, 0}(\widetilde{P}; \gamma_1)-\psi_{0, 0}(\widetilde{P}; \gamma_0)$ .

Methods

Estimation of $\psi_{t}(\widetilde{P}; \gamma_t)$ proceeds in four steps:

Derive the efficient influence function.
Estimate the nuisance functions that characterize $\widetilde{P}$ .
Construct a one-step estimator using sample splitting and truncation.
Estimate the asymptotic variance and construct $95\%$ confidence intervals.

(1). Derive the efficient influence function for $\psi_t(\widetilde{P}; \gamma_t)$ , denoted $\phi_t(\widetilde{P}; \gamma_t)(\widetilde{O})$

$\phi_t(\widetilde{P}; \gamma_t)(\widetilde{O}) = \upsilon_{t, 1}(\widetilde{P})(\widetilde{O})\times P(R=1)+\upsilon_{t, 0}(\widetilde{P}; \gamma_t)(\widetilde{O})\times P(R=0) - \psi_t(\widetilde{P}; \gamma_t) \ ,$

where the exact form of $\upsilon_{t, 1}(\widetilde{P})(\widetilde{O})$ and $\upsilon_{t, 0}(\widetilde{P}; \gamma_t)(\widetilde{O})$ can be found in the paper.

(2). Estimate $\widetilde{P}$

To construct a estimator for $E[Y(t)]$ , we need to estimate $\pi_{t, r}(X)$ , $\eta_1(X, r, t)$ and $\widetilde{F}_{t, r}(y|X)$ for $r=0, 1$ . The nuisance functions are estimated using methods that achieve sufficiently fast convergence rates to ensure asymptotic normality of the resulting estimator at the $\sqrt{n}$ rates. Specifically, we apply the generalized additive model (GAM) with a logistic link function (Hastie et al. 1990) for $\pi_{t, r}(X)$ and $\eta_1(X, r, t)$ for $r=0, 1$ . We apply the single index model (Chiang and Huang 2012) for $\widetilde{F}_{t, r}(y|X)$ for $r=0, 1$ . The conditional expectations $\widetilde{\mu}_{t, 0}\big(Y\exp(\gamma_t s_t( Y)); X\big)$ , $\widetilde{\mu}_{t, 0}\big(\exp(\gamma_t s_t(Y)); X\big)$ and $\widetilde{\mu}_{t, r}\big(Y; X\big)$ are subsequently computed via numerical integration.

(3). Compute the one-step, truncated, split sample estimator

We employ K-fold sample splitting together with tuning-free Huberization procedure (Wang et al. 2021) to truncate the estimated influence-function components: $\upsilon_{t, 1} \left(\widehat{\widetilde{P}}^{(-k)}\right) (\widetilde{O}_i)$ and $\upsilon_{t, 0}\left(\widehat{\widetilde{P}}^{(-k)} ; \gamma_t\right) (\widetilde{O}_i)$ .

$\begin{equation*} \widehat{\psi}^\dagger_t(\gamma_t) = \frac{1}{K} \sum_{k=1}^K \underbrace{\left\{\frac{1}{n_k} \sum_{i: S_i=k}\Bigg[ \underbrace{\upsilon_{t, 1}^\dagger \left(\widehat{\widetilde{P}}^{(-k)}\right) (\widetilde{O}_i)\times P_{n/n_k}^{(-k)}(R=1) +\upsilon_{t, 0}^\dagger \left(\widehat{\widetilde{P}}^{(-k)} ; \gamma_t\right) (\widetilde{O}_i)\times P_{n/n_k}^{(-k)}(R=0)}_{\upsilon_t^\dagger\left(\widehat{\widetilde{P}}^{(-k)} ; \gamma_t\right)(\widetilde{O}_i)}\Bigg]\right\}}_{\widehat{\psi}_t^{(k)\dagger}(\gamma_t)} \end{equation*}$ where $S_i$ is the split membership of the $i$ th observation (i.e, $S_i \in \{1, \ldots, K\}$ ).

(4). Compute variance and $95\%$ confidence interval

The asymptotic variance of $\widehat{\psi}_t^\dagger(\gamma_t)$ is estimated by $\sigma_t^{\dagger 2}=\frac{1}{nK}\sum_{k=1}^K\left\{\frac{1}{n_k-1}\sum_{i: S_i=k}\left\{\upsilon_t^\dagger\left(\widehat{\widetilde{P}}^{(-k)} ; \gamma_t\right)(\widetilde{O}_i)-\widehat{\psi}_t^{(k)\dagger}(\gamma_t)\right\}^2\right\}$ A two-sided $95\%$ Wald confidence interval for $\psi_t(\widetilde{P}; \gamma_t)$ is given by $\widehat{\psi}^\dagger_t(\gamma_t)\pm z_{0.975}\frac{\sigma_t^\dagger}{\sqrt{n}}$ .

Implementation

library(matrixStats)

We now illustrate the use of CompCausal for estimating the CCCE, RTCE, and PPCE using a simulated dataset included with the package.

Data

The package includes a simulated dataset based on the TOIB study (Underwood et al. 2008). The TOIB study is a comprehensive cohort study on the effect of topical versus oral non-steroidal anti-inflammatory drugs (NSAIDs) in managing knee pain among older adults with chronic knee pain. In the analysis, the outcome of interest is Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC) pain score ( $0-100$ ) at 12 months. Baseline covariates include age, baseline WOMAC pain score, expected pain one year later, and chronic pain grade.

library(CompCausal)
## load data
data(ccohort)
## data structure
str(ccohort)

'data.frame':   563 obs. of  8 variables:
 $ Y           : num  26.7 72.1 55.3 NA 46 ...
 $ M           : num  1 1 1 0 1 1 1 0 1 1 ...
 $ R           : int  1 0 1 1 0 1 1 1 1 1 ...
 $ t           : num  1 1 1 0 0 0 0 1 0 0 ...
 $ age         : num  54 63 70 74 55 78 76 61 58 52 ...
 $ womac_bq    : num  22.2 100 25.8 30.4 27.4 31.6 45.2 73 39.4 39.4 ...
 $ expectationb: Factor w/ 3 levels "Much/A little worse",..: 3 2 1 2 3 1 1 3 1 1 ...
 $ ChronicPainb: Factor w/ 2 levels "1-2","3-4": 1 2 1 2 1 1 1 1 2 1 ...

The dataset has 563 observations and 8 variables, which include:

Y: WOMAC pain score at 12 months, continuous variable ranging from 0 to 100. NA indicate missing outcome.
M: outcome missingness indicator, binary variable. $1$ if $Y$ is observed, $0$ if $Y$ is missing.
R: randomization consent indicator, binary variable. $1$ for RCT, $0$ for OBS.
t: treatment indicator, binary variable. $1$ for topical NSAIDs treatment group, $0$ for oral NSAIDs treatment group.
age: age, continuous variable.
womac_bq: WOMAC pain score at baseline, continuous variable ranging from 0 to 100. No missing value.
expectationb: expected pain a year from now, categorical variable. Three levels: much/a little worse, about the same, a little/much better/free of pain.
ChronicPainb: chronic pain grade, categorical variable. Two levels: 1-2, 3-4.

Application

The primary function is est_psi(), which returns point estimates, estimated variance and $95\%$ Wald confidence interval for $\psi_t(\widetilde{P}; \gamma_t)$ , $\psi_{t, 0}(\widetilde{P}; \gamma_t)$ and $\psi_{t, 1}(\widetilde{P})$ under one or more pre-specified $\gamma_t$ .

Users need to input data and specify several parameters:

Data: Y, M, R, t, X (baseline covariates as a data frame)
Estimand of interest: trt=1 if estimating $E[Y(1)]$ , $E[Y(1)|R=0]$ and $E[Y(1)|R=1]$ ; trt=0 if estimating $E[Y(0)]$ , $E[Y(0)|R=0]$ and $E[Y(0)|R=1]$
Sensitivity parameters: gamma, a vector of $\gamma_t$ value.
Single index model settings (Redd et al. 2025): kernel, kernel smoothing, choices of Gaussian and Epanechnikov kernel; single_index_method, types of constraints in estimation, choices of setting first coefficient to 1, norm of coefficients to 1, and bandwidth to 1; method, optimization methods, default to optim; use_mave, whether to apply MAVE (Xia et al. 2002; Wang and Xia 2008) or cumulative sliced inverse regression method (Zhu et al. 2010) to estimate initial value of the coefficients, default to TRUE.
Truncation methods: simple_trunc=TRUE to apply quantile truncation of $\frac{1}{\pi_{t, r}(X)}$ and $\frac{1}{\eta_m(X, r, t)}$ , simple_trunc=FALSE to apply tuning-free Huberization procedure (Wang et al. 2021) to influence functions. If simple_trunc=TRUE, specify quantile truncation by setting quant from 0 to 1.
K-fold sample splitting: fold=K, an integer. seed, set.seed(seed).

Here is an example of inferences for $\psi_1(\widetilde{P}; \gamma_1)$ , $\psi_{1, 0}(\widetilde{P}; \gamma_1)$ and $\psi_{1, 1}(\widetilde{P})$ under $\gamma_1=0, 0.5$ , using 5-fold sample splitting, influence function truncation procedure and specific single index model settings.

## data adaptive truncation
out <- with(ccohort, {
  est_psi(Y, M, R, X = data.frame(age, womac_bq, expectationb, ChronicPainb), 
          t, trt = 1, gamma = c(0, 0.5), fold = 5, seed = 1, IF_output = FALSE,
          simple_trunc = FALSE, quant = NULL, kernel="dnorm", 
          single_index_method="norm1coef", method="optim")
})

Results are organized into a table:

## organize result into data frame
res_out <- data.frame(est=c(out$est_trunc, out$est_trunc_R1[1], out$est_trunc_R0), 
                      var=c(out$var_trunc, out$var_trunc_R1[1], out$var_trunc_R0), 
                      lowerCI=c(out$lowerCI_trunc, out$lowerCI_trunc_R1[1], out$lowerCI_trunc_R0), 
                      upperCI=c(out$upperCI_trunc, out$upperCI_trunc_R1[1], out$upperCI_trunc_R0), 
                      gamma=c(0, 0.5, NA, 0, 0.5), type=c("CCCE", "CCCE", "RTCE", "PPCE", "PPCE"))
res_out2 <- as.matrix(round(res_out[, 1:4], 2))
rownames(res_out2) <- c("$\\psi_1(\\widetilde{P}; 0)$", "$\\psi_1(\\widetilde{P}; 0.5)$", "$\\psi_{1,1}(\\widetilde{P})$",
                        "$\\psi_{1,0}(\\widetilde{P}; 0)$", "$\\psi_{1,0}(\\widetilde{P}; 0.5)$")
## present result
knitr::kable(res_out2, col.names=c("Estimates", "Var", "Lower CI", "Upper CI"))

	Estimates	Var	Lower CI	Upper CI
$\psi_1(\widetilde{P}; 0)$	40.12	2.16	37.24	43.00
$\psi_1(\widetilde{P}; 0.5)$	40.44	2.19	37.54	43.34
$\psi_{1,1}(\widetilde{P})$	38.50	4.72	34.24	42.76
$\psi_{1,0}(\widetilde{P}; 0)$	41.70	3.95	37.80	45.60
$\psi_{1,0}(\widetilde{P}; 0.5)$	42.33	4.04	38.39	46.26

To estimate treatment effects such as $\psi_1(\widetilde{P}; \gamma_1)-\psi_0(\widetilde{P}; \gamma_0)$ , the estimated influence functions must be retained so that the covariance between the two estimators can be properly accounted for when computing standard errors. Note that to make sure the influence functions align for each individuals under $t=1$ and $t=0$ , it is advised that seed is set to the same value under $t=1$ and $t=0$ .

## t=1
out_t1 <- with(ccohort, {
  est_psi(Y, M, R, X = data.frame(age, womac_bq, expectationb, ChronicPainb), 
          t, trt = 1, gamma = c(0, 0.5), fold = 5, seed = 1, IF_output = TRUE,
          simple_trunc = FALSE, quant = NULL, kernel="dnorm", 
          single_index_method="norm1coef", method="optim")
})
## t=0
out_t0 <- with(ccohort, {
  est_psi(Y, M, R, X = data.frame(age, womac_bq, expectationb, ChronicPainb), 
          t, trt = 0, gamma = c(0, 0.5), fold = 5, seed = 1, IF_output = TRUE,
          simple_trunc = FALSE, quant = NULL, kernel="dnorm", 
          single_index_method="norm1coef", method="optim")
})

Results are organized into a table:

## data frame for results
res_out_diff <- expand.grid(gamma1=c(0, 0.5), gamma0=c(0, 0.5), type=c("CCCE", "PPCE"))
res_out_diff <- rbind(res_out_diff, data.frame(gamma1=NA, gamma0=NA, type="RTCE"))
res_out_diff$est <- 0
res_out_diff$var <- 0

## compute variance
scale_n <- 1 / dim(ccohort)[1]

fold_idx <- split(seq_along(out_t1$fold_index_l), out_t1$fold_index_l)

t1_IF_mat <- t(do.call(rbind, out_t1$IF_trunc))
t1_IF_R0_mat <- t(do.call(rbind, out_t1$IF_trunc_R0))
t1_IF_R1_mat <- t(do.call(rbind, out_t1$IF_trunc_R1))

res_out_diff$est[which(res_out_diff$type=="RTCE")] <- out_t1$est_trunc_R1[1]-out_t0$est_trunc_R1[1]

IF_R1_diff <- t1_IF_R1_mat - out_t0$IF_trunc_R1[[1]]
var_R1_temp <- vapply(fold_idx, function(id) colVars(IF_R1_diff[id, ]), numeric(2))
res_out_diff$var[which(res_out_diff$type=="RTCE")] <- rowMeans(var_R1_temp)[1]*scale_n

for (g_0 in seq_along(c(0, 0.5))) {
  
  IF_diff    <- t1_IF_mat - out_t0$IF_trunc[[g_0]]
  IF_R0_diff <- t1_IF_R0_mat - out_t0$IF_trunc_R0[[g_0]]

  var_temp    <- vapply(fold_idx, function(id) colVars(IF_diff[id, ]),    numeric(2))
  var_R0_temp <- vapply(fold_idx, function(id) colVars(IF_R0_diff[id, ]), numeric(2))

  indx_CCCE <- which(res_out_diff$gamma0==c(0, 0.5)[g_0]&res_out_diff$type=="CCCE")
  res_out_diff$est[indx_CCCE] <- out_t1$est_trunc-out_t0$est_trunc
  res_out_diff$var[indx_CCCE] <- rowMeans(var_temp)*scale_n
  
  indx_PPCE <- which(res_out_diff$gamma0==c(0, 0.5)[g_0]&res_out_diff$type=="PPCE")
  res_out_diff$est[indx_PPCE] <- out_t1$est_trunc_R0-out_t0$est_trunc_R0
  res_out_diff$var[indx_PPCE] <- rowMeans(var_R0_temp)*scale_n

}
## 95% CI
res_out_diff$lowerCI <- res_out_diff$est-qnorm(0.975)*sqrt(res_out_diff$var)
res_out_diff$upperCI <- res_out_diff$est+qnorm(0.975)*sqrt(res_out_diff$var)

## present result
res_out_diff2 <- as.matrix(round(res_out_diff[, 4:7], 2))
rownames(res_out_diff2) <- c("$\\psi_1(\\widetilde{P}; 0)-\\psi_0(\\widetilde{P}; 0)$", "$\\psi_1(\\widetilde{P}; 0.5)-\\psi_0(\\widetilde{P}; 0)$", 
                             "$\\psi_1(\\widetilde{P}; 0)-\\psi_0(\\widetilde{P}; 0.5)$", "$\\psi_1(\\widetilde{P}; 0.5)-\\psi_0(\\widetilde{P}; 0.5)$",
                             "$\\psi_{1,0}(\\widetilde{P}; 0)-\\psi_{0,0}(\\widetilde{P}; 0)$", "$\\psi_{1,0}(\\widetilde{P}; 0.5)-\\psi_{0,0}(\\widetilde{P}; 0)$",
                             "$\\psi_{1,0}(\\widetilde{P}; 0)-\\psi_{0,0}(\\widetilde{P}; 0.5)$", "$\\psi_{1,0}(\\widetilde{P}; 0.5)-\\psi_{0,0}(\\widetilde{P}; 0.5)$", 
                             "$\\psi_{1,1}(\\widetilde{P})-\\psi_{0,1}(\\widetilde{P})$")
knitr::kable(res_out_diff2, col.names=c("Estimates", "Var", "Lower CI", "Upper CI"))

	Estimates	Var	Lower CI	Upper CI
$\psi_1(\widetilde{P}; 0)-\psi_0(\widetilde{P}; 0)$	-0.69	4.14	-4.68	3.30
$\psi_1(\widetilde{P}; 0.5)-\psi_0(\widetilde{P}; 0)$	-0.88	4.16	-4.88	3.12
$\psi_1(\widetilde{P}; 0)-\psi_0(\widetilde{P}; 0.5)$	-0.69	4.15	-4.68	3.30
$\psi_1(\widetilde{P}; 0.5)-\psi_0(\widetilde{P}; 0.5)$	-0.88	4.17	-4.88	3.12
$\psi_{1,0}(\widetilde{P}; 0)-\psi_{0,0}(\widetilde{P}; 0)$	-1.45	8.40	-7.13	4.23
$\psi_{1,0}(\widetilde{P}; 0.5)-\psi_{0,0}(\widetilde{P}; 0)$	-1.82	8.45	-7.52	3.88
$\psi_{1,0}(\widetilde{P}; 0)-\psi_{0,0}(\widetilde{P}; 0.5)$	-1.45	8.43	-7.14	4.24
$\psi_{1,0}(\widetilde{P}; 0.5)-\psi_{0,0}(\widetilde{P}; 0.5)$	-1.82	8.49	-7.53	3.89
$\psi_{1,1}(\widetilde{P})-\psi_{0,1}(\widetilde{P})$	0.22	8.13	-5.37	5.81

References

Chiang, Chin-Tsang, and Ming-Yueh Huang. 2012. “New Estimation and Inference Procedures for a Single-Index Conditional Distribution Model.” Journal of Multivariate Analysis 111: 271–85.

Hastie, Trevor et al. 1990. “Tibshirani: Generalized Additive Models.” London: Chapman and Hall 1931 (11): 683–741.

Redd, Andrew, Yujing Gao, Bonnie B. Smith, Ravi Varadhan, Andrea J. Apter, and Daniel O. Scharfstein. 2025. SensIAT: An r Package for Conducting Sensitivity Analysis of Randomized Trials with Irregular Assessment Times. https://arxiv.org/abs/2509.22389.

Underwood, Martin, Deborah Ashby, Dawn Carnes, et al. 2008. “Topical or Oral Ibuprofen for Chronic Knee Pain in Older People. The TOIB Study.” Health Technology Assessment.

Wang, Hansheng, and Yingcun Xia. 2008. “Sliced Regression for Dimension Reduction.” Journal of the American Statistical Association 103 (482): 811–21.

Wang, Lili, Chao Zheng, Wen Zhou, and Wen-Xin Zhou. 2021. “A New Principle for Tuning-Free Huber Regression.” Statistica Sinica 31 (4): 2153–77.

Xia, Yingcun, Howell Tong, W. K. Li, and Li-Xing Zhu. 2002. “An Adaptive Estimation of Dimension Reduction Space.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64 (3): 363–410. https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/1467-9868.03411.

Zhu, Li-Ping, Li-Xing Zhu, and Zheng-Hui Feng. 2010. “Dimension Reduction in Regressions Through Cumulative Slicing Estimation.” Journal of the American Statistical Association 105 (492): 1455–66.