
DoublyRobust Inference in R using drtmle
Inverse probability of treatment weighted estimators and doubly robust estimators (including augmented inverse probability of treatment weight and targeted minimum loss estimators) are widely used in causal inference to estimate and draw inference about the average effect of a treatment. As an intermediate step, these estimators require estimation of key nuisance parameters, which are often regression functions. Typically, regressions are estimated using maximum likelihood and parametric models. Confidence intervals and pvalues may be computed based on standard asymptotic results, such as the central limit theorem, the delta method, and the nonparametric bootstrap. However, in highdimensional settings, maximum likelihood estimation often breaks down and standard procedures no longer yield correct inference. Instead, we may rely on adaptive estimators of nuisance parameters to construct flexible regression estimators. However, use of adaptive estimators poses a challenge for performing statistical inference about an estimated treatment effect. While doubly robust estimators facilitate inference when all relevant regression functions are consistently estimated, the same cannot be said when at least one nuisance estimator is inconsistent. drtmle implements doubly robust confidence intervals and hypothesis tests for targeted minimum loss estimates of the average treatment effect, in addition to several other recently proposed estimators of the average treatment effect.
causal inference, machine learning, semiparametric estimation, treatment effects, multiple robustness, targeted minimum loss estimation
1. Introduction
In many fields, researchers are interested in assessing the populationlevel effect of a particular treatment on an outcome of interest. The treatment might correspond to a drug, a potentially harmful exposure, or a policy intervention. Often, the treatment cannot be randomized due to ethical or logistical reasons. This has sparked great interest in developing statistical methodology to estimate populationlevel effects from observational data. Estimation of these effects often requires accounting for putative confounding factors – that is, factors related to whether a participant receives treatment and to the participant’s outcome. In many settings, researchers measure many potential confounders, for example using administrative [End Page 43] databases or genetic sequencing technology. Due to the curse of dimensionality, common statistical estimation techniques are not feasible for such highdimensional data without restrictive modeling assumptions. When these assumptions are violated, estimates of the populationlevel effect of a treatment can suffer from a large degree of bias.
This has led to increased interest in adopting adaptive estimation techniques from the machine learning literature to control for highdimensional confounders when estimating treatment effects. However, facilitating proper statistical inference (i.e., confidence intervals and hypothesis tests) about estimates of effects based on adaptive estimators is challenging. In particular, the use of standard causal inference techniques, including both Gcomputation estimators (Robins 1986) and inverse probability of treatment weighted (IPTW) estimators (Horvitz and Thompson 1952; Robins, Hernan, and Brumback 2000) is challenging when utilizing adaptive estimators. In particular, with rare exceptions, if crossvalidation is used to select tuning parameters for the nuisance parameters, estimates of the populationlevel effect will be too biased to achieve standard n1/2rate asymptotics. Instead, undersmoothing must be used to select tuning parameters to control the bias (M. J. van der Laan, Benkeser, and Cai 2019; Ertefaie, Hejazi, and van der Laan 2022; Hejazi et al. 2022). However, this can prove quite challenging in practice, and few guidelines are available.
A more natural framework for integrating adaptive methods into nuisance regression estimation are doubly robust approaches, including augmented inverse probability of treatment weighted (AIPTW) estimation (Robins, Rotnitzky, and Zhao 1994) and targeted minimum loss (TML) estimation (M. J. van der Laan and Rubin 2006). The asymptotics of these estimators are characterized in part by the convergence rates of nuisance parameter estimates with respect to a proper norm. This fact makes the use of crossvalidation more appealing for selection of tuning parameters for estimators of nuisance parameters, thereby providing a more straightforward means of obtaining an estimate of the effect of interest that enjoys standard n1/2 asymptotics. Based on the efficient influence function, a key quantity in semiparametric theory, these asymptotic results also yield closedform standard error estimates, which can serve as the basis for constructing confidence intervals and hypothesis tests.
As an intermediate step, the AIPTW and TML estimators require estimation of two key nuisance parameters: the probability of treatment as a function of confounders (the propensity score, hereafter referred to as the PS ) and the average outcome as a function of confounders and treatment (the outcome regression, OR). In addition to their desirable inferential properties, AIPTW and TML estimators also enjoy a double robustness property. That is, the estimators are consistent for the populationlevel effect of interest if either of the PS or OR is consistently estimated. This property gives the AIPTW and TML estimators a natural appeal: Inconsistency in the estimation of one nuisance parameter may be mitigated by the consistent estimation of the other. However, recent work has shown that the double robustness property does not extend to the inferential properties of these estimators when adaptive estimators of the PS and/or OR are used (M. van der Laan 2014; D. Benkeser et al. 2017). Instead, valid inference requires consistent estimation of both the PS and the OR. M. van der Laan (2014) proposed new TML estimators that asymptotically attain a normal sampling distribution with estimable variance under consistent estimation of either the PS or the OR, paving the way for inference that is doublyrobust. D. Benkeser et al. (2017) [End Page 44] proposed modifications of these estimators with similar robustness properties and illustrated their practical performance.
The two proposed procedures that yield doublyrobust inference are quite involved, notably requiring iterative estimation of additional, complex nuisance parameters. The drtmle package was motivated by the need for a open source tool to implement these methods. The package implements the TML estimators of populationlevel effects proposed in M. van der Laan (2014) and D. Benkeser et al. (2017), as well as several extensions including crossvalidated targeted minimum loss estimation (CVTMLE). Additionally, drtmle implements an IPTW estimator that overcomes shortcomings of existing IPTW implementations with respect to statistical inference. In particular, the estimator allows an adaptive estimator of the PS to be used in construction of the IPTW estimator, while yielding valid statistical inference about populationlevel effects. Both the TML and IPTW estimators implemented in the drtmle package are capable of estimating populationlevel effects for discretevalued treatments; moreover, the package provides convenient utilities for constructing confidence intervals and hypothesis tests about contrasts of the mean outcome under different levels of treatment. This manuscript explains some of the key concepts underlying doubly robust inference and illustrates usage of the drtmle package for the R language and environment for statistical computing (R Core Team 2022).
In tandem with growing interest in and usage of doubly robust estimation approaches, open source software tools implementing these methods have proliferated. Notably extant software packages include the tmle R package (Gruber and van der Laan 2012), which implements TML estimators of the effects of static single time point interventions; the ltmle R package (Lendle et al. 2017), which implements TML estimators of the effects of multiple time point interventions, subject to timedependent confounding and censoring, as well as estimators of the parameters of a longitudinal working marginal structural model; the tmle3 R package (Coyle 2021), which serves as a generalizable engine for implementing TML estimators within the tlverse ecosystem (M. J. van der Laan et al. 2023), designed to provide a common and extensible framework for TML estimation of the effects of static, dynamic, and stochastic single time point interventions; and the doubleml R and Python packages (Bach et al. 2021, 2022), which implement estimators of causal effects within the double machine learning framework (Chernozhukov et al. 2017, 2018), reyling upon Neyman orthogonality and samplesplitting for robustness. From among this nonexhaustive list, drtmle stands out as, to our knowledge, the only implementation of statistical procedures designed to attain doubly robust inference (D. Benkeser et al. 2017), uniquely positioning drtmle as a robust tool for the estimation of treatment effects in observational data settings.
2. Definition and estimators of the average treatment effect
2.1 Parameters of interest
Suppose the observed data consist of n independent and identically distributed copies of the random variable O = (W, A, Y ) ∼ P_{0}. Here, W is a vector of baseline covariates, A ∈ {0, 1} is a binary treatment, Y is an outcome, and P_{0} is the true datagenerating distribution. The parameters of interest are
[End Page 45]
where Q̄_{0}(a, w) = E_{0} (Y  A = a, W = w) is the socalled outcome regression and Q̄_{W,}_{0}(w) = P r_{0} (W ≤ w) is the distribution function of the covariates. The value ψ_{0}(a) represents the marginal (i.e., populationlevel) treatmentspecific average outcome, hereafter referred to as the marginal mean under treatment A = a. The marginal effect of the treatment on the outcome can be assessed by comparing marginal means under different treatment levels. For example, the average treatment effect is often defined as ψ_{0} := ψ_{0}(1) − ψ_{0}(0).
Under additional assumptions, these parameters identify the mean counterfactual outcomes under the different treatments. Specifically, we denote by Y (a) a welldefined counterfactual outcome that would be observed under treatment a. If the conditional randomization assumption (Y (a) ⊥ A  W ) and positivity conditions (P r_{0}{P r_{0}(A = a  W ) > 0} = 1) hold, then ψ(a) identifies the mean of Y (a).
2.2 Estimators
The Gcomputation estimator of ψ_{0}(a) may be motivated directly by (1). Suppose we have available an estimate of the OR, Q̄_{n}, and an estimate of the distribution function of baseline covariates, Q̄_{W,n}. An estimator of ψ_{0}(a) may be obtained by plugging these into (1), ψ_{n}(a) =∫ Q̄_{n}(a, w)dQ_{W,n}(w). Typically, the empirical distribution function of covariates is used so that this estimator can be equivalently expressed
Inference for GCOMP estimators based on parametric OR estimators may be facilitated through the nonparametric bootstrap. However, if adaptive approaches are used to estimate the OR, inference may not be available without further modification to the estimator (e.g., undersmoothing as discussed above).
An alternative estimator of the treatmentspecific populationlevel mean may be obtained by noting that (2) can be equivalently written as
We use g_{0}(a  W ) := P r_{0}(A = a  W ) to denote the propensity score. Suppose we had available g_{n}, an estimate of the PS.
which is referred to as the inverse probability of treatment weighted (IPTW) estimator. Like GCOMP estimators, inference for IPTW estimators based on parametric PS estimators [End Page 46] may be facilitated through the nonparametric bootstrap. If adaptive approaches are used to estimate the PS, inference may not be available without further modification of the propensity score estimator, whether via targeting, as we discuss below, or undersmoothing (as in Ertefaie, Hejazi, and van der Laan (2022) and Hejazi et al. (2022)).
The GCOMP and IPTW estimators suffer from two important shortcomings. The first is a lack of robustness – that is, validity of GCOMP estimators relies on consistent estimation of the OR, while validity of the IPTW relies on consistent estimation of the PS. In practice, we may not know which of the OR or PS is simpler to consistently estimate and thus may not know which of the two estimators to prefer. Furthermore, if adaptive estimators (e.g., based on machine learning) are used to estimate the OR or PS, then the GCOMP and IPTW estimators generally lack a valid basis for asymptotic statistical inference.
The augmented IPTW (AIPTW) estimator
adds an augmentation term to the IPTW estimator. Equivalently, the estimator can be viewed as adding an augmentation to the GCOMP estimator
The AIPTW estimator overcomes the two limitations of the GCOMP and IPTW estimators. First, it exhibits double robustness in that it is consistent for ψ_{0}(a) if either Q̄_{n} is consistent for the true OR or g_{n} is consistent for the true PS. Furthermore, if both the OR and PS are consistently estimated and regularity conditions hold, one can establish a normal limiting distribution for the AIPTW estimator. Closedform estimators of the variance of this limiting distribution can be derived that may be used to construct Waldstyle confidence intervals and hypothesis tests.
A more recent proposal to improve robustness and inferential properties of the GCOMP and IPTW estimators utilizes targeted minimum loss estimation (or TMLE). TMLE is a general methodology that may be used to modify an estimator of a regression function in a manner that ensures that the modified estimator satisfies userselected equations. In the present problem, one can show that if an estimator of the OR Q̄^{⋆}_{n} satisfies
then the GCOMP estimator based on Q̄^{⋆}_{n} will have the same asymptotic properties as the AIPTW estimator. That is, the estimator will be doubly robust and, provided both the OR and PS are consistently estimated and regularity conditions are satisfied, will have a normal limiting distribution. TMLE provides a means of mapping an initial estimator of the OR into an estimator that satisfies (5). In addition to possessing the desirable asymptotic properties, TML estimators have been shown to outperform AIPTW estimators in finite samples (Porter et al. 2011). We refer readers to M. J. van der Laan and Rose (2011) and M. J. van der Laan et al. (2023) for further details about on the TMLE methodology. [End Page 47]
3. Doubly robust inference
M. van der Laan (2014) demonstrated that the double robustness property of AIPTW and TML estimators does not extend to their normal limiting distribution if adaptive estimators of the OR and PS are used to construct the estimators. However, in settings with highdimensional and continuousvalued covariates, adaptive estimation is likely necessary in order to obtain a consistent estimate of either the OR or PS, and thus may be necessary to obtain minimally biased estimates of the marginal means of interest. M. van der Laan (2014) derived modified TML estimators that are doubly robust not only with respect to consistency, but also with respect to asymptotic normality. Furthermore, this work provided closedform, doubly robust variance estimators that may be used to construct doubly robust confidence intervals and hypothesis tests. D. Benkeser et al. (2017) proposed simplifications of M. van der Laan (2014)’s estimators and demonstrated the practical performance of these estimators via simulation experiments.
The key to the theory underlying these results is finding estimators of the OR and PS that simultaneously satisfy
In words, the ROR is the regression of the residual from the outcome regression onto the estimated PS amongst observations with A = a, while the RPS is the propensity for treatment A = a given the estimated OR and PS. We note that a key feature of these reduceddimension regressions is that they are lowdimensional regardless of the dimension of W and depend on neither the true OR nor the true PS. Thus, these regressions may be consistently estimated at reasonably fast rates irrespective of the dimension of W and irrespective of whether the OR or PS are consistently estimated. Based on these reduceddimension regressions, M. van der Laan (2014) showed that if estimates of the OR Q̄^{⋆}_{n}, PS g^{⋆}_{n}, ROR Q̄^{⋆}_{r,n}, and RPS g^{⋆}_{r,n} satisfied
then, under regularity conditions described in Theorem 3 of M. van der Laan (2014), the GCOMP estimator based on Q̄^{⋆}_{n} would be doubly robust not only with respect to consistency, but also with respect to its limiting distribution. The author additionally proposed a TMLE algorithm to map initial estimates of the OR, PS, ROR, and RPS into estimates that satisfy these key equations. The regularity conditions required to achieve asymptotic [End Page 48] linearity include assumptions on the convergence rates of the OR, PS, ROR, and RPS to their true counterparts.
D. Benkeser et al. (2017) demonstrated alternative equations that can be satisfied to achieve this form of doublerobustness. In particular, these authors formulated an alternative version of the RPS,
which we refer to as RPS1. They also introduced an additional regression of a weighted residual from the PS on the OR,
which we refer to as RPS2. The key feature of RPS1 and RPS2 is that they are both univariate regressions, while RPS is a bivariate regression. D. Benkeser et al. (2017) suggested that this may make RPS1 and RPS2 easier to estimate consistently than RPS. Further, they showed that if estimators of the OR, PS, and ROR satisfy
then the GCOMP estimator based on this Q̄^{⋆}_{n} also enjoys a doubly robust limiting distribution. D. Benkeser et al. (2017) provided a TMLE algorithm that produced such estimates and provided closedform, doubly robust variance estimates.
Remark: In the previous section, we introduced AIPTW and TMLE as two methods for constructing doubly robust (with respect to consistency) estimators of our parameter of interest. Theoretically, these two frameworks are closely related and, practically, the two estimators are generally seen as competitors for estimating marginal means. In
4. Implementation of doubly robust inference
The main workhorse of the drtmle package is the eponymous drtmle() function. This function estimates the treatmentspecific marginal mean for userspecified levels of a discretevalued treatment and computes a doubly robust covariance matrix for these estimates. The function implements either the estimators of M. van der Laan (2014) (if reduction = “bivariate”) or the improved estimators of D. Benkeser et al. (2017) (if reduction = “univariate”). The package includes utility functions for combining these estimates into estimates of average treatment effects, as well as other contrasts, as discussed below.
The main data arguments for drtmle() are W, A, Y, which, as above, represent the covariates, a discretevalued treatment, and an outcome, respectively. Missing data are allowed in A and Y, as discussed below. Beyond the data arguments, it is necessary to specify how to estimate each of the OR, PS, ROR, and RPS (or RPS1 and RPS2). The package provides flexibility in how these regression parameters may be estimated. Once the user has specified the data arguments and how to estimate each regression parameter, the function proceeds as follows:
1. estimate the OR via the internal estimateQ() function;
2. estimate the PS via the internal estimateG() function;
3. estimate reduceddimension regression via the internal estimateQrn() and estimategrn() functions;
4. iteratively modify initial estimates of the OR and PS via the fluctuateQ() and fluctuateG() functions until
equations (8) –(10) are approximately solved;5. compute the TML estimates and covariance estimates based on the final OR.
We refer interested readers to D. Benkeser et al. (2017) for a theoretical derivation of how the iterative modification of initial OR and PS is performed.
In addition to returning doubly robust parameter and covariance estimates, the drtmle() function returns estimates of the GCOMP, AIPTW, standard TMLE, and the modified AIPTW estimator discussed in the remark above. While doubly robust inference based on these estimators is not available, the estimators are provided for comparison. In the following subsections, we demonstrate the use of these methods by examining various options for making calls to drtmle().
4.1 Estimating regressions
While the estimators computed by drtmle() are consistent and asymptotically normal under inconsistent estimation of either the OR or PS, it is nevertheless advisable to strive for consistent estimation of both the OR and PS. If both of the regression parameters are consistently estimated, then the estimators computed by drtmle() achieve the nonparametric efficiency bound (that is, the asymptotic variance of the efficient influence function), resulting in narrower confidence intervals and more powerful hypothesis tests. In order for the estimates of the OR and PS to have the greatest chance of consistency, the drtmle() function facilitates estimation using adaptive estimation techniques. In fact, there are three ways to estimate each of the regression parameters: generalized linear models, super learning (M. J. van der Laan, Polley, and Hubbard 2007; Polley et al. 2017), and a userspecified estimation technique. We demonstrate each of these approaches in turn on a simulated [End Page 50] data set in which the treatment is not randomized. The true parameter values of interest are ψ_{0}(0) = 0.62 and ψ_{0}(1) = 0.72.
4.1.1 Using generalized linear models to estimate regressions
Generalized linear models are perhaps the most popular means of estimating regression functions. On account of this legacy, these estimators have been included as an option for estimating the regression parameters. The user specifies the righthand side of a regression formula as a character string in the glm_g, glm_Q, glm_Qr, and glm_gr options to estimate the OR, PS, ROR, and RPS. These regressions are fit by calling the glm() function from R’s base stats package (R Core Team 2022). Thus, the model specification should be able to be parsed as a valid formula object, as required by glm(). For the PS formula specified by glm_g, colnames(W) are available to use in the formula. For the OR (glm_Q) available names of variables to be used in the formula depends on the stratify option. If stratify = FALSE, a single OR is fit to estimate Q̄_{0}(A, W ). In this case, the OR regression is fit by passing a data.frame that contains W and A into the data argument of glm(). Thus, both colnames(W) and “A” may appear in the formula specified in glm_Q. If stratify = TRUE, regressions are fit stratifying on A to separately estimate Q̄_{0}(1, W ) and Q̄_{0}(0, W ). In this case, the glm_Q formula may not include “A”.
To understand the variable names available for ROR, recall from
The code below illustrates how generalized linear models can be used to estimate the regression parameters. In this call to drtmle(), we specify family = binomial() so that logistic regression is used to estimate the OR. We illustrate nonstratified regression for both the univariate (default) and bivariate reductions below.
[End Page 51]
The fit of the reduceddimension regressions can be plotted using the posthoc plot.drtmle method.
4.1.2 Using super learner to estimate regressions
Super learning is a lossbased ensemble machine learning technique (M. J. van der Laan, Polley, and Hubbard 2007) that generalizes stacking and stacked regression (Wolpert 1992; Breiman 1996). The user specifies many potential candidate regression algorithms, referred to as a library of candidate estimators, and the super learner algorithm estimates the convex combination of regression estimators that minimizes the V fold crossvalidated risk based on a userselected loss function (Dudoit and van der Laan 2005). Oracle inequalities show that the super learner estimate performs essentially as well as the unknown best convex combination of the candidate estimators (van der Vaart, Dudoit, and van der Laan 2006). Thus, the methodology may be seen as an asymptotically optimal way of performing estimator selection. Practically, the super learner provides the best chance for obtaining a consistent estimator for both the OR and PS, and thereby an efficient estimator of the marginal means of interest.
Super learning is implemented in the R package SuperLearner (Polley et al. 2017) and drtmle() provides the option to utilize super learning for estimation of the OR, PS, and reduceddimension regressions via the SL_Q, SL_g, SL_Qr, and SL_gr options. When calling [End Page 52] drtmle() either the generalized linear model options for regression modeling or the super learner options should be invoked. If both are called, the function will automatically opt to using the super learner option. The inputs to the super learner options are passed to the SL.library argument of the SuperLearner() function of the SuperLearner package. We refer readers to the documentation of the SuperLearner() function for information on how to properly specify a super learner library. In the code below, we illustrate a call to drtmle() with a relatively simple library. For the OR and PS, the library includes a mainterms generalized linear model (via the function SL.glm from SuperLearner) and an unadjusted generalized linear model (SL.mean). For the reduceddimension regressions we use a super learner library with a mainterms generalized linear model (SL.glm) and adaptive regression splines (SL.earth, which utilizes the earth package (Milborrow et al. 2018)).
The drtmle package also allows users to pass in their own functions for estimating regression parameters via the SL options. These functions must be formatted properly for compatibility with drtmle(). The necessary formatting is identical to the formatting required for writing a SuperLearner wrapper (see SuperLearner::write.SL.template()). The drtmle package includes SL.npreg, which is an example of how a user can specify a function for estimating regression parameters. This functions fits a kernel regression estimator using the np package (Hayfield and Racine 2008). Below we show the source code for this function.
In the above code, we see that a userspecified function should have options Y, X, newX, family, obsWeights, ..., along with other options specific to the function. The option Y corresponds to the outcome of the regression and X to the variables onto which the outcome is regressed. The option newX should be of the same class and dimension as X and is meant to contain new values of variables at which to evaluate the regression fit. The family and obsWeights options are not strictly necessary (indeed, they are not used by SL.npreg), though these may be useful to include if one wishes to utilize the function as part of a super learner library. The SL.npreg function proceeds by checking whether there is sufficient variation in Y to justify fitting the kernel regression (as specified by option rangeThresh), and if so, fits a kernel regression with crossvalidated bandwidth selection via the npregbw() function. The output is then formatted in a specific way so that drtmle() is able to retrieve the appropriate objects from the fitted regression. In particular the output should be a list with named objects pred and fit; the former including predictions made from the fitted regression on newX, the latter including any information that may be needed to predict new values based on the fitted object. A class is assigned to the fit object, so that an S3 predict() method may later be used to predict new values based on the fitted regression object. The user must also specify this predict() method. Use drtmle:::predict.SL.npreg() to view the predict() method for SL.npreg.
In the code below, we demonstrate a call to drtmle() with SL.npreg used to estimate each nuisance parameter. Here, we also illustrate the use of stratify = TRUE, which fits a separate kernel regression of Y onto W in observations with A = 1 and A = 0.
[End Page 54]
In the Appendix, we describe an additional means of using drtmle() with OR and PS estimates that have been obtained externally.
4.1.3 Reducing dependence on random seed
Some users may find it unsettling that when applying the super learner in real data analysis the results are dependent on the random seed that is set. While this is a natural artifact of crossvalidation (samplesplitting) and adaptive estimation, one can average results over repeated super learner runs to reduce or remove this dependence – more repeats will lead to less dependence on the random seed. There are generally two approaches that can be used: (i) averaging on the scale of the nuisance parameters and (ii) averaging on the scale of the point estimates.
For (i), suppose that we have n_{SL} estimated super learners for the OR, Q̄_{n,j}, j = 1, . . . , n_{SL}, and the PS g_{n,j}, j = 1, . . . , n_{SL}. We can take the average over these fits,
as our estimate of the OR (similarly for the PS). These averaged estimates are then used to compute the AIPTW or TMLE, as described above.
For (ii), suppose for example that we have n_{SL} TML (similarly, AIPTW) estimates ψ_{n,j}(1) of ψ_{0}(1), each obtained based on one (or more) super learners of the OR and/or PS. We can take the average over these point estimates,
as our estimate of ψ_{0}(1).
This feature is implemented via the n_SL and avg_over options, where the former specifies the number of super learners to repeat and the latter specifies the scale(s) over which to average. For example, if avg_over = “drtmle” and n_SL = 2, then two point estimates are computed, each based on a single super learner, and averaged to obtain a final estimate. If instead avg_over = “SL” and n_SL = 2, then two super learners are fit and averaged before a final point estimate is obtained. Finally, if avg over = c(“drtmle”, “SL”) and n_SL = 2 then averaging on both scales is performed. That is, the final estimate is the average of two point estimates, each obtained using OR and PS estimates that are averaged over two super learners.
Unsurprisingly, averaging over multiple super learners can add considerable computational expense to calls to drtmle(). The package is currently parallelized at the level of individual calls to SuperLearner(). With parallelization over enough cores, repeated super learning should not add too much to the computational burden relative to a single call to SuperLearner(). [End Page 55]
4.2 Standard error estimation
This section contains some more advanced mathematical material and may be skipped over by users interested only in implementation. For simplicity, we focus this discussion on the standard TML and AIPTW estimators.
Reported standard error estimates are based on an estimate of the asymptotic variance of the estimators. This variance can be characterized by the (efficient) influence function of the estimators. For example, let ψ_{n}(1) be an AIPTW estimate of ψ_{0}(1). Under regularity conditions,
where
Thus, by the central limit theorem, n^{1/2}ψ_{n}(1) converges in distribution to a normal distribution with mean ψ_{0}(1) and variance
An estimate of σ^{2}_{0} is naturally obtained by taking the empirical variance of D_{i} = D^{⋆}(Q̄_{n}, g_{n}, Q̄_{W,n})(O_{i}), that is,
Thus, an estimate of the standard error of ψ_{n}(1) is σ_{n}/n^{1/2}. Conveniently, this idea generalizes naturally to estimating the asymptotic covariance matrix of a vector n^{1/2}(ψ_{n}(a) : a).
These ideas generalize immediately to the DRTML estimator, where we can use the variance of the estimated influence function of that estimator. Unfortunately, the formulas for the influence functions more complex, but are available in Theorem 1 of D. Benkeser et al. (2017).
4.2.1 Standard errors for TMLE
For TML estimators, it is standard practice to construct D_{i} using the targeted version of the OR (denoted by Q̄^{⋆}_{n} above). Recall that such estimates are obtained by modifying an initial OR estimate so as to ensure a particular (set of) equation(s) is (are) solved. However, simulations have shown that better standard error estimates for TMLE may be achieved by instead using the initial OR estimate in the computation of standard errors, as described in the preceding subsection. To this end, The drtmle() function includes the targeted se option, which specifies whether initial nuisance parameter estimates are to be in computing the standard error (i.e., targeted_se = FALSE). [End Page 56]
4.2.2 Crossvalidated standard errors
Simulation studies have revealed that when employing machine learning to estimate the OR and/or PS, the estimated standard errors are often too small, in small to moderatesized samples. Problematically, this leads directly to undercoverage of confidence intervals and possibly inflated TypeI error rates of hypothesis tests based on these standard error estimates. This should not be too surprising, as the phenomenon is related to the form of the standard error estimates. In particular, such estimates include a term that resembles a (weighted) mean squared error (MSE) of the OR. Since machine learning algorithms often set out to minimize empirical MSE as part of their training process, it follows that the empirical MSE often ends up biased towards zero relative to the true MSE. To more accurately estimate the MSE of a machine learning algorithm, crossvalidation can be used. The same samplesplitting idea can be used to obtain more conservative standard error estimates.
Suppose we use V fold crossvalidation to obtain V trainingsamplespecific estimates of the OR, Q̄_{n,v}, v = 1, . . . , V and the PS, g_{n,v}, v = 1, . . . , V . For i = 1, . . . , n, let Q̄^{cv}_{n,i}(1) denote the estimate of Q̄_{0}(a, W_{i}) obtained when observation i was in the validation fold. That is, Q̄^{cv}_{n,i} is an outofsample prediction of Y_{i}. We define g^{cv}_{n,i} similarly. Using these estimates, we can define
[End Page 57]
and compute an estimate of σ^{2}_{0} using
In drtmle, crossvalidated standard error estimates can be obtained by setting se_cv = “full” and se cvFolds to a number greater than one. For example,
4.2.3 Partially crossvalidated standard errors
Obtaining fully crossvalidated standard errors can be fairly timeintensive when super learner is used to estimate the OR and PS. In drtmle, we give an option to compute standard error estimates that are partially crossvalidated and do not require any additional fitting beyond a standard super learner. Here, we describe briefly how these estimates are obtained and show calls to drtmle() for obtaining such estimates.
Suppose we have K candidate regressions in our super learner library for both the OR and PS (in practice, the number could be different for the OR and for the PS). Recall that a super learner estimate of, for example, Q̄_{0}(a, w) is obtained as a weighted combination of estimates from each candidate regression. Let Q̄_{n,k} denote the estimate of Q̄_{0} obtained by training algorithm k using the full data set. The super learner estimate of Q̄_{0}(a, w) is
[End Page 58]
where α_{n} = (α_{n,}_{1}, . . . , α_{n,K}) is a weight vector that is selected by minimizing a crossvalidated risk criterion. That is, each of the K algorithms is trained according to a V fold crossvalidation scheme, resulting in V + 1 fits of each algorithm (one in each of V training folds, plus one obtained by fitting the algorithm using the full data). Let Q̄_{n,k,v} denote the k^{th} algorithm trained using the v^{th} training fold. For i = 1, . . . , n, let Q̄^{cv}_{n,k,i} (1) denote the estimate of Q̄_{0}(1, W_{i}) implied by the fitted algorithm Q̄n,k,v, which was obtained when observation i was in the validation set. That is, for observations with A_{i} = 1, Q̄^{cv}_{n,k,i}(1) is an outofsample prediction of Y_{i} based on the k^{th} algorithm. Now, we can construct a partially crossvalidated super learner prediction as follows. For i = 1, . . . , n let
be the estimate obtained by ensembling the trainingsetspecific algorithm fits based on α_{n}. We can use these estimates to produce our partially crossvalidated standard error estimate as follows. First, define
then, compute an estimate of σ^{2}_{0} using
Note that is not a properly crossvalidated estimate, since observation i is used to compute the weight vector α_{n}. However, such estimates can be obtained based on the output of a single round of super learning, thereby providing improved standard error estimates at minimal additional computational cost.
Below, we demonstrate how to obtain such estimates using drtmle().
4.3 Confidence intervals
The drtmle package contains methods that compute Waldstyle doubly robust confidence intervals. The ci() method computes confidence intervals about the estimated marginal means in a fitted object of S3 class “drtmle”, in addition to contrasts between those means. By default, the method gives confidence intervals about each mean.
Alternatively the contrast option allows users to specify a linear combination of marginal means, which can be used to estimate the average treatment effect, as we show below.
The contrast option also accepts as input a list of functions to be used to compute a Waldstyle confidence interval for a userspecified contrast. For example, we might be interested in the risk ratio ψ_{0}(1)/ψ_{0}(0). When constructing Waldstyle confidence intervals for ratios, it is common to compute the interval on the logarithmic scale and subsequently backtransform. That is, the confidence interval takes the form
where σ^{log}_{n} is the estimated standard error on the logarthmic scale. By the delta method, an estimate of this standard error is given by
where Σ_{n} is the doubly robust covariance matrix estimate output from drtmle() and ∇g(ψ) is the gradient of log{ψ(1)/ψ(0)}. This confidence interval may be obtained as follows. [End Page 60]
More generally, this manner of inputting the contrast option to the ci() method can be used to compute confidence intervals of the form
where f (contrast$f()) is the transformation of the confidence interval, f^{−}^{1} (contrast$f_inv()) is the backtransformation of the interval, h (contrast$h()) is the contrast of the marginal means, and ∇f(h(ψ_{n})) (contrast$fh grad()) defines the gradient of the transformed contrast at the estimated marginal means.
We note that this method of computing confidence intervals can also be used to obtain a transformed confidence interval for a particular marginal mean. For example, recall that Y is binary in our running example. Thus, we may wish to enforce that our confidence interval be computed on a scale ensuring that the estimated confidence limits fall within the unit interval [0, 1]. To that end, we can utilize an interval of the form:
This confidence interval may be implemented as follows.
4.4 Hypothesis tests
Doubly robust Waldstyle hypothesis tests may be implemented using the wald_test() method. As with confidence intervals, there are three options for testing. First, we may test the null hypothesis that the marginal means equal a fixed value (or values). Below, we test the null hypothesis that ψ_{0}(1) = 0.5 and that ψ_{0}(0) = 0.6 by inputting a vector to the null option. [End Page 61]
As with the ci() method, the wald_test() method allows users to test linear combinations of marginal means by inputting a vector of ones and negative ones in the contrast option. We can use this to test the null hypothesis of no average treatment effect or to test the null hypothesis that the average treatment effect equals a given reference value, e.g., 0.05, by specifying the null option.
The wald_test() method also allows for userspecified contrasts to be tested using syntax similar to the ci() method. The only difference syntactically is that it is not strictly necessary to specify f_inv(), as the hypothesis test is performed on the transformed scale. That is, we can generally test the null hypothesis that f(h(ψ_{0})) equals f_{0} (the function f applied to the value passed to null) using the Wald statistic
We can use the riskRatio contrast defined previously to test the null hypothesis that ψ_{0}(1)/ψ_{0}(0) = 1.
[End Page 62]
4.5 Missing data
The drtmle() function supports missing data in A and Y. Such missing data are common in practice and, if ignored, can bias estimates of the marginal means of interest. The estimators previously discussed can be modified to allow such missingness processes to depend on observed covariates. That is, if we define ∆_{A} and ∆_{Y} as indicators that A and Y are observed, respectively, then we can still perform valid inference on E[Y (a)] under the assumptions that (i) Y (a) ⊥ ∆_{A}  W, (ii) Y (a) ⊥ A  ∆_{A} = 1, W, and (iii) Y (a) ⊥ ∆_{Y}  ∆_{A} = 1, A = a, W . In this case, we can redefine the OR to be Q̄_{0}(a, w) := E_{0}(∆_{Y} Y  ∆_{A}A = a, ∆_{A} = 1, ∆_{Y} = 1, W = w) and similarly redefine the PS to be
To demonstrate, we artificially introduce missing values to A and Y into the data from our running example.
The only required modification of the call to drtmle() is in how the PS estimation is performed. The options glm_g and SL_g are still used to fit generalized linear models or a super learner, respectively. However, the code allows for separate regression fits for each of the three components of the PS in
It is possible to mix generalized linear models and super learning when fitting each of the components of the PS. In the example below, we use the wrapper function SL.glm, which fits a main terms logistic regression, to fit the regression of ∆_{A} on W ; use SL.npreg to fit the regression of A on W ; and use a super learner with library SL.glm, SL.mean, and SL.earth to fit the regression of ∆_{Y} on W and A.
We can also fit the PS regressions externally and pass in via gn, though this process is slightly more complicated than when no missing data are present. The basic idea is to fit one regression for (i) the indicator of missing A, (ii) A itself, and (iii) the indicator of the outcome being missing. The three PS’s are then multiplied together and an appropriately ordered list is constructed as before. We illustrate with the following example.
Remark: In practice it is common to also have missingness in W . In this situation, there are varying assumptions under which one can still identify and estimate E[Y (a)]. The only native solution in drtmle currently is to augment W with columns of missingness indicators for each component of W . For example, if variable W_{1} is observed, we could define a new indicator ∆_{W,}_{1} = 1 and ∆_{W,}_{1} = 0 otherwise. For observations with ∆_{W,}_{1} = 0, we could use some (possibly very simple) imputation strategy to fill in a value for W_{1}. Candidate learners for the OR and PS may then include appropriate crossproducts between [End Page 65] W_{1} and ∆_{W,}_{1}. Other learners may opt to treat the imputed value of W_{1} as the true value and ignore ∆_{W,}_{1}. Super learner can be used to help determine the modeling strategy that leads to the optimal model fit conditional on W_{1} and ∆_{W,}_{1}. It is important to note that this approach is only valid for drawing causal inference under the assumption that Y (a) is independent of A conditional on the partially observed variables in W and their missingness indicators. This assumption may not be plausible depending on the form of missingness of W . Other approaches for handling missing data, e.g., multiple imputation, could in theory be coupled with drtmle, though such applications have not been studied empirically and are not available natively in the drtmle package.
4.6 Multilevel treatments
In our examples so far, we have only considered binary treatments. However, drtmle is fully equipped to handle treatments with an arbitrary number of discrete values as well. Suppose that A assumes values in a set A of limited cardinality, the PS regression estimation is modified to ensure that
If this condition is true, we say that the estimates of the PS are compatible. Many regression methodologies that have been developed work well with binary outcomes, but have not been extended to multilevel outcomes. To ensure that users of drtmle have access to the large number of binary regression techniques when estimating the PS, we have taken an alternative approach to estimation of the propensity for each level of the treatment. Specifically, rather than fitting a single multinomialtype regression, we fit a series of binomial regressions. As a concrete example, consider the case of no missing data and where = {0, 1, 2}. We can ensure compatible estimates of the PS by generating an estimate g_{n}(0  w) of P r_{0}(A = 0  W = w) and g˜_{n}(1  w) of P r_{0}(A = 1  A > 0, W = w). The latter estimate may be generated by regressing I(A = 1) on W in observations with A > 0. Since the outcome is binary, this regression may be estimated using any technique suitable for binary outcomes. By rules of conditional probability, it holds that g_{0}(1  w) = P r_{0}(A = 1  A > 0, W = w){1 − P r_{0}(A = 0  W )} and thus an estimate of g_{0}(1  w) can be computed as g_{n}(1  w) = ˜g_{n}(1  w){1 − g_{n}(0  w)}. Finally, we let g_{n}(2  w) = 1 − g_{n}(0  w) − g_{n}(1  w), which ensures compatibility of the estimates for each level of the covariates. This approach generalizes to an arbitrary number of treatment levels. Below, we provide an illustration using simulated data with a treatment that has three levels. The true parameter values in this simulation are ψ_{0}(0) = 0.62, ψ_{0}(1) = 0.72, and ψ_{0}(2) = 0.77.
The multilevel PS can still be estimated using any of the three techniques discussed previously. Here, we illustrate with generalized linear models. [End Page 66]
Above, we used the option a 0 to specify the levels of the treatment at which we were interested in estimating marginal means. There is no need for a 0 to contain all unique values of A. For example, certain levels of the treatment may not be of scientific interest, but nevertheless we would like to use these data to help estimate the OR and PS (by setting stratify = FALSE).
The confidence interval and testing procedures extend to multilevel treatments with minor modifications. We can still obtain a confidence interval and hypothesis test for each of the marginal means by not specifying a contrast.
Alternatively, we can specify a contrast to estimate confidence intervals about the average treatment effect of A = 1 versus A = 0. Calls to wald_test() can also be made. [End Page 67]
Similarly, we can compute confidence intervals comparing the average treatment effect of A = 2 versus A = 0.
Userspecified contrasts are also available. For example, we can modify the riskRatio object we used previously for a twolevel treatment to compute the risk ratio comparing A = 1 to A = 0 and comparing A = 2 to A = 0.
Similarly, we can compute confidence intervals for the risk ratio comparing the marginal mean for A = 2 to A = 1.
[End Page 68]
4.7 Crossvalidated regression estimates
To avoid overfitting initial regressions, one can use crossvalidated estimates of the nuisance regressions. Theoretically, the use of crossvalidated regression weakens entropy conditions on nuisance parameter estimators necessary to achieve asymptotic normality (Klaassen 1987; Bickel et al. 1993; Zheng and van der Laan 2010). This is true of the regular TML and AIPTW estimators, as well as of the TML estimator with doubly robust inference. The drtmle() function implements crossvalidated estimation of the nuisance regression parameters through the cvFolds option, as shown in the code below.
Crossvalidation may significantly increase the computation time necessary for running drtmle(). Parallelized nuisance regression fitting is available by setting the option use_future = TRUE. Since, internally, all computations are carried out using futures (with the API provided by the future R package), setting this option amounts only to parallelizing with futures (Bengtsson 2017a, 2021) via future.apply::lapply(); when use_future = FALSE, base lapply() is simply used instead. Together, setting these options results in parallelization of the estimation of the OR, PS, and reduceddimension regressions.
If no adjustments are made prior to invoking the drtmle() function, futures are computed sequentially, as per the defaults of the future package. It is left to the user to specify appropriate systemspecific future backends that may better suit both their analytical problem and computing environment (e.g., using the future.batchtools package (Bengtsson 2017b) or even simply setting a parallelized future::plan()). Below, we demonstrate registration using an adhoc cluster on the local system. The choice of backend is flexible, and submission of the job to a variety of schedulers (e.g., SLURM, TORQUE) is also permitted. Interested readers are invited to consult the documentation of the future.batchtools package for further information.
5. Discussion
The drtmle package was designed to provide a userfriendly implementation of the TMLE algorithm that facilitates doubly robust inference about marginal means and average treatment effects. While the theory underlying doubly robust inference is complex, the implementation of the methods only requires users to provide the data and specify how to estimate nuisance regressions. Though estimation of the OR and PS is familiar to those with backgrounds in causal inferencerelated estimation, the reduceddimension regressions may appear strange. Indeed, it is difficult to provide intuition for these parameters and more difficult still to determine what scientific background knowledge might guide users in how to estimate these regression parameters. Therefore, we recommended estimating these quantities adaptively, e.g., with the super learner algorithm. The super learner library should contain several relatively smooth estimators such as SL.mean, SL.glm, and SL.gam. In particular, including SL.mean (an unadjusted generalized linear model) may be important due to the fact that, when the OR and PS are consistently estimated, the reduceddimension regressions are identically equal to zero. Thus, this unadjusted regression estimator may do a good job estimating the reduceddimension regressions in situations where the OR and PS are estimated well. In addition to relatively smooth estimators, we recommend additionally including several adaptive estimators such as SL.npreg or SL.earth.
Planned extensions of the drtmle package include extensions to longitudinal settings involving treatments at multiple time points and inclusion of the collaborative targeted maximum likelihood estimator (CTMLE) of M. van der Laan (2014) that also facilitates collaborative doubly robust inference. The drtmle R package is publicly available on the Comprehensive R Archive Network (CRAN) at https://CRAN.Rproject.org/package=drtmle, and the open source development repository is available on the GitHub platform for collaborative programming at https://github.com/benkeser/drtmle; the package’s documentation (automatically built via the pkgdown R package) is hosted via GitHub Pages at https://benkeser.github.io/drtmle/.
6. Appendix
6.1 Arbitrary userspecified regressions
The drtmle package also allows users to pass in their own estimated OR and PS via the Qn and gn options, respectively. If Qn is specified by the user, drtmle() will ignore the nuisance parameter estimation routines specified by SL_Q and glm_Q – similarly for gn and SL_g, glm_g. As noted in the package documentation, there is a specific format necessary for inputting Qn and gn. For Qn, the entries in the list should correspond to the OR evaluated at A and the observed values of W, with order determined by the input to a_0. For example, if a_0 = c(0, 1) then Qn[[1]] should be the OR at A = 0 and Qn[[2]] should be the outcome regression at A = 1. The example below illustrates this concept. [End Page 70]
Similarly, for gn we need to pass in properly ordered lists. For example, if a_0 = c(0, 1) then gn[[1]] should be the PS for receiving A = 0 and gn[[2]] should be the PS for receiving A = 1. The example below illustrates this concept.
This flexibility allows users to leverage any regression technique to externally obtain initial estimates of nuisance parameters. The key step is making sure that the ordering of Qn and gn is consistent with a_0.
6.2 Additional methods included in drtmle
6.2.1 Inference for IPTW using adaptive PS estimators
Doubly robust estimators of marginal means are appealing in their robustness and efficiency properties; however, researchers may prefer the IPTW estimator for its intuitive derivation and implementation. Heretofore, inference for IPTW estimators precluded the use of adaptive estimators of the PS, yet M. van der Laan (2014) proposed modified IPTW estimators that allow for adaptive PS estimation by incorporating an additional step to update the initial PS estimates. Like those presented above, this estimator is based on an additional regression parameter, Q˜_{r,}_{0}_{n}(a, w) := E_{0}{Y  A = a, g_{n}(W ) = g_{n}(w)}, that is, the regression of the outcome Y on the estimated propensity score amongst observations with A = a. The author showed that if g^{⋆}_{n}(a  w), an estimator of the propensity for treatment A = a, [End Page 73] satisfied that
then the IPTW estimator based on g^{⋆}_{n} would asymptotically attain a normal limiting distribution with estimable variance. In fact, M. van der Laan (2014) proposed a TMLEtype algorithm that mapped an initial PS estimate into an estimate that satisfies this key equation and variance estimators that can be used to generate Waldstyle confidence intervals and hypothesis tests.
An estimator with equivalent asymptotic properties based on a PS estimate, g_{n}(a  w), can be computed as
Both this estimator and the TMLEbased IPTW estimator above are implemented in drtmle. Alternative approaches for developing IPTW estimators that are asymptotically normal and attain the nonparametric efficiency bound have been investigated by Ertefaie, Hejazi, and van der Laan (2022) and Hejazi et al. (2022), who relied on undersmoothingbased corrections rather than the TMLEtype corrections of M. van der Laan (2014).
Implementation of inference for adaptive IPTW
The IPTW estimators based on adaptive estimation of the PS are implemented in the adaptive_iptw() function. The adaptive_iptw() function shares many options with drtmle(); however, as the IPTW estimator does not rely on an estimate of the OR, options related to the OR are omitted.
The “adaptive_iptw” S3 class contains identical methods for computing confidence intervals and Waldstyle tests as the “drtmle” class, thus readers are referred back to the section on confidence intervals and hypothesis tests for reminders on the various options for these procedures. By default, these methods are implemented for the iptw_tmle() [End Page 74] estimator of M. van der Laan (2014). We conjecture that this estimator will have improved finitesample behavior relative to iptw os() (the alternative estimator described above), though further study in numerical experiments is warranted.
As with drtmle(), the adaptive_iptw() function allows missing data in A and Y. Similarly, adaptive_iptw() also allows for parallelized, crossvalidated estimation of the regression parameters via the cvFolds and parallel options. As with the estimators implemented by drtmle(), crossvalidation allows for valid inference under weaker assumptions.
6.2.2 Outcomeadaptive propensity score estimation
Ju, Benkeser, and van der Laan (2020) have recently proposed estimators of the average treatment effect to be used in settings where this effect is weakly identifiable, as evidenced by estimated propensity scores of small magnitude. In these settings, doubly robust estimators may exhibit nonrobust behavior in small samples. The proposed solution is to target an alternative quantity in the propensity score estimation step. If we are interested in estimating ψ_{0}(a) for a = 0, 1, we can estimate P r_{0}(A = a  Q̄_{0}(1, W ), Q̄_{0}(0, W )), which describes the conditional probability of receiving treatment A = a given the value of the two outcome regressions. For this reason, we call this quantity an outcomeadaptive propensity score. We have shown that all of the usual asymptotics for TMLE carry through when an estimator of this quantity is substituted for the true propensity score. The resultant TML estimator is superefficient, meaning that it will generally have greater asymptotic efficiency than a standard TML estimator (in fact, greater asymptotic efficiency than is ordinarily available to any regular asymptotically linear estimator). Simulation experiments indicate that this approach also delivers more stable behavior in settings where estimated propensity scores become very small. However, the estimator is not doubly robust – the asymptotics rely on the outcome regression being consistently estimated at n^{−}^{1}^{/}^{4} rate. The highly adaptive lasso estimator (David Benkeser and van der Laan 2016; M. J. van der Laan 2017; Bibaut and van der Laan 2019), implemented in the open source hal9001 R package (Hejazi, Coyle, and van der Laan 2020; Coyle et al. 2022), can attain such a convergence rate without sacrificing flexible estimation; nevertheless, as a consequence of this convergence rate requirement, this estimator trades off robustness against achieving greater stability and efficiency. See Ju, Benkeser, and van der Laan (2020) for further discussion.
Implementation of outcomeadaptive propensity score
If the outcomeadaptive PS is desired, the user may specify this via the adapt g option. In this case, the additional targeting steps needed for doubly robust inference are “turned off” (i.e., guard = NULL), since, as discussed above, this estimator is not doubly robust. Note that when adapt g = TRUE, the data used for PS estimation will be a data.frame with column names QaW where a takes all the values specified in a_0. For example if a_0 = c(0,1), then the PS estimation data frame will have two columns called Q0W and Q1W. This may be relevant for users writing their own SuperLearner wrapper functions or those who make use of the glm_g setting.
Atlanta, GA 30322 USA
Boston, MA 02115 USA
Acknowledgements
NSH grateful acknowledges the support of the National Science Foundation (award no. DMS 2102840).