Comment on "Statistical Modeling:The Two Cultures" by Leo Breiman

Abstract

Motivated by Breiman's rousing 2001 paper on the "two cultures" in statistics, we consider the role that different modeling approaches play in causal inference. We discuss the relationship between model complexity and causal (mis)interpretation, the relative merits of plug-in versus targeted estimation, issues that arise in tuning flexible estimators of causal effects, and some outstanding cultural divisions in causal inference.

Keywords

Causal Inference, Model Misspecification, Semiparametrics

1. Introduction

Breiman[5] describes two distinct data analysis cultures that animated the statistical community at the turn of the 21st century. One culture, popular among statisticians, starts with an explicit probabilistic model of the data generating process (DGP) and uses this model to answer questions of interest. In Breiman's view, such probabilistic models are frequently misleading; furthermore, he argues that this approach is incapable of tackling modern challenges involving large, complex datasets, with potentially more variables than observations.

Breiman argues that such challenges can only be effectively solved by embracing a purely algorithmic approach. Analysts in this culture choose a method with the highest predictive [End Page 145] accuracy, regardless of whether it corresponds to some interpretable, parsimonious model for the DGP. They then extract whatever information they can from the model post-hoc.

In this commentary, we discuss how Breiman's points relate to problems in causal inference. Understanding the effect of X on Y is fundamentally about predicting what happens if the distribution X is artificially manipulated in some way, which results in an extra layer of complexity compared to standard prediction problems.

We first discuss how causal misinterpretations can arise regardless of whether the data generating process is explicitly specified or not. Next, we highlight two different approaches to causal parameter estimation – plug-in vs. targeted estimation – that are loosely analogous to Breiman's two cultures. We then expand on the targeted approach and discuss how optimal estimation of target parameters can require undersmoothing the nuisance parameter estimates. This is a setting in which simply optimizing for predictive accuracy in the nuisance parameters may lead to suboptimal performance in the overall estimator. Finally, motivated by Breiman, we briefly note several current cultural divisions in causal inference.

2. Model Complexity & Causal (Mis)interpretations

Breiman describes explicit models of the DGP as being motivated by a desire for interpretability. One of his chief complaints is that it is easy to misinterpret such models. If the model does not describe nature well, then conclusions drawn from its parameter estimates may be spurious or misleading. Implicit in Breiman's critique is the fact that misinterpretations of parameters often have a causal flavor. As Breiman alludes to, in many fields, it has historically been common to run a linear or generalized linear regression and then interpret the coefficients as representing causal effects, without consideration of whether this interpretation is justified. In Breiman's view, algorithmic modeling approaches sidestep this pitfall. By not specifying a DGP, these methods avoid presenting users with parameters which they may be tempted to misinterpret.

We argue, however, that explicit models of the DGP are not necessarily intrinsically misleading, and that nothing in the algorithmic approach intrinsically protects against misinterpretation. Here it seems useful to distinguish several senses of the word "model." In one usage, common among statisticians, "model" refers to a collection of probability distributions, one of which is generally assumed to represent the true data generating process. In a second usage, "model" refers to an algorithm that maps a dataset—the training data—to a function which itself maps from the covariate or feature space to the output space. That is, a model is a procedure used to construct a predictor from data. In a third, closely related usage, "model" refers to the output of this procedure in a particular instance, i.e. a fixed predictor induced by a particular training dataset.

We note that the use of a model in the second or third sense need not imply a commitment to a model in the first sense. Simple parametric predictors may be used without assuming that the regression function follows an analogous parametric form. Least squares linear regression, for example, can be understood as estimating the best linear approximation to the true regression function, regardless of the shape of that function[41;7;8]. The resulting coefficients represent predictive relationships among the variables, no matter how they are [End Page 146] causally related. The coefficients do not need to be mapped onto any particular process in nature.

On the other side, there is a large and growing literature dedicated to rendering complex algorithmic predictors legible. This work falls under the heading of explainable, interpretable, or transparent machine learning. These methods involve constructing or extracting simple quantities that describe the behavior of a complex model. There are now many such methods available[13;2]. Some involve approximating the complex model by simple global or local models or decision rules; others quantify the "contribution" of each feature to predictions in order to assign importance weights to each of the features. Breiman's own feature importance method for random forests[6] is an example of such a method, and perhaps he would be pleased that there has been so much work in this area since.

While the proliferation of methods for interpretable machine learning speaks to the desire users have to gain insight into the behavior of complex models, the outputs of many of these methods are susceptible to exactly the same types of misinterpretations as the parameter estimates from simpler models. Indeed, many approximation methods involve simple parametric models, such as linear models, and produce precisely the same types of estimates[24]. Feature importance methods yield weights for each feature, which are superficially similar to the coefficient estimates from a linear model. Though we have not attempted to verify this, we suspect that it is common to read feature importance values as indicating relative causal importance, even when this interpretation is not licensed.

It is simple to see that the method used to model the data is orthogonal to whether or not causal inference assumptions are satisfied. Consider a causal graph XY, where Y causes X, and suppose we construct a model that predicts Y from X. Whatever quantities we use to summarize the model, whether they are parameter estimates or feature importance weights or other outputs from interpretable machine learning methods, it may be tempting to read these quantities as representing the causal influence of X on Y. This interpretation will always be invalid, because X has no causal effect on Y.

In short, it is always possible to misread causality into a predictive model, regardless of whether the model is parametric or nonparametric, simple or complex. The solution, presumably, is to foreground the assumptions required for causal inference, rather than relying on model opacity to protect us from ourselves. After all, in many scientific settings, causal inference is the goal, whether this is expressed explicitly or not. A user who is determined to find quantities to which causal conclusions can be attached will have many ways of doing so, whatever modeling technique they use.

3. Plug-in Versus Targeted Estimation

Under appropriate causal identification assumptions, there are a variety of ways to proceed with estimation. The two modeling approaches that Breiman describes are roughly analogous to two distinct approaches to causal parameter estimation, which differ in whether they prioritize modeling the entire DGP or just estimating a particular parameter well. In the first approach, one starts by developing a generative model for the entire set of observed data, and then plugs in estimates from this model to answer any generic scientific question of interest. Often finite-dimensional parametric models are used, in which case the estimates [End Page 147] takes the form of particular coefficients embedded in the model (or combinations thereof). Examples of this kind of approach include the parametric g-formula[32;17] as well as other classical approaches relying on coefficients in logistic regression or Cox models. Matching can also be viewed as a form of plug-in estimator, based on nearest-neighbor-type regression predictions[1]. There are some advantages to this kind of approach, including its simplicity and the relative interpretability of the estimators (i.e., one simply takes the estimand of interest, and plugs in estimates from the model for any unknown quantities).

However, there are some important disadvantages to the above plug-in approach. First, simple parametric models induce the risk of substantial bias due to model misspecification. Alternatively, if one uses more flexible methods, then inference is no longer straightforward (e.g., the bootstrap may not provide valid standard errors), and in general the resulting estimator will not be efficient in a nonparametric model (i.e., there will exist different estimators with smaller mean squared error, despite not making stronger assumptions).

These limitations motivate a second approach, where the estimation procedure is tailored to the specific scientific question and estimand of interest. For many estimands, such as average treatment effects of discrete treatments, this tailoring can be framed in terms of semi/nonparametric efficiency theory[23;4;36;33;18]. Namely, for a given pathwise differentiable estimand (i.e., functional or parameter, for example the average treatment effect identified under causal assumptions such as no unmeasured confounding), one can derive the efficient influence function, which yields a benchmark for efficient estimation: no estimator can have mean squared error smaller than the variance of the efficient influence function divided by sample size, in a local asymptotic minimax sense[39]. The efficient influence function is also the crucial ingredient in standard recipes for constructing efficient estimators, for example based on one-step bias correction, estimating equations, or targeted maximum likelihood.

Although in some cases the estimators themselves may not be as intuitive as simple plug-ins, methods that are tailored to specific estimands of interest come with some important advantages. First, they often force analysts and investigators to think hard about the scientific question in formulating the estimand of interest, rather than resorting right away to modeling and outputting coefficients. More concretely, though, estimators based on influence functions (now often called doubly robust or double machine learning methods) can be optimally efficient in nonparametric models, achieving fast parametric 1/n-type mean squared errors even in large nonparametric models, where components of the underlying data generating process can only be estimated at slower rates. In addition to the references mentioned above, we also refer to a fast-growing recent literature for more details[37;9;30].

4. Tuning in Targeted Estimation

One of Breiman's core arguments is that methods of data analysis should be chosen based on their out-of-sample prediction performance. In standard prediction problems, this can be accomplished, for instance, via cross-validation. In causal inference, however, the target parameters are often complex functionals that depend on several nuisance functions. Suppose these functions are estimated using an algorithm that depends on a tuning parameter k: how should k be chosen? [End Page 148]

One could take the usual route and choose k based on cross-validation in a way that is optimal for estimating the nuisance functions, say in terms of mean-square-error. Although this choice can still lead to nonparametric efficient estimators, it is not always optimal for estimating the functional itself, which is the ultimate target of inference. For example, consider estimating the expected density χ(f) = ∫ f2(x)dx, which can be estimated at root-n rates even in a nonparametric model, e.g., under smoothness assumptions[3]. An intuitive estimator is inline graphic, where, for convenience, we may think of inline graphic as a minimax-optimal estimator of f computed from a separate independent sample. A simple calculation shows that the estimation error contains the term inline graphic. In a nonparametric model, the best convergence rate for inline graphic is always slower than root-n, which means that this procedure will never result in a root-n consistent estimator unless some modifications are introduced.

4.1 Influence function-based estimators

To achieve faster convergence rates in nonparametric models, plug-in estimators of "smooth" functionals can be "corrected" by adding averages of estimates of their influence function(s)[4;39;33;18;9]. Suppose χ(ℙ) ∈ℝ is smooth enough to satisfy a von Mises expansion of the form

inline graphic

for a second order remainder term R2 and a (first order) influence function ϕ(O; ℙ). This expansion, which is the functional analogue of a Taylor expansion, suggests a "corrected estimator"

inline graphic

where inline graphic is the plug-in estimator. We will refer to estimators based on first order influence functions simply as first order estimators, though they are also called doubly robust, targeted, or double machine learning estimators. For instance, for inline graphic, i.e. the average treatment effect under no-unmeasured-confounding, inline graphic takes the form of the celebrated doubly-robust estimator[25;26;36]:

inline graphic

where inline graphic and π(x) = ℙ(A = 1 | X = x) are the outcome regression and propensity score, respectively.

The key advantage of correcting a plug-in estimator by subtracting off an estimate of the functional's first order influence function term is that the remainder error R2 is second order. For instance, in the case of the average treatment effect, we have

inline graphic

[End Page 149] where f(x) is the density of X. If inline graphic is bounded away from zero and one, an application of Cauchy-Schwarz yields that

inline graphic

for some constant C. Therefore, inline graphic can be o(n−1/2) even if the L2 error in estimating the nuisance functions is o(n−1/4), for example. In this case, under mild conditions and provided that the nuisance functions are estimated from an independent sample, inline graphic would be n−1/2-consistent and semiparametric efficient.

Achieving n−1/4 rates for the nuisance functions, and as a result, n1/2-consistency for the functional, generally requires that the classes in which the nuisance parameters live are sufficiently regular. For example, a d-dimensional regression would need to live in a α-Holder class with α > d/2 in order for a minimax optimal estimator to be n−1/4-consistent[34]. Thus, if π(x) is α-smooth and μa(x) is β-smooth, inline graphic is o(n−1/2) if α + β > d. Interestingly, there are function classes that admit estimation at rates that do not depend exponentially on d. For instance, functions that are cadlag and have bounded variation norm can be estimated with error o(n−1/4) using the Highly Adaptive Lasso[35]. Similarly, certain neural network classes allow n−1/4-consistency (up to log factors) irrespective of the dimension of the covariates[14].

Corrected estimators based on first-order influence functions represent a substantial improvement over plug-ins because they allow for efficient estimation even in (not overly complex) nonparametric models. Furthermore, they are straightforward to implement, as they do not require any extra tuning beyond estimating the nuisance functions. One may then wonder: can first-order corrected estimators be further improved? The answer is yes, under certain conditions, but the available estimators are not as easily implementable in practice.

4.2 Higher-order & double sample-split estimators

A natural way to improve upon first-order estimators is to correct for additional terms in the von-Mises expansion by estimating and subtracting off higher order influence function terms. However, typical functionals do not possess influence functions of order m ≥ 2, and exact corrections are not possible. Despite this, estimators based on approximate higher order influence functions (HOIFs) can improve upon first-order estimators[27;28;30]. At a high level, approximate HOIFs of a functional χ(ℙ) can be derived as exact HOIFs of a "smoothed" functional inline graphic that possesses influence functions of all orders. One can then optimally tune the approximation bias inline graphic and the error in estimating inline graphic. Remarkably, higher-order estimators of the average treatment effect and the expected conditional covariance are root-n consistent if α + β > d/2, which is a necessary condition for root-n consistency in the Holder setup[29]. Further, under smoothness constraints on f, these estimators can be minimax-optimal even outside the root-n regime, i.e. when α + βd/2[27;30]. In settings where the covariate density f is non-smooth, minimax estimation is largely still an open problem, even for relatively simple causal effect functionals. Finally, HOIFs are higher order U-statistics involving kernels of projections onto suitable finite-dimensional subspaces. To [End Page 150] the best of our knowledge, there is currently no data-driven way to choose the orders of the U-statistics and the projections, which can hinder the applicability of HOIF-based estimators in practice.

In order to construct simpler estimators than the ones based on HOIFs,[22] have proposed using particular forms of sample splitting with the intent of estimating different components of the functional with separate, independent subsamples. For example, letting inline graphic and inline graphic, a first-order estimator of the expected conditional covariance inline graphic is

inline graphic

[22] show that if a and μ are estimated with suitable series estimators constructed from separate independent samples and the average is over a third sample independent of the other two, then inline graphic will be root-n consistent under the minimal condition α + β > d/2 when a is α-smooth and μ is β-smooth. For other functionals, such as the average treatment effect, estimators based on HOIF remain the only ones known to achieve root-n consistency under the minimal condition. Recently,[19] has extended the approach of[22] and obtained fast rates for estimating the conditional average treatment effect, which is a non-pathwise differentiable parameter that does not satisfy the classic von Mises expansion.

Despite being simpler to construct than the HOIF estimators, exploiting the advantages of these estimators based on plug-ins or first-order estimators that use double sample splitting does appear to require choosing the tuning parameters for the nuisance functions in a way that is optimal for estimating the functional, rather than just the nuisance functions. Therefore, just like HOIF-based estimators, practical implementation poses some challenges. Often, the optimal choice of the tuning parameters amounts to undersmoothing when estimating the nuisance functions. Undersmoothing has played an important role in functional estimation[12;15;16;38]. However, only the optimal order (relative to the sample size) of the tuning parameters is generally known, leaving the constants unspecified. In finite samples, choosing different constants could result in widely different estimates, and thus finding the right amount of undersmoothing in applications is difficult. Finding a data-driven method for undersmoothing remains largely an open problem. We refer to Chapter 5 in[40] for a discussion on the practical challenges of undersmoothing, and to[10] for a recent contribution that could play an important role in tuning parameter selection for doubly robust functionals.

Finally, the improvements over first-order estimators by either using HOIFs or carefully choosing the tuning parameters and doing double sample splitting are based on some structural assumptions. For instance, choosing the optimal order of the HOIF and the nuisance tuning parameters generally requires knowledge of the smoothness / sparsity of the nuisance functions (e.g. the order of the Holder class). Remarkably, there are exceptions to this. For instance,[19] (Theorem 3) constructs an estimator of the conditional average treatment effect whose performance depends on two tuning parameters, one of which is optimized by undersmoothing as much as possible. In practice, this suggests one may then keep undersmoothing until numerical issues arise. When the optimal amount of undersmoothing is hard to obtain in finite samples, Lepski's method[20], originally developed for nonparametric regression problems, [End Page 151] selects tuning parameters in a way that can adapt to the unknown level of smoothness of the function that is estimated. A variant of it has been used for estimating the expected value of a density by Giné et al.[11], and with second-order influence function estimators by Liu et al.[21]. Many open problems remain.

5. Some Cultural Divisions in Causal Inference

In addition to the distinction between plug-in and targeted estimation, there are a number of interesting cultural divisions that are unique to (or especially prominent in) causal inference, which do not necessarily align with Breiman's contrast of data modeling versus algorithmic modeling. The distinctions arise in the types of targets, the formalisms used to frame assumptions and define estimands, and the techniques used for estimation. Some of these cultural divisions are:

  • • causal discovery versus estimation,

  • • parametric versus nonparametric methods,

  • • graphs versus potential outcomes,

  • • explanation versus informing policy,

  • • sensitivity analysis versus bounds (versus neither).

Although there is much recent work that aims to bridge many of these divisions, one division remains particularly pronounced: whether the scientific question is about discovery versus estimation. Causal discovery[31] usually aims at questions like: "given some large set of variables, how can we determine which have non-zero effects on others?" This is essentially about learning the entire underlying causal structure, e.g., in the form of a graph. In contrast, estimation problems typically focus on one or a small number of exposures, and ask only what their effect is on some outcome of particular interest, leaving unspecified much of the rest of the causal structure (for example, specific relations among confounding variables). Thus far these kinds of problems have been pursued quite separately, with philosophy and machine learning communities taking the lead on discovery, and with estimation pursued more in statistics and biostatistics and epidemiology. However, we feel there could be big benefits to more cross-disciplinary collaboration on these problems; for example, this could lead to more focus on inference and error guarantees in discovery in nonparametric models, or on using faithfulness assumptions to enlarge the kinds of questions that are typically asked in the estimation world. For more details, we refer to some excellent recent discussion by Cosma Shalizi of work by Kun Zhang at the Online Causal Inference Seminar (link). [End Page 152]

Edward H. Kennedy
Department of Statistics & Data Science
Carnegie Mellon University
5000 Forbes Ave, Pittsburgh, PA 15213 USA
edward@stat.cmu.edu
Matteo Bonvini
Department of Statistics & Data Science
Carnegie Mellon University
5000 Forbes Ave, Pittsburgh, PA 15213 USA
mbonvini@stat.cmu.edu
Alan Mishler
Department of Statistics & Data Science
Carnegie Mellon University
5000 Forbes Ave, Pittsburgh, PA 15213 USA
amishler@stat.cmu.edu

References

[1]. Alberto Abadie and Guido W Imbens. Large sample properties of matching estimators for average treatment effects. Econometrica, 74(1):235–267, 2006.
[2]. Vijay Arya, Rachel K. E. Bellamy, Pin-Yu Chen, Amit Dhurandhar, Michael Hind, Samuel C. Hoffman, Stephanie Houde, Q. Vera Liao, Ronny Luss, Aleksandra Mojsilovi?, Sami Mourad, Pablo Pedemonte, Ramya Raghavendra, John Richards, Prasanna Sattigeri, Karthikeyan Shanmugam, Moninder Singh, Kush R. Varshney, Dennis Wei, and Yunfeng Zhang. One explanation does not fit all: A toolkit and taxonomy of AI explainability techniques. arxiv preprint arXiv:1909.03012, 2019.
[3]. Peter J Bickel and Yaacov Ritov. Estimating integrated squared density derivatives: sharp best order of convergence estimates. Sankhyā, pages 381–393, 1988.
[4]. Peter J Bickel, Chris AJ Klaassen, Ya'acov Ritov, and Jon A Wellner. Efficient and Adaptive Estimation for Semiparametric Models. Baltimore: Johns Hopkins University Press, 1993.
[5]. Leo Breiman. Statistical modeling: The two cultures. Statistical Science, 16(3):199–231, 2001.
[6]. Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001. doi:10.1023/A:1010933404324.
[7]. Andreas Buja, Lawrence Brown, Richard Berk, Edward George, Emil Pitkin, Mikhail Traskin, Kai Zhang, and Linda Zhao. Models as approximations I: Consequences illustrated with linear regression. Statistical Science, 34(4):523–544, 2019. doi:10.1214/18-STS693.
[8]. Andreas Buja, Lawrence Brown, Arun Kumar Kuchibhotla, Richard Berk, Edward George, and Linda Zhao. Models as approximations II: A model-free theory of parametric regression. Statistical Science, 34(4):545–565, 2019. doi:10.1214/18-STS694.
[9]. Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James M Robins. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1–C68, 2018.
[10]. Yifan Cui and Eric Tchetgen Tchetgen. Selective machine learning of doubly robust functionals. arXiv preprint arXiv:1911.02029, 2019.
[11]. Evarist Giné, Richard Nickl, et al. A simple adaptive estimator of the integrated square of a density. Bernoulli, 14(1):47–61, 2008.
[12]. Larry Goldstein and Karen Messer. Optimal plug-in estimators for nonparametric functional estimation. The annals of statistics, pages 1306–1328, 1992.
[13]. Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. A survey of methods for explaining black box models. ACM Computing Surveys, 51(5):1–42, 2018. doi:10.1145/3236009.
[14]. László Györfi, Michael Kohler, Adam Krzykaz, and Harro Walk. A Distribution-Free Theory of Nonparametric Regression. Springer, 2002.
[15]. Jinyong Hahn. On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, 66(2):315–331, 1998.
[16]. Keisuke Hirano, Guido W Imbens, and Geert Ridder. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71(4):1161–1189, 2003.
[17]. Alexander P Keil, Jessie K Edwards, David R Richardson, Ashley I Naimi, and Stephen R Cole. The parametric g-formula for time-to-event data: towards intuition with a worked example. Epidemiology (Cambridge, Mass.), 25(6):889, 2014.
[18]. Edward H Kennedy. Semiparametric theory and empirical processes in causal inference. In: Statistical Causal Inferences and Their Applications in Public Health Research, pages 141–167, 2016.
[19]. Edward H Kennedy. Optimal doubly robust estimation of heterogeneous causal effects. arXiv preprint arXiv:2004.14497, 2020.
[20]. Oleg V Lepski and Vladimir G Spokoiny. Optimal pointwise adaptive methods in nonparametric estimation. The Annals of Statistics, pages 2512–2546, 1997.
[21]. Lin Liu, Rajarshi Mukherjee, James Robins, and Eric Tchetgen Tchetgen. Adaptive estimation of nonparametric functionals. arXiv preprint arXiv:1608.01364, 2016.
[22]. Whitney K Newey and James M Robins. Cross-fitting and fast remainder rates for semiparametric estimation. arXiv preprint arXiv:1801.09138, 2018.
[23]. Johann Pfanzagl. Contributions to a general asymptotic statistical theory, volume 13. Springer, 1982.
[24]. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "why should i trust you?": Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144. ACM, 2016. doi:10.1145/2939672.2939778.
[25]. James M Robins and Andrea Rotnitzky. Semiparametric efficiency in multivariate regression models with missing data. Journal of the American Statistical Association, 90 (429):122–129, 1995.
[26]. James M Robins and Naisyin Wang. Inference for imputation estimators. Biometrika, 87(1):113–124, 2000.
[27]. James M Robins, Lingling Li, Eric J Tchetgen Tchetgen, and Aad W van der Vaart. Higher order influence functions and minimax estimation of nonlinear functionals. Probability and Statistics: Essays in Honor of David A. Freedman, pages 335–421, 2008.
[28]. James M Robins, Lingling Li, Eric Tchetgen Tchetgen, and Aad W van der Vaart. Quadratic semiparametric von mises calculus. Metrika, 69(2-3):227–247, 2009.
[29]. James M Robins, Eric J Tchetgen Tchetgen, Lingling Li, and Aad W van der Vaart. Semiparametric minimax rates. Electronic Journal of Statistics, 3:1305–1321, 2009.
[30]. James M Robins, Lingling Li, Rajarshi Mukherjee, E Tchetgen Tchetgen, and Aad W van der Vaart. Minimax estimation of a functional on a structured high dimensional model. The Annals of Statistics, 45(5):1951–1987, 2017.
[31]. Peter Spirtes, Clark N Glymour, and Richard Scheines. Causation, prediction, and search. MIT press, 2000.
[32]. Sarah L Taubman, James M Robins, Murray A Mittleman, and Miguel A Hernán. Intervening on risk factors for coronary heart disease: an application of the parametric g-formula. International journal of epidemiology, 38(6):1599–1611, 2009.
[33]. Anastasios A Tsiatis. Semiparametric Theory and Missing Data. New York: Springer, 2006.
[34]. Alexandre B Tsybakov. Introduction to Nonparametric Estimation. New York: Springer, 2009.
[35]. Mark van der Laan. A generally efficient targeted minimum loss based estimator based on the highly adaptive lasso. The international journal of biostatistics, 13(2), 2017.
[36]. Mark J van der Laan and James M Robins. Unified Methods for Censored Longitudinal Data and Causality. New York: Springer, 2003.
[37]. Mark J van der Laan and Sherri Rose. Targeted Learning: Causal Inference for Observational and Experimental Data. Springer, 2011.
[38]. Mark J van der Laan, David Benkeser, and Weixin Cai. Efficient estimation of pathwise differentiable target parameters with the undersmoothed highly adaptive lasso. arXiv preprint arXiv:1908.05607, 2019.
[39]. Aad W van der Vaart. Semiparametric statistics. In: Lectures on Probability Theory and Statistics, pages 331–457, 2002.
[40]. Larry Wasserman. All of Nonparametric Statistics. Springer, 2006.
[41]. Halbert White. Maximum likelihood estimation of misspecified models. Econometrica, pages 1–25, 1982.

Footnotes

* MB and AM contributed equally.

Share