University of Pennsylvania Press
Article

Bayesian Causal Forests & the 2022 ACIC Data Challenge:Scalability and Sensitivity

Abstract

We demonstrate how Hahn et al.'s Bayesian Causal Forests model (BCF) can be used to estimate conditional average treatment effects for the longitudinal dataset in the 2022 American Causal Inference Conference Data Challenge. Unfortunately, existing implementations of BCF do not scale to the size of the challenge data. Therefore, we developed flexBCF—a more scalable and flexible implementation of BCF— and used it in our challenge submission. We investigate the sensitivity of our results to the choice of propensity score estimation method and the use of sparsity-inducing regression tree priors. While we found that our overall point predictions were not especially sensitive to these modeling choices, we did observe that running BCF with flexibly estimated propensity scores often yielded better-calibrated uncertainty intervals.

Keywords

Treed ensembles, heterogeneous treatment effects, Bayesian nonparametrics

1. Introduction

Given pairs of observed predictor vectors x ∈ ℝp and scalar outputs y ∈ ℝ, Chipman et al. (2010)'s Bayesian additive regression trees (BART) model approximates the regression function f(x) := 𝔼[y|x] with a sum of binary regression trees. BART often delivers accurate point estimates and well-calibrated uncertainty intervals around evaluations f(x) without requiring users to (i) pre-specify the functional form of f or (ii) tune any hyper-parameters. BART's ease-of-use and generally excellent tuning-free estimation, prediction, and uncertainty quantification have made BART a popular "off-the-shelf" modeling tool.

As Hill (2011) noted, these advantages make BART an attractive choice for estimating heterogeneous treatment effects. To wit, conditional average treatment effects may be nonlinear and may depend on complicated interactions between confounders, covariates, and treatment. Correctly specifying such non-linearities and interactions is difficult—if not impossible—in most practical settings, making BART particularly compelling. Given observed triplets (x, y, z) of covariates x, outcome y, and treatment indicator z, Hill (2011) used BART to obtain an estimate inline graphic of the response surface 𝔼[Y |X = x, Z = z] [End Page 29] before estimating treatment effects as inline graphic . In past iterations of the ACIC Data Challenge, BART-based models have consistently ranked among the top-performing methods. One method in particular—Hahn et al. (2020)'s Bayesian causal forests (BCF) model—stands out as particularly adept at estimating heterogeneous treatment effects. In this paper, we describe our experience deploying BCF to the 2022 ACIC Data Challenge.

At a high-level, BCF models the expected outcome as 𝔼[Y |X = x, Z = z] = µ(x)+z ×τ(x) and further approximates the prognostic function µ and the treatment effect function τ using separate regression tree ensembles. When fitting the tree ensembles, BCF includes an estimate of the propensity score as an additional covariate. Although these ensembles are a priori independent, they are dependent a posteriori, as BCF leverages all observations (both treated and control) to learn µ.

At first glance, BCF appears ill-suited to the 2022 ACIC Data Challenge: BCF was initially proposed for cross-sectional data and the Data Challenge features longitudinal data with time-varying covariates and outcomes. However, as we show in Section 2.2, under a mild assumption about the data generating process, the Data Challenge admits an additive decomposition similar to the one used by BCF. Unfortunately, we found that the implementation available in the R package bcf did not scale to the challenge data. We therefore re-implemented a version of BCF that can scale to the Data Challenge.

When writing our implementation and subsequently deploying it during the competition, we made several essentially ad hoc modeling decisions. First, we included an estimate of Imai and Ratkovic (2014)'s covariate balancing propensity score (CBPS) as an additional covariate in the expected outcome model. Second, in our regression tree prior, we selected splitting variables using Linero (2018)'s sparsity-inducing Dirichlet prior instead of the uniform prior that is used in the bcf package. Briefly, Linero (2018)'s prior encourages regression trees to split on a small number of covariates, thereby facilitating a type of automatic variable selection. In the context of treatment effect estimation, such a sparsity-inducing prior can potentially identify the main drivers of effect heterogeneity. We assess the sensitivity of our contest submissions to these modeling choices.

Here is an outline for the rest of the paper. We briefly introduce the competition dataset in Section 2 before presenting our identification analysis, introducing our new implementation of BCF, and reporting the results of our initial challenge submission. Then, in Section 3, we describe our post-challenge sensitivity analysis. Finally, we discuss the implications of our findings in Section 4.

2. Identification and estimation

2.1 Review: Study design, notation, and definitions

The 2022 ACIC Data Challenge involved analyzing a total of 3,400 synthetic datasets consisting of 200 independent realizations of 17 different data generating processes (DGP's). Each synthetic dataset simulated observations of beneficiary-level Medicare expenditure data over a period of four years. Each beneficiary received care at one of about 500 different medical practices. In addition to simulated time-invariant beneficiary- and practice-level [End Page 30] covariates, each dataset also included average measurements of patient covariates within each practice. In each DGP, treatment was assigned to some practices after two years. The main task of the Challenge was to assess how the treatment deployed at the practice level affected future medical expenditures of beneficiaries in the treated practices.

For each beneficiary i in practice j, let Yijt denote their outcome in year t. Also, for each practice j, let Zj ∈ {0, 1} denote its treatment status where Zj = 1 indicates that practice j was assigned treatment between year 2 and year 3 and let Zj = 0 indicates that practice j was not treated during that period. As the notation suggests, the treatment is assigned only once and before assigning treatment, all practices are untreated. Also, treatment is assigned at the practice level and all beneficiaries in a practice received the same treatment condition. Hence, we omit subscripts i and t in Zj for notational simplicity. Finally, for each beneficiary i in practice j, let Xij ∈ ℝp be all observed pre-treatment covariates, i.e. non-outcome variables measured at years 1 and 2 before treatment is assigned. To prevent potential feedback, when modeling the full set of observed outcomes, we did not include Yij1 or Yij2 in Xij. Again, for notational simplicity, we omit the subscript t in Xij.

We define causal effects using potential outcomes (Splawa-Neyman et al., 1990; Rubin, 1974). For each beneficiary i in practice j, let Yijt(0) denote their potential outcome at time t when their practice was untreated between years 2 and 3. Similarly, let Yijt(1) denote their potential outcome at time t when their practice was treated between years 2 and 3. Implicit in our notation are the assumptions that (i) the treatment status of practice j cannot affect the potential outcome of beneficiaries in practice j′j and (ii) there is only one version of treatment and only one version of non-treatment. Together these assumptions are often know as the Stable Unit Treatment Value Assumption (Rubin, 1980, 1986, SUTVA). We consider these assumptions reasonable in the context of the Data Challenge, while also noting the Challenge provided insufficient context about the data to probe potential violations1.

We focus on the conditional average treatment effect on the treated (CATT), defined as

inline graphic

Averaging the CATT with respect to the distribution of covariates yields the average treatment effect on the treated (ATT), i.e. ATT(t) = 𝔼[CATT(Xij, t)].

2.2 Review: Causal Identification

To identify the CATT from the observed data, we made the following three assumptions for all beneficiaries and practices:

  • (A1). For all t, Yijt = Yijt(0) × 𝟙 (t ≤ 2) + Yijt(Zj) × 𝟙 (t > 2)

  • (A2). For each s ∈ {1, 2} and t ∈ {3, 4}, (Yijt(0) − Yijs(0)) ⫫ Zj | Xij.

  • (A3). There is a 0 < δ < 1 such that δ < ℙ(Zj = 1|Xij = x) < 1 − δ for all x. [End Page 31]

As written, (A1) implicitly assumes that (i) the observed outcomes of beneficiaries in practice j does not depend on the treatment status of a different practice j; and that (ii) the observed outcome of beneficiary i in practice j does not depend on the potential outcomes of any other beneficiaries. Because treatment is defined at the practice level and not the beneficiary level, to paraphrase Small et al. (2008, §1.2), the issue of interference between beneficiaries in the same practice does not arise by design. Further, without additional context about the motivating application, it is impossible to probe potential violations of these non-interference assumptions with the provided data. We further note that (A3) is somewhat stronger than the assumption imposed by the Data Challenge organizers. Namely, our assumption applies to all practices and not just the treated practices.

Assuming (A1)–(A3), we can write CATT(x, t) as a function of the observed data:

inline graphic

See Appendix A for a proof.

Notice that, according to Equation (2), we can identify the CATT using either s = 1 or s = 2. This introduces a further constraint, namely that

inline graphic

In words, Equation (3) roughly states that conditional on covariates, the average pretreatment trend in outcomes is identical between the treated and the untreated groups. The additional structure implied by these assumptions forms the basis for an additional modeling assumption that eventually allows us to deploy BCF.

2.3 Estimation & Implementation

We will additionally assume that for every (t, x, z), we can write

inline graphic

Under Equation (4) and the assumptions introduced in Section 2.2, CATT(x, t) = τ(x, t). Furthermore, 𝔼 [Yij2Yij1|Zj = z, Xij = x] = µ(x, 2) − µ(x, 1) for z ∈ {0, 1}, satisfying Equation (3). In contrast to Assumption (A2), which was about trends in outcomes, Equation (4) makes an assumption directly about the outcome at each time t. We note that the functional form of the expression on the right-hand side of Equation (4) is very similar to that used by BCF.

On this basis, we propose fitting the model in Equation (4) with BCF. That is, we will approximate each of µ and τ with their own ensemble of regression trees, which receive their own independent priors. As noted by Hahn et al. (2020), using an additive decomposition as in Equation (4) allows us to regularize our estimate of τ in a direct and transparent fashion. This is in sharp contrast to approaches like Hill (2011) that flexibly fit the response surface 𝔼[Yijt|Xij, Zj] with a single regression tree ensemble. [End Page 32]

2.4 Introducing flexBCF

Unfortunately, we encountered several problems when we tried to use the bcf package to fit the beneficiary-level data. We discovered that bcf makes a copy of all covariate data and then allocates a large matrix to store posterior samples of treatment effect function evaluations. When modeling the beneficiary-level data, we included the practice label as an additional categorical covariate, which bcf subsequently converted into binary indicators (one per practice). As a result, bcf allocated 5GB of memory to copy the covariate data. bcf further attempted to allocate 20GB to store 2000 posterior samples (the number used in the examples in bcf's documentation) for every τ(xij, t) value. In this way, before it could even begin the main MCMC simulation, bcf's memory requirement far exceeded the capacity of our personal computers.

When we tried running bcf on a high-memory computer, we found that the sampler was extremely slow. On further inspection, we discovered that the posterior updates of each regression tree performed many redundant computations. Basically, while updating a single tree, bcf loops over the entire dataset several times to identify the leaf to which every observation is assigned. Because the underlying Gibbs sampler only changes at most two leaves at a time, the map from observations to tree leaves does not change much iteration to iteration, rendering many of these loops redundant.

To overcome these difficulties, we developed the R package flexBCF, which is available at https://github.com/skdeshpande91/flexBCF. flexBCF includes a more efficient implementation of the basic BCF model that does not (i) unnecessarily copy the data; (ii) perform redundant calculations when updating the trees; and (iii) instantiate a matrix to hold all posterior draws of τ(x, t). Instead, our sampler returns a list of character vectors whose elements contain string representations of each regression tree. flexBCF includes a predict routine that takes in these character vectors, a new set of covariates x, and a time index t, and returns posterior samples of τ(x, t). As a result, flexBCF is faster and less memory-intensive than bcf: for each beneficiary-level dataset, it only took a few hours and about 4GB of memory to run multiple MCMC chains in sequence.

flexBCF also introduces some methodological changes aimed at achieving more flexible fits to data. Namely, our package gives users the option to use Linero (2018)'s sparsity-inducing prior on the splitting variables used in the µ and τ ensembles. More substantively, unlike bcf and most implementations of BART-based models, flexBCF does not one-hot encode categorical predictors. In the context of the Data Challenge, such one-hot encoding forces bcf to partition the practices using a recursive "remove one at a time" strategy. This strategy partitions the practices into a few singleton sets and one large set with the remaining practices. In this way, bcf is prevented from forming several small subgroups of practices and partially pooling data within each group. By directly partitioning the practices instead of binary indicators, flexBCF can form a richer set of partitions, endowing it with much more representational flexibility. See Deshpande (2022) for an argument against one-hot encoding categorical predictors in BART-based models and for a description of an alternative strategy, which is employed by flexBCF. [End Page 33]

2.5 Our initial challenge submission

For our initial submission, we fit the model in Equation (4) using flexBCF. We included averages of the beneficiary-level covariates at both pre-treatment time points as covariates for µ and similar averages at all four time points as covariates for τ. Although beneficiary-level covariates are time-invariant, practice composition generally changed over time as beneficiaries attrited from the study. Consequently, our conditioning on these post-treatment covariates can lead to biased estimation of the CATT unless patient attrition is not impacted by treatment (Rosenbaum, 1984). We assumed that beneficiary attrition— and therefore practice composition—was not impacted by treatment. We additionally included an estimated covariate balancing propensity score (Imai and Ratkovic, 2014) as a covariate for both µ and τ. Although we did not include outcomes measured in years 1 or 2 as covariates for µ or τ, we did include these measurements as covariates in our propensity score model. We further placed a sparsity-inducing prior on the splitting variables for both the µ and τ tree ensembles. This was in an attempt to discover which covariates drove heterogeneity in the prognostic function µ and treatment effect function τ.

We submitted the posterior mean and 90% credible interval of the requested ATT and subgroup SATT's, computed from the MCMC samples of evaluations τ(x, t) (and suitable averages thereof). The credible intervals were formed by taking the 5% and 95% quantiles of the MCMC samples of the estimands. We note that these intervals, like all Bayesian credible intervals, are not theoretically guaranteed to have nominal frequentist coverage. After the Challenge concluded, the organizers provided us with the individual-level treatment effects, which we used to compute the RMSE and uncertainty interval coverage.

When fit to the beneficiary-level data, our initial submission achieved RMSE (uncertainty interval coverage) for the overall ATT between 9.2 (33%) to 29.9 (82.5%), with a median of 21.81 (49%) and average of 20.54 (52.15%) across all 17 DGP's. For the SATT's, our RMSE's (coverages) ranged from 10.73 (26.5%) to 39.72 (81%).

When fit to the practice-level data, our initial submission achieved RMSE (coverage) for the overall ATT between 11.11 (65%) to 34.11 (99%), with a median of 23.72 (87.5%) and average of 23.04 (83.68%) across all 17 DGP's. Our RMSE's (coverages) ranged from 11.71 (61%) to 44.29 (99.5%) for individual SATT's.

In sharp contrast to previous iterations of the ACIC Data Challenge, our BCF-based results were fairly middling compared to other submissions, especially in the presence of confounding; in fact, from what we could tell, our RMSE results tended to be around the median value. Interestingly, when fit to the practice-level data, our uncertainty interval coverage improved substantially while the RMSE increased slightly. We suspect that the improvement in coverage is an artifact of sample size. Basically, the posterior distributions of the estimands seemed to be much more diffuse when conditioning on the practice-level data than on the beneficiary-level data. [End Page 34]

3. Re-analysis of Challenge Data

To assess the sensitivity of our initial submission results to our ad hoc use of sparsity-inducing priors and choice of propensity score estimate, we re-analyzed the competition data using different propensity score estimates and using (or not) the sparsity-inducing prior on splitting variables for both µ and τ.

We carried out two sensitivity analyses, one for the beneficiary-level data and one for the practice-level data. For brevity, we report only the results for the beneficiary-level data here. The findings for the practice-level data are qualitatively similar and are reported in Section S1 of the Supplementary Materials.

For the beneficiary-level sensitivity analysis, we restricted our re-analysis to using sparsity-inducing priors in both ensembles or neither ensemble and did not consider the other possibilities (e.g. sparsity in µ but not τ or vice versa). In addition to the covariate balancing propensity score (hereafter CBPS), we considered running flexBCF with a propensity score estimated with gradient boosted trees (hereafter GBM). We computed the two propensity score estimates using the R packages CBPS (Fong et al., 2022) and GBM (Greenwell et al., 2022). Note that CBPS assumes that the log-odds of treatment are linear in the covariates while GBM does not make any functional form assumptions.

We refer to the model configuration used in our initial Challenge submission as CBPS(S) where the suffix (S) refers to the sparsity-inducing prior. We use the suffix (D) to refer to the uniform prior on splitting variables. So for the beneficiary-level dataset, we considered four model configurations CBPS(S), CPBS(D), GBM(S), and GBM(D).

The 3,400 datasets consist of 200 replications of 17 different DGP's. For each combination of model configuration and dataset, we computed the posterior mean and a symmetric 90% posterior credible interval using the posterior samples returned by flexBCF for the overall ATT and the SATT for each of the 12 pre-specified subgroups. Using the true values provided by the Data Challenge organizers, we computed the squared difference between the posterior mean and the true value of the estimand. We computed the RMSE as the square root of the average of these squared differences over the 200 replications of each DGP. We further computed uncertainty interval coverage as the proportion of the 200 replications in which our credible interval contained the true value of the estimand.

We report the RMSE and the uncertainty interval coverage within each of five categories of DGPs defined by the amount of confounding and effect heterogeneity. The categories are: (i) no confounding but large effect heterogeneity; (ii) weak confounding with small effect heterogeneity; (iii) weak confounding with large effect heterogeneity; (iv) strong confounding with small effect heterogeneity; and (v) strong confounding with large effect heterogeneity. The Data Challenge included exactly one DGP with no confounding and four DGPs in each of the other categories.

3.1 Overall ATT

Generally speaking, we found that there were not massive differences between the four model configurations when it came to estimating the overall ATT. Perhaps unsurprisingly, in the [End Page 35] absence of confounding, each model configuration estimates the ATT quite accurately; the RMSE (coverage) in the unconfounded DGP was 9 (83.5%) for CBPS(S), 8.79 (84%) for CBPS(D), 9.28 (80.5%) for GBM(S), and 9.07 (85%) for GBM(D). However, in the presence of confounding and large effect heterogeneity, we observed increased RMSE and decreased uncertainty interval coverage. Figures 1 and 2 shows the RMSE and coverage for the overall ATT for all model configurations and DGP's. Each point in the figures corresponds to a DGP. We indicate the model configuration using the shape, fill, and color of each point, and we connect points corresponding to the same model configuration.

Figure 1. RMSE for overall ATT based on beneficiary-level data for the confounded DGPs. Dots represent average performance averaged across 200 replications of a DGP. Dots corresponding to the same model configuration are connected.
Click for larger view
View full resolution
Figure 1.

RMSE for overall ATT based on beneficiary-level data for the confounded DGPs. Dots represent average performance averaged across 200 replications of a DGP. Dots corresponding to the same model configuration are connected.

Figure 2. Uncertainty interval coverage for the overall ATT based on beneficiary-level data for the confounded DGPs. Dots represent performance averaged across 200 replications of a DGP. Dots corresponding to the same model configuration are connected.
Click for larger view
View full resolution
Figure 2.

Uncertainty interval coverage for the overall ATT based on beneficiary-level data for the confounded DGPs. Dots represent performance averaged across 200 replications of a DGP. Dots corresponding to the same model configuration are connected.

Interestingly, in the presence of confounding, the GBM-based configurations tended to achieve somewhat smaller RMSEs than the CBPS-based configurations across all DGP's. We further observed that using sparsity-inducing priors consistently yielded slightly higher RMSE's. That said, the differences in performance are quite small. [End Page 36]

Uncertainty interval coverage appeared more sensitive to our modeling choices. For instance, using a CBPS-based propensity score often yielded somewhat worse coverage than using the more flexible GBM-based propensity score. On further inspection, we found that the posterior intervals computed with the CBPS-based propensity score were shorter than the posterior intervals computing using the GBM-based propensity score; see Figure S7 in Section S2 of the Supplementary Materials for a comparison of interval lengths.

Across the board, each model configuration performed worse as the amount of confounding increased. We suspect that this due to a certain amount of model misspecification. Namely, we did not include pre-treatment outcomes or trends as covariates when fitting the model in Equation (4). It is conceivable that the failure to condition on these potential confounders led to the substantial bias (as evidenced by the large RMSE's). The rather low uncertainty interval coverages are likely driven by the combination of the this bias and the extremely large sample size. Said differently, the potential misspecification of our confounders and the large sample size likely led each of our BCF models to heavily concentrate around biased estimates.

3.2 Subgroup effects

In the absence of confounding, the median RMSE (median coverage) across the 12 SATTs were 26.83 (49%) for CBPS(S), 30.04 (52%) for CBPS(D), 27.04 (52%) for GBM(S), and 30.35 (53%) for GBM(D). Figures 3 compares the RMSE's of each SATT for each model configuration across all categories of confounded DGP's. We see that for all model configurations, RMSE increased alongside confounding and heterogeneity. Additionally, we found that using the flexible GBM-based propensity score estimates yielded lower RMSE's than using CBPS-based estimates (triangles in Figure 3 are shifted downwards compared to circles) and that sparsity-inducing priors produced slightly larger RMSE's (filled points shifted downwards compared to unfilled points).

Figure 3. RMSE's for all 12 SATT's based on beneficiary-level data for the confounded DGPs. Dots represent average performance averaged across 200 replications of a DGP. All points between dashed vertical lines represent performance on the same DGP.
Click for larger view
View full resolution
Figure 3.

RMSE's for all 12 SATT's based on beneficiary-level data for the confounded DGPs. Dots represent average performance averaged across 200 replications of a DGP. All points between dashed vertical lines represent performance on the same DGP.

[End Page 37]

Figure 4 compares the coverage of the uncertainty intervals across the 12 SATT's. Generally speaking, the uncertainty interval coverage decreased as confounding and heterogeneity increased. GBM-based models have very slightly higher coverage than the CBPS based models and using the sparsity-inducing priors consistently yielded lower coverage (un-filled points are shifted downwards compared to filled points in Figure 4).

Figure 4. Uncertainty interval coverage for all 12 SATT's based on beneficiary-level data for the confounded DGP's. Dots represent average performance averaged across 200 replications of a DGP. All points between dashed vertical lines represent performance on the same DGP.
Click for larger view
View full resolution
Figure 4.

Uncertainty interval coverage for all 12 SATT's based on beneficiary-level data for the confounded DGP's. Dots represent average performance averaged across 200 replications of a DGP. All points between dashed vertical lines represent performance on the same DGP.

4. Discussion

We showed that this year's ACIC Data Challenge admitted a decomposition similar to the one used by Hahn et al. (2020)'s Bayesian casual forest model. Because the bcf package could not scale to the size of the data in the competition, we implemented a faster and less resource-intensive version, which is available in the flexBCF package. We investigated the sensitivity of our competition submission to two modeling choices: (i) the choice of propensity score method, and (ii) enforcing (or not) sparsity in the BART priors. While we did observe some sensitivity (e.g. flexible nonparametric propensity score estimates yielded longer uncertainty intervals with better coverage than parametric propensity score estimates), the differences in RMSE and coverage were not especially large.

As pointed out by a referee, the fact that the sparsity-inducing priors generally yielded worse performance may be because the Data Challenge organizers provided only those covariates that drove heterogeneity in the released datasets. When all covariates are relevant, the sparsity-inducing prior may not have selected all covariates and instead may have inappropriately excluded some important drivers of heterogeneity. Because we generally do not know all drivers of heterogeneity in practice, we do not anticipate that the results of our sensitivity analysis will generalize far beyond the Data Challenge.

While a small performance degradation can be expected as confounding increases, we were rather surprised by how poorly our BCF models performed in the presence of confounding. [End Page 38] We suspect that the poor performance stems from our decision to (i) directly model the observed outcomes using Equation (4) as opposed to trends and (ii) exclude pre-treatment outcome or trends as covariates in Equation (4). In a sense, we attribute the rather poor performance of our BCF models to model misspecification rather than any particular operating characteristic of BCF.

It is also possible that our specific results may be sensitive to other choices including, but not limited to, the inclusion (or exclusion) of certain covariates; the number of trees in the µ and τ ensembles; the choice of prior variance on the leaf node parameters in each tree; and the number of MCMC iterations and number of chains. Due to time constraints, we could not exhaustively assess sensitivity along all these dimensions. However, we note that our package flexBCF renders such sensitivity analyses computationally feasible. And while we do not anticipate our specific results to generalize beyond the Data Challenge, we do hope that flexBCF will facilitate the use of BCF on larger datasets. [End Page 39]

Supplementary Material

Click here to view appendix. (PDF)

Ajinkya H. Kokandakar
Department of Statistics
University of Wisconsin–Madison
Madison, WI 53706, USA
ajinkya@stat.wisc.edu
Hyunseung Kang
Department of Statistics
University of Wisconsin–Madison
Madison, WI 53706, USA
hyunseung@stat.wisc.edu
Sameer K. Deshpande
Department of Statistics
University of Wisconsin–Madison
Madison, WI 53706, USA
sameer.deshpande@wisc.edu

Acknowledgments

We are grateful to Mariel Finucane, Jennifer Starling, and Dan Thal from Mathematica Policy Research for several helpful discussions following the Data Challenge.

This research was performed using the computing resources and assistance of the UW–Madison Center For High Throughput Computing (CHTC) in the Department of Computer Sciences. The CHTC is supported by UW–Madison, the Advanced Computing Initiative, the Wisconsin Alumni Research Foundation, the Wisconsin Institutes for Discovery, and the National Science Foundation, and is an active member of the OSG Consortium, which is supported by the National Science Foundation and the U.S. Department of Energy's Office of Science.

Support for A.H.K. and S.K.D. was provided by the University of Wisconsin–Madison, Office of the Vice Chancellor for Research and Graduate Education with funding from the Wisconsin Alumni Research Foundation.

References

Hugh A. Chipman, Edward I. George, and Robert E. McCulloch. BART: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1):266–298, 2010.
Sameer K. Deshpande. A new BART prior for flexible modeling with categorical predictors. arXiv pre-print arXiv: 2211.04459, 2022.
Christian Fong, Marc Ratkovic, and Kosuke Imai. CBPS: Covariate Balancing Propensity Score, 2022. R package version 0.23.
Brandon Greenwell, Bradley Boehmke, Jay Cunningham, and GBM Developers. gbm: Generalized Boosted Regression Models, 2022. R package version 2.1.8.1.
P. Richard Hahn, Jared S. Murray, and Carlos M. Carvalho. Bayesian regression tree models for causal inference: Regularization, confounding, and heterogeneous effects. Bayesian Analysis, 15(3):965–1056, 2020.
Jennifer L. Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217–240, 2011.
Kosuke Imai and Marc Ratkovic. Covariate balancing propensity score. Journal of the Royal Statistical Society: Series B, 76(1):243–263, 2014.
Antonio R. Linero. Bayesian regression trees for high-dimensional prediction and variable selection. Journal of the American Statistical Association, 113(522):626–636, 2018.
Paul R. Rosenbaum. The consequences of adjustment for a concomitant variable that has been affected by the treatment. Journal of the Royal Statistical Society: Series A, 147(5):656–666, 1984.
Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66(5):688–701, 1974.
Donald B. Rubin. Randomization analysis of experimental data: The Fisher randomization test. Journal of the American Statistical Association, 75(371):575–582, 1980.
Donald B. Rubin. Comment. Journal of the American Statistical Association, 81(396): 961–962, 1986.
Dylan S. Small, Thomas R. Ten Have, and Paul R. Rosenbaum. Randomization inference in a group-randomized trial of treatments for depression: covariate adjustment, noncompliance, and quantile effects. Journal of the American Statistical Association, 103(481): 271–279, 2008.
Jerzy Splawa-Neyman, D. M. Dabrowska, and T. P. Speed. On the application of probability theory to agricultural experiments. essay on principles. section 9. Statistical Science, 5(4):465–472, 1990.

Appendix A. Identification

Recall Assumption (A3), which asserts that there is a 0 < δ < 1 such that for all covariate vectors x, δ < ℙ(Zj = 1 | Xij = x) < 1 − δ. Marginalizing over x, this implies that δ < ℙ(Zj = 1) < 1 − δ so that the population consists of a non-trivial mixture of treated and untreated practices.

Let t be a post-intervention time-point (i.e. t ∈ {3, 4}) and let s be a pre-intervention time-point (i.e. s ∈ {1, 2}). Then, for any s and t where δ < ℙ(Zj = 1 | Xij = x) < 1 − δ, we have the following:

inline graphic

inline graphic

inline graphic

inline graphic

inline graphic

Rearranging we can express the estimand of interest in terms of expectations of observables as follows:

inline graphic

which is the same as Equation (2).

Footnotes

1. In fact, our notation makes further implicit assumptions that we similarly regard as reasonable. Namely, we assume that treatment status does not vary amongst beneficiaries in the same practice and does not change once assigned in between years 2 and 3.

Share