
Leveraging Contact Network Information in Clustered Randomized Studies of Contagion Processes
In a randomized study, leveraging covariates related to the outcome (e.g. disease status) may produce less variable estimates of the effect of exposure. For contagion processes operating on a contact network, transmission can only occur through ties that connect affected and unaffected individuals; the outcome of such a process is known to depend intimately on the structure of the network. In this paper, we investigate the use of contact network features as efficiency covariates in exposure effect estimation. Using augmented generalized estimating equations (GEE), we estimate how gains in efficiency depend on the network structure and spread of the contagious agent or behavior. We apply this approach to simulated randomized trials using a stochastic compartmental contagion model on a collection of modelbased contact networks and compare the bias, power, and variance of the estimated exposure effects using an assortment of network covariate adjustment strategies. We also demonstrate the use of networkaugmented GEEs on a clustered randomized trial evaluating the effects of wastewater monitoring on COVID19 cases in residential buildings at the the University of California San Diego.
Network, Contagion, Statistics
1. Introduction
Contact networks capture the structure of possible pairwise transmissions (represented by network edges) in a population of actors (represented by network nodes) for various types of contagion processes; these may describe the spread of pathogens, behaviors, or ideas. Transmission on the network can only occur through edges that connect exposed and unexposed individuals; given that the structure of the network constrains pairwise transmissions, the [End Page 157] outcome of such a process must depend on this structure. In general, the relationship between network structure and contagion processes can be complex (Frank and Strauss, 1986; Newman, 2010); in investigation of exposure effects from observational data, one must consider that network properties may confound such effects. Furthermore, knowledge of the degree to which network features predict outcome can also improve efficiency of estimation.
This paper investigates the question of the degree to which incorporating information about contact network structure and summaries of contagion process outcomes at baseline improve the accuracy of estimates of exposure effects on outcomes that reflect a contagion process operating on a network. Regardless of whether a study is observational or randomized, as long as individual outcomes are correlated only within discrete and independent clusters but not across them, estimates of the treatment or exposure effect that ignore correlation may still be unbiased—provided that all confounding factors are measured and appropriately modeled in the observational setting. Consistent estimation of the variance must adjust for withincluster outcome correlation (Rosenbaum, 2002; Murray, 1998; Eldridge and Kerry, 2012). Of note, in both randomized and observational studies, contact network information can improve statistical efficiency of estimation (Chemie et al., 2014).
To accommodate the intercluster correlation of outcomes, we make use of Generalized Estimating Equations (GEEs) (Zeger and Liang, 1986) that provide estimates of the average marginal treatment effect across all clusters in cluster randomized trials or in settings characterized by a correlation across groups of individuals (Murray, 1998). In this manuscript, we consider the augmented GEE (Prague et al., 2016), which allows for the use of a userdefined outcome model to improve estimation efficiency. The variance of this augmented estimate decreases asymptotically if the covariates included in the outcome model predict the outcome. Other semiparametric approaches with potential efficiency gains have been developed, such as targeted maximum likelihood estimation (TMLE) for clustered data (Balzer et al., 2017). Although our investigation focuses on bias and efficiency using the augmented GEE, the issues we discuss are relevant for the TMLE approach as well.
The paper is structured as follows. Section 2 provides some background on networks and presents the details of the estimation procedure, including how to incorporate contact network structure in the estimation process. Section 3 demonstrates the ability of our methods to evaluate the potential gains in efficiency under different conditions in a simulation study that considers a range of network types. Section 4 considers a data example from a clustered randomized trial conducted at the University of California San Diego. Section 5 provides concluding remarks.
2. Methods
2.1 Networks
This section provides background on network concepts and the notation used throughout this paper. We assume that the true data generating mechanism is a simple susceptibleinfected (SI) contagion process spreading through a network. We identify a network as a collection of nodes, such that outcomes within the network are correlated while outcomes across networks are independent. Complete data on network and contagion processes are not generally observable; but even when they are not, it may nevertheless be possible to characterize certain features of processes and/or networks. [End Page 158]
A simple network consists of set of nodes = {1, ..., n} and edges . The placement of edges may be described by an n × n symmetric adjacency matrix e, where element e_{ij} = e_{ji} is 1 if an edge exists between nodes i and j and is 0 otherwise. The degree of node i, denoted as k_{i}, is the number of edges that are adjacent to it: k_{i} = ∑_{j}e_{ij}. Mean neighbor degree of node i, , is the unweighted average of a node’s neighbors’ degrees. Degree assortativity is a composite measure of mean neighbor degree across the network, defined as the Pearson correlation coefficient of degrees of adjacent nodes taken over all network edges (Callaway et al., 2001), which may be calculated as follows. The concept of excess degree is often used in analytical treatment of network models; the excess degree of a node is defined as one less than the (actual) degree of the node. Let the marginal probability of a node having excess degree k be q_{k}, and let the probability of this node connecting with a node with excess degree k′ be p_{kk}′ . Degree assortativity can then be calculated for a given network using a sum across all degree pairs (Newman, 2002) as , where σ^{2}_{q} is the variance of the excess degree distribution for the network. A connected component is a maximal subset of nodes for which a path exists between each pair of nodes. A path exists between two nodes i and j if and only if there exists a subset of edges in the network that connect nodes i and j. Let C be the number of components in the network, and node i belongs to the connected component with label c_{i} ∈ {1, ..., C}, where component labels are assumed to be ordered from largest to smallest. The largest connected component (LCC) contains nodes. The mean component size is n/C.
[End Page 159]
The contagion status of each node or that of the node’s network neighbors might be observable at baseline. We describe a person who has been impacted by the contagion process as affected (e.g., infected if the process is infectious or impacted if the process alters behavior), and we use I_{i}(t) to denote the binary process outcome for node i at time t. One simple metric of contagion status is the number of affected neighbors at baseline for node i, ∑j e_{ij}I_{j}(0), or the number of affected individuals at baseline belonging to the same component as a given node . Another metric is the length of the shortest path between each node i and each infected individual j in the network at baseline. The shortest path length between nodes i and j is d_{ij}, where d_{ij} := ∞ when no path exists between the two nodes. The shortest path length from the closest node affected at baseline is min_{j} d_{ij}. The sum of the inverse path lengths to node i is ∑j (d_{ij})^{−}^{1}. Some of these metrics may be difficult to determine in practice given limited knowledge about a network or process outcomes; our interest lies in examining whether their inclusion in analyses yields strong enough improvements in estimation to justify the efforts necessary to gather the required data. Table 1 summarizes these network features.
2.2 GEEbased estimation of the effect of a randomized exposure on outcome
Generalized estimating equations (GEE) provide a general approach for analyzing correlated outcomes that: i) is more robust to variance structure misspecification and relies less on parametric assumptions than the standard likelihood methods, and ii) provides population level conclusions on the effect of an exposure on an outcome. This section reviews the augmented GEE described in (Prague et al., 2016) and describes incorporation of contact network structure in the estimation process.
Consider an randomized study of a contagion process that consists of k = 1, ..., m clusters with i = 1, ..., n_{k} individuals per cluster, and ∑_{k} n_{i} = n is the total number of individuals in the study. The binary outcome for individual j in cluster i, Y_{ij}, is 1 if the individual is affected by the process by the end of the study, otherwise Y_{ij} = 0. [End Page 160] Y_{i} = (Y_{i}_{1}, ..., Y_{ini})^{⊤} denotes the associated vector of outcomes in cluster i. We assume that there is no mixing across clusters and that outcomes are independent across them. We consider a setting where some of the clusters are randomized to a specific treatment, intervention, or exposure while others are not; we use A_{i} = a to denote an exposure indicator such that a = 1 for the exposed clusters and a = 0 for the unexposed (control) clusters. In Figure 1, we show a schematic of such a trial. The outcome can be modeled as a function of exposure such that , with a link function h. The general form of a classical GEE is U (β) = ∑^{m}i=1D^{T}_{i} V^{−}^{1}_{i}{Y_{i} − µ_{i}(β, A_{i})}, where is the design matrix, V_{i} is the covariance matrix equal to ϕR^{1}^{/}^{2}_{i}C(α)R^{1}^{/}^{2}_{i}, R_{i} a diagonal matrix with elements var(Y_{ij}), ϕ is the dispersion parameter, and C(α) is the “working” correlation structure with nondiagonal terms α. Parameters are estimated by setting U (β) to 0. Because our goal is to estimate the effect of exposure, our causal parameter of interest β_{A} is given as: .
We assume each individual i in cluster j to have a set of P clusterlevel and individuallevel covariates X_{ij} = (X^{(1)}_{ij}, ..., X^{(}^{P}^{)}_{ij})^{⊤}, with all covariates represented compactly as X_{i} = (X_{i}_{1}, ..., X_{ini})^{⊤}. Some of the covariates may relate to the network in which the individual is embedded and are described in the previous subsection. Even when the effect of intervention is not confounded, such as in a randomized setting, there exist chance imbalances in the postrandomization distribution of covariate values across treatment arms. It is then possible to improve efficiency of estimation by introducing covariate adjustment to the standard GEE framework by augmenting the GEE itself. This requires specification of an outcome model (OM) B_{i}(X_{i}, A_{i} = a, η_{B}) = [B_{ij}(X_{i}, A_{i} = a, η_{B})]_{j}_{=1},...,n_{i}, which is an arbitrary function of X_{i} for each exposure level, and η_{B} are nuisance parameters to be estimated. Estimation is most efficient if the OM correctly models the probability of the outcome of interest given baseline covariates
(Y_{ij}X_{i}, A_{i} = a). We will call the OM “correctly specified” as it corresponds to the true data generation process.
The variance of β is estimated by using an empirical or “sandwich” estimator, which is also robust in the sense that it provides valid standard errors even when the assumed covariance structure is not correct. Given the OM, the exposure effect is represented by the vector of coefficients β.
In this paper, one can estimate the OM as (Y_{ij}X_{i}, A_{i} = a) based on some regression model, such as a simple GLM. Due to potential treatmentcovariate interaction, we fit separate OM models for the two treatment arms. In our simulation study, we will consider (1) augmented GEEs adjusting for each of the thirteen covariates listed in Table 1 alone, (2) an augmented GEE adjusting for all covariates, and (3) an augmented GEE employing a stepwise selection of relevant network network covariates X_{i}. [End Page 161]
In settings with missing data, one can include an additional propensity score (PS) term to account for data missingness. Including both the OM and the PS yields the doublyrobust (DR) GEE, which is consistent and asymptotically normal (CAN) if either the OM or PS are correctly specified (Prague et al., 2016). Our investigation focuses solely on the use of network covariates in the augmented GEE, but this approach can be generalized to the DR GEE for studies with missing data.
3. Simulation Study
In this section, we describe simulation of a contagion process on a network in a randomized study setting to estimate the effect of an intervention on that process. We use this simulation study to investigate the usefulness of our method to reduce variance of estimates of effects of exposure. As before, we assume that individuals are nested within a collection of independent clusters, each with its own contact network structure, and that the outcome of interest arises from a contagion process propagating through network ties. We also assume that the intervention will reduce the rate of the contagion by varying amounts.
We first describe the network generation for each cluster and the contagious spreading processes as well as the effect of the exposure (or intervention) on the latter. To estimate the average exposure effect, we apply the augmented GEE described above and compare results to a standard GEE and the effect of the simulation conditions on their relative efficiencies.
Simulated Contact Networks
The network generation model in our simulation study is the degreecorrected stochastic block model with degree correlation. Because of the complexity of this model, it is not possible to analytically obtain an estimate of the improvement in efficiency resulting from incorporation of network information in analyses. The original stochastic block model (Anderson et al., 1992) assumes that, in a given network, each node i = 1, . . . , n belongs to only one block b_{i} in a partition of nodes ℬ = {1, ..., B}; the set of node memberships is given by the vector b = {b_{1}, ..., b_{n}}. In this model, the probability of an edge between nodes i and j depends only on their block membership P (e_{ij} = 1) = p_{bibj} . An extension of this model, the socalled degreecorrected stochastic block model, allows each node i to have arbitrary expected degree θ_{i} := (k_{i}), where k_{i} is the observed degree of node i (Karrer and Newman, 2011). The likelihood associated with this model assumes that the mean number of edges ν_{ij} between nodes i and j is the product of the expected degrees of nodes i and j (θ_{i} and θ_{j}, respectively), multiplied by the expected amount of mixing ω_{bi,bj} between the blocks to which nodes i and j belong. The full likelihood of this model is
where ν_{ij} = θ_{i}θ_{j}ω_{bi,bj} . The model assumes that e_{ij} is Poisson distributed, allowing for multiple edges between pairs of nodes, which converges to a simple Bernoulli network (having binary edges) for sparse networks in the limit n → ∞ (Karrer and Newman, 2011). The 1/2 terms in the second half of the likelihood account for the fact that selfedges (edges from one node to itself) are counted twice by this indexing. [End Page 162]
In addition to block structure and node degree, networks may vary in the extent to which degrees of adjacent nodes are correlated (Newman, 2002). One metric for quantifying this property is degree assortativity, which was defined above. Degree assortativity can be varied in the network generating process by performing degree assortative rewiring, which increases or decreases the assortativity in the network while preserving block structure and each node’s degree (XulviBrunet and Sokolov, 2004; Rao et al., 1996). The details of this algorithm are given in the Appendix.
Block Structure
Each network/cluster in our simulation comprises of eight blocks. We simulate networks using two types of block structure: random and heterogeneous. For a complete description of the block structures used, see Section 2 of the Supplementary Material. A diagram of these structures is shown in Figure 2.
[End Page 163]
Contagion Process
We simulate a contagion process (PastorSatorras et al., 2015) by employing a stochastic compartmental SI (susceptibleinfectious) model (Anderson and May, 1991), shown in Algorithm 1. Initially, S% of all nodes are selected to be affected by the contagion process (capable of transmitting) at random across all study clusters/networks i = 1, ..., m. After initiation, affected node j in network i selects z_{ij} of their k_{ij} neighbors at random and transmits to them with probability p_{0}, where z_{ij} is the node’s affectivity, which may vary between 0 and k_{ij}. Zhou et al. showed that the properties of spreading processes on networks can depend strongly on affectivity (Zhou et al., 2006). Unit affectivity and degree affectivity occur when an individual attempts to affect either one partner (selected at random) or all partners, respectively, per unit time. (Illustrative diagrams of the contagion process over time are given in Section 1 of the Supplementary Material.) This process is repeated until B% of the population is affected by contagion, which defines the baseline, or earliest time an infection is assumed to be observed, and can be included in the estimation procedure without risk of bias in the exposure effect estimate. Half of the clusters are exposed (A_{i} = 1) and half unexposed (A_{i} = 0), at random. The contagion process continues for T time steps with a perinfectednode probability to infect a susceptible network neighbor p_{0} in unexposed clusters and p_{1} in exposed clusters. The contagion process ends at time T .
1. S% of all nodes are selected uniformly at random to be initially affected.

2. Until B% population incidence:

For each affected node j (in random order):
a. Successively select z_{ij} neighbors j_{1}, ..., j_{zij} .
b. If neighbor j′ ∈ {j_{1}, ..., j_{zij} } is already affected, do nothing. If not, affect with probability p_{0}.


3. Repeat T times:

For each affected node j (in random order):
a. Successively select z_{ij} neighbors j_{1}, ..., j_{zij} .
b. If neighbor j′ ∈ {j_{1}, ..., j_{zij} } is already affected, do nothing. If not, affect with probability:
p_{0} for those in unexposed clusters.
p_{1} for those in exposed clusters.

Algorithm 1: Stochastic Compartmental Contagion Process
Simulation Setting Parameters
Each simulated study consists of a contagion process propagating on the networks associated with m = 48 clusters of size 200, totaling 9600 individuals. The initial seed percentage S% is set to 2%, and the baseline discovery B% is set to 15%. For each scenario, we perform 1000 iterations of the simulation. [End Page 164]
The number of clusters was chosen to be roughly comparable to cluster randomized trials investigating varied interventions such as influenza vaccines (Loeb et al., 2016) (52 clusters), hygiene programs (Freeman et al., 2012) (135 clusters over three treatment arms), and financial services (Ksoll et al., 2016) (46 clusters).The validity of the GEE robust sandwich variance estimates requires an asymptotically large number of independent clusters. Thus, it has been suggested that GEEs only be applied when the number of clusters exceeds 30 (Hayes and Moulton, 2017). When the number of clusters is small, a variety of smallsample bias corrections can be applied (Mancl and DeRouen, 2001; Fay and Graubard, 2001; Kauermann and Caroll, 2001). Another alternative are permutation tests, which may be used for valid hypothesis testing even if the number of clusters is low. For our application, we use the Fay smallsample adjustment (Fay and Graubard, 2001) built into the CRTgeeDR R package from Prague et al. (2017).
Sensitivity Analysis and Scenarios
To estimate the sensitivity of estimation performance on simulation features, we vary four aspects of the simulation in two ways each: degree distribution (Poisson or heavytailed lognormal distribution), assortativity (0.2 and 0.2), block mixing structure (random or heterogenous), and infectivity (unit or degree). This leads to a total of 2^{4} = 16 scenarios.
Contagion Process
The contagion process continues for T = 5 time steps. For simplicity, we assume an exposure effect that reduces contagious spread: the probability each affected node affects a selected neighbor is p_{0} = 0.15 in unexposed clusters and p_{1} = 0.12 in exposed clusters. We chose a small effect of intervention so that the power of the trials remains low enough to observe changes due to augmentation of the GEE. Due to major differences in transmission in scenarios including unit versus degree affectivity, we divide the probability of transmission by a constant factor of 9 in all of the scenarios exhibiting degree affectivity. This constant was chosen to attain similar levels of power between the unit affectivity and degree affectivity scenarios when estimates are obtained from the unadjusted GEE (an average of 42.6% for the unit affectivity scenarios and 41.8% for the degree affectivity scenarios).
Evaluation of overall approach
Performance Metrics
In principle, the randomized setting should be free from confounding factors, and the augmented GEE approach should reduce variance in the estimation of β_{A}. All GEEs used an exchangeable correlation structure. Based on the simulation specifications, we define estimation metrics for the average exposure effect β_{A}. For r = 1, ..., R replicates, the estimate of exposure effect is denoted β̂_{r} and the estimated standard deviation ). These standard deviation estimates are compared with the empirical bootstrap estimates, which are calculated as the standard deviations of all estimates β̂_{r} in the R replicates. Empirical power and coverage are derived from these simulation study point estimates and their confidence intervals. Power is based on 0.05 level tests of the exposure effect. Coverage is defined as the percentage of replicates for which the true value of exposure effect is included in the 95% [End Page 165] confidence interval of its estimate. Since the contagious spreading rates in unexposed and exposed groups are specified as p_{0} and p_{1} at the node level, the true value for the average exposure effect must be estimated through simulation (Woodward, 2001). To estimate the true underlying exposure effect β_{A}, we simulate an additional 20,000 clustered randomized trials, each with a uniquely generated network satisfying the scenario of interest, and obtain the mean treatment effect, averaged over all trials. We define improvement in estimation efficiency as the percent reduction in root mean squared error (RMSE) for each covariate set in the outcome model, comparing the augmentation adjustment to that of the unadjusted GEE ( ):
Simulation Results
Due to inherent randomness, the covariates included in the outcome model may be correlated with the exposure. By augmenting the GEE with such covariates, we can adjust for that imbalance and obtain higher efficiency. Bias in estimation, average model standard error, empirical standard error, RMSE reduction, power, and coverage are provided in Table 2; full simulation results are given in Section 3 in Supplementary Material. Results are averaged across the 16 observational scenarios with the standard deviations across scenarios shown in parentheses. Averaged across all simulation replications and variants, inclusion of covariates in outcome and propensity score models led to gains in efficiency. We also find that a single covariate, the number of affected nodes in the same component as the node at baseline (X^{(10)}), provides a reduction in RMSE comparable to or higher than that achieved by the variable selection approach.
Network feature selection
The covariates selected in a stepwise procedure for inclusion in the outcome model vary across the different simulated datasets. To assess which covariates are most useful for [End Page 166] adjustment in the outcome model, we measure the frequency of covariate inclusion and its variability by simulation scenario (see Figure 3). Baseline status (X^{(0)}), degree (X^{(1)}) and covariates related to contagion at baseline (covariates X^{(9)} and X^{(11)}) are included most often; others are selected in a range of frequencies.
Sensitivity Analysis Results
Features of the simulated contagion process affect performance metrics such as power and improvement in RMSE. To evaluate the sizes of these effects, we used simple linear regression treating the RMSE as the outcome and simulation features (i.e., mean degree, degree distribution, assortativity, block mixing structure, infectivity mode, and infection prevalence at baseline) as covariates; these are coded as binary variables. The fitted coefficients represent the metric change when changing simulation features, holding all other simulation features constant. The percent improvement in RMSE is shown in Table 3. For example, holding all other simulation features constant and using stepwise model selection, a contagion process [End Page 167] exhibiting degree affectivity shows an additional RMSE reduction of 7.2 percentage points compared to a contagion process exhibiting unit affectivity. Using covariate X^{(9)} as a predictor yields much larger reductions in RMSE if the contagion exhibits degree affectivity instead of unit affectivity. This suggests that the contagion status of a node’s neighbors can be quite predictive of the risk of the node becoming affected, especially in the degree affectivity case. Table 4 shows analogous results for changes in power, holding all other simulation features constant.
4. Application to Clustered Randomized Trial
In the following section, we demonstrate the application of our method to a CRT conducted at the University of California San Diego (UCSD). [End Page 168]
Wastewater Monitoring Cluster Randomized Trial
Wastewater monitoring paired with automated reporting systems can be utilized for forecasting COVID19 cases and preventing outbreaks. Here, we consider the clustered randomized trial component of the wastewater surveillance program implemented at UCSD (Karthikeyan et al., 2021). In this program, wastewater monitors were placed inside manholes associated with selected UCSD residential buildings. Between November 23 and December 29, 2020, manholes were randomized to either receive wastewater monitors or not. The intervention was also paired with an alert system that notified residents when COVID19 was detected in the wastewater associated with their building.
In keeping with the structure of our simulation example, we define each cluster by a unique manhole. Thus, a single cluster is composed of the set of residence buildings associated with a single manhole. Each residence building is represented by a single node within the contact network associated with its respective cluster. For this simple example, we considered each cluster to be a complete (maximum density) network, such that there exists an edge between every pair of buildings that reside in the same cluster. In total, we considered 41 manholes, and the sizes of manholeassociated clusters ranged between 1 and 7 residence buildings.
During the period of this study, all oncampus UCSD students were mandated to adhere to a biweekly testing schedule. We define the outcome of interest as the total number of positive COVID19 tests registered during the period of randomization. During the period of randomization, a total of 11930 tests were returned, with 68 of these tests indicating a positive COVID19 result.
Results
In Table 5, we summarize results obtained by augmenting a GEE with a selection of the various network features defined in Section 2. The definition of several network covariates requires knowledge about the number of cases at baseline, prior to the delivery of randomized treatment. As a proxy for this measure, we consider the number of positive cases registered from each residence during the three months prior to the introduction of wastewater monitoring devices. During this initial baseline period, a total of 20655 tests were returned, and 28 showed a positive COVID19 result.
In addition, we note that because the networks for every cluster was complete, some of the network features are identical. A few examples are X^{(9)} and X^{(10)}, or X^{(1)} and X^{(2)}. We also define an additional network covariate, X^{(13)}, which is defined as the sum of the total cases occurring in neighbors at baseline. [End Page 169]
All of the selected network covariates led to significant improvements in variance estimates. Particularly of note is the network covariate X^{(1)}, the individuallevel covariate of node degree. In most of our simulated scenarios, X^{(1)} led only to minor improvements in efficiency, but in this data example, its associated improvement is significant. This is likely due to a higher variance in degree distribution among clusters in the UCSD trial when compared to our simulations, where degree distribution was controlled by scenario.
5. Discussion
Spread of a contagious process depends on contact network structure and contagious process dynamics. To estimate the marginal effect of exposure on this process, augmented GEE methods as described above can be used to reduce the variation of the exposure effect estimate. Even after randomization, chance imbalances in covariates can still exist between treatment arms. Through a userspecified outcome model, estimation efficiency can be improved by taking these covariate imbalances into account (Stephens et al., 2012).
In settings where network information is available, augmenting the GEE procedure is particularly important. The space of possible networks is highdimensional; hence network structure, even when described by summary statistics, can vary widely between clusters, and imbalance between treatment arms is almost guaranteed. Furthermore, the degree to which differing networks impact individual outcomes also varies. Due to high potential variability between treatment arms, the inclusion of relevant network covariates in the augmented GEE may lead to significant efficiency gains.
Our focus was investigation of the extent to which power and RMSE in exposure effect estimates depend on specific network properties. Adjusting for contact network features and baseline contagion improved power and yielded a considerable reduction in RMSE across a range of simulation settings.
In our simulation studies, covariates derived from a variety of network features differed in their usefulness in reducing exposure estimate variance. They also differed in the degree to which collection of the required information would be feasible. Obtaining the degree of an individual might be relatively easy to collect in some settings by using a simple survey, but the gains we observed from using this covariate were very modest. We found that the number of nodes in the component affected at baseline yielded the largest reduction in variance; though perhaps more challenging to obtain, this quantity may also be feasible to [End Page 170] estimate in some important settings, such as infectious disease outbreaks. The number of neighbors affected at baseline also yielded large reductions in variance and may be estimated under similar settings. While not a network covariate, individual baseline status was selected for in the stepwise models 100% of the time, and as a lone augmentation term, still yielded significant reductions in variance.
In addition, we applied our method to a real data example from the a clustered randomized trial on the efficacy of wastewater monitoring at UCSD. Again, augmentation using a variety of network features led to significant increases in statistical efficiency. Notably, one feature (X^{(1)}) that did not lead to significant efficiency gains in the simulated scenarios led to a large improvement in the data example. While the degree distribution across clusters was relatively homogeneous in the simulated scenarios, the complete networks in the data example had high variance in degree distributions. Thus, correcting for chance imbalances in degree distributions across the treatment and control arms was far more effective in the data example. For the data example, we chose to assume complete networks existed within each cluster, even though one might imagine more advanced methods of inferring contact networks between residence buildings. However, we note that augmentation with network features will lead to efficiency gains as long as the network features are imbalanced across treatment arms and are also associated with the outcome of interest. Under these conditions, the use of estimated contact networks or proxies for network features can still increase efficiency.
This work invites several extensions. Information may be missing or misreported for individual outcomes, contact network data, or both; incompleteness of data may lead to bias or increased variation in estimating the exposure effect. Methods for addressing missing information on both networks and outcomes need to be developed. Although we carried out a wide range of simulations, the range of possible scenarios in which the methods might be relevant is large and beyond the scope of any single paper to address. Simulation of other settings would be useful to help guide randomized studies in which augmentation by network features may help improve efficiency of estimation. Appropriate methods still need to be developed for observational studies with networkrelated confounding. Lastly, the SI contagion model presented in this work most closely resembles the effect of an educational intervention; to model an infectious disease, SIR, SIS, and other compartmental models may be more realistic.
6. Appendix
Degree assortative rewiring
This is performed by randomly selecting two edges within a block pair and rewiring them, as described in Algorithm 2. A diagram of this process is shown in Figure 4. To decrease assortativity, the inequality in Step 3 must be reversed. There exist cases where it is not possible to obtain high (or low) enough assortativity after specifying the degree of each node. When this occurs, we accept the network if assortativity converges to within 10% of the desired assortativity, and regenerate the entire network if it does not. [End Page 171]
Algorithm 2 Edge Rewiring to Adjust Assortativity
1. Select two blocks b_{l} and b_{l}′ at random.
2. Select two edges (N_{1}, N_{2}) and (N_{3}, N_{4}) at random between blocks b_{l} and b_{l}′ .

3. If k_{N1} − k_{N2} + k_{N3} − k_{N4} > k_{N1} − k_{N4} + k_{N2} − k_{N3}:
Remove edges (N_{1}, N_{2}) and (N_{3}, N_{4})
Add edges (N_{1}, N_{4}) and (N_{2}, N_{3})
Supplementary Material
* Maxwell Wang and Patrick Staples contributed equally to this paper.
Acknowledgments
We are grateful for support from NIH Grants 2T32AI00735826, R37 AI51164, R24 AI106039, R01 AI147441, and R01 AI138901. We thank Natasha Martin, Jingjing Zou, and Tuo Lin for their assistance in preparing the UCSD wastewater monitoring dataset as well as for their important insights about the nature and conduct of the study. We would also like to acknowledge the large number of individuals responsible for the funding, installation, and monitoring of the wastewater samplers.
Ethics
The UCSD institutional review board deemed investigation of effects of wastewater testing not to be human subject research, as personally identifiable information was not recorded.