learning representations for counterfactual inference github

Finally, although TARNETs trained with PM have similar asymptotic properties as kNN, we found that TARNETs trained with PM significantly outperformed kNN in all cases. Wager, Stefan and Athey, Susan. stream rk*>&TaYh%gc,(| DiJIRR?ZzfT"Zv$]}-P+"{Z4zVSNXs$kHyS$z>q*BHA"6#d.wtt3@V^SL+xm=,mh2\'UHum8Nb5gI >VtU i-zkAz~b6;]OB9:>g#{(XYW>idhKt We consider the task of answering counterfactual questions such as, "Would this patient have lower blood sugar had she received a different medication?". an exact match in the balancing score, for observed factual outcomes. PSMMI was overfitting to the treated group. Flexible and expressive models for learning counterfactual representations that generalise to settings with multiple available treatments could potentially facilitate the derivation of valuable insights from observational data in several important domains, such as healthcare, economics and public policy. We found that NN-PEHE correlates significantly better with the PEHE than MSE (Figure 2). Propensity Dropout (PD) Alaa etal. We also found that the NN-PEHE correlates significantly better with real PEHE than MSE, that including more matched samples in each minibatch improves the learning of counterfactual representations, and that PM handles an increasing treatment assignment bias better than existing state-of-the-art methods. For the IHDP and News datasets we respectively used 30 and 10 optimisation runs for each method using randomly selected hyperparameters from predefined ranges (Appendix I). Estimating individual treatment effects111The ITE is sometimes also referred to as the conditional average treatment effect (CATE). (2017). Perfect Match: A Simple Method for Learning Representations For Counterfactual Inference With Neural Networks, Correlation MSE and NN-PEHE with PEHE (Figure 3), https://cran.r-project.org/web/packages/latex2exp/vignettes/using-latex2exp.html, The available command line parameters for runnable scripts are described in, You can add new baseline methods to the evaluation by subclassing, You can register new methods for use from the command line by adding a new entry to the. %PDF-1.5 Mutual Information Minimization, The Effect of Medicaid Expansion on Non-Elderly Adult Uninsurance Rates << /Filter /FlateDecode /Length1 1669 /Length2 8175 /Length3 0 /Length 9251 >> 373 0 obj Dorie, Vincent. A tag already exists with the provided branch name. Jinsung Yoon, James Jordon, and Mihaela vander Schaar. PM is easy to implement, compatible with any architecture, does not add computational complexity or hyperparameters, and extends to any number of treatments. (2011), is that it reduces the variance during training which in turn leads to better expected performance for counterfactual inference (Appendix E). PM effectively controls for biased assignment of treatments in observational data by augmenting every sample within a minibatch with its closest matches by propensity score from the other treatments. 167302 within the National Research Program (NRP) 75 Big Data. compatible with any architecture, does not add computational complexity or hyperparameters, and extends to any number of treatments. =0 indicates no assignment bias. Given the training data with factual outcomes, we wish to train a predictive model ^f that is able to estimate the entire potential outcomes vector ^Y with k entries ^yj. Observational data, i.e. (2017). We found that PM better conforms to the desired behavior than PSMPM and PSMMI. x4k6Q0z7F56K.HtB$w}s{y_5\{_{? arXiv as responsive web pages so you Share on. Jennifer L Hill. The ATE measures the average difference in effect across the whole population (Appendix B). multi-task gaussian processes. medication?". Interestingly, we found a large improvement over using no matched samples even for relatively small percentages (<40%) of matched samples per batch. Brookhart, and Marie Davidian. As outlined previously, if we were successful in balancing the covariates using the balancing score, we would expect that the counterfactual error is implicitly and consistently improved alongside the factual error. Examples of tree-based methods are Bayesian Additive Regression Trees (BART) Chipman etal. You can register new benchmarks for use from the command line by adding a new entry to the, After downloading IHDP-1000.tar.gz, you must extract the files into the. Learning representations for counterfactual inference - ICML, 2016. 2019. You can look at the slides here. This makes it difficult to perform parameter and hyperparameter optimisation, as we are not able to evaluate which models are better than others for counterfactual inference on a given dataset. questions, such as "What would be the outcome if we gave this patient treatment t1?". In. CSE, Chalmers University of Technology, Gteborg, Sweden. Perfect Match: A Simple Method for Learning Representations For Counterfactual Inference With Neural Networks d909b/perfect_match ICLR 2019 However, current methods for training neural networks for counterfactual inference on observational data are either overly complex, limited to settings with only two available treatments, or both. Empirical results on synthetic and real-world datasets demonstrate that the proposed method can precisely decompose confounders and achieve a more precise estimation of treatment effect than baselines. On the binary News-2, PM outperformed all other methods in terms of PEHE and ATE. Learning fair representations. By using a head network for each treatment, we ensure tj maintains an appropriate degree of influence on the network output. Candidate at the Saarland University Graduate School of Computer Science, where he is advised by Dietrich Klakow. MatchIt: nonparametric preprocessing for parametric causal Bayesian nonparametric modeling for causal inference. A supervised model navely trained to minimise the factual error would overfit to the properties of the treated group, and thus not generalise well to the entire population. Counterfactual inference enables one to answer "What if?" This regularises the treatment assignment bias but also introduces data sparsity as not all available samples are leveraged equally for training. Domain adaptation: Learning bounds and algorithms. The shared layers are trained on all samples. (2010); Chipman and McCulloch (2016), Random Forests (RF) Breiman (2001), CF Wager and Athey (2017), GANITE Yoon etal. For the python dependencies, see setup.py. Observational studies are rising in importance due to the widespread accumulation of data in fields such as healthcare, education, employment and ecology. BART: Bayesian additive regression trees. https://archive.ics.uci.edu/ml/datasets/bag+of+words. The IHDP dataset Hill (2011) contains data from a randomised study on the impact of specialist visits on the cognitive development of children, and consists of 747 children with 25 covariates describing properties of the children and their mothers. (2017). Learning Representations for Counterfactual Inference Fredrik D.Johansson, Uri Shalit, David Sontag [1] Benjamin Dubois-Taine Feb 12th, 2020 . in Linguistics and Computation from Princeton University. Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. We propose a new algorithmic framework for counterfactual inference which brings together ideas from domain adaptation and representation learning. Invited commentary: understanding bias amplification. Similarly, in economics, a potential application would, for example, be to determine how effective certain job programs would be based on results of past job training programs LaLonde (1986). Note that we only evaluate PM, + on X, + MLP, PSM on Jobs. Our experiments demonstrate that PM outperforms a number of more complex state-of-the-art methods in inferring counterfactual outcomes across several benchmarks, particularly in settings with many treatments. In International Conference on Learning Representations. We extended the original dataset specification in Johansson etal. ]|2jZ;lU.t`' task. Jiang, Jing. (2017). (2011) to estimate p(t|X) for PM on the training set. The advantage of matching on the minibatch level, rather than the dataset level Ho etal. the treatment and some contribute to the outcome. Home Browse by Title Proceedings ICML'16 Learning representations for counterfactual inference. We focus on counterfactual questions raised by what areknown asobservational studies. Finally, we show that learning rep-resentations that encourage similarity (also called balance)between the treatment and control populations leads to bet-ter counterfactual inference; this is in contrast to manymethods which attempt to create balance by re-weightingsamples (e.g., Bang & Robins, 2005; Dudk et al., 2011;Austin, 2011; Swaminathan We reassigned outcomes and treatments with a new random seed for each repetition. Representation learning: A review and new perspectives. Free Access. In these situations, methods for estimating causal effects from observational data are of paramount importance. However, in many settings of interest, randomised experiments are too expensive or time-consuming to execute, or not possible for ethical reasons Carpenter (2014); Bothwell etal. functions. << /Type /XRef /Length 73 /Filter /FlateDecode /DecodeParms << /Columns 4 /Predictor 12 >> /W [ 1 2 1 ] /Index [ 367 184 ] /Info 183 0 R /Root 369 0 R /Size 551 /Prev 846568 /ID [<6128b543239fbdadfc73903b5348344b>] >> (2017). Following Imbens (2000); Lechner (2001), we assume unconfoundedness, which consists of three key parts: (1) Conditional Independence Assumption: The assignment to treatment t is independent of the outcome yt given the pre-treatment covariates X, (2) Common Support Assumption: For all values of X, it must be possible to observe all treatments with a probability greater than 0, and (3) Stable Unit Treatment Value Assumption: The observed outcome of any one unit must be unaffected by the assignments of treatments to other units. !lTv[ sj DanielE Ho, Kosuke Imai, Gary King, and ElizabethA Stuart. However, current methods for training neural networks for counterfactual inference on observational data are either overly complex, limited to settings with only two available treatments, or both. 2C&( ??;9xCc@e%yeym? The topic for this semester at the machine learning seminar was causal inference. propose a synergistic learning framework to 1) identify and balance confounders (2016) to enable the simulation of arbitrary numbers of viewing devices. How do the learning dynamics of minibatch matching compare to dataset-level matching? MarkR Montgomery, Michele Gragnolati, KathleenA Burke, and Edmundo Paredes. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. (2017) subsequently introduced the TARNET architecture to rectify this issue. (2007). Hw(a? 368 0 obj We perform experiments that demonstrate that PM is robust to a high level of treatment assignment bias and outperforms a number of more complex state-of-the-art methods in inferring counterfactual outcomes across several benchmark datasets. As training data, we receive samples X and their observed factual outcomes yj when applying one treatment tj, the other outcomes can not be observed. Shalit etal. in Linguistics and Computation from Princeton University. PM is easy to use with existing neural network architectures, simple to implement, and does not add any hyperparameters or computational complexity. inference which brings together ideas from domain adaptation and representation The script will print all the command line configurations (13000 in total) you need to run to obtain the experimental results to reproduce the IHDP results. 167302 within the National Research Program (NRP) 75 "Big Data". Representation Learning: What Is It and How Do You Teach It? Author(s): Patrick Schwab, ETH Zurich patrick.schwab@hest.ethz.ch, Lorenz Linhardt, ETH Zurich llorenz@student.ethz.ch and Walter Karlen, ETH Zurich walter.karlen@hest.ethz.ch. Both PEHE and ATE can be trivially extended to multiple treatments by considering the average PEHE and ATE between every possible pair of treatments. Stay informed on the latest trending ML papers with code, research developments, libraries, methods, and datasets. For low-dimensional datasets, the covariates X are a good default choice as their use does not require a model of treatment propensity. See below for a step-by-step guide for each reported result. Perfect Match is a simple method for learning representations for counterfactual inference with neural networks. Repeat for all evaluated method / benchmark combinations. https://github.com/vdorie/npci, 2016. << /Linearized 1 /L 849041 /H [ 2447 819 ] /O 371 /E 54237 /N 78 /T 846567 >> Candidate, Saarland UniversityDate:Monday, May 8, 2017Time: 11amLocation: Room 1202, CSE BuildingHost: CSE Prof. Mohan Paturi (paturi@eng.ucsd.edu)Representation Learning: What Is It and How Do You Teach It?Abstract:In this age of Deep Learning, Big Data, and ubiquitous graphics processors, the knowledge frontier is often controlled not by computing power, but by the usefulness of how scientists choose to represent their data. Bottou, Lon, Peters, Jonas, Quinonero-Candela, Joaquin, Charles, Denis X, Chickering, D Max, Portugaly, Elon, Ray, Dipankar, Simard, Patrice, and Snelson, Ed. To ensure that differences between methods of learning counterfactual representations for neural networks are not due to differences in architecture, we based the neural architectures for TARNET, CFRNETWass, PD and PM on the same, previously described extension of the TARNET architecture Shalit etal. (2011). (2010); Chipman and McCulloch (2016) and Causal Forests (CF) Wager and Athey (2017). This repository contains the source code used to evaluate PM and most of the existing state-of-the-art methods at the time of publication of our manuscript. ecology. dimensionality. Are you sure you want to create this branch? Limits of estimating heterogeneous treatment effects: Guidelines for We then defined the unscaled potential outcomes yj=~yj[D(z(X),zj)+D(z(X),zc)] as the ideal potential outcomes ~yj weighted by the sum of distances to centroids zj and the control centroid zc using the Euclidean distance as distance D. We assigned the observed treatment t using t|xBern(softmax(yj)) with a treatment assignment bias coefficient , and the true potential outcome yj=Cyj as the unscaled potential outcomes yj scaled by a coefficient C=50. Improving Unsupervised Vector-Space Thematic Fit Evaluation via Role-Filler Prototype Clustering, Sub-Word Similarity-based Search for Embeddings: Inducing Rare-Word Embeddings for Word Similarity Tasks and Language Modeling. This is sometimes referred to as bandit feedback (Beygelzimer et al.,2010). We outline the Perfect Match (PM) algorithm in Algorithm 1 (complexity analysis and implementation details in Appendix D). This setup comes up in diverse areas, for example off-policy evalu-ation in reinforcement learning (Sutton & Barto,1998), However, they are predominantly focused on the most basic setting with exactly two available treatments. ^mATE Symbols correspond to the mean value of, Comparison of several state-of-the-art methods for counterfactual inference on the test set of the News-8 dataset when varying the treatment assignment imbalance, Comparison of methods for counterfactual inference with two and more available treatments on IHDP and News-2/4/8/16. Formally, this approach is, when converged, equivalent to a nearest neighbour estimator for which we are guaranteed to have access to a perfect match, i.e. Shalit etal. (2017); Alaa and Schaar (2018). Domain adaptation: Learning bounds and algorithms. This is a recurring payment that will happen monthly, If you exceed more than 500 images, they will be charged at a rate of $5 per 500 images. Rubin, Donald B. Estimating causal effects of treatments in randomized and nonrandomized studies. ICML'16: Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48. Counterfactual Inference With Neural Networks, Double Robust Representation Learning for Counterfactual Prediction, Enhancing Counterfactual Classification via Self-Training, Interventional and Counterfactual Inference with Diffusion Models, Continual Causal Inference with Incremental Observational Data, Explaining Deep Learning Models using Causal Inference. The distribution of samples may therefore differ significantly between the treated group and the overall population. We perform extensive experiments on semi-synthetic, real-world data in settings with two and more treatments. The fundamental problem in treatment effect estimation from observational data is confounder identification and balancing. Learning representations for counterfactual inference. =1(k2)k1i=0i1j=0^ATE,i,jt M.Blondel, P.Prettenhofer, R.Weiss, V.Dubourg, J.Vanderplas, A.Passos, We assigned a random Gaussian outcome distribution with mean jN(0.45,0.15) and standard deviation jN(0.1,0.05) to each centroid. To address the treatment assignment bias inherent in observational data, we propose to perform SGD in a space that approximates that of a randomised experiment using the concept of balancing scores. However, it has been shown that hidden confounders may not necessarily decrease the performance of ITE estimators in practice if we observe suitable proxy variables Montgomery etal. However, current methods for training neural networks for counterfactual . stream Uri Shalit, FredrikD Johansson, and David Sontag. endobj Sign up to our mailing list for occasional updates. In the binary setting, the PEHE measures the ability of a predictive model to estimate the difference in effect between two treatments t0 and t1 for samples X. available at this link. PM is easy to implement, Upon convergence, under assumption (1) and for. PM may be used for settings with any amount of treatments, is compatible with any existing neural network architecture, simple to implement, and does not introduce any additional hyperparameters or computational complexity. Most of the previous methods Newman, David. As a Research Staff Member of the Collaborative Research Center on Information Density and Linguistic Encoding, he analyzes cross-level interactions between vector-space representations of linguistic units. (2017) is another method using balancing scores that has been proposed to dynamically adjust the dropout regularisation strength for each observed sample depending on its treatment propensity. zz !~A|66}$EPp("i n $* The samples X represent news items consisting of word counts xiN, the outcome yjR is the readers opinion of the news item, and the k available treatments represent various devices that could be used for viewing, e.g. In addition to a theoretical justification, we perform an empirical comparison with previous approaches to causal inference from observational data. Causal Multi-task Gaussian Processes (CMGP) Alaa and vander Schaar (2017) apply a multi-task Gaussian Process to ITE estimation. A kernel two-sample test. that units with similar covariates xi have similar potential outcomes y. Or, have a go at fixing it yourself the renderer is open source! In addition to a theoretical justification, we perform an empirical comparison with previous approaches to causal inference from observational data. GANITE uses a complex architecture with many hyperparameters and sub-models that may be difficult to implement and optimise. [HJ)mD:K`G?/BPWw(a&ggl }[OvP ps@]TZP?x ;_[YN^0'5 Identification and estimation of causal effects of multiple >> PM, in contrast, fully leverages all training samples by matching them with other samples with similar treatment propensities. (2016), TARNET Shalit etal. We consider fully differentiable neural network models ^f optimised via minibatch stochastic gradient descent (SGD) to predict potential outcomes ^Y for a given sample x. Recursive partitioning for personalization using observational data. general, not all the observed variables are confounders which are the common We propose a new algorithmic framework for counterfactual inference which brings together ideas from domain adaptation and representation learning. << /Annots [ 484 0 R ] /Contents 372 0 R /MediaBox [ 0 0 362.835 272.126 ] /Parent 388 0 R /Resources 485 0 R /Trans << /S /R >> /Type /Page >> NPCI: Non-parametrics for causal inference, 2016. (2007). (2011). Matching as nonparametric preprocessing for reducing model dependence To rectify this problem, we use a nearest neighbour approximation ^NN-PEHE of the ^PEHE metric for the binary Shalit etal. trees. Use of the logistic model in retrospective studies. stream The primary metric that we optimise for when training models to estimate ITE is the PEHE Hill (2011). Run the command line configurations from the previous step in a compute environment of your choice. This alert has been successfully added and will be sent to: You will be notified whenever a record that you have chosen has been cited. Speaker: Clayton Greenberg, Ph.D. Counterfactual reasoning and learning systems: The example of computational advertising. Towards Interactivity and Interpretability: A Rationale-based Legal Judgment Prediction Framework, EMNLP, 2022. Since the original TARNET was limited to the binary treatment setting, we extended the TARNET architecture to the multiple treatment setting (Figure 1). We found that including more matches indeed consistently reduces the counterfactual error up to 100% of samples matched. Pi,&t#,RF;NCil6 !M)Ehc! Small software tool to analyse search results on twitter to highlight counterfactual statements on certain topics, This is a recurring payment that will happen monthly, If you exceed more than 500 images, they will be charged at a rate of $5 per 500 images. This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. (2017).. Propensity Score Matching (PSM) Rosenbaum and Rubin (1983) addresses this issue by matching on the scalar probability p(t|X) of t given the covariates X. In, Strehl, Alex, Langford, John, Li, Lihong, and Kakade, Sham M. Learning from logged implicit exploration data. Measuring living standards with proxy variables. The central role of the propensity score in observational studies for Chipman, Hugh and McCulloch, Robert. =1(k2)k1i=0i1j=0^PEHE,i,j Analysis of representations for domain adaptation. We then randomly pick k+1 centroids in topic space, with k centroids zj per viewing device and one control centroid zc. Comparison of the learning dynamics during training (normalised training epochs; from start = 0 to end = 100 of training, x-axis) of several matching-based methods on the validation set of News-8. Scatterplots show a subsample of 1400 data points. The optimisation of CMGPs involves a matrix inversion of O(n3) complexity that limits their scalability. (2017); Schuler etal. E A1 ha!O5 gcO w.M8JP ? A literature survey on domain adaptation of statistical classifiers. 2) and ^mATE (Eq. A simple method for estimating interactions between a treatment and a large number of covariates. One fundamental problem in the learning treatment effect from observational This shows that propensity score matching within a batch is indeed effective at improving the training of neural networks for counterfactual inference. Bio: Clayton Greenberg is a Ph.D. We therefore conclude that matching on the propensity score or a low-dimensional representation of X and using the TARNET architecture are sensible default configurations, particularly when X is high-dimensional. Domain adaptation and sample bias correction theory and algorithm for regression. smartphone, tablet, desktop, television or others Johansson etal. xc```b`g`f`` `6+r @0AcSCw-_0 @ LXa>dx6aTglNa i%d5X{985,`Q`~ S 97L?d25h~a ;-dtc 8:NDZ9sUw{wo=s3W9=54r}I$bcg8y7Z{)4#$'ee u?T'PO+!_,zI2Y-Lm47}7"(Dq#^EYWvDV5o^r-*Yt5Pm@Wt>Ks^8$pUD.r#1[Ir i{6lerb@y2X8JS/qP9-8l)/LVU~[(/\l\"|o$";||e%R^~Yi:4K#)E)JRe|/TUTR We consider a setting in which we are given N i.i.d. the treatment effect performs better than the state-of-the-art methods on both \includegraphics[width=0.25]img/nn_pehe. On IHDP, the PM variants reached the best performance in terms of PEHE, and the second best ATE after CFRNET. ;'/ Bag of words data set. Stay informed on the latest trending ML papers with code, research developments, libraries, methods, and datasets. To perform counterfactual inference, we require knowledge of the underlying. Representation-balancing methods seek to learn a high-level representation for which the covariate distributions are balanced across treatment groups. This work contains the following contributions: We introduce Perfect Match (PM), a simple methodology based on minibatch matching for learning neural representations for counterfactual inference in settings with any number of treatments. In this paper, we propose Counterfactual Explainable Recommendation ( Fair machine learning aims to mitigate the biases of model predictions against certain subpopulations regarding sensitive attributes such as race and gender. Marginal structural models and causal inference in epidemiology. 369 0 obj Generative Adversarial Nets. 1 Paper PM is easy to implement, compatible with any architecture, does not add computational complexity or hyperparameters, and extends to any number of treatments.