Background: Using machine learning (ML) methods, including super learner, hybrid learner, and collaborative targeted maximum likelihood estimation (CTMLE), in high-dimensional confounding adjustment promises to reduce unmeasured confounding and optimize control in a given data set. However, concerns remain whether these methods can avoid bias amplification from inappropriate adjustment for instrumental variables or colliders.
Objectives: To provide an in-depth overview of recent advancements that expand the frequently-used high-dimensional propensity score (hdPS) framework with ML to help automate covariate identification and prioritization for improved causal effect estimation. Researchers working on high-dimensional real-world data, comprised of electronic health record, administrative, clinical, and other healthcare data, will benefit from attending this symposium.
Description: We will first describe how hdPS optimizes confounder identification and selection free of cognitive biases through automated feature generation and prioritization and why it may reduce bias from unmeasured confounding through proxy adjustment. Insight will be provided into the conditions under which an assumption of reduced unmeasured confounding is likely to hold, and whether this might outweigh the potential cost of adjusting for colliders or instrumental variables, which is known to potentially increase residual bias. Additional ML algorithms are thought to optimize covariate prioritization in the framework of causal inference. We will summarize recently proposed hdPS extensions that incorporate additional ML algorithms and their ability to address potential complications to causal interpretations of findings. The discussion around the pros and cons of these approaches will focus on: a) data structures in which the methods might perform similarly and/or differently; and b) cases of potential model misspecification due to investigator-driven variable selection approaches to propensity score estimation. Finally, we will introduce CTMLE that incorporates ML in estimation of both the treatment mechanism and the outcome regression, and provides statistically efficient estimations of target estimands in causal inference. The CTMLE framework and its cross-validation for selecting tuning parameters of the embedded ML algorithms will be discussed.