Variable selection and manipulation with missing data

Koyuncu, Deniz

thesis

oai:https://dspace.rpi.edu:20.500.13015/7003

Variable selection and manipulation with missing data

Authors: Deniz Koyuncu
Publication date: 31 January 2025
Publisher: Rensselaer Polytechnic Institute, Troy, NY

Abstract

December 2024School of EngineeringBoth causal modeling and associative feature selection aim to identify essential relationships among the set of modeled variables, albeit with different objectives. While causal modeling focuses on functional relationships, associative feature selection focuses on statistical relationships. In practice, inferring causal or associational relations often has to be done from a dataset with missing entries under the potential bias missing data introduces. Traditionally, missingness has been attributed to benign data collection processes, but as datasets are increasingly curated from diverse sources, including untrusted parties, maliciously engineered missingness has become a likely threat. In turn, to make reliable inferences, a practitioner has to understand how the methods used to extract these causal or associational relationships are affected by benign and adversarial missingness. This dissertation addresses these challenges in three parts. First, we examine the impact of benign missing data on the model-X knockoffs framework, a recent method that provides false discovery rate (FDR) control across a broad range of feature selection techniques. We identify how the distribution shift resulting from imputing the missing entries or dropping partially observed data points interferes with the model-X knockoffs’ FDR guarantees. Next, we introduce sufficient conditions under which imputation using the generative model originally intended for FDR calibration can preserve all assumptions of the model-X framework. Second, we study the effects of adversarial missing data on causal structural learning from observational data. We introduce the adversarial missingness treat model, where an attacker selectively omits data entries. Under this threat model, we show an adversary can asymptotically render a corrupted causal model an optimal solution by concealing a subset of the features in certain observations. We also propose learning-based attacks that are effective with finite data and show that they can successfully obscure adversarially targeted causal relationships in various experimental setups. Third, we extend our study of adversarial missingness to associative learning tasks through a bi-level optimization approach. To tailor attacks to standard missing data handling methods, we develop differentiable approximations for three widely used techniques: mean imputation, regression-based imputation, and complete-case analysis. Our results demonstrate that these attacks can effectively manipulate generalized linear models, altering p-values from significant to insignificant by omitting less than 20% of targeted features.Ph

Similar works

Full text

DSpace@RPI (Rensselaer Polytechnic Institute)

oai:https://dspace.rpi.edu:20....

Last time updated on 26/04/2025

This paper was published in DSpace@RPI (Rensselaer Polytechnic Institute).

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.