17 research outputs found
Counterfactual Generation Under Confounding
A machine learning model, under the influence of observed or unobserved
confounders in the training data, can learn spurious correlations and fail to
generalize when deployed. For image classifiers, augmenting a training dataset
using counterfactual examples has been empirically shown to break spurious
correlations. However, the counterfactual generation task itself becomes more
difficult as the level of confounding increases. Existing methods for
counterfactual generation under confounding consider a fixed set of
interventions (e.g., texture, rotation) and are not flexible enough to capture
diverse data-generating processes. Given a causal generative process, we
formally characterize the adverse effects of confounding on any downstream
tasks and show that the correlation between generative factors (attributes) can
be used to quantitatively measure confounding between generative factors. To
minimize such correlation, we propose a counterfactual generation method that
learns to modify the value of any attribute in an image and generate new images
given a set of observed attributes, even when the dataset is highly confounded.
These counterfactual images are then used to regularize the downstream
classifier such that the learned representations are the same across various
generative factors conditioned on the class label. Our method is
computationally efficient, simple to implement, and works well for any number
of generative factors and confounding variables. Our experimental results on
both synthetic (MNIST variants) and real-world (CelebA) datasets show the
usefulness of our approach
Rethinking Counterfactual Data Augmentation Under Confounding
Counterfactual data augmentation has recently emerged as a method to mitigate
confounding biases in the training data for a machine learning model. These
biases, such as spurious correlations, arise due to various observed and
unobserved confounding variables in the data generation process. In this paper,
we formally analyze how confounding biases impact downstream classifiers and
present a causal viewpoint to the solutions based on counterfactual data
augmentation. We explore how removing confounding biases serves as a means to
learn invariant features, ultimately aiding in generalization beyond the
observed data distribution. Additionally, we present a straightforward yet
powerful algorithm for generating counterfactual images, which effectively
mitigates the influence of confounding effects on downstream classifiers.
Through experiments on MNIST variants and the CelebA datasets, we demonstrate
the effectiveness and practicality of our approach
Synthesizing Quality Open Data Assets from Private Health Research Studies
International audienceGenerating synthetic data represents an attractive solution for creating open data, enabling health research and education while preserving patient privacy. We reproduce the research outcomes obtained on two previously published studies, which used private health data, using synthetic data generated with a method that we developed, called HealthGAN. We demonstrate the value of our methodology for generating and evaluating the quality and privacy of synthetic health data. The dataset are from OptumLabs R Data Warehouse (OLDW). The OLDW is accessed within a secure environment and doesn't allow exporting of patient level data of any type of data, real or synthetic, therefore the HealthGAN exports a privacy-preserving generator model instead. The studies examine questions related to comorbidites of Autism Spectrum Disorder (ASD) using medical records of children with ASD and matched patients without ASD. HealthGAN generates high quality synthetic data that produce similar results while preserving patient privacy. By creating synthetic versions of these datasets that maintain privacy and achieve a high level of resemblance and utility, we create valuable open health data assets for future research and education efforts
Evaluating and Mitigating Bias in Image Classifiers: A Causal Perspective Using Counterfactuals
Counterfactual examples for an input - perturbations that change specific features but not others - have been shown to be useful for evaluating bias of machine learning models, e.g., against specific demographic groups. However, generating counterfactual examples for images is nontrivial due to the underlying causal structure on the various features of an image. To be meaningful, generated perturbations need to satisfy constraints implied by the causal model. We present a method for generating counterfactuals by incorporating a structural causal model (SCM) in an improved variant of Adversarially Learned Inference (ALI), that generates counterfactuals in accordance with the causal relationships between attributes of an image. Based on the generated counterfactuals, we show how to explain a pre-trained machine learning classifier, evaluate its bias, and mitigate the bias using a counterfactual regularizer. On the Morpho-MNIST dataset, our method generates counterfactuals comparable in quality to prior work on SCM-based counterfactuals (DeepSCM), while on the more complex CelebA dataset our method outperforms DeepSCM in generating high-quality valid counterfactuals. Moreover, generated counterfactuals are indistinguishable from reconstructed images in a human evaluation experiment and we subsequently use them to evaluate the fairness of a standard classifier trained on CelebA data. We show that the classifier is biased w.r.t. skin and hair color, and how counterfactual regularization can remove those biases. © 2022 IEEE
Evaluating and Mitigating Bias in Image Classifiers: A Causal Perspective Using Counterfactuals
Recent studies have reported biases in machine learning image classifiers,
especially against particular demographic groups. Counterfactual examples for
an input -- perturbations that change specific features but not others -- have
been shown to be useful for evaluating explainability and fairness of machine
learning models. However, generating counterfactual examples for images is
non-trivial due to the underlying causal structure governing the various
features of an image. To be meaningful, generated perturbations need to satisfy
constraints implied by the causal model. We present a method for generating
counterfactuals by incorporating a structural causal model (SCM) in a novel
improved variant of Adversarially Learned Inference (ALI), that generates
counterfactuals in accordance with the causal relationships between different
attributes of an image. Based on the generated counterfactuals, we show how to
evaluate bias and explain a pre-trained machine learning classifier. We also
propose a counterfactual regularizer that can mitigate bias in the classifier.
On the Morpho-MNIST dataset, our method generates counterfactuals comparable in
quality to prior work on SCM-based counterfactuals. Our method also works on
the more complex CelebA faces dataset; generated counterfactuals are
indistinguishable from original images in a human evaluation experiment. As a
downstream task, we use counterfactuals to evaluate a standard classifier
trained on CelebA data and show that it is biased w.r.t. skin and hair color,
and show how counterfactual regularization can be used to remove the identified
biases
Divided We Rule: Influencer Polarization on Twitter during Political Crises in India
Influencers are key to the nature and networks of information propagation on social media. Influencers are particularly important in political discourse through their engagement with issues, and may derive their legitimacy either solely or in large part through online operation, or have an offline sphere of expertise such as entertainers, journalists etc. To quantify influencers' political engagement and polarity, we use Google's Universal Sentence Encoder (USE) to encode the tweets of 6k influencers and 26k Indian politicians during political crises in India. We then obtain aggregate vector representations of the influencers based on their tweet embeddings, which alongside retweet graphs help compute their stance and polarity with respect to these political issues. We find that influencers engage with the topics in a partisan manner, with polarized influencers being rewarded with increased retweeting and following. Moreover, we observe that specific groups of influencers are consistently polarized across all events. We conclude by discussing how our study provides insights into the political schisms of present-day India, but also offers a means to study the role of influencers in exacerbating political polarization in other contexts
Medical Time-Series Data Generation using Generative Adversarial Networks
International audienceMedical data is rarely made publicly available due to high deidentification costs and risks. Access to such data is highly regulated due to it's sensitive nature. These factors impede the development of data-driven advancements in the healthcare domain. Synthetic medical data which can maintain the utility of the real data while simultaneously preserving privacy can be an ideal substitute for advancing research. Medical data is longitudinal in nature, with a single patient having multiple temporal events, influenced by static covariates like age, gender, comorbidities, etc. Extending existing time-series generative models to generate medical data can be challenging due to this influence of patient covariates. We propose a workflow wherein we leverage existing generative models to generate such data. We demonstrate this approach by generating synthetic versions of several time-series datasets where static covariates influence the temporal values. We use a state-of-the-art benchmark as a comparative baseline. Our methodology for empirically evaluating synthetic timeseries data shows that the synthetic data generated with our workflow has higher resemblance and utility. We also demonstrate how stratification by covariates is required to gain a deeper understanding of synthetic data quality and underscore the importance of including this analysis in evaluation of synthetic medical data quality
Synthesizing Quality Open Data Assets from Private Health Research Studies
International audienceGenerating synthetic data represents an attractive solution for creating open data, enabling health research and education while preserving patient privacy. We reproduce the research outcomes obtained on two previously published studies, which used private health data, using synthetic data generated with a method that we developed, called HealthGAN. We demonstrate the value of our methodology for generating and evaluating the quality and privacy of synthetic health data. The dataset are from OptumLabs R Data Warehouse (OLDW). The OLDW is accessed within a secure environment and doesn't allow exporting of patient level data of any type of data, real or synthetic, therefore the HealthGAN exports a privacy-preserving generator model instead. The studies examine questions related to comorbidites of Autism Spectrum Disorder (ASD) using medical records of children with ASD and matched patients without ASD. HealthGAN generates high quality synthetic data that produce similar results while preserving patient privacy. By creating synthetic versions of these datasets that maintain privacy and achieve a high level of resemblance and utility, we create valuable open health data assets for future research and education efforts