57 research outputs found
On the Trade-offs between Adversarial Robustness and Actionable Explanations
As machine learning models are increasingly being employed in various
high-stakes settings, it becomes important to ensure that predictions of these
models are not only adversarially robust, but also readily explainable to
relevant stakeholders. However, it is unclear if these two notions can be
simultaneously achieved or if there exist trade-offs between them. In this
work, we make one of the first attempts at studying the impact of adversarially
robust models on actionable explanations which provide end users with a means
for recourse. We theoretically and empirically analyze the cost (ease of
implementation) and validity (probability of obtaining a positive model
prediction) of recourses output by state-of-the-art algorithms when the
underlying models are adversarially robust vs. non-robust. More specifically,
we derive theoretical bounds on the differences between the cost and the
validity of the recourses generated by state-of-the-art algorithms for
adversarially robust vs. non-robust linear and non-linear models. Our empirical
results with multiple real-world datasets validate our theoretical results and
show the impact of varying degrees of model robustness on the cost and validity
of the resulting recourses. Our analyses demonstrate that adversarially robust
models significantly increase the cost and reduce the validity of the resulting
recourses, thus shedding light on the inherent trade-offs between adversarial
robustness and actionable explanation
Towards a Unified Framework for Fair and Stable Graph Representation Learning
As the representations output by Graph Neural Networks (GNNs) are
increasingly employed in real-world applications, it becomes important to
ensure that these representations are fair and stable. In this work, we
establish a key connection between counterfactual fairness and stability and
leverage it to propose a novel framework, NIFTY (uNIfying Fairness and
stabiliTY), which can be used with any GNN to learn fair and stable
representations. We introduce a novel objective function that simultaneously
accounts for fairness and stability and develop a layer-wise weight
normalization using the Lipschitz constant to enhance neural message passing in
GNNs. In doing so, we enforce fairness and stability both in the objective
function as well as in the GNN architecture. Further, we show theoretically
that our layer-wise weight normalization promotes counterfactual fairness and
stability in the resulting representations. We introduce three new graph
datasets comprising of high-stakes decisions in criminal justice and financial
lending domains. Extensive experimentation with the above datasets demonstrates
the efficacy of our framework.Comment: Accepted to UAI'2
Discriminative Feature Attributions: Bridging Post Hoc Explainability and Inherent Interpretability
With the increased deployment of machine learning models in various
real-world applications, researchers and practitioners alike have emphasized
the need for explanations of model behaviour. To this end, two broad strategies
have been outlined in prior literature to explain models. Post hoc explanation
methods explain the behaviour of complex black-box models by identifying
features critical to model predictions; however, prior work has shown that
these explanations may not be faithful, in that they incorrectly attribute high
importance to features that are unimportant or non-discriminative for the
underlying task. Inherently interpretable models, on the other hand, circumvent
these issues by explicitly encoding explanations into model architecture,
meaning their explanations are naturally faithful, but they often exhibit poor
predictive performance due to their limited expressive power. In this work, we
identify a key reason for the lack of faithfulness of feature attributions: the
lack of robustness of the underlying black-box models, especially to the
erasure of unimportant distractor features in the input. To address this issue,
we propose Distractor Erasure Tuning (DiET), a method that adapts black-box
models to be robust to distractor erasure, thus providing discriminative and
faithful attributions. This strategy naturally combines the ease of use of post
hoc explanations with the faithfulness of inherently interpretable models. We
perform extensive experiments on semi-synthetic and real-world datasets and
show that DiET produces models that (1) closely approximate the original
black-box models they are intended to explain, and (2) yield explanations that
match approximate ground truths available by construction. Our code is made
public at https://github.com/AI4LIFE-GROUP/DiET
- …