6 research outputs found
Explaining Black Box Predictions and Unveiling Data Artifacts through Influence Functions
Modern deep learning models for NLP are notoriously opaque. This has
motivated the development of methods for interpreting such models, e.g., via
gradient-based saliency maps or the visualization of attention weights. Such
approaches aim to provide explanations for a particular model prediction by
highlighting important words in the corresponding input text. While this might
be useful for tasks where decisions are explicitly influenced by individual
tokens in the input, we suspect that such highlighting is not suitable for
tasks where model decisions should be driven by more complex reasoning. In this
work, we investigate the use of influence functions for NLP, providing an
alternative approach to interpreting neural text classifiers. Influence
functions explain the decisions of a model by identifying influential training
examples. Despite the promise of this approach, influence functions have not
yet been extensively evaluated in the context of NLP, a gap addressed by this
work. We conduct a comparison between influence functions and common
word-saliency methods on representative tasks. As suspected, we find that
influence functions are particularly useful for natural language inference, a
task in which 'saliency maps' may not have clear interpretation. Furthermore,
we develop a new quantitative measure based on influence functions that can
reveal artifacts in training data.Comment: ACL 202
Identifying Spurious Correlations for Robust Text Classification
The predictions of text classifiers are often driven by spurious correlations
-- e.g., the term `Spielberg' correlates with positively reviewed movies, even
though the term itself does not semantically convey a positive sentiment. In
this paper, we propose a method to distinguish spurious and genuine
correlations in text classification. We treat this as a supervised
classification problem, using features derived from treatment effect estimators
to distinguish spurious correlations from "genuine" ones. Due to the generic
nature of these features and their small dimensionality, we find that the
approach works well even with limited training examples, and that it is
possible to transport the word classifier to new domains. Experiments on four
datasets (sentiment classification and toxicity detection) suggest that using
this approach to inform feature selection also leads to more robust
classification, as measured by improved worst-case accuracy on the samples
affected by spurious correlations.Comment: Findings of EMNLP-202
Topics to Avoid: Demoting Latent Confounds in Text Classification
Despite impressive performance on many text classification tasks, deep neural
networks tend to learn frequent superficial patterns that are specific to the
training data and do not always generalize well. In this work, we observe this
limitation with respect to the task of native language identification. We find
that standard text classifiers which perform well on the test set end up
learning topical features which are confounds of the prediction task (e.g., if
the input text mentions Sweden, the classifier predicts that the author's
native language is Swedish). We propose a method that represents the latent
topical confounds and a model which "unlearns" confounding features by
predicting both the label of the input text and the confound; but we train the
two predictors adversarially in an alternating fashion to learn a text
representation that predicts the correct label but is less prone to using
information about the confound. We show that this model generalizes better and
learns features that are indicative of the writing style rather than the
content.Comment: 2019 Conference on Empirical Methods in Natural Language Processing
(EMNLP 2019
PANDORA Talks: Personality and Demographics on Reddit
Personality and demographics are important variables in social sciences,
while in NLP they can aid in interpretability and removal of societal biases.
However, datasets with both personality and demographic labels are scarce. To
address this, we present PANDORA, the first large-scale dataset of Reddit
comments labeled with three personality models (including the well-established
Big 5 model) and demographics (age, gender, and location) for more than 10k
users. We showcase the usefulness of this dataset on three experiments, where
we leverage the more readily available data from other personality models to
predict the Big 5 traits, analyze gender classification biases arising from
psycho-demographic variables, and carry out a confirmatory and exploratory
analysis based on psychological theories. Finally, we present benchmark
prediction models for all personality and demographic variables.Comment: Proceedings of the Ninth International Workshop on Natural Language
Processing for Social Media, NAACL 2021,
https://www.aclweb.org/anthology/2021.socialnlp-1.1
Causal Effects of Linguistic Properties
We consider the problem of using observational data to estimate the causal
effects of linguistic properties. For example, does writing a complaint
politely lead to a faster response time? How much will a positive product
review increase sales? This paper addresses two technical challenges related to
the problem before developing a practical method. First, we formalize the
causal quantity of interest as the effect of a writer's intent, and establish
the assumptions necessary to identify this from observational data. Second, in
practice, we only have access to noisy proxies for the linguistic properties of
interest -- e.g., predictions from classifiers and lexicons. We propose an
estimator for this setting and prove that its bias is bounded when we perform
an adjustment for the text. Based on these results, we introduce TextCause, an
algorithm for estimating causal effects of linguistic properties. The method
leverages (1) distant supervision to improve the quality of noisy proxies, and
(2) a pre-trained language model (BERT) to adjust for the text. We show that
the proposed method outperforms related approaches when estimating the effect
of Amazon review sentiment on semi-simulated sales figures. Finally, we present
an applied case study investigating the effects of complaint politeness on
bureaucratic response times.Comment: To appear at NAACL 2021 (Annual Conference of the North American
Chapter of the Association for Computational Linguistics
A Survey of the State of Explainable AI for Natural Language Processing
Recent years have seen important advances in the quality of state-of-the-art
models, but this has come at the expense of models becoming less interpretable.
This survey presents an overview of the current state of Explainable AI (XAI),
considered within the domain of Natural Language Processing (NLP). We discuss
the main categorization of explanations, as well as the various ways
explanations can be arrived at and visualized. We detail the operations and
explainability techniques currently available for generating explanations for
NLP model predictions, to serve as a resource for model developers in the
community. Finally, we point out the current gaps and encourage directions for
future work in this important research area.Comment: To appear in AACL-IJCNLP 202