44 research outputs found
Impossibility Theorems for Feature Attribution
Despite a sea of interpretability methods that can produce plausible
explanations, the field has also empirically seen many failure cases of such
methods. In light of these results, it remains unclear for practitioners how to
use these methods and choose between them in a principled way. In this paper,
we show that for moderately rich model classes (easily satisfied by neural
networks), any feature attribution method that is complete and linear -- for
example, Integrated Gradients and SHAP -- can provably fail to improve on
random guessing for inferring model behaviour. Our results apply to common
end-tasks such as characterizing local model behaviour, identifying spurious
features, and algorithmic recourse. One takeaway from our work is the
importance of concretely defining end-tasks: once such an end-task is defined,
a simple and direct approach of repeated model evaluations can outperform many
other complex feature attribution methods.Comment: 36 pages, 4 figures. Significantly expanded experiment
Hierarchical Reinforcement Learning for Open-Domain Dialog
Open-domain dialog generation is a challenging problem; maximum likelihood
training can lead to repetitive outputs, models have difficulty tracking
long-term conversational goals, and training on standard movie or online
datasets may lead to the generation of inappropriate, biased, or offensive
text. Reinforcement Learning (RL) is a powerful framework that could
potentially address these issues, for example by allowing a dialog model to
optimize for reducing toxicity and repetitiveness. However, previous approaches
which apply RL to open-domain dialog generation do so at the word level, making
it difficult for the model to learn proper credit assignment for long-term
conversational rewards. In this paper, we propose a novel approach to
hierarchical reinforcement learning, VHRL, which uses policy gradients to tune
the utterance-level embedding of a variational sequence model. This
hierarchical approach provides greater flexibility for learning long-term,
conversational rewards. We use self-play and RL to optimize for a set of
human-centered conversation metrics, and show that our approach provides
significant improvements -- in terms of both human evaluation and automatic
metrics -- over state-of-the-art dialog models, including Transformers
Moral Foundations of Large Language Models
Moral foundations theory (MFT) is a psychological assessment tool that
decomposes human moral reasoning into five factors, including care/harm,
liberty/oppression, and sanctity/degradation (Graham et al., 2009). People vary
in the weight they place on these dimensions when making moral decisions, in
part due to their cultural upbringing and political ideology. As large language
models (LLMs) are trained on datasets collected from the internet, they may
reflect the biases that are present in such corpora. This paper uses MFT as a
lens to analyze whether popular LLMs have acquired a bias towards a particular
set of moral values. We analyze known LLMs and find they exhibit particular
moral foundations, and show how these relate to human moral foundations and
political affiliations. We also measure the consistency of these biases, or
whether they vary strongly depending on the context of how the model is
prompted. Finally, we show that we can adversarially select prompts that
encourage the moral to exhibit a particular set of moral foundations, and that
this can affect the model's behavior on downstream tasks. These findings help
illustrate the potential risks and unintended consequences of LLMs assuming a
particular moral stance
Active learning for electrodermal activity classification
To filter noise or detect features within physiological signals, it is often effective to encode expert knowledge into a model such as a machine learning classifier. However, training such a model can require much effort on the part of the researcher; this often takes the form of manually labeling portions of signal needed to represent the concept being trained. Active learning is a technique for reducing human effort by developing a classifier that can intelligently select the most relevant data samples and ask for labels for only those samples, in an iterative process. In this paper we demonstrate that active learning can reduce the labeling effort required of researchers by as much as 84% for our application, while offering equivalent or even slightly improved machine learning performance.MIT Media Lab ConsortiumRobert Wood Johnson Foundatio
Wavelet-based motion artifact removal for electrodermal activity
Electrodermal activity (EDA) recording is a powerful, widely used tool for monitoring psychological or physiological arousal. However, analysis of EDA is hampered by its sensitivity to motion artifacts. We propose a method for removing motion artifacts from EDA, measured as skin conductance (SC), using a stationary wavelet transform (SWT). We modeled the wavelet coefficients as a Gaussian mixture distribution corresponding to the underlying skin conductance level (SCL) and skin conductance responses (SCRs). The goodness-of-fit of the model was validated on ambulatory SC data. We evaluated the proposed method in comparison with three previous approaches. Our method achieved a greater reduction of artifacts while retaining motion-artifact-free data