248 research outputs found
Quantifying Attention Flow in Transformers
In the Transformer model, "self-attention" combines information from attended
embeddings into the representation of the focal embedding in the next layer.
Thus, across layers of the Transformer, information originating from different
tokens gets increasingly mixed. This makes attention weights unreliable as
explanations probes. In this paper, we consider the problem of quantifying this
flow of information through self-attention. We propose two methods for
approximating the attention to input tokens given attention weights, attention
rollout and attention flow, as post hoc methods when we use attention weights
as the relative relevance of the input tokens. We show that these methods give
complementary views on the flow of information, and compared to raw attention,
both yield higher correlations with importance scores of input tokens obtained
using an ablation method and input gradients
Compositional Distributional Semantics with Long Short Term Memory
We are proposing an extension of the recursive neural network that makes use
of a variant of the long short-term memory architecture. The extension allows
information low in parse trees to be stored in a memory register (the `memory
cell') and used much later higher up in the parse tree. This provides a
solution to the vanishing gradient problem and allows the network to capture
long range dependencies. Experimental results show that our composition
outperformed the traditional neural-network composition on the Stanford
Sentiment Treebank.Comment: 10 pages, 7 figure
Quantifying the vanishing gradient and long distance dependency problem in recursive neural networks and recursive LSTMs
Recursive neural networks (RNN) and their recently proposed extension
recursive long short term memory networks (RLSTM) are models that compute
representations for sentences, by recursively combining word embeddings
according to an externally provided parse tree. Both models thus, unlike
recurrent networks, explicitly make use of the hierarchical structure of a
sentence. In this paper, we demonstrate that RNNs nevertheless suffer from the
vanishing gradient and long distance dependency problem, and that RLSTMs
greatly improve over RNN's on these problems. We present an artificial learning
task that allows us to quantify the severity of these problems for both models.
We further show that a ratio of gradients (at the root node and a focal leaf
node) is highly indicative of the success of backpropagation at optimizing the
relevant weights low in the tree. This paper thus provides an explanation for
existing, superior results of RLSTMs on tasks such as sentiment analysis, and
suggests that the benefits of including hierarchical structure and of including
LSTM-style gating are complementary
Unsupervised Dependency Parsing: Let's Use Supervised Parsers
We present a self-training approach to unsupervised dependency parsing that
reuses existing supervised and unsupervised parsing algorithms. Our approach,
called `iterated reranking' (IR), starts with dependency trees generated by an
unsupervised parser, and iteratively improves these trees using the richer
probability models used in supervised parsing that are in turn trained on these
trees. Our system achieves 1.8% accuracy higher than the state-of-the-part
parser of Spitkovsky et al. (2013) on the WSJ corpus.Comment: 11 page
Experiential, Distributional and Dependency-based Word Embeddings have Complementary Roles in Decoding Brain Activity
We evaluate 8 different word embedding models on their usefulness for
predicting the neural activation patterns associated with concrete nouns. The
models we consider include an experiential model, based on crowd-sourced
association data, several popular neural and distributional models, and a model
that reflects the syntactic context of words (based on dependency parses). Our
goal is to assess the cognitive plausibility of these various embedding models,
and understand how we can further improve our methods for interpreting brain
imaging data.
We show that neural word embedding models exhibit superior performance on the
tasks we consider, beating experiential word representation model. The
syntactically informed model gives the overall best performance when predicting
brain activation patterns from word embeddings; whereas the GloVe
distributional method gives the overall best performance when predicting in the
reverse direction (words vectors from brain images). Interestingly, however,
the error patterns of these different models are markedly different. This may
support the idea that the brain uses different systems for processing different
kinds of words. Moreover, we suggest that taking the relative strengths of
different embedding models into account will lead to better models of the brain
activity associated with words.Comment: accepted at Cognitive Modeling and Computational Linguistics 201
DoLFIn: Distributions over Latent Features for Interpretability
Interpreting the inner workings of neural models is a key step in ensuring
the robustness and trustworthiness of the models, but work on neural network
interpretability typically faces a trade-off: either the models are too
constrained to be very useful, or the solutions found by the models are too
complex to interpret. We propose a novel strategy for achieving
interpretability that -- in our experiments -- avoids this trade-off. Our
approach builds on the success of using probability as the central quantity,
such as for instance within the attention mechanism. In our architecture,
DoLFIn (Distributions over Latent Features for Interpretability), we do no
determine beforehand what each feature represents, and features go altogether
into an unordered set. Each feature has an associated probability ranging from
0 to 1, weighing its importance for further processing. We show that, unlike
attention and saliency map approaches, this set-up makes it straight-forward to
compute the probability with which an input component supports the decision the
neural model makes. To demonstrate the usefulness of the approach, we apply
DoLFIn to text classification, and show that DoLFIn not only provides
interpretable solutions, but even slightly outperforms the classical CNN and
BiLSTM text classifiers on the SST2 and AG-news datasets
Transparency at the Source: Evaluating and Interpreting Language Models With Access to the True Distribution
We present a setup for training, evaluating and interpreting neural language
models, that uses artificial, language-like data. The data is generated using a
massive probabilistic grammar (based on state-split PCFGs), that is itself
derived from a large natural language corpus, but also provides us complete
control over the generative process. We describe and release both grammar and
corpus, and test for the naturalness of our generated data. This approach
allows us to define closed-form expressions to efficiently compute exact lower
bounds on obtainable perplexity using both causal and masked language
modelling. Our results show striking differences between neural language
modelling architectures and training objectives in how closely they allow
approximating the lower bound on perplexity. Our approach also allows us to
directly compare learned representations to symbolic rules in the underlying
source. We experiment with various techniques for interpreting model behaviour
and learning dynamics. With access to the underlying true source, our results
show striking differences and outcomes in learning dynamics between different
classes of words.Comment: EMNLP Findings 202
- …