13 research outputs found
Quantifying Attention Flow in Transformers
In the Transformer model, "self-attention" combines information from attended
embeddings into the representation of the focal embedding in the next layer.
Thus, across layers of the Transformer, information originating from different
tokens gets increasingly mixed. This makes attention weights unreliable as
explanations probes. In this paper, we consider the problem of quantifying this
flow of information through self-attention. We propose two methods for
approximating the attention to input tokens given attention weights, attention
rollout and attention flow, as post hoc methods when we use attention weights
as the relative relevance of the input tokens. We show that these methods give
complementary views on the flow of information, and compared to raw attention,
both yield higher correlations with importance scores of input tokens obtained
using an ablation method and input gradients
Experiential, Distributional and Dependency-based Word Embeddings have Complementary Roles in Decoding Brain Activity
We evaluate 8 different word embedding models on their usefulness for
predicting the neural activation patterns associated with concrete nouns. The
models we consider include an experiential model, based on crowd-sourced
association data, several popular neural and distributional models, and a model
that reflects the syntactic context of words (based on dependency parses). Our
goal is to assess the cognitive plausibility of these various embedding models,
and understand how we can further improve our methods for interpreting brain
imaging data.
We show that neural word embedding models exhibit superior performance on the
tasks we consider, beating experiential word representation model. The
syntactically informed model gives the overall best performance when predicting
brain activation patterns from word embeddings; whereas the GloVe
distributional method gives the overall best performance when predicting in the
reverse direction (words vectors from brain images). Interestingly, however,
the error patterns of these different models are markedly different. This may
support the idea that the brain uses different systems for processing different
kinds of words. Moreover, we suggest that taking the relative strengths of
different embedding models into account will lead to better models of the brain
activity associated with words.Comment: accepted at Cognitive Modeling and Computational Linguistics 201
A Comparison of Architectures and Pretraining Methods for Contextualized Multilingual Word Embeddings
The lack of annotated data in many languages is a well-known challenge within
the field of multilingual natural language processing (NLP). Therefore, many
recent studies focus on zero-shot transfer learning and joint training across
languages to overcome data scarcity for low-resource languages. In this work we
(i) perform a comprehensive comparison of state-ofthe-art multilingual word and
sentence encoders on the tasks of named entity recognition (NER) and part of
speech (POS) tagging; and (ii) propose a new method for creating multilingual
contextualized word embeddings, compare it to multiple baselines and show that
it performs at or above state-of-theart level in zero-shot transfer settings.
Finally, we show that our method allows for better knowledge sharing across
languages in a joint training setting.Comment: 7 pages, 6 figure
Adaptivity and Modularity for Efficient Generalization Over Task Complexity
Can transformers generalize efficiently on problems that require dealing with
examples with different levels of difficulty? We introduce a new task tailored
to assess generalization over different complexities and present results that
indicate that standard transformers face challenges in solving these tasks.
These tasks are variations of pointer value retrieval previously introduced by
Zhang et al. (2021). We investigate how the use of a mechanism for adaptive and
modular computation in transformers facilitates the learning of tasks that
demand generalization over the number of sequential computation steps (i.e.,
the depth of the computation graph). Based on our observations, we propose a
transformer-based architecture called Hyper-UT, which combines dynamic function
generation from hyper networks with adaptive depth from Universal Transformers.
This model demonstrates higher accuracy and a fairer allocation of
computational resources when generalizing to higher numbers of computation
steps. We conclude that mechanisms for adaptive depth and modularity complement
each other in improving efficient generalization concerning example complexity.
Additionally, to emphasize the broad applicability of our findings, we
illustrate that in a standard image recognition task, Hyper- UT's performance
matches that of a ViT model but with considerably reduced computational demands
(achieving over 70\% average savings by effectively using fewer layers)
Robust Evaluation of Language–Brain Encoding Experiments
Language–brain encoding experiments evaluate the ability of language models to predict brain responses elicited by language stimuli. The evaluation scenarios for this task have not yet been standardized which makes it difficult to compare and interpret results. We perform a series of evaluation experiments with a consistent encoding setup and compute the results for multiple fMRI datasets. In addition, we test the sensitivity of the evaluation measures to randomized data and analyze the effect of voxel selection methods. Our experimental framework is publicly available to make modelling decisions more transparent and support reproducibility for future comparisons.</p