5,204 research outputs found
Quantifying the vanishing gradient and long distance dependency problem in recursive neural networks and recursive LSTMs
Recursive neural networks (RNN) and their recently proposed extension
recursive long short term memory networks (RLSTM) are models that compute
representations for sentences, by recursively combining word embeddings
according to an externally provided parse tree. Both models thus, unlike
recurrent networks, explicitly make use of the hierarchical structure of a
sentence. In this paper, we demonstrate that RNNs nevertheless suffer from the
vanishing gradient and long distance dependency problem, and that RLSTMs
greatly improve over RNN's on these problems. We present an artificial learning
task that allows us to quantify the severity of these problems for both models.
We further show that a ratio of gradients (at the root node and a focal leaf
node) is highly indicative of the success of backpropagation at optimizing the
relevant weights low in the tree. This paper thus provides an explanation for
existing, superior results of RLSTMs on tasks such as sentiment analysis, and
suggests that the benefits of including hierarchical structure and of including
LSTM-style gating are complementary
Quantifying Attention Flow in Transformers
In the Transformer model, "self-attention" combines information from attended
embeddings into the representation of the focal embedding in the next layer.
Thus, across layers of the Transformer, information originating from different
tokens gets increasingly mixed. This makes attention weights unreliable as
explanations probes. In this paper, we consider the problem of quantifying this
flow of information through self-attention. We propose two methods for
approximating the attention to input tokens given attention weights, attention
rollout and attention flow, as post hoc methods when we use attention weights
as the relative relevance of the input tokens. We show that these methods give
complementary views on the flow of information, and compared to raw attention,
both yield higher correlations with importance scores of input tokens obtained
using an ablation method and input gradients
Characteristics of an x-ray preionized TEA CO laser
A cryogenically cooled sealed-off x-ray preionized self-sustained discharge CO laser was succesfully operated. It was found that 20 to 40% higher output energies could be obtained using x-ray instead of uv preionization. A maximum output energy of 2.9 J per pulse could be extracted from a 2×2×40cm3 discharge volume. The maximum electrical efficiency proved to be 12.6%
Unsupervised Dependency Parsing: Let's Use Supervised Parsers
We present a self-training approach to unsupervised dependency parsing that
reuses existing supervised and unsupervised parsing algorithms. Our approach,
called `iterated reranking' (IR), starts with dependency trees generated by an
unsupervised parser, and iteratively improves these trees using the richer
probability models used in supervised parsing that are in turn trained on these
trees. Our system achieves 1.8% accuracy higher than the state-of-the-part
parser of Spitkovsky et al. (2013) on the WSJ corpus.Comment: 11 page
Experiential, Distributional and Dependency-based Word Embeddings have Complementary Roles in Decoding Brain Activity
We evaluate 8 different word embedding models on their usefulness for
predicting the neural activation patterns associated with concrete nouns. The
models we consider include an experiential model, based on crowd-sourced
association data, several popular neural and distributional models, and a model
that reflects the syntactic context of words (based on dependency parses). Our
goal is to assess the cognitive plausibility of these various embedding models,
and understand how we can further improve our methods for interpreting brain
imaging data.
We show that neural word embedding models exhibit superior performance on the
tasks we consider, beating experiential word representation model. The
syntactically informed model gives the overall best performance when predicting
brain activation patterns from word embeddings; whereas the GloVe
distributional method gives the overall best performance when predicting in the
reverse direction (words vectors from brain images). Interestingly, however,
the error patterns of these different models are markedly different. This may
support the idea that the brain uses different systems for processing different
kinds of words. Moreover, we suggest that taking the relative strengths of
different embedding models into account will lead to better models of the brain
activity associated with words.Comment: accepted at Cognitive Modeling and Computational Linguistics 201
Compositional Distributional Semantics with Long Short Term Memory
We are proposing an extension of the recursive neural network that makes use
of a variant of the long short-term memory architecture. The extension allows
information low in parse trees to be stored in a memory register (the `memory
cell') and used much later higher up in the parse tree. This provides a
solution to the vanishing gradient problem and allows the network to capture
long range dependencies. Experimental results show that our composition
outperformed the traditional neural-network composition on the Stanford
Sentiment Treebank.Comment: 10 pages, 7 figure
- …
