2,502,259 research outputs found
Attention Is All You Need
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks in an encoder-decoder configuration. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer, based
solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experiments on two machine translation tasks show these models to be
superior in quality while being more parallelizable and requiring significantly
less time to train. Our model achieves 28.4 BLEU on the WMT 2014
English-to-German translation task, improving over the existing best results,
including ensembles by over 2 BLEU. On the WMT 2014 English-to-French
translation task, our model establishes a new single-model state-of-the-art
BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction
of the training costs of the best models from the literature. We show that the
Transformer generalizes well to other tasks by applying it successfully to
English constituency parsing both with large and limited training data.Comment: 15 pages, 5 figure
Attention: Marginal Probability is All You Need?
Attention mechanisms are a central property of cognitive systems allowing
them to selectively deploy cognitive resources in a flexible manner. Attention
has been long studied in the neurosciences and there are numerous
phenomenological models that try to capture its core properties. Recently
attentional mechanisms have become a dominating architectural choice of machine
learning and are the central innovation of Transformers. The dominant intuition
and formalism underlying their development has drawn on ideas of keys and
queries in database management systems. In this work, we propose an alternative
Bayesian foundation for attentional mechanisms and show how this unifies
different attentional architectures in machine learning. This formulation
allows to to identify commonality across different attention ML architectures
as well as suggest a bridge to those developed in neuroscience. We hope this
work will guide more sophisticated intuitions into the key properties of
attention architectures and suggest new ones
Attention Is Not All You Need Anymore
In recent years, the popular Transformer architecture has achieved great
success in many application areas, including natural language processing and
computer vision. Many existing works aim to reduce the computational and memory
complexity of the self-attention mechanism in the Transformer by trading off
performance. However, performance is key for the continuing success of the
Transformer. In this paper, a drop-in replacement for the self-attention
mechanism in the Transformer, called the Extractor, is proposed. Experimental
results show that replacing the self-attention mechanism with the Extractor
improves the performance of the Transformer. Furthermore, the proposed
Extractor has the potential to run faster than the self-attention since it has
a much shorter critical path of computation. Additionally, the sequence
prediction problem in the context of text generation is formulated using
variable-length discrete-time Markov chains, and the Transformer is reviewed
based on our understanding
Attention Is (not) All You Need for Commonsense Reasoning
The recently introduced BERT model exhibits strong performance on several
language understanding benchmarks. In this paper, we describe a simple
re-implementation of BERT for commonsense reasoning. We show that the
attentions produced by BERT can be directly utilized for tasks such as the
Pronoun Disambiguation Problem and Winograd Schema Challenge. Our proposed
attention-guided commonsense reasoning method is conceptually simple yet
empirically powerful. Experimental analysis on multiple datasets demonstrates
that our proposed system performs remarkably well on all cases while
outperforming the previously reported state of the art by a margin. While
results suggest that BERT seems to implicitly learn to establish complex
relationships between entities, solving commonsense reasoning tasks might
require more than unsupervised models learned from huge text corpora.Comment: to appear at ACL 201
RITA: Group Attention is All You Need for Timeseries Analytics
Timeseries analytics is of great importance in many real-world applications.
Recently, the Transformer model, popular in natural language processing, has
been leveraged to learn high quality feature embeddings from timeseries, core
to the performance of various timeseries analytics tasks. However, the
quadratic time and space complexities limit Transformers' scalability,
especially for long timeseries. To address these issues, we develop a
timeseries analytics tool, RITA, which uses a novel attention mechanism, named
group attention, to address this scalability issue. Group attention dynamically
clusters the objects based on their similarity into a small number of groups
and approximately computes the attention at the coarse group granularity. It
thus significantly reduces the time and space complexity, yet provides a
theoretical guarantee on the quality of the computed attention. The dynamic
scheduler of RITA continuously adapts the number of groups and the batch size
in the training process, ensuring group attention always uses the fewest groups
needed to meet the approximation quality requirement. Extensive experiments on
various timeseries datasets and analytics tasks demonstrate that RITA
outperforms the state-of-the-art in accuracy and is significantly faster --
with speedups of up to 63X
Attention Is All You Need For Blind Room Volume Estimation
In recent years, dynamic parameterization of acoustic environments has raised
increasing attention in the field of audio processing. One of the key
parameters that characterize the local room acoustics in isolation from
orientation and directivity of sources and receivers is the geometric room
volume. Convolutional neural networks (CNNs) have been widely selected as the
main models for conducting blind room acoustic parameter estimation, which aims
to learn a direct mapping from audio spectrograms to corresponding labels. With
the recent trend of self-attention mechanisms, this paper introduces a purely
attention-based model to blindly estimate room volumes based on single-channel
noisy speech signals. We demonstrate the feasibility of eliminating the
reliance on CNN for this task and the proposed Transformer architecture takes
Gammatone magnitude spectral coefficients and phase spectrograms as inputs. To
enhance the model performance given the task-specific dataset, cross-modality
transfer learning is also applied. Experimental results demonstrate that the
proposed model outperforms traditional CNN models across a wide range of
real-world acoustics spaces, especially with the help of the dedicated
pretraining and data augmentation schemes.Comment: 5 pages, 4 figures, submitted ICASSP 202
- …