Search CORE

31,971 research outputs found

Tensorized Self-Attention: Efficiently Modeling Pairwise and Global Dependencies Together

Author: Jiang Jing
Long Guodong
Shen Tao
Zhang Chengqi
Zhou Tianyi
Publication venue
Publication date: 01/01/2019
Field of study

Neural networks equipped with self-attention have parallelizable computation, light-weight structure, and the ability to capture both long-range and local dependencies. Further, their expressive power and performance can be boosted by using a vector to measure pairwise dependency, but this requires to expand the alignment matrix to a tensor, which results in memory and computation bottlenecks. In this paper, we propose a novel attention mechanism called "Multi-mask Tensorized Self-Attention" (MTSA), which is as fast and as memory-efficient as a CNN, but significantly outperforms previous CNN-/RNN-/attention-based models. MTSA 1) captures both pairwise (token2token) and global (source2token) dependencies by a novel compatibility function composed of dot-product and additive attentions, 2) uses a tensor to represent the feature-wise alignment scores for better expressive power but only requires parallelizable matrix multiplications, and 3) combines multi-head with multi-dimensional attentions, and applies a distinct positional mask to each head (subspace), so the memory and computation can be distributed to multiple heads, each with sequential information encoded independently. The experiments show that a CNN/RNN-free model based on MTSA achieves state-of-the-art or competitive performance on nine NLP benchmarks with compelling memory- and time-efficiency

arXiv.org e-Print Archive

Crossref

OPUS - University of Technology Sydney

Recommended from our members

AGM, a dataflow database machine

Author: Bic Lubomir
Hartmann Robert L.
Publication venue: eScholarship, University of California
Publication date: 01/01/1984
Field of study

In recent years, a number of database machines consisting of large numbers of parallel processing elements have been proposed. Unfortunately, one of the main limitations to parallelism in database processing is the I/O bandwidth of the underlying storage devices. One way to solve this problem is to use multiple parallel disk units. The main problem with this approach, however, is the lack of a computational model capable of utilizing the potential of any significant number of such devices.This paper presents a database model which is based on the principles of data-driven computation. According to this model, the database is represented as a network in which each node is conceptually an independent processing element, capable of communicating with other nodes by exchanging messages along the network arcs. To answer a query, one or more such messages, called tokens, are created and injected into the network. These then propagate asynchronously through the network in the search of results satisfying the given query.To investigate the performance of the proposed system, we have implemented the model on a simulated computer architecture. The results of the simulation ex-periments indicate that the model is capable of exploiting the potential I/O band-width of a large number of disk units as well as the computational power of the associated processing elements

eScholarship - University of California

Adaptive Attention Span in Transformers

Author: Bojanowski Piotr
Grave Edouard
Joulin Armand
Sukhbaatar Sainbayar
Publication venue
Publication date: 01/01/2019
Field of study

We propose a novel self-attention mechanism that can learn its optimal attention span. This allows us to extend significantly the maximum context size used in Transformer, while maintaining control over their memory footprint and computational time. We show the effectiveness of our approach on the task of character level language modeling, where we achieve state-of-the-art performances on text8 and enwiki8 by using a maximum context of 8k characters.Comment: Accepted to ACL 201

arXiv.org e-Print Archive

Crossref