Search CORE

18,822 research outputs found

Attention Is All You Need

Author: Gomez Aidan N.
Jones Llion
Kaiser Lukasz
Parmar Niki
Polosukhin Illia
Shazeer Noam
Uszkoreit Jakob
Vaswani Ashish
Publication venue
Publication date: 05/12/2017
Field of study

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.Comment: 15 pages, 5 figure

arXiv.org e-Print Archive

Attention Is All You Need

Author: Krönke Christoph
Publication venue
Publication date: 14/04/2023
Field of study

<intR>²Dok

Masked Attention is All You Need for Graphs

Author: Buterez David
Janet Jon Paul
Lio Pietro
Oglic Dino
Publication venue
Publication date: 16/02/2024
Field of study

Graph neural networks (GNNs) and variations of the message passing algorithm are the predominant means for learning on graphs, largely due to their flexibility, speed, and satisfactory performance. The design of powerful and general purpose GNNs, however, requires significant research efforts and often relies on handcrafted, carefully-chosen message passing operators. Motivated by this, we propose a remarkably simple alternative for learning on graphs that relies exclusively on attention. Graphs are represented as node or edge sets and their connectivity is enforced by masking the attention weight matrix, effectively creating custom attention patterns for each graph. Despite its simplicity, masked attention for graphs (MAG) has state-of-the-art performance on long-range tasks and outperforms strong message passing baselines and much more involved attention-based methods on over 55 node and graph-level tasks. We also show significantly better transfer learning capabilities compared to GNNs and comparable or better time and memory scaling. MAG has sub-linear memory scaling in the number of nodes or edges, enabling learning on dense graphs and future-proofing the approach

arXiv.org e-Print Archive

RITA: Group Attention is All You Need for Timeseries Analytics

Author: Cao Lei
Ives Zachary
Li Guoliang
Liang Jiaming
Madden Samuel
Publication venue
Publication date: 02/06/2023
Field of study

Timeseries analytics is of great importance in many real-world applications. Recently, the Transformer model, popular in natural language processing, has been leveraged to learn high quality feature embeddings from timeseries, core to the performance of various timeseries analytics tasks. However, the quadratic time and space complexities limit Transformers' scalability, especially for long timeseries. To address these issues, we develop a timeseries analytics tool, RITA, which uses a novel attention mechanism, named group attention, to address this scalability issue. Group attention dynamically clusters the objects based on their similarity into a small number of groups and approximately computes the attention at the coarse group granularity. It thus significantly reduces the time and space complexity, yet provides a theoretical guarantee on the quality of the computed attention. The dynamic scheduler of RITA continuously adapts the number of groups and the batch size in the training process, ensuring group attention always uses the fewest groups needed to meet the approximation quality requirement. Extensive experiments on various timeseries datasets and analytics tasks demonstrate that RITA outperforms the state-of-the-art in accuracy and is significantly faster -- with speedups of up to 63X

arXiv.org e-Print Archive

Attention Is All You Need For Blind Room Volume Estimation

Author: Bao Changchun
Jia Maoshen
Jin Wenyu
Li Meiran
Wang Chunxi
Publication venue
Publication date: 23/09/2023
Field of study

In recent years, dynamic parameterization of acoustic environments has raised increasing attention in the field of audio processing. One of the key parameters that characterize the local room acoustics in isolation from orientation and directivity of sources and receivers is the geometric room volume. Convolutional neural networks (CNNs) have been widely selected as the main models for conducting blind room acoustic parameter estimation, which aims to learn a direct mapping from audio spectrograms to corresponding labels. With the recent trend of self-attention mechanisms, this paper introduces a purely attention-based model to blindly estimate room volumes based on single-channel noisy speech signals. We demonstrate the feasibility of eliminating the reliance on CNN for this task and the proposed Transformer architecture takes Gammatone magnitude spectral coefficients and phase spectrograms as inputs. To enhance the model performance given the task-specific dataset, cross-modality transfer learning is also applied. Experimental results demonstrate that the proposed model outperforms traditional CNN models across a wide range of real-world acoustics spaces, especially with the help of the dedicated pretraining and data augmentation schemes.Comment: 5 pages, 4 figures, submitted ICASSP 202

arXiv.org e-Print Archive

Gaussian Adaptive Attention is All You Need: Robust Contextual Representations Across Multiple Modalities

Author: Chadha Aman
Elkins Aaron
Ioannides Georgios
Publication venue
Publication date: 30/01/2024
Field of study

We propose the Multi-Head Gaussian Adaptive Attention Mechanism (GAAM), a novel probabilistic attention framework, and the Gaussian Adaptive Transformer (GAT), designed to enhance information aggregation across multiple modalities, including Speech, Text and Vision. GAAM integrates learnable mean and variance into its attention mechanism, implemented in a Multi-Headed framework enabling it to collectively model any Probability Distribution for dynamic recalibration of feature significance. This method demonstrates significant improvements, especially with highly non-stationary data, surpassing the state-of-the-art attention techniques in model performance (up to approximately +20% in accuracy) by identifying key elements within the feature space. GAAM's compatibility with dot-product-based attention models and relatively low number of parameters showcases its adaptability and potential to boost existing attention frameworks. Empirically, GAAM exhibits superior adaptability and efficacy across a diverse range of tasks, including emotion recognition in speech, image classification, and text classification, thereby establishing its robustness and versatility in handling multi-modal data. Furthermore, we introduce the Importance Factor (IF), a new learning-based metric that enhances the explainability of models trained with GAAM-based methods. Overall, GAAM represents an advancement towards development of better performing and more explainable attention models across multiple modalities

arXiv.org e-Print Archive

"Attention is all you need". Arquitectura Transformers: descripción y aplicaciones

Author: Nasimba Tipan Alexis Fabian
Publication venue: 'Universidad Miguel Hernandez de Elche'
Publication date: 01/06/2023
Field of study

El procesado del lenguaje natural, más conocido por sus siglas en ingles NPL (Natural lenguage processing), ha ido evolucionando constantemente a lo largo de los años, llegando a estar presente en herramientas que el usuario común usa a diario, como es el traductor de Google. Esta rama del famoso Machine Learning ha tenido una aceptación muy grande entre la comunidad científica y entre las empresas, lo que está permitiendo un desarrollo vertiginoso. Algunas de las aplicaciones más comunes de estos algoritmos de NPL, están en la clasificación de textos, traductores de idioma o la generación de texto. Debido a su gran versatilidad ya se están utilizando para la resolución de problemas del mundo real. En esta búsqueda de las soluciones más eficientes a los problemas de un mundo cada más digitalizado, se han realizado avances en las investigaciones de nuevos algoritmos para la comprensión y generación de texto, como son los Transformers, la red neuronal con mayor acogida en este ámbito hasta el momento, debido a su gran potencial demostrado en modelos de lenguaje grandes como GPT- 4 o LaMDA. El objetivo de este proyecto es llevar a cabo un estudio profundo de la red neural conocida como Transformer, empezando por sus inicios, las redes neuronales que le preceden, su estructura y funcionamiento, su aplicación práctica en modelos actuales y finalmente resolveremos un problema mediante la elaboración de la red neuronal, entrenamiento y pruebas, pudiendo así realizar un análisis completo de los resultados obtenidos

RediUMH (Universidad Miguel Hernández)