8 research outputs found
Attention weights in transformer NMT fail aligning words between sequences but largely explain model predictions
This work proposes an extensive analysis of the Transformer architecture in the Neural Machine Translation (NMT) setting. Focusing on the encoder-decoder attention mechanism, we prove that attention weights systematically make alignment errors by relying mainly on uninformative tokens from the source sequence. However, we observe that NMT models assign attention to these tokens to regulate the contribution in the prediction of the two contexts, the source and the prefix of the target sequence. We provide evidence about the influence of wrong alignments on the model behavior, demonstrating that the encoder-decoder attention mechanism is well suited as an interpretability method for NMT. Finally, based on our analysis, we propose methods that largely reduce the word alignment error rate compared to standard induced alignments from attention weights.This work is supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 947657).Peer ReviewedPostprint (published version
Measuring the mixing of contextual information in the transformer
The Transformer architecture aggregates input information through the self-attention mechanism, but there is no clear understanding of how this information is mixed across the entire model. Additionally, recent works have demonstrated that attention weights alone are not enough to describe the flow of information. In this paper, we consider the whole attention block --multi-head attention, residual connection, and layer normalization-- and define a metric to measure token-to-token interactions within each layer. Then, we aggregate layer-wise interpretations to provide input attribution scores for model predictions. Experimentally, we show that our method, ALTI (Aggregation of Layer-wise Token-to-token Interactions), provides more faithful explanations and increased robustness than gradient-based methods.Javier Ferrando and Gerard I. Gállego are supported by the Spanish Ministerio de Ciencia e Innovación through the project PID2019-107579RB-I00 / AEI / 10.13039/501100011033.Peer ReviewedPostprint (published version
Explaining how transformers use context to build predictions
Language Generation Models produce words based on the previous context. Although existing methods offer input attributions as explanations for a model's prediction, it is still unclear how prior words affect the model's decision throughout the layers. In this work, we leverage recent advances in explainability of the Transformer and present a procedure to analyze models for language generation. Using contrastive examples, we compare the alignment of our explanations with evidence of the linguistic phenomena, and show that our method consistently aligns better than gradient-based and perturbation-based baselines. Then, we investigate the role of MLPs inside the Transformer and show that they learn features that help the model predict words that are grammatically acceptable. Lastly, we apply our method to Neural Machine Translation models, and demonstrate that they generate human-like source-target alignments for building predictions.Javier Ferrando, Gerard I. Gállego and Ioannis Tsiamas are supported by the Spanish Ministerio de Ciencia e Innovación through the project PID2019-107579RB-I00 / AEI / 10.13039/501100011033.Peer ReviewedPreprin
Explaining how transformers use context to build predictions
Language Generation Models produce words based on the previous context. Although existing methods offer input attributions as explanations for a model’s prediction, it is still unclear how prior words affect the model’s decision throughout the layers. In this work, we leverage recent advances in explainability of the Transformer and present a procedure to analyze models for language generation. Using contrastive examples, we compare the alignment of our explanations with evidence of the linguistic phenomena, and show that our method consistently aligns better than gradient-based and perturbation-based baselines. Then, we investigate the role of MLPs inside the Transformer and show that they learn features that help the model predict words that are grammatically acceptable. Lastly, we apply our method to Neural Machine Translation models, and demonstrate that they generate human-like source-target alignments for building predictions.Javier Ferrando, Gerard I.Gállego and Ioannis Tsiamas are supported by the Spanish Ministerio de Ciencia e Innovación through the project PID2019-107579RB-I00/AEI /10.13039/501100011033.Peer ReviewedPostprint (published version
On the locality of attention in direct speech translation
Transformers have achieved state-of-the-art results across multiple NLP tasks. However, the self-attention mechanism complexity scales quadratically with the sequence length, creating an obstacle for tasks involving long sequences, like in the speech domain. In this paper, we discuss the usefulness of self-attention for Direct Speech Translation. First, we analyze the layer-wise token contributions in the self-attention of the encoder, unveiling local diagonal patterns. To prove that some attention weights are avoidable, we propose to substitute the standard self-attention with a local efficient one, setting the amount of context used based on the results of the analysis. With this approach, our model matches the baseline performance, and improves the efficiency by skipping the computation of those weights that standard attention discards.This work was partially funded by the project ADAVOICE, PID2019-107579RB-I00 / AEI / 10.13039/501100011033, and the UPC INIREC scholarship nº3522.Peer ReviewedPostprint (published version
The TALP-UPC participation in WMT21 news translation task: an mBART-based NMT approach
This paper describes the submission to the WMT 2021 news translation shared task by the UPC Machine Translation group. The goal of the task is to translate German to French (De-Fr) and French to German (Fr-De). Our submission focuses on fine-tuning a pre-trained model to take advantage of monolingual data. We fine-tune mBART50 using the filtered data, and additionally, we train a Transformer model on the same data from scratch. In the experiments, we show that fine-tuning mBART50 results in 31.69 BLEU for De-Fr and 23.63 BLEU for Fr-De, which increases 2.71 and 1.90 BLEU accordingly, as compared to the model we train from scratch. Our final submission is an ensemble of these two models, further increasing 0.3 BLEU for Fr-De.Postprint (published version
Speeding up document image classification
This work presents a solution by means of light Convolutional Neural Networks (CNNs) in the Document Classification task, essential problem in the digitalization process of institutions. We show in the RVL-CDIP dataset that we can achieve state-of-the-art results with a set of lighter models such as the EfficientNets and present its transfer learning capabilities on a smaller in-domain dataset such as Tobacco3482. Moreover, we present an ensemble pipeline which is able to boost solely image input by combining image model predictions with the ones generated by BERT model on extracted text by OCR. We also show that the batch size can be effectively increased without hindering its accuracy so that the training process can be sped up by parallelizing throughout multiple GPUs, decreasing the computational time needed
Sistema de toma de decisiones para un agente robótico basado en aprendizaje por refuerzo.
[EN] In this project, a decision-making system, based on artificial emotions is proposed for the resolution of problems in autonomous robotic agents. Specifically, it is based on the application of reinforcement learning techniques to the production of artificial emotions, which allows the agent to get motivated to solve the problem in the more efficient way. The agent takes into account the importance, opportunity and urgency when it faces a new situation, and is the learning system the one which determines how these 3 appraisals are distributed, determining the proper emotional state to take the correct decision. The application used to test this approach consists of a simulation of a crash, where the robotic agent has the mission of cleaning up strains and collecting pieces resulting from an accident. On a previously developed basis in C++ Builder programming environment, functionalities are added completing the simulator. In addition, the Python language with its multiple libraries and the TensorFlow framework are used to develop the learning system through neural networks.[ES] En este proyecto se propone un sistema de toma de decisiones, basado en emociones artificiales, para la resolución de problemas de agentes robóticos autónomos. En concreto, se basa en la aplicación de técnicas de aprendizaje por refuerzo para la generación de emociones artificiales que consigan motivar al agente a resolver el problema de la manera más eficiente posible. El agente tiene en cuenta la importancia, la oportunidad y la urgencia a la hora de enfrentarse a una situación, y es el sistema de aprendizaje el que determina cómo se distribuyen esas 3 valoraciones, determinando el estado emocional adecuado para llevar a cabo la decisión correcta. La aplicación sobre la que se realizan las pruebas de este trabajo consiste en una simulación de un accidente, donde el agente robótico tiene la misión de limpiar manchas y recoger piezas fruto de un accidente. Sobre una base desarrollada anteriormente en el entorno de programación C++ Builder, se añaden funcionalidades completando el simulador. Además, se utiliza el lenguaje Python y sus múltiples librerÃas, y el framework TensorFlow para desarrollar el sistema de aprendizaje mediante redes neuronales.Ferrando MonsonÃs, J. (2018). Decision-making system for a robotic agent based on reinforcement learning. Universitat Politècnica de València. http://hdl.handle.net/10251/112129TFG