8 research outputs found

    Measuring the mixing of contextual information in the transformer

    Get PDF
    The Transformer architecture aggregates input information through the self-attention mechanism, but there is no clear understanding of how this information is mixed across the entire model. Additionally, recent works have demonstrated that attention weights alone are not enough to describe the flow of information. In this paper, we consider the whole attention block --multi-head attention, residual connection, and layer normalization-- and define a metric to measure token-to-token interactions within each layer. Then, we aggregate layer-wise interpretations to provide input attribution scores for model predictions. Experimentally, we show that our method, ALTI (Aggregation of Layer-wise Token-to-token Interactions), provides more faithful explanations and increased robustness than gradient-based methods.Javier Ferrando and Gerard I. Gállego are supported by the Spanish Ministerio de Ciencia e Innovación through the project PID2019-107579RB-I00 / AEI / 10.13039/501100011033.Peer ReviewedPostprint (published version

    Explaining how transformers use context to build predictions

    Get PDF
    Language Generation Models produce words based on the previous context. Although existing methods offer input attributions as explanations for a model's prediction, it is still unclear how prior words affect the model's decision throughout the layers. In this work, we leverage recent advances in explainability of the Transformer and present a procedure to analyze models for language generation. Using contrastive examples, we compare the alignment of our explanations with evidence of the linguistic phenomena, and show that our method consistently aligns better than gradient-based and perturbation-based baselines. Then, we investigate the role of MLPs inside the Transformer and show that they learn features that help the model predict words that are grammatically acceptable. Lastly, we apply our method to Neural Machine Translation models, and demonstrate that they generate human-like source-target alignments for building predictions.Javier Ferrando, Gerard I. Gállego and Ioannis Tsiamas are supported by the Spanish Ministerio de Ciencia e Innovación through the project PID2019-107579RB-I00 / AEI / 10.13039/501100011033.Peer ReviewedPreprin

    Explaining how transformers use context to build predictions

    Get PDF
    Language Generation Models produce words based on the previous context. Although existing methods offer input attributions as explanations for a model’s prediction, it is still unclear how prior words affect the model’s decision throughout the layers. In this work, we leverage recent advances in explainability of the Transformer and present a procedure to analyze models for language generation. Using contrastive examples, we compare the alignment of our explanations with evidence of the linguistic phenomena, and show that our method consistently aligns better than gradient-based and perturbation-based baselines. Then, we investigate the role of MLPs inside the Transformer and show that they learn features that help the model predict words that are grammatically acceptable. Lastly, we apply our method to Neural Machine Translation models, and demonstrate that they generate human-like source-target alignments for building predictions.Javier Ferrando, Gerard I.Gállego and Ioannis Tsiamas are supported by the Spanish Ministerio de Ciencia e Innovación through the project PID2019-107579RB-I00/AEI /10.13039/501100011033.Peer ReviewedPostprint (published version

    Tackling low-resourced sign language translation: UPC at WMT-SLT 22

    Get PDF
    This paper describes the system developed at the Universitat Politècnica de Catalunya for the Workshop on Machine Translation 2022 Sign Language Translation Task, in particular, for the sign-to-text direction. We use a Transformer model implemented with the Fairseq modeling toolkit. We have experimented with the vocabulary size, data augmentation techniques and pretraining the model with the PHOENIX-14T dataset. Our system obtains 0.50 BLEU score for the test set, improving the organizers’ baseline by 0.38 BLEU. We remark the poor results for both the baseline and our system, and thus, the unreliability of our findings.This research was partially supported by research grant Adavoice PID2019-107579RB-I00 /AEI / 10.13039/501100011033, research grants PRE2020-094223, PID2021-126248OB-I00 and PID2019-107255GB-C21.Peer ReviewedPostprint (published version

    On the locality of attention in direct speech translation

    Get PDF
    Transformers have achieved state-of-the-art results across multiple NLP tasks. However, the self-attention mechanism complexity scales quadratically with the sequence length, creating an obstacle for tasks involving long sequences, like in the speech domain. In this paper, we discuss the usefulness of self-attention for Direct Speech Translation. First, we analyze the layer-wise token contributions in the self-attention of the encoder, unveiling local diagonal patterns. To prove that some attention weights are avoidable, we propose to substitute the standard self-attention with a local efficient one, setting the amount of context used based on the results of the analysis. With this approach, our model matches the baseline performance, and improves the efficiency by skipping the computation of those weights that standard attention discards.This work was partially funded by the project ADAVOICE, PID2019-107579RB-I00 / AEI / 10.13039/501100011033, and the UPC INIREC scholarship nÂş3522.Peer ReviewedPostprint (published version

    Sign language translation from instructional videos

    Get PDF
    The advances in automatic sign language translation (SLT) to spoken languages have been mostly benchmarked with datasets of limited size and restricted domains. Our work advances the state of the art by providing the first baseline results on How2Sign, a large and broad dataset. We train a Transformer over I3D video features, using the reduced BLEU as a reference metric for validation, instead of the widely used BLEU score. We report a result of 8.03 on the BLEU score, and publish the first open-source implementation of its kind to promote further advances.This research was partially supported by research grant Adavoice PID2019-107579RB-I00 / AEI / 10.13039/501100011033, research grants PRE2020-094223, PID2021-126248OB-I00 and PID2019-107255GB-C21 and by Generalitat de Catalunya (AGAUR) under grant agreement 2021-SGR-00478.Peer ReviewedPostprint (published version

    End-to-end Speech Translation with Self-supervised Speech Representations

    Get PDF
    For years speech translation has been faced as concatenation of speech recognition and machine translation. The powerful architectures of deep learning has made end-to-end speech translation feasible. The student will have to use the encoder-decoder architecture based on Transformer to build multilingual speech translation systems.Nowadays, there is a growing interest in the field of Speech Translation (speech-to-text). Traditionally, this task has been faced with the concatenation of Automatic Speech Recognition and Machine Translation modules. Nevertheless, in the last few years, many researchers have proposed the use of an end-to-end approach, in which the speech is not transcripted but directly translated into the target language. Furthermore, there is a notable research trend in the use of self-supervision techniques to train speech encoders. These systems do not need human-annotated data for training, and they can extract much richer speech representations than other traditional methods. In this project, we explored the use of three pre-trained speech encoders (PASE+, APC and Wav2Vec) to improve end-to-end ST, using a Transformer as the core of our model. We trained it with the English-French split of the MuST-C corpus, comprising 492h of speech, and we developed our code on top of Fairseq, creating a repository which will facilitate our group's future research in ST. Our system did not achieve the results of the baseline, but we think that there is still room for improvement, and we believe we will be able to compete with state-of-the-art ST systems using pre-trained speech encoders in the future.Actualment, hi ha un inter es creixent en el camp de la traducci o de la parla (de veu a text). Tradicionalment, aquesta tasca s'ha resolt concatenant m oduls de reconeixement de veu i de traducci o autom atica. Ara b e, els ultims anys molts investigadors han proposat l' us d'una estructura de traducci o directa, en la que la veu es tradueix sense necessitat d'obtenir la transcripci o intermitja. A m es, hi ha un important corrent d'investigaci o en l' us de t ecniques d'autosupervisi o per entrenar codi cadors de la parla. Aquests sistemes no necessiten dades etiquetades per humans per entrenar-se, i poden extreure representacions de la parla molt m es riques que altres m etodes tradicionals. En aquest projecte, hem explorat l' us de tres codi cadors pre-entrenats (PASE+, APC i Wav2Vec) per millorar els sistemes de traducci o directa de la parla, utilitzant un Transformer com a component central del nostre model. L'hem entrenat amb la partici o Angl es-Franc es del corpus MuST-C, que cont e 492h de veu enregistrada, i hem desenvolupat el nostre codi sobre Fairseq, creant un repositori que facilitar a la investigaci o futura del nostre grup en aquest camp. El nostre sistema no ha aconseguit equiparar-se amb els resultats del sistema de refer encia, per o pensem que encara tenim marge de millora, i creiem que en el futur serem capa cos de competir amb sistemes punters, utilitzant codi cadors de la parla pre-entrenats.Actualmente hay un creciente inter es en el campo de la traducci on del habla (de voz a texto). Tradicionalmente, esta tarea se ha afrontado concatenando m odulos de reconocimiento de voz y de traducci on autom atica. Sin embargo, en los ultimos a~nos muchos investigadores han propuesto el uso de una estructura de traducci on directa, en la cual la voz se traduce sin necesidad de obtener la transcripci on intermedia. Adem as, hay una importante corriente de investigaci on en el uso de t ecnicas de autosupervisi on para entrenar codi cadores del habla. Estos sistemas no necesitan datos etiquetados por humanos para su entrenamiento, y pueden extraer representaciones del habla mucho m as ricas que otros m etodos tradicionales. En este proyecto, hemos explorado el uso de tres codi cadores pre-entrenados (PASE+, APC yWav2Vec) para mejorar los sistemas de traducci on directa del habla, usando un Transformer como componente central de nuestro modelo. Lo hemos entrenado con la partici on Ingl es- Franc es del corpus MuST-C, que contiene 492h de voz, y hemos desarrollado nuestro c odigo sobre Fairseq, creando un repositorio que facilitar a la investigaci on futura de nuestro grupo en la traducci on del habla. Nuestro sistema no ha conseguido equipararse con los resultados del sistema de referencia, pero pensamos que a un tenemos margen de mejora, y creemos que en el futuro seremos capaces de competir con sistemas punteros, usando codi cadores del habla pre-entrenados

    Examen Final

    No full text
    Resolved2022/20231r quadrimestr
    corecore