Search CORE

94,834 research outputs found

Parallel Attention Forcing for Machine Translation

Author: Dou Qingyun
Gales Mark
Publication venue
Publication date: 06/11/2022
Field of study

Attention-based autoregressive models have achieved state-of-the-art performance in various sequence-to-sequence tasks, including Text-To-Speech (TTS) and Neural Machine Translation (NMT), but can be difficult to train. The standard training approach, teacher forcing, guides a model with the reference back-history. During inference, the generated back-history must be used. This mismatch limits the evaluation performance. Attention forcing has been introduced to address the mismatch, guiding the model with the generated back-history and reference attention. While successful in tasks with continuous outputs like TTS, attention forcing faces additional challenges in tasks with discrete outputs like NMT. This paper introduces the two extensions of attention forcing to tackle these challenges. (1) Scheduled attention forcing automatically turns attention forcing on and off, which is essential for tasks with discrete outputs. (2) Parallel attention forcing makes training parallel, and is applicable to Transformer-based models. The experiments show that the proposed approaches improve the performance of models based on RNNs and Transformers.Comment: 13 pages, 8 figures. arXiv admin note: text overlap with arXiv:2104.0126

arXiv.org e-Print Archive

Deliberation Networks and How to Train Them

Author: Dou Qingyun
Gales Mark
Publication venue
Publication date: 06/11/2022
Field of study

Deliberation networks are a family of sequence-to-sequence models, which have achieved state-of-the-art performance in a wide range of tasks such as machine translation and speech synthesis. A deliberation network consists of multiple standard sequence-to-sequence models, each one conditioned on the initial input and the output of the previous model. During training, there are several key questions: whether to apply Monte Carlo approximation to the gradients or the loss, whether to train the standard models jointly or separately, whether to run an intermediate model in teacher forcing or free running mode, whether to apply task-specific techniques. Previous work on deliberation networks typically explores one or two training options for a specific task. This work introduces a unifying framework, covering various training options, and addresses the above questions. In general, it is simpler to approximate the gradients. When parallel training is essential, separate training should be adopted. Regardless of the task, the intermediate model should be in free running mode. For tasks where the output is continuous, a guided attention loss can be used to prevent degradation into a standard model.Comment: 10 pages, 2 figure

arXiv.org e-Print Archive

Exploring teacher forcing techniques for sequence-to-sequence abstractive headline summarization

Author: Albert Corbin
Publication venue
Publication date
Field of study

Every internet user today is exposed to countless article headlines. These can range from informative, to sensationalist, to downright misleading. These snippets of information can have tremendous impacts on those exposed and can shape ones views on a subject before even reading the associated article. For these reasons and more, it is important that the Natural Language Processing community turn its attention towards this critical part of everyday life by improving current abstractive text summarization techniques. To aid in that endeavor, this project explores various methods of teacher forcing, a technique used during model training for sequence-to-sequence recurrent reural network architectures. A relatively new deep learning library called PyTorch has made experimentation with teacher forcing accessible for the first time and is utilized for this purpose in the project. Additionally, to the author’s best knowledge this is the first implementation of abstrac¬tive headline summarization in PyTorch. Seven different teacher forcing techniques were designed and experimented with: (1) Constant levels of 0%, 25%, 50%, 75%, and 100% teacher forcing probability through the entire training cycle; and (2) two different gradu¬ated techniques: one that decreased linearly from 100% to 0% through the entire training cycle to convergence, and another that graduated from 100% to 0% every 12.5% of the training cycle, often corresponding with learning rate annealing. Dozens of generative sequence-to-sequence models were trained with these various techniques to observe their differences. These seven different teacher forcing techniques were compared to one another via two metrics: (1) ROUGE F-scores, the most common metric used in this field; and (2) average loss over time. Counter to what was expected, this project shows with statistical significance that consistent 100% and 75% teacher forcing produced better ROUGE scores than any other metric. These results confirm the use of 100% teacher forcing, the most widely used technique today. However, this throws into question an important assumption by many leading machine learning researchers that dynamic, graduated teacher forcing techniques should results in greater model performance. Questions of ROUGE metric validity, response to more complicated model parameters, and domain specificity are encouraged for further analysis

Nottingham ePrints

Sentic Computing for Aspect-Based Opinion Summarization Using Multi-Head Attention with Feature Pooled Pointer Generator Network

Author: Gupta S
Kumar A
Maini S
Seth S
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 04/02/2022
Field of study

Neural sequence to sequence models have achieved superlative performance in summarizing text. But they tend to generate generic summaries that under-represent the opinion-sensitive aspects of the document. Additionally, the sequence to sequence models are prone to test-train discrepancy (exposure-bias) arising from the differential summary decoding processes in the training and testing phases. The models use ground truth summary words in the decoder training phase and predicted outputs in the testing phase. This inconsistency leads to error accumulation and substandard performance. To address these gaps, a cognitive aspect-based opinion summarizer, Feature Pooled Pointer Generator Network (FP2GN), is proposed which selectively attends to thematic and contextual cues to generate sentiment-aware review summaries. This study augments the pointer generator framework with opinion feature extraction, feature pooling, and mutual attention mechanism for opinion summarization. The proposed model FP2GN identifies the aspect terms in review text using sentic computing (SenticNet 5 and concept frequency-inverse opinion frequency) and statistical feature engineering. These aspect terms are encoded into context embeddings using weighted average feature pooling, which is processed in a pointer-generator framework inspired stacked Bi-LSTM encoder–decoder model with multi-head self-attention. The decoder system uses temporal and mutual attention mechanisms to ensure the appropriate representation of input-sequence. The study also proffers the use of teacher forcing ratio to curtail the exposure-bias-related error-accumulation. The model achieves ROUGE-1 score of 86.04% and ROUGE-L score of 88.51% on the Amazon Fine Foods dataset. An average gain of 2% over other methods is observed. The proposed model reinforces pointer generator network architecture with opinion feature extraction, feature pooling, and mutual attention mechanism to generate human-readable opinion summaries. Empirical analysis substantiates that the proposed model is better than the baseline opinion summarizers

E-space: Manchester Metropolitan University's Research Repository

A differentiable BLEU loss. Analysis and first results

Author: Casas Manzanares Noé
Rodríguez Fonollosa José Adrián
Ruiz Costa-Jussà Marta
Publication venue
Publication date: 01/01/2018
Field of study

In natural language generation tasks, like neural machine translation and image captioning, there is usually a mismatch between the optimized loss and the de facto evaluation criterion, namely token-level maximum likelihood and corpus-level BLEU score. This article tries to reduce this gap by defining differentiable computations of the BLEU and GLEU scores. We test this approach on simple tasks, obtaining valuable lessons on its potential applications but also its pitfalls, mainly that these loss functions push each token in the hypothesis sequence toward the average of the tokens in the reference, resulting in a poor training signal.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC