5,773 research outputs found
Improving N-gram Language Models with Pre-trained Deep Transformer
Although n-gram language models (LMs) have been outperformed by the
state-of-the-art neural LMs, they are still widely used in speech recognition
due to its high efficiency in inference. In this paper, we demonstrate that
n-gram LM can be improved by neural LMs through a text generation based data
augmentation method. In contrast to previous approaches, we employ a
large-scale general domain pre-training followed by in-domain fine-tuning
strategy to construct deep Transformer based neural LMs. Large amount of
in-domain text data is generated with the well trained deep Transformer to
construct new n-gram LMs, which are then interpolated with baseline n-gram
systems. Empirical studies on different speech recognition tasks show that the
proposed approach can effectively improve recognition accuracy. In particular,
our proposed approach brings significant relative word error rate reduction up
to 6.0% for domains with limited in-domain data
Sample Efficient Text Summarization Using a Single Pre-Trained Transformer
Language model (LM) pre-training has resulted in impressive performance and
sample efficiency on a variety of language understanding tasks. However, it
remains unclear how to best use pre-trained LMs for generation tasks such as
abstractive summarization, particularly to enhance sample efficiency. In these
sequence-to-sequence settings, prior work has experimented with loading
pre-trained weights into the encoder and/or decoder networks, but used
non-pre-trained encoder-decoder attention weights. We instead use a pre-trained
decoder-only network, where the same Transformer LM both encodes the source and
generates the summary. This ensures that all parameters in the network,
including those governing attention over source states, have been pre-trained
before the fine-tuning step. Experiments on the CNN/Daily Mail dataset show
that our pre-trained Transformer LM substantially improves over pre-trained
Transformer encoder-decoder networks in limited-data settings. For instance, it
achieves 13.1 ROUGE-2 using only 1% of the training data (~3000 examples),
while pre-trained encoder-decoder models score 2.3 ROUGE-2
Evaluation of sentence embeddings in downstream and linguistic probing tasks
Despite the fast developmental pace of new sentence embedding methods, it is
still challenging to find comprehensive evaluations of these different
techniques. In the past years, we saw significant improvements in the field of
sentence embeddings and especially towards the development of universal
sentence encoders that could provide inductive transfer to a wide variety of
downstream tasks. In this work, we perform a comprehensive evaluation of recent
methods using a wide variety of downstream and linguistic feature probing
tasks. We show that a simple approach using bag-of-words with a recently
introduced language model for deep context-dependent word embeddings proved to
yield better results in many tasks when compared to sentence encoders trained
on entailment datasets. We also show, however, that we are still far away from
a universal encoder that can perform consistently across several downstream
tasks.Comment: 15 pages, 3 figures, 11 table
Jasper: An End-to-End Convolutional Neural Acoustic Model
In this paper, we report state-of-the-art results on LibriSpeech among
end-to-end speech recognition models without any external training data. Our
model, Jasper, uses only 1D convolutions, batch normalization, ReLU, dropout,
and residual connections. To improve training, we further introduce a new
layer-wise optimizer called NovoGrad. Through experiments, we demonstrate that
the proposed deep architecture performs as well or better than more complex
choices. Our deepest Jasper variant uses 54 convolutional layers. With this
architecture, we achieve 2.95% WER using a beam-search decoder with an external
neural language model and 3.86% WER with a greedy decoder on LibriSpeech
test-clean. We also report competitive results on the Wall Street Journal and
the Hub5'00 conversational evaluation datasets.Comment: Accepted to INTERSPEECH 201
Multi-scale Transformer Language Models
We investigate multi-scale transformer language models that learn
representations of text at multiple scales, and present three different
architectures that have an inductive bias to handle the hierarchical nature of
language. Experiments on large-scale language modeling benchmarks empirically
demonstrate favorable likelihood vs memory footprint trade-offs, e.g. we show
that it is possible to train a hierarchical variant with 30 layers that has 23%
smaller memory footprint and better perplexity, compared to a vanilla
transformer with less than half the number of layers, on the Toronto
BookCorpus. We analyze the advantages of learned representations at multiple
scales in terms of memory footprint, compute time, and perplexity, which are
particularly appealing given the quadratic scaling of transformers' run time
and memory usage with respect to sequence length
A Comprehensive Survey of Grammar Error Correction
Grammar error correction (GEC) is an important application aspect of natural
language processing techniques. The past decade has witnessed significant
progress achieved in GEC for the sake of increasing popularity of machine
learning and deep learning, especially in late 2010s when near human-level GEC
systems are available. However, there is no prior work focusing on the whole
recapitulation of the progress. We present the first survey in GEC for a
comprehensive retrospect of the literature in this area. We first give the
introduction of five public datasets, data annotation schema, two important
shared tasks and four standard evaluation metrics. More importantly, we discuss
four kinds of basic approaches, including statistical machine translation based
approach, neural machine translation based approach, classification based
approach and language model based approach, six commonly applied performance
boosting techniques for GEC systems and two data augmentation methods. Since
GEC is typically viewed as a sister task of machine translation, many GEC
systems are based on neural machine translation (NMT) approaches, where the
neural sequence-to-sequence model is applied. Similarly, some performance
boosting techniques are adapted from machine translation and are successfully
combined with GEC systems for enhancement on the final performance.
Furthermore, we conduct an analysis in level of basic approaches, performance
boosting techniques and integrated GEC systems based on their experiment
results respectively for more clear patterns and conclusions. Finally, we
discuss five prospective directions for future GEC researches
Parallel Iterative Edit Models for Local Sequence Transduction
We present a Parallel Iterative Edit (PIE) model for the problem of local
sequence transduction arising in tasks like Grammatical error correction (GEC).
Recent approaches are based on the popular encoder-decoder (ED) model for
sequence to sequence learning. The ED model auto-regressively captures full
dependency among output tokens but is slow due to sequential decoding. The PIE
model does parallel decoding, giving up the advantage of modelling full
dependency in the output, yet it achieves accuracy competitive with the ED
model for four reasons: 1.~predicting edits instead of tokens, 2.~labeling
sequences instead of generating sequences, 3.~iteratively refining predictions
to capture dependencies, and 4.~factorizing logits over edits and their token
argument to harness pre-trained language models like BERT. Experiments on tasks
spanning GEC, OCR correction and spell correction demonstrate that the PIE
model is an accurate and significantly faster alternative for local sequence
transduction.Comment: Accepted at EMNLP-IJCNLP 201
Non-Parametric Adaptation for Neural Machine Translation
Neural Networks trained with gradient descent are known to be susceptible to
catastrophic forgetting caused by parameter shift during the training process.
In the context of Neural Machine Translation (NMT) this results in poor
performance on heterogeneous datasets and on sub-tasks like rare phrase
translation. On the other hand, non-parametric approaches are immune to
forgetting, perfectly complementing the generalization ability of NMT. However,
attempts to combine non-parametric or retrieval based approaches with NMT have
only been successful on narrow domains, possibly due to over-reliance on
sentence level retrieval. We propose a novel n-gram level retrieval approach
that relies on local phrase level similarities, allowing us to retrieve
neighbors that are useful for translation even when overall sentence similarity
is low. We complement this with an expressive neural network, allowing our
model to extract information from the noisy retrieved context. We evaluate our
semi-parametric NMT approach on a heterogeneous dataset composed of WMT, IWSLT,
JRC-Acquis and OpenSubtitles, and demonstrate gains on all 4 evaluation sets.
The semi-parametric nature of our approach opens the door for non-parametric
domain adaptation, demonstrating strong inference-time adaptation performance
on new domains without the need for any parameter updates.Comment: Accepted at NAACL 201
Iterative Pseudo-Labeling for Speech Recognition
Pseudo-labeling has recently shown promise in end-to-end automatic speech
recognition (ASR). We study Iterative Pseudo-Labeling (IPL), a semi-supervised
algorithm which efficiently performs multiple iterations of pseudo-labeling on
unlabeled data as the acoustic model evolves. In particular, IPL fine-tunes an
existing model at each iteration using both labeled data and a subset of
unlabeled data. We study the main components of IPL: decoding with a language
model and data augmentation. We then demonstrate the effectiveness of IPL by
achieving state-of-the-art word-error rate on the Librispeech test sets in both
standard and low-resource setting. We also study the effect of language models
trained on different corpora to show IPL can effectively utilize additional
text. Finally, we release a new large in-domain text corpus which does not
overlap with the Librispeech training transcriptions to foster research in
low-resource, semi-supervised ASRComment: INTERSPEECH 202
The Dialogue Dodecathlon: Open-Domain Knowledge and Image Grounded Conversational Agents
We introduce dodecaDialogue: a set of 12 tasks that measures if a
conversational agent can communicate engagingly with personality and empathy,
ask questions, answer questions by utilizing knowledge resources, discuss
topics and situations, and perceive and converse about images. By multi-tasking
on such a broad large-scale set of data, we hope to both move towards and
measure progress in producing a single unified agent that can perceive, reason
and converse with humans in an open-domain setting. We show that such
multi-tasking improves over a BERT pre-trained baseline, largely due to
multi-tasking with very large dialogue datasets in a similar domain, and that
the multi-tasking in general provides gains to both text and image-based tasks
using several metrics in both the fine-tune and task transfer settings. We
obtain state-of-the-art results on many of the tasks, providing a strong
baseline for this challenge.Comment: ACL 202
- …