2,666 research outputs found
Adapting End-to-End Speech Recognition for Readable Subtitles
Automatic speech recognition (ASR) systems are primarily evaluated on
transcription accuracy. However, in some use cases such as subtitling, verbatim
transcription would reduce output readability given limited screen size and
reading time. Therefore, this work focuses on ASR with output compression, a
task challenging for supervised approaches due to the scarcity of training
data. We first investigate a cascaded system, where an unsupervised compression
model is used to post-edit the transcribed speech. We then compare several
methods of end-to-end speech recognition under output length constraints. The
experiments show that with limited data far less than needed for training a
model from scratch, we can adapt a Transformer-based ASR model to incorporate
both transcription and compression capabilities. Furthermore, the best
performance in terms of WER and ROUGE scores is achieved by explicitly modeling
the length constraints within the end-to-end ASR system.Comment: IWSLT 202
Relative Positional Encoding for Speech Recognition and Direct Translation
Transformer models are powerful sequence-to-sequence architectures that are
capable of directly mapping speech inputs to transcriptions or translations.
However, the mechanism for modeling positions in this model was tailored for
text modeling, and thus is less ideal for acoustic inputs. In this work, we
adapt the relative position encoding scheme to the Speech Transformer, where
the key addition is relative distance between input states in the
self-attention network. As a result, the network can better adapt to the
variable distributions present in speech data. Our experiments show that our
resulting model achieves the best recognition result on the Switchboard
benchmark in the non-augmentation condition, and the best published result in
the MuST-C speech translation benchmark. We also show that this model is able
to better utilize synthetic data than the Transformer, and adapts better to
variable sentence segmentation quality for speech translation.Comment: Submitted to Interspeech 202
Consecutive Decoding for Speech-to-text Translation
Speech-to-text translation (ST), which directly translates the source
language speech to the target language text, has attracted intensive attention
recently. However, the combination of speech recognition and machine
translation in a single model poses a heavy burden on the direct cross-modal
cross-lingual mapping. To reduce the learning difficulty, we propose
COnSecutive Transcription and Translation (COSTT), an integral approach for
speech-to-text translation. The key idea is to generate source transcript and
target translation text with a single decoder. It benefits the model training
so that additional large parallel text corpus can be fully exploited to enhance
the speech translation training. Our method is verified on three mainstream
datasets, including Augmented LibriSpeech English-French dataset, TED
English-German dataset, and TED English-Chinese dataset. Experiments show that
our proposed COSTT outperforms the previous state-of-the-art methods. The code
is available at https://github.com/dqqcasia/st.Comment: Accepted by AAAI 2021. arXiv admin note: text overlap with
arXiv:2009.0970
CUE Vectors: Modular Training of Language Models Conditioned on Diverse Contextual Signals
We propose a framework to modularize the training of neural language models
that use diverse forms of sentence-external context (including metadata) by
eliminating the need to jointly train sentence-external and within-sentence
encoders. Our approach, contextual universal embeddings (CUE), trains LMs on
one set of context, such as date and author, and adapts to novel metadata
types, such as article title, or previous sentence. The model consists of a
pretrained neural sentence LM, a BERT-based context encoder, and a masked
transformer decoder that estimates LM probabilities using sentence-internal and
sentence-external information. When context or metadata are unavailable, our
model learns to combine contextual and sentence-internal information using
noisy oracle unigram embeddings as a proxy. Real contextual information can be
introduced later and used to adapt a small number of parameters that map
contextual data into the decoder's embedding space. We validate the CUE
framework on a NYTimes text corpus with multiple metadata types, for which the
LM perplexity can be lowered from 36.6 to 27.4 by conditioning on context.
Bootstrapping a contextual LM with only a subset of the context/metadata during
training retains 85\% of the achievable gain. Training the model initially with
proxy context retains 67% of the perplexity gain after adapting to real
context. Furthermore, we can swap one type of pretrained sentence LM for
another without retraining the context encoders, by only adapting the decoder
model. Overall, we obtain a modular framework that allows incremental, scalable
training of context-enhanced LMs.Comment: To appear in Findings of ACL 202
Linguistic-family-specific Encoders and Decoders for Multilingual Spoken Machine Translation
This project provides a spoken language translation system trained with UN Parallel Corpus and MuST-C, aiming at study the correlation between languages of different linguistic families and the performance of the translation tasks. This SLT system consists of a text-to-text Neural Machine Translation model, whose dataset includes six languages from five linguistic families, and a Automated Speech Recognition model, using dataset that contains four languages from four linguistic families. The combined SLT system is an end2end system, which is a relatively new task, and in this project, the idea is to analyze how would different linguistic families perform when training under the same conditions. Apart from measuring the performance using BLEU score system, this project also performs fine-tuning and zero-shot translation tasks. In general, the obtained BLEU scores are good and similar to original baseline models studies in UNPC and MuST-C papers. Finetuning and zero-shot translation experiments also obtained reasonable results, proving the hypothesized positive correlation between the closeness of languages and the performances of the translation tasks
- …