2,666 research outputs found

    Adapting End-to-End Speech Recognition for Readable Subtitles

    Full text link
    Automatic speech recognition (ASR) systems are primarily evaluated on transcription accuracy. However, in some use cases such as subtitling, verbatim transcription would reduce output readability given limited screen size and reading time. Therefore, this work focuses on ASR with output compression, a task challenging for supervised approaches due to the scarcity of training data. We first investigate a cascaded system, where an unsupervised compression model is used to post-edit the transcribed speech. We then compare several methods of end-to-end speech recognition under output length constraints. The experiments show that with limited data far less than needed for training a model from scratch, we can adapt a Transformer-based ASR model to incorporate both transcription and compression capabilities. Furthermore, the best performance in terms of WER and ROUGE scores is achieved by explicitly modeling the length constraints within the end-to-end ASR system.Comment: IWSLT 202

    Relative Positional Encoding for Speech Recognition and Direct Translation

    Full text link
    Transformer models are powerful sequence-to-sequence architectures that are capable of directly mapping speech inputs to transcriptions or translations. However, the mechanism for modeling positions in this model was tailored for text modeling, and thus is less ideal for acoustic inputs. In this work, we adapt the relative position encoding scheme to the Speech Transformer, where the key addition is relative distance between input states in the self-attention network. As a result, the network can better adapt to the variable distributions present in speech data. Our experiments show that our resulting model achieves the best recognition result on the Switchboard benchmark in the non-augmentation condition, and the best published result in the MuST-C speech translation benchmark. We also show that this model is able to better utilize synthetic data than the Transformer, and adapts better to variable sentence segmentation quality for speech translation.Comment: Submitted to Interspeech 202

    Consecutive Decoding for Speech-to-text Translation

    Full text link
    Speech-to-text translation (ST), which directly translates the source language speech to the target language text, has attracted intensive attention recently. However, the combination of speech recognition and machine translation in a single model poses a heavy burden on the direct cross-modal cross-lingual mapping. To reduce the learning difficulty, we propose COnSecutive Transcription and Translation (COSTT), an integral approach for speech-to-text translation. The key idea is to generate source transcript and target translation text with a single decoder. It benefits the model training so that additional large parallel text corpus can be fully exploited to enhance the speech translation training. Our method is verified on three mainstream datasets, including Augmented LibriSpeech English-French dataset, TED English-German dataset, and TED English-Chinese dataset. Experiments show that our proposed COSTT outperforms the previous state-of-the-art methods. The code is available at https://github.com/dqqcasia/st.Comment: Accepted by AAAI 2021. arXiv admin note: text overlap with arXiv:2009.0970

    CUE Vectors: Modular Training of Language Models Conditioned on Diverse Contextual Signals

    Full text link
    We propose a framework to modularize the training of neural language models that use diverse forms of sentence-external context (including metadata) by eliminating the need to jointly train sentence-external and within-sentence encoders. Our approach, contextual universal embeddings (CUE), trains LMs on one set of context, such as date and author, and adapts to novel metadata types, such as article title, or previous sentence. The model consists of a pretrained neural sentence LM, a BERT-based context encoder, and a masked transformer decoder that estimates LM probabilities using sentence-internal and sentence-external information. When context or metadata are unavailable, our model learns to combine contextual and sentence-internal information using noisy oracle unigram embeddings as a proxy. Real contextual information can be introduced later and used to adapt a small number of parameters that map contextual data into the decoder's embedding space. We validate the CUE framework on a NYTimes text corpus with multiple metadata types, for which the LM perplexity can be lowered from 36.6 to 27.4 by conditioning on context. Bootstrapping a contextual LM with only a subset of the context/metadata during training retains 85\% of the achievable gain. Training the model initially with proxy context retains 67% of the perplexity gain after adapting to real context. Furthermore, we can swap one type of pretrained sentence LM for another without retraining the context encoders, by only adapting the decoder model. Overall, we obtain a modular framework that allows incremental, scalable training of context-enhanced LMs.Comment: To appear in Findings of ACL 202

    Linguistic-family-specific Encoders and Decoders for Multilingual Spoken Machine Translation

    Get PDF
    This project provides a spoken language translation system trained with UN Parallel Corpus and MuST-C, aiming at study the correlation between languages of different linguistic families and the performance of the translation tasks. This SLT system consists of a text-to-text Neural Machine Translation model, whose dataset includes six languages from five linguistic families, and a Automated Speech Recognition model, using dataset that contains four languages from four linguistic families. The combined SLT system is an end2end system, which is a relatively new task, and in this project, the idea is to analyze how would different linguistic families perform when training under the same conditions. Apart from measuring the performance using BLEU score system, this project also performs fine-tuning and zero-shot translation tasks. In general, the obtained BLEU scores are good and similar to original baseline models studies in UNPC and MuST-C papers. Finetuning and zero-shot translation experiments also obtained reasonable results, proving the hypothesized positive correlation between the closeness of languages and the performances of the translation tasks
    • …
    corecore