Search CORE

8,766 research outputs found

Acoustic-to-Word Models with Conversational Context Information

Author: Kim Suyoun
Metze Florian
Publication venue
Publication date: 21/05/2019
Field of study

Conversational context information, higher-level knowledge that spans across sentences, can help to recognize a long conversation. However, existing speech recognition models are typically built at a sentence level, and thus it may not capture important conversational context information. The recent progress in end-to-end speech recognition enables integrating context with other available information (e.g., acoustic, linguistic resources) and directly recognizing words from speech. In this work, we present a direct acoustic-to-word, end-to-end speech recognition model capable of utilizing the conversational context to better process long conversations. We evaluate our proposed approach on the Switchboard conversational speech corpus and show that our system outperforms a standard end-to-end speech recognition system.Comment: NAACL 2019. arXiv admin note: text overlap with arXiv:1808.0217

arXiv.org e-Print Archive

Recent Progresses in Deep Learning based Acoustic Models (Updated)

Author: Li Jinyu
Yu Dong
Publication venue
Publication date: 26/04/2018
Field of study

In this paper, we summarize recent progresses made in deep learning based acoustic models and the motivation and insights behind the surveyed techniques. We first discuss acoustic models that can effectively exploit variable-length contextual information, such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), and their various combination with other models. We then describe acoustic models that are optimized end-to-end with emphasis on feature representations learned jointly with rest of the system, the connectionist temporal classification (CTC) criterion, and the attention-based sequence-to-sequence model. We further illustrate robustness issues in speech recognition systems, and discuss acoustic model adaptation, speech enhancement and separation, and robust training strategies. We also cover modeling techniques that lead to more efficient decoding and discuss possible future directions in acoustic model research.Comment: This is an updated version with latest literature until ICASSP2018 of the paper: Dong Yu and Jinyu Li, "Recent Progresses in Deep Learning based Acoustic Models," vol.4, no.3, IEEE/CAA Journal of Automatica Sinica, 201

arXiv.org e-Print Archive

Stream attention-based multi-array end-to-end speech recognition

Author: Hermansky Hynek
Hori Takaaki
Li Ruizhi
Mallid Sri Harish
Wang Xiaofei
Watanabe Shinji
Publication venue
Publication date: 18/02/2019
Field of study

Automatic Speech Recognition (ASR) using multiple microphone arrays has achieved great success in the far-field robustness. Taking advantage of all the information that each array shares and contributes is crucial in this task. Motivated by the advances of joint Connectionist Temporal Classification (CTC)/attention mechanism in the End-to-End (E2E) ASR, a stream attention-based multi-array framework is proposed in this work. Microphone arrays, acting as information streams, are activated by separate encoders and decoded under the instruction of both CTC and attention networks. In terms of attention, a hierarchical structure is adopted. On top of the regular attention networks, stream attention is introduced to steer the decoder toward the most informative encoders. Experiments have been conducted on AMI and DIRHA multi-array corpora using the encoder-decoder architecture. Compared with the best single-array results, the proposed framework has achieved relative Word Error Rates (WERs) reduction of 3.7% and 9.7% in the two datasets, respectively, which is better than conventional strategies as well.Comment: Submitted to ICASSP 201

arXiv.org e-Print Archive

Joint Modeling of Accents and Acoustics for Multi-Accent Speech Recognition

Author: Audhkhasi Kartik
Hasegawa-Johnson Mark
Ramabhadran Bhuvana
Rosenberg Andrew
Thomas Samuel
Yang Xuesong
Publication venue
Publication date: 07/02/2018
Field of study

The performance of automatic speech recognition systems degrades with increasing mismatch between the training and testing scenarios. Differences in speaker accents are a significant source of such mismatch. The traditional approach to deal with multiple accents involves pooling data from several accents during training and building a single model in multi-task fashion, where tasks correspond to individual accents. In this paper, we explore an alternate model where we jointly learn an accent classifier and a multi-task acoustic model. Experiments on the American English Wall Street Journal and British English Cambridge corpora demonstrate that our joint model outperforms the strong multi-task acoustic model baseline. We obtain a 5.94% relative improvement in word error rate on British English, and 9.47% relative improvement on American English. This illustrates that jointly modeling with accent information improves acoustic model performance.Comment: Accepted in The 43rd IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2018

arXiv.org e-Print Archive

Towards End-to-end Automatic Code-Switching Speech Recognition

Author: Fung Pascale
Madotto Andrea
Winata Genta Indra
Wu Chien-Sheng
Publication venue
Publication date: 30/10/2018
Field of study

Speech recognition in mixed language has difficulties to adapt end-to-end framework due to the lack of data and overlapping phone sets, for example in words such as "one" in English and "w\`an" in Chinese. We propose a CTC-based end-to-end automatic speech recognition model for intra-sentential English-Mandarin code-switching. The model is trained by joint training on monolingual datasets, and fine-tuning with the mixed-language corpus. During the decoding process, we apply a beam search and combine CTC predictions and language model score. The proposed method is effective in leveraging monolingual corpus and detecting language transitions and it improves the CER by 5%.Comment: Submitted to ICASSP 201

arXiv.org e-Print Archive

Acoustic-To-Word Model Without OOV

Author: Droppo Jasha
Gong Yifan
Li Jinyu
Ye Guoli
Zhao Rui
Publication venue
Publication date: 28/11/2017
Field of study

Recently, the acoustic-to-word model based on the Connectionist Temporal Classification (CTC) criterion was shown as a natural end-to-end model directly targeting words as output units. However, this type of word-based CTC model suffers from the out-of-vocabulary (OOV) issue as it can only model limited number of words in the output layer and maps all the remaining words into an OOV output node. Therefore, such word-based CTC model can only recognize the frequent words modeled by the network output nodes. It also cannot easily handle the hot-words which emerge after the model is trained. In this study, we improve the acoustic-to-word model with a hybrid CTC model which can predict both words and characters at the same time. With a shared-hidden-layer structure and modular design, the alignments of words generated from the word-based CTC and the character-based CTC are synchronized. Whenever the acoustic-to-word model emits an OOV token, we back off that OOV segment to the word output generated from the character-based CTC, hence solving the OOV or hot-words issue. Evaluated on a Microsoft Cortana voice assistant task, the proposed model can reduce the errors introduced by the OOV output token in the acoustic-to-word model by 30%

arXiv.org e-Print Archive

On the Inductive Bias of Word-Character-Level Multi-Task Learning for Speech Recognition

Author: Borgholt Lasse
Kremer Jan
Maaløe Lars
Publication venue
Publication date: 28/11/2018
Field of study

End-to-end automatic speech recognition (ASR) commonly transcribes audio signals into sequences of characters while its performance is evaluated by measuring the word-error rate (WER). This suggests that predicting sequences of words directly may be helpful instead. However, training with word-level supervision can be more difficult due to the sparsity of examples per label class. In this paper we analyze an end-to-end ASR model that combines a word-and-character representation in a multi-task learning (MTL) framework. We show that it improves on the WER and study how the word-level model can benefit from character-level supervision by analyzing the learned inductive preference bias of each model component empirically. We find that by adding character-level supervision, the MTL model interpolates between recognizing more frequent words (preferred by the word-level model) and shorter words (preferred by the character-level model).Comment: Accepted at the IRASL workshop at NeurIPS 201

arXiv.org e-Print Archive

Sequence-to-Sequence Models Can Directly Translate Foreign Speech

Author: Chen Zhifeng
Chorowski Jan
Jaitly Navdeep
Weiss Ron J.
Wu Yonghui
Publication venue
Publication date: 12/06/2017
Field of study

We present a recurrent encoder-decoder deep neural network architecture that directly translates speech in one language into text in another. The model does not explicitly transcribe the speech into text in the source language, nor does it require supervision from the ground truth source language transcription during training. We apply a slightly modified sequence-to-sequence with attention architecture that has previously been used for speech recognition and show that it can be repurposed for this more complex task, illustrating the power of attention-based models. A single model trained end-to-end obtains state-of-the-art performance on the Fisher Callhome Spanish-English speech translation task, outperforming a cascade of independently trained sequence-to-sequence speech recognition and machine translation models by 1.8 BLEU points on the Fisher test set. In addition, we find that making use of the training data in both languages by multi-task training sequence-to-sequence speech translation and recognition models with a shared encoder network can improve performance by a further 1.4 BLEU points.Comment: 5 pages, 1 figure. Interspeech 201

arXiv.org e-Print Archive

Jasper: An End-to-End Convolutional Neural Acoustic Model

Author: Cohen Jonathan M.
Gadde Ravi Teja
Ginsburg Boris
Kuchaiev Oleksii
Lavrukhin Vitaly
Leary Ryan
Li Jason
Nguyen Huyen
Publication venue
Publication date: 26/08/2019
Field of study

In this paper, we report state-of-the-art results on LibriSpeech among end-to-end speech recognition models without any external training data. Our model, Jasper, uses only 1D convolutions, batch normalization, ReLU, dropout, and residual connections. To improve training, we further introduce a new layer-wise optimizer called NovoGrad. Through experiments, we demonstrate that the proposed deep architecture performs as well or better than more complex choices. Our deepest Jasper variant uses 54 convolutional layers. With this architecture, we achieve 2.95% WER using a beam-search decoder with an external neural language model and 3.86% WER with a greedy decoder on LibriSpeech test-clean. We also report competitive results on the Wall Street Journal and the Hub5'00 conversational evaluation datasets.Comment: Accepted to INTERSPEECH 201

arXiv.org e-Print Archive

Minimum Latency Training Strategies for Streaming Sequence-to-Sequence ASR

Author: Gaur Yashesh
Gong Yifan
Inaguma Hirofumi
Li Jinyu
Lu Liang
Publication venue
Publication date: 14/05/2020
Field of study

Recently, a few novel streaming attention-based sequence-to-sequence (S2S) models have been proposed to perform online speech recognition with linear-time decoding complexity. However, in these models, the decisions to generate tokens are delayed compared to the actual acoustic boundaries since their unidirectional encoders lack future information. This leads to an inevitable latency during inference. To alleviate this issue and reduce latency, we propose several strategies during training by leveraging external hard alignments extracted from the hybrid model. We investigate to utilize the alignments in both the encoder and the decoder. On the encoder side, (1) multi-task learning and (2) pre-training with the framewise classification task are studied. On the decoder side, we (3) remove inappropriate alignment paths beyond an acceptable latency during the alignment marginalization, and (4) directly minimize the differentiable expected latency loss. Experiments on the Cortana voice search task demonstrate that our proposed methods can significantly reduce the latency, and even improve the recognition accuracy in certain cases on the decoder side. We also present some analysis to understand the behaviors of streaming S2S models.Comment: Accepted at IEEE ICASSP 202

arXiv.org e-Print Archive