8,766 research outputs found
Acoustic-to-Word Models with Conversational Context Information
Conversational context information, higher-level knowledge that spans across
sentences, can help to recognize a long conversation. However, existing speech
recognition models are typically built at a sentence level, and thus it may not
capture important conversational context information. The recent progress in
end-to-end speech recognition enables integrating context with other available
information (e.g., acoustic, linguistic resources) and directly recognizing
words from speech. In this work, we present a direct acoustic-to-word,
end-to-end speech recognition model capable of utilizing the conversational
context to better process long conversations. We evaluate our proposed approach
on the Switchboard conversational speech corpus and show that our system
outperforms a standard end-to-end speech recognition system.Comment: NAACL 2019. arXiv admin note: text overlap with arXiv:1808.0217
Recent Progresses in Deep Learning based Acoustic Models (Updated)
In this paper, we summarize recent progresses made in deep learning based
acoustic models and the motivation and insights behind the surveyed techniques.
We first discuss acoustic models that can effectively exploit variable-length
contextual information, such as recurrent neural networks (RNNs), convolutional
neural networks (CNNs), and their various combination with other models. We
then describe acoustic models that are optimized end-to-end with emphasis on
feature representations learned jointly with rest of the system, the
connectionist temporal classification (CTC) criterion, and the attention-based
sequence-to-sequence model. We further illustrate robustness issues in speech
recognition systems, and discuss acoustic model adaptation, speech enhancement
and separation, and robust training strategies. We also cover modeling
techniques that lead to more efficient decoding and discuss possible future
directions in acoustic model research.Comment: This is an updated version with latest literature until ICASSP2018 of
the paper: Dong Yu and Jinyu Li, "Recent Progresses in Deep Learning based
Acoustic Models," vol.4, no.3, IEEE/CAA Journal of Automatica Sinica, 201
Stream attention-based multi-array end-to-end speech recognition
Automatic Speech Recognition (ASR) using multiple microphone arrays has
achieved great success in the far-field robustness. Taking advantage of all the
information that each array shares and contributes is crucial in this task.
Motivated by the advances of joint Connectionist Temporal Classification
(CTC)/attention mechanism in the End-to-End (E2E) ASR, a stream attention-based
multi-array framework is proposed in this work. Microphone arrays, acting as
information streams, are activated by separate encoders and decoded under the
instruction of both CTC and attention networks. In terms of attention, a
hierarchical structure is adopted. On top of the regular attention networks,
stream attention is introduced to steer the decoder toward the most informative
encoders. Experiments have been conducted on AMI and DIRHA multi-array corpora
using the encoder-decoder architecture. Compared with the best single-array
results, the proposed framework has achieved relative Word Error Rates (WERs)
reduction of 3.7% and 9.7% in the two datasets, respectively, which is better
than conventional strategies as well.Comment: Submitted to ICASSP 201
Joint Modeling of Accents and Acoustics for Multi-Accent Speech Recognition
The performance of automatic speech recognition systems degrades with
increasing mismatch between the training and testing scenarios. Differences in
speaker accents are a significant source of such mismatch. The traditional
approach to deal with multiple accents involves pooling data from several
accents during training and building a single model in multi-task fashion,
where tasks correspond to individual accents. In this paper, we explore an
alternate model where we jointly learn an accent classifier and a multi-task
acoustic model. Experiments on the American English Wall Street Journal and
British English Cambridge corpora demonstrate that our joint model outperforms
the strong multi-task acoustic model baseline. We obtain a 5.94% relative
improvement in word error rate on British English, and 9.47% relative
improvement on American English. This illustrates that jointly modeling with
accent information improves acoustic model performance.Comment: Accepted in The 43rd IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP2018
Towards End-to-end Automatic Code-Switching Speech Recognition
Speech recognition in mixed language has difficulties to adapt end-to-end
framework due to the lack of data and overlapping phone sets, for example in
words such as "one" in English and "w\`an" in Chinese. We propose a CTC-based
end-to-end automatic speech recognition model for intra-sentential
English-Mandarin code-switching. The model is trained by joint training on
monolingual datasets, and fine-tuning with the mixed-language corpus. During
the decoding process, we apply a beam search and combine CTC predictions and
language model score. The proposed method is effective in leveraging
monolingual corpus and detecting language transitions and it improves the CER
by 5%.Comment: Submitted to ICASSP 201
Acoustic-To-Word Model Without OOV
Recently, the acoustic-to-word model based on the Connectionist Temporal
Classification (CTC) criterion was shown as a natural end-to-end model directly
targeting words as output units. However, this type of word-based CTC model
suffers from the out-of-vocabulary (OOV) issue as it can only model limited
number of words in the output layer and maps all the remaining words into an
OOV output node. Therefore, such word-based CTC model can only recognize the
frequent words modeled by the network output nodes. It also cannot easily
handle the hot-words which emerge after the model is trained. In this study, we
improve the acoustic-to-word model with a hybrid CTC model which can predict
both words and characters at the same time. With a shared-hidden-layer
structure and modular design, the alignments of words generated from the
word-based CTC and the character-based CTC are synchronized. Whenever the
acoustic-to-word model emits an OOV token, we back off that OOV segment to the
word output generated from the character-based CTC, hence solving the OOV or
hot-words issue. Evaluated on a Microsoft Cortana voice assistant task, the
proposed model can reduce the errors introduced by the OOV output token in the
acoustic-to-word model by 30%
On the Inductive Bias of Word-Character-Level Multi-Task Learning for Speech Recognition
End-to-end automatic speech recognition (ASR) commonly transcribes audio
signals into sequences of characters while its performance is evaluated by
measuring the word-error rate (WER). This suggests that predicting sequences of
words directly may be helpful instead. However, training with word-level
supervision can be more difficult due to the sparsity of examples per label
class. In this paper we analyze an end-to-end ASR model that combines a
word-and-character representation in a multi-task learning (MTL) framework. We
show that it improves on the WER and study how the word-level model can benefit
from character-level supervision by analyzing the learned inductive preference
bias of each model component empirically. We find that by adding
character-level supervision, the MTL model interpolates between recognizing
more frequent words (preferred by the word-level model) and shorter words
(preferred by the character-level model).Comment: Accepted at the IRASL workshop at NeurIPS 201
Sequence-to-Sequence Models Can Directly Translate Foreign Speech
We present a recurrent encoder-decoder deep neural network architecture that
directly translates speech in one language into text in another. The model does
not explicitly transcribe the speech into text in the source language, nor does
it require supervision from the ground truth source language transcription
during training. We apply a slightly modified sequence-to-sequence with
attention architecture that has previously been used for speech recognition and
show that it can be repurposed for this more complex task, illustrating the
power of attention-based models. A single model trained end-to-end obtains
state-of-the-art performance on the Fisher Callhome Spanish-English speech
translation task, outperforming a cascade of independently trained
sequence-to-sequence speech recognition and machine translation models by 1.8
BLEU points on the Fisher test set. In addition, we find that making use of the
training data in both languages by multi-task training sequence-to-sequence
speech translation and recognition models with a shared encoder network can
improve performance by a further 1.4 BLEU points.Comment: 5 pages, 1 figure. Interspeech 201
Jasper: An End-to-End Convolutional Neural Acoustic Model
In this paper, we report state-of-the-art results on LibriSpeech among
end-to-end speech recognition models without any external training data. Our
model, Jasper, uses only 1D convolutions, batch normalization, ReLU, dropout,
and residual connections. To improve training, we further introduce a new
layer-wise optimizer called NovoGrad. Through experiments, we demonstrate that
the proposed deep architecture performs as well or better than more complex
choices. Our deepest Jasper variant uses 54 convolutional layers. With this
architecture, we achieve 2.95% WER using a beam-search decoder with an external
neural language model and 3.86% WER with a greedy decoder on LibriSpeech
test-clean. We also report competitive results on the Wall Street Journal and
the Hub5'00 conversational evaluation datasets.Comment: Accepted to INTERSPEECH 201
Minimum Latency Training Strategies for Streaming Sequence-to-Sequence ASR
Recently, a few novel streaming attention-based sequence-to-sequence (S2S)
models have been proposed to perform online speech recognition with linear-time
decoding complexity. However, in these models, the decisions to generate tokens
are delayed compared to the actual acoustic boundaries since their
unidirectional encoders lack future information. This leads to an inevitable
latency during inference. To alleviate this issue and reduce latency, we
propose several strategies during training by leveraging external hard
alignments extracted from the hybrid model. We investigate to utilize the
alignments in both the encoder and the decoder. On the encoder side, (1)
multi-task learning and (2) pre-training with the framewise classification task
are studied. On the decoder side, we (3) remove inappropriate alignment paths
beyond an acceptable latency during the alignment marginalization, and (4)
directly minimize the differentiable expected latency loss. Experiments on the
Cortana voice search task demonstrate that our proposed methods can
significantly reduce the latency, and even improve the recognition accuracy in
certain cases on the decoder side. We also present some analysis to understand
the behaviors of streaming S2S models.Comment: Accepted at IEEE ICASSP 202
- …