2,976 research outputs found
Unified Autoregressive Modeling for Joint End-to-End Multi-Talker Overlapped Speech Recognition and Speaker Attribute Estimation
In this paper, we present a novel modeling method for single-channel
multi-talker overlapped automatic speech recognition (ASR) systems. Fully
neural network based end-to-end models have dramatically improved the
performance of multi-taker overlapped ASR tasks. One promising approach for
end-to-end modeling is autoregressive modeling with serialized output training
in which transcriptions of multiple speakers are recursively generated one
after another. This enables us to naturally capture relationships between
speakers. However, the conventional modeling method cannot explicitly take into
account the speaker attributes of individual utterances such as gender and age
information. In fact, the performance deteriorates when each speaker is the
same gender or is close in age. To address this problem, we propose unified
autoregressive modeling for joint end-to-end multi-talker overlapped ASR and
speaker attribute estimation. Our key idea is to handle gender and age
estimation tasks within the unified autoregressive modeling. In the proposed
method, transformer-based autoregressive model recursively generates not only
textual tokens but also attribute tokens of each speaker. This enables us to
effectively utilize speaker attributes for improving multi-talker overlapped
ASR. Experiments on Japanese multi-talker overlapped ASR tasks demonstrate the
effectiveness of the proposed method.Comment: Accepted at Interspeech 202
A Further Study of Unsupervised Pre-training for Transformer Based Speech Recognition
Building a good speech recognition system usually requires large amounts of
transcribed data, which is expensive to collect. To tackle this problem, many
unsupervised pre-training methods have been proposed. Among these methods,
Masked Predictive Coding achieved significant improvements on various speech
recognition datasets with BERT-like Masked Reconstruction loss and Transformer
backbone. However, many aspects of MPC have not been fully investigated. In
this paper, we conduct a further study on MPC and focus on three important
aspects: the effect of pre-training data speaking style, its extension on
streaming model, and how to better transfer learned knowledge from pre-training
stage to downstream tasks. Experiments reveled that pre-training data with
matching speaking style is more useful on downstream recognition tasks. A
unified training objective with APC and MPC provided 8.46% relative error
reduction on streaming model trained on HKUST. Also, the combination of target
data adaption and layer-wise discriminative training helped the knowledge
transfer of MPC, which achieved 3.99% relative error reduction on AISHELL over
a strong baseline
Transformers with convolutional context for ASR
The recent success of transformer networks for neural machine translation and
other NLP tasks has led to a surge in research work trying to apply it for
speech recognition. Recent efforts studied key research questions around ways
of combining positional embedding with speech features, and stability of
optimization for large scale learning of transformer networks. In this paper,
we propose replacing the sinusoidal positional embedding for transformers with
convolutionally learned input representations. These contextual representations
provide subsequent transformer blocks with relative positional information
needed for discovering long-range relationships between local concepts. The
proposed system has favorable optimization characteristics where our reported
results are produced with fixed learning rate of 1.0 and no warmup steps. The
proposed model achieves a competitive 4.7% and 12.9% WER on the Librispeech
``test clean'' and ``test other'' subsets when no extra LM text is provided
You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation
Data augmentation is one of the most effective ways to make end-to-end
automatic speech recognition (ASR) perform close to the conventional hybrid
approach, especially when dealing with low-resource tasks. Using recent
advances in speech synthesis (text-to-speech, or TTS), we build our TTS system
on an ASR training database and then extend the data with synthesized speech to
train a recognition model. We argue that, when the training data amount is
relatively low, this approach can allow an end-to-end model to reach hybrid
systems' quality. For an artificial low-to-medium-resource setup, we compare
the proposed augmentation with the semi-supervised learning technique. We also
investigate the influence of vocoder usage on final ASR performance by
comparing Griffin-Lim algorithm with our modified LPCNet. When applied with an
external language model, our approach outperforms a semi-supervised setup for
LibriSpeech test-clean and only 33% worse than a comparable supervised setup.
Our system establishes a competitive result for end-to-end ASR trained on
LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for
test-other
Listen Attentively, and Spell Once: Whole Sentence Generation via a Non-Autoregressive Architecture for Low-Latency Speech Recognition
Although attention based end-to-end models have achieved promising
performance in speech recognition, the multi-pass forward computation in
beam-search increases inference time cost, which limits their practical
applications. To address this issue, we propose a non-autoregressive end-to-end
speech recognition system called LASO (listen attentively, and spell once).
Because of the non-autoregressive property, LASO predicts a textual token in
the sequence without the dependence on other tokens. Without beam-search, the
one-pass propagation much reduces inference time cost of LASO. And because the
model is based on the attention based feedforward structure, the computation
can be implemented in parallel efficiently. We conduct experiments on publicly
available Chinese dataset AISHELL-1. LASO achieves a character error rate of
6.4%, which outperforms the state-of-the-art autoregressive transformer model
(6.7%). The average inference latency is 21 ms, which is 1/50 of the
autoregressive transformer model.Comment: accepted by INTERSPEECH202
Multilingual End-to-End Speech Recognition with A Single Transformer on Low-Resource Languages
Sequence-to-sequence attention-based models integrate an acoustic,
pronunciation and language model into a single neural network, which make them
very suitable for multilingual automatic speech recognition (ASR). In this
paper, we are concerned with multilingual speech recognition on low-resource
languages by a single Transformer, one of sequence-to-sequence attention-based
models. Sub-words are employed as the multilingual modeling unit without using
any pronunciation lexicon. First, we show that a single multilingual ASR
Transformer performs well on low-resource languages despite of some language
confusion. We then look at incorporating language information into the model by
inserting the language symbol at the beginning or at the end of the original
sub-words sequence under the condition of language information being known
during training. Experiments on CALLHOME datasets demonstrate that the
multilingual ASR Transformer with the language symbol at the end performs
better and can obtain relatively 10.5\% average word error rate (WER) reduction
compared to SHL-MLSTM with residual learning. We go on to show that, assuming
the language information being known during training and testing, about
relatively 12.4\% average WER reduction can be observed compared to SHL-MLSTM
with residual learning through giving the language symbol as the sentence start
token.Comment: arXiv admin note: text overlap with arXiv:1805.0623
Very Deep Self-Attention Networks for End-to-End Speech Recognition
Recently, end-to-end sequence-to-sequence models for speech recognition have
gained significant interest in the research community. While previous
architecture choices revolve around time-delay neural networks (TDNN) and long
short-term memory (LSTM) recurrent neural networks, we propose to use
self-attention via the Transformer architecture as an alternative. Our analysis
shows that deep Transformer networks with high learning capacity are able to
exceed performance from previous end-to-end approaches and even match the
conventional hybrid systems. Moreover, we trained very deep models with up to
48 Transformer layers for both encoder and decoders combined with stochastic
residual connections, which greatly improve generalizability and training
efficiency. The resulting models outperform all previous end-to-end ASR
approaches on the Switchboard benchmark. An ensemble of these models achieve
9.9% and 17.7% WER on Switchboard and CallHome test sets respectively. This
finding brings our end-to-end models to competitive levels with previous hybrid
systems. Further, with model ensembling the Transformers can outperform certain
hybrid systems, which are more complicated in terms of both structure and
training procedure.Comment: Submitted to INTERSPEECH 201
Multiresolution and Multimodal Speech Recognition with Transformers
This paper presents an audio visual automatic speech recognition (AV-ASR)
system using a Transformer-based architecture. We particularly focus on the
scene context provided by the visual information, to ground the ASR. We extract
representations for audio features in the encoder layers of the transformer and
fuse video features using an additional crossmodal multihead attention layer.
Additionally, we incorporate a multitask training criterion for multiresolution
ASR, where we train the model to generate both character and subword level
transcriptions.
Experimental results on the How2 dataset, indicate that multiresolution
training can speed up convergence by around 50% and relatively improves word
error rate (WER) performance by upto 18% over subword prediction models.
Further, incorporating visual information improves performance with relative
gains upto 3.76% over audio only models.
Our results are comparable to state-of-the-art Listen, Attend and Spell-based
architectures.Comment: Accepted for ACL 202
A Comparison of Modeling Units in Sequence-to-Sequence Speech Recognition with the Transformer on Mandarin Chinese
The choice of modeling units is critical to automatic speech recognition
(ASR) tasks. Conventional ASR systems typically choose context-dependent states
(CD-states) or context-dependent phonemes (CD-phonemes) as their modeling
units. However, it has been challenged by sequence-to-sequence attention-based
models, which integrate an acoustic, pronunciation and language model into a
single neural network. On English ASR tasks, previous attempts have already
shown that the modeling unit of graphemes can outperform that of phonemes by
sequence-to-sequence attention-based model.
In this paper, we are concerned with modeling units on Mandarin Chinese ASR
tasks using sequence-to-sequence attention-based models with the Transformer.
Five modeling units are explored including context-independent phonemes
(CI-phonemes), syllables, words, sub-words and characters. Experiments on HKUST
datasets demonstrate that the lexicon free modeling units can outperform
lexicon related modeling units in terms of character error rate (CER). Among
five modeling units, character based model performs best and establishes a new
state-of-the-art CER of on HKUST datasets without a hand-designed
lexicon and an extra language model integration, which corresponds to a
relative improvement over the existing best CER of by the joint
CTC-attention based encoder-decoder network.Comment: arXiv admin note: substantial text overlap with arXiv:1804.1075
NAUTILUS: a Versatile Voice Cloning System
We introduce a novel speech synthesis system, called NAUTILUS, that can
generate speech with a target voice either from a text input or a reference
utterance of an arbitrary source speaker. By using a multi-speaker speech
corpus to train all requisite encoders and decoders in the initial training
stage, our system can clone unseen voices using untranscribed speech of target
speakers on the basis of the backpropagation algorithm. Moreover, depending on
the data circumstance of the target speaker, the cloning strategy can be
adjusted to take advantage of additional data and modify the behaviors of
text-to-speech (TTS) and/or voice conversion (VC) systems to accommodate the
situation. We test the performance of the proposed framework by using deep
convolution layers to model the encoders, decoders and WaveNet vocoder.
Evaluations show that it achieves comparable quality with state-of-the-art TTS
and VC systems when cloning with just five minutes of untranscribed speech.
Moreover, it is demonstrated that the proposed framework has the ability to
switch between TTS and VC with high speaker consistency, which will be useful
for many applications.Comment: Submitted to The IEEE/ACM Transactions on Audio, Speech, and Language
Processin
- …