27 research outputs found
Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview
We present a structured overview of adaptation algorithms for neural
network-based speech recognition, considering both hybrid hidden Markov model /
neural network systems and end-to-end neural network systems, with a focus on
speaker adaptation, domain adaptation, and accent adaptation. The overview
characterizes adaptation algorithms as based on embeddings, model parameter
adaptation, or data augmentation. We present a meta-analysis of the performance
of speech recognition adaptation algorithms, based on relative error rate
reductions as reported in the literature.Comment: Submitted to IEEE Open Journal of Signal Processing. 30 pages, 27
figure
Multi-Dialect Speech Recognition With A Single Sequence-To-Sequence Model
Sequence-to-sequence models provide a simple and elegant solution for
building speech recognition systems by folding separate components of a typical
system, namely acoustic (AM), pronunciation (PM) and language (LM) models into
a single neural network. In this work, we look at one such sequence-to-sequence
model, namely listen, attend and spell (LAS), and explore the possibility of
training a single model to serve different English dialects, which simplifies
the process of training multi-dialect systems without the need for separate AM,
PM and LMs for each dialect. We show that simply pooling the data from all
dialects into one LAS model falls behind the performance of a model fine-tuned
on each dialect. We then look at incorporating dialect-specific information
into the model, both by modifying the training targets by inserting the dialect
symbol at the end of the original grapheme sequence and also feeding a 1-hot
representation of the dialect information into all layers of the model.
Experimental results on seven English dialects show that our proposed system is
effective in modeling dialect variations within a single LAS model,
outperforming a LAS model trained individually on each of the seven dialects by
3.1 ~ 16.5% relative.Comment: submitted to ICASSP 201
CommonAccent: Exploring Large Acoustic Pretrained Models for Accent Classification Based on Common Voice
Despite the recent advancements in Automatic Speech Recognition (ASR), the
recognition of accented speech still remains a dominant problem. In order to
create more inclusive ASR systems, research has shown that the integration of
accent information, as part of a larger ASR framework, can lead to the
mitigation of accented speech errors. We address multilingual accent
classification through the ECAPA-TDNN and Wav2Vec 2.0/XLSR architectures which
have been proven to perform well on a variety of speech-related downstream
tasks. We introduce a simple-to-follow recipe aligned to the SpeechBrain
toolkit for accent classification based on Common Voice 7.0 (English) and
Common Voice 11.0 (Italian, German, and Spanish). Furthermore, we establish new
state-of-the-art for English accent classification with as high as 95%
accuracy. We also study the internal categorization of the Wav2Vev 2.0
embeddings through t-SNE, noting that there is a level of clustering based on
phonological similarity. (Our recipe is open-source in the SpeechBrain toolkit,
see: https://github.com/speechbrain/speechbrain/tree/develop/recipes)Comment: To appear in Proceedings of the Annual Conference of the
International Speech Communication Association, INTERSPEECH 202