188 research outputs found
CommonAccent: Exploring Large Acoustic Pretrained Models for Accent Classification Based on Common Voice
Despite the recent advancements in Automatic Speech Recognition (ASR), the
recognition of accented speech still remains a dominant problem. In order to
create more inclusive ASR systems, research has shown that the integration of
accent information, as part of a larger ASR framework, can lead to the
mitigation of accented speech errors. We address multilingual accent
classification through the ECAPA-TDNN and Wav2Vec 2.0/XLSR architectures which
have been proven to perform well on a variety of speech-related downstream
tasks. We introduce a simple-to-follow recipe aligned to the SpeechBrain
toolkit for accent classification based on Common Voice 7.0 (English) and
Common Voice 11.0 (Italian, German, and Spanish). Furthermore, we establish new
state-of-the-art for English accent classification with as high as 95%
accuracy. We also study the internal categorization of the Wav2Vev 2.0
embeddings through t-SNE, noting that there is a level of clustering based on
phonological similarity. (Our recipe is open-source in the SpeechBrain toolkit,
see: https://github.com/speechbrain/speechbrain/tree/develop/recipes)Comment: To appear in Proceedings of the Annual Conference of the
International Speech Communication Association, INTERSPEECH 202
Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview
We present a structured overview of adaptation algorithms for neural
network-based speech recognition, considering both hybrid hidden Markov model /
neural network systems and end-to-end neural network systems, with a focus on
speaker adaptation, domain adaptation, and accent adaptation. The overview
characterizes adaptation algorithms as based on embeddings, model parameter
adaptation, or data augmentation. We present a meta-analysis of the performance
of speech recognition adaptation algorithms, based on relative error rate
reductions as reported in the literature.Comment: Submitted to IEEE Open Journal of Signal Processing. 30 pages, 27
figure
Accented Speech Recognition With Accent-specific Codebooks
Speech accents pose a significant challenge to state-of-the-art automatic
speech recognition (ASR) systems. Degradation in performance across
underrepresented accents is a severe deterrent to the inclusive adoption of
ASR. In this work, we propose a novel accent adaptation approach for end-to-end
ASR systems using cross-attention with a trainable set of codebooks. These
learnable codebooks capture accent-specific information and are integrated
within the ASR encoder layers. The model is trained on accented English speech,
while the test data also contained accents which were not seen during training.
On the Mozilla Common Voice multi-accented dataset, we show that our proposed
approach yields significant performance gains not only on the seen English
accents (up to relative improvement in word error rate) but also on the
unseen accents (up to relative improvement in WER). Further, we
illustrate benefits for a zero-shot transfer setup on the L2Artic dataset. We
also compare the performance with other approaches based on accent adversarial
training.Comment: Accepted to EMNLP 2023 Main Conference (Long Paper
Towards spoken dialect identification of Irish
The Irish language is rich in its diversity of dialects and accents. This
compounds the difficulty of creating a speech recognition system for the
low-resource language, as such a system must contend with a high degree of
variability with limited corpora. A recent study investigating dialect bias in
Irish ASR found that balanced training corpora gave rise to unequal dialect
performance, with performance for the Ulster dialect being consistently worse
than for the Connacht or Munster dialects. Motivated by this, the present
experiments investigate spoken dialect identification of Irish, with a view to
incorporating such a system into the speech recognition pipeline. Two acoustic
classification models are tested, XLS-R and ECAPA-TDNN, in conjunction with a
text-based classifier using a pretrained Irish-language BERT model. The
ECAPA-TDNN, particularly a model pretrained for language identification on the
VoxLingua107 dataset, performed best overall, with an accuracy of 73%. This was
further improved to 76% by fusing the model's outputs with the text-based
model. The Ulster dialect was most accurately identified, with an accuracy of
94%, however the model struggled to disambiguate between the Connacht and
Munster dialects, suggesting a more nuanced approach may be necessary to
robustly distinguish between the dialects of Irish.Comment: Accepted to Interspeech 2023 Workshop of the 2nd Annual Meeting of
the Special Interest Group of Under-resourced Languages Workshop, Dublin
(SiGUL
- …