4 research outputs found
Multi-task Learning for Speaker Verification and Voice Trigger Detection
Automatic speech transcription and speaker recognition are usually treated as
separate tasks even though they are interdependent. In this study, we
investigate training a single network to perform both tasks jointly. We train
the network in a supervised multi-task learning setup, where the speech
transcription branch of the network is trained to minimise a phonetic
connectionist temporal classification (CTC) loss while the speaker recognition
branch of the network is trained to label the input sequence with the correct
label for the speaker. We present a large-scale empirical study where the model
is trained using several thousand hours of labelled training data for each
task. We evaluate the speech transcription branch of the network on a voice
trigger detection task while the speaker recognition branch is evaluated on a
speaker verification task. Results demonstrate that the network is able to
encode both phonetic \emph{and} speaker information in its learnt
representations while yielding accuracies at least as good as the baseline
models for each task, with the same number of parameters as the independent
models
Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource End-to-End Speech Recognition
With the rapid development of speech assistants, adapting server-intended
automatic speech recognition (ASR) solutions to a direct device has become
crucial. Researchers and industry prefer to use end-to-end ASR systems for
on-device speech recognition tasks. This is because end-to-end systems can be
made resource-efficient while maintaining a higher quality compared to hybrid
systems. However, building end-to-end models requires a significant amount of
speech data. Another challenging task associated with speech assistants is
personalization, which mainly lies in handling out-of-vocabulary (OOV) words.
In this work, we consider building an effective end-to-end ASR system in
low-resource setups with a high OOV rate, embodied in Babel Turkish and Babel
Georgian tasks. To address the aforementioned problems, we propose a method of
dynamic acoustic unit augmentation based on the BPE-dropout technique. It
non-deterministically tokenizes utterances to extend the token's contexts and
to regularize their distribution for the model's recognition of unseen words.
It also reduces the need for optimal subword vocabulary size search. The
technique provides a steady improvement in regular and personalized
(OOV-oriented) speech recognition tasks (at least 6% relative WER and 25%
relative F-score) at no additional computational cost. Owing to the use of
BPE-dropout, our monolingual Turkish Conformer established a competitive result
with 22.2% character error rate (CER) and 38.9% word error rate (WER), which is
close to the best published multilingual system.Comment: 16 pages, 7 figure