17 research outputs found
An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition
The two most popular loss functions for streaming end-to-end automatic speech
recognition (ASR) are the RNN-Transducer (RNN-T) and the connectionist temporal
classification (CTC) objectives. Both perform an alignment-free training by
marginalizing over all possible alignments, but use different transition rules.
Between these two loss types we can classify the monotonic RNN-T (MonoRNN-T)
and the recently proposed CTC-like Transducer (CTC-T), which both can be
realized using the graph temporal classification-transducer (GTC-T) loss
function. Monotonic transducers have a few advantages. First, RNN-T can suffer
from runaway hallucination, where a model keeps emitting non-blank symbols
without advancing in time, often in an infinite loop. Secondly, monotonic
transducers consume exactly one model score per time step and are therefore
more compatible and unifiable with traditional FST-based hybrid ASR decoders.
However, the MonoRNN-T so far has been found to have worse accuracy than RNN-T.
It does not have to be that way, though: By regularizing the training - via
joint LAS training or parameter initialization from RNN-T - both MonoRNN-T and
CTC-T perform as well - or better - than RNN-T. This is demonstrated for
LibriSpeech and for a large-scale in-house data set.Comment: Submitted to Interspeech 202
Anchored Speech Recognition with Neural Transducers
Neural transducers have achieved human level performance on standard speech
recognition benchmarks. However, their performance significantly degrades in
the presence of cross-talk, especially when the primary speaker has a low
signal-to-noise ratio. Anchored speech recognition refers to a class of methods
that use information from an anchor segment (e.g., wake-words) to recognize
device-directed speech while ignoring interfering background speech. In this
paper, we investigate anchored speech recognition to make neural transducers
robust to background speech. We extract context information from the anchor
segment with a tiny auxiliary network, and use encoder biasing and joiner
gating to guide the transducer towards the target speech. Moreover, to improve
the robustness of context embedding extraction, we propose auxiliary training
objectives to disentangle lexical content from speaking style. We evaluate our
methods on synthetic LibriSpeech-based mixtures comprising several SNR and
overlap conditions; they improve relative word error rates by 19.6% over a
strong baseline, when averaged over all conditions.Comment: To appear at IEEE ICASSP 202
Towards Selection of Text-to-speech Data to Augment ASR Training
This paper presents a method for selecting appropriate synthetic speech
samples from a given large text-to-speech (TTS) dataset as supplementary
training data for an automatic speech recognition (ASR) model. We trained a
neural network, which can be optimised using cross-entropy loss or Arcface
loss, to measure the similarity of a synthetic data to real speech. We found
that incorporating synthetic samples with considerable dissimilarity to real
speech, owing in part to lexical differences, into ASR training is crucial for
boosting recognition performance. Experimental results on Librispeech test sets
indicate that, in order to maintain the same speech recognition accuracy as
when using all TTS data, our proposed solution can reduce the size of the TTS
data down below its , which is superior to several baseline methods
Towards General-Purpose Speech Abilities for Large Language Models Using Unpaired Data
In this work, we extend the instruction-tuned Llama-2 model with end-to-end
general-purpose speech processing and reasoning abilities while maintaining the
wide range of LLM capabilities, without using any carefully curated paired
data. The proposed model can utilize audio prompts as a replacement for text
and sustain a conversation. Such a model also has extended cross-modal
capabilities such as being able to perform speech question answering, speech
translation, and audio summarization amongst many other closed and open-domain
tasks. This is unlike prior approaches in speech, in which LLMs are extended to
handle audio for a limited number of pre-designated tasks. Experiments show
that our end-to-end approach is on par with or outperforms a cascaded system
(speech recognizer + LLM) in terms of modeling the response to a prompt.
Furthermore, unlike a cascade, our approach shows the ability to interchange
text and audio modalities and utilize the prior context in a conversation to
provide better results
Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model
Neural network pruning offers an effective method for compressing a
multilingual automatic speech recognition (ASR) model with minimal performance
loss. However, it entails several rounds of pruning and re-training needed to
be run for each language. In this work, we propose the use of an adaptive
masking approach in two scenarios for pruning a multilingual ASR model
efficiently, each resulting in sparse monolingual models or a sparse
multilingual model (named as Dynamic ASR Pathways). Our approach dynamically
adapts the sub-network, avoiding premature decisions about a fixed sub-network
structure. We show that our approach outperforms existing pruning methods when
targeting sparse monolingual models. Further, we illustrate that Dynamic ASR
Pathways jointly discovers and trains better sub-networks (pathways) of a
single multilingual model by adapting from different sub-network
initializations, thereby reducing the need for language-specific pruning