11 research outputs found
Egocentric Audio-Visual Noise Suppression
This paper studies audio-visual suppression for egocentric videos -- where
the speaker is not captured in the video. Instead, potential noise sources are
visible on screen with the camera emulating the off-screen speaker's view of
the outside world. This setting is different from prior work in audio-visual
speech enhancement that relies on lip and facial visuals. In this paper, we
first demonstrate that egocentric visual information is helpful for noise
suppression. We compare object recognition and action classification based
visual feature extractors, and investigate methods to align audio and visual
representations. Then, we examine different fusion strategies for the aligned
features, and locations within the noise suppression model to incorporate
visual information. Experiments demonstrate that visual features are most
helpful when used to generate additive correction masks. Finally, in order to
ensure that the visual features are discriminative with respect to different
noise types, we introduce a multi-task learning framework that jointly
optimizes audio-visual noise suppression and video based acoustic event
detection. This proposed multi-task framework outperforms the audio only
baseline on all metrics, including a 0.16 PESQ improvement. Extensive ablations
reveal the improved performance of the proposed model with multiple active
distractors, over all noise types and across different SNRs.Comment: Under Review at ICASSP 202
End-to-End Speech Recognition Contextualization with Large Language Models
In recent years, Large Language Models (LLMs) have garnered significant
attention from the research community due to their exceptional performance and
generalization capabilities. In this paper, we introduce a novel method for
contextualizing speech recognition models incorporating LLMs. Our approach
casts speech recognition as a mixed-modal language modeling task based on a
pretrained LLM. We provide audio features, along with optional text tokens for
context, to train the system to complete transcriptions in a decoder-only
fashion. As a result, the system is implicitly incentivized to learn how to
leverage unstructured contextual information during training. Our empirical
results demonstrate a significant improvement in performance, with a 6% WER
reduction when additional textual context is provided. Moreover, we find that
our method performs competitively and improve by 7.5% WER overall and 17% WER
on rare words against a baseline contextualized RNN-T system that has been
trained on more than twenty five times larger speech dataset. Overall, we
demonstrate that by only adding a handful number of trainable parameters via
adapters, we can unlock contextualized speech recognition capability for the
pretrained LLM while keeping the same text-only input functionality
Towards General-Purpose Speech Abilities for Large Language Models Using Unpaired Data
In this work, we extend the instruction-tuned Llama-2 model with end-to-end
general-purpose speech processing and reasoning abilities while maintaining the
wide range of LLM capabilities, without using any carefully curated paired
data. The proposed model can utilize audio prompts as a replacement for text
and sustain a conversation. Such a model also has extended cross-modal
capabilities such as being able to perform speech question answering, speech
translation, and audio summarization amongst many other closed and open-domain
tasks. This is unlike prior approaches in speech, in which LLMs are extended to
handle audio for a limited number of pre-designated tasks. Experiments show
that our end-to-end approach is on par with or outperforms a cascaded system
(speech recognizer + LLM) in terms of modeling the response to a prompt.
Furthermore, unlike a cascade, our approach shows the ability to interchange
text and audio modalities and utilize the prior context in a conversation to
provide better results