48,409 research outputs found
Effects of Lombard Reflex on the Performance of Deep-Learning-Based Audio-Visual Speech Enhancement Systems
Humans tend to change their way of speaking when they are immersed in a noisy
environment, a reflex known as Lombard effect. Current speech enhancement
systems based on deep learning do not usually take into account this change in
the speaking style, because they are trained with neutral (non-Lombard) speech
utterances recorded under quiet conditions to which noise is artificially
added. In this paper, we investigate the effects that the Lombard reflex has on
the performance of audio-visual speech enhancement systems based on deep
learning. The results show that a gap in the performance of as much as
approximately 5 dB between the systems trained on neutral speech and the ones
trained on Lombard speech exists. This indicates the benefit of taking into
account the mismatch between neutral and Lombard speech in the design of
audio-visual speech enhancement systems
LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders
Audio-visual speech enhancement aims to extract clean speech from a noisy
environment by leveraging not only the audio itself but also the target
speaker's lip movements. This approach has been shown to yield improvements
over audio-only speech enhancement, particularly for the removal of interfering
speech. Despite recent advances in speech synthesis, most audio-visual
approaches continue to use spectral mapping/masking to reproduce the clean
audio, often resulting in visual backbones added to existing speech enhancement
architectures. In this work, we propose LA-VocE, a new two-stage approach that
predicts mel-spectrograms from noisy audio-visual speech via a
transformer-based architecture, and then converts them into waveform audio
using a neural vocoder (HiFi-GAN). We train and evaluate our framework on
thousands of speakers and 11+ different languages, and study our model's
ability to adapt to different levels of background noise and speech
interference. Our experiments show that LA-VocE outperforms existing methods
according to multiple metrics, particularly under very noisy scenarios.Comment: Submitted to ICASSP 202
Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement through Knowledge Distillation
Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech along
with extra visual information such as lip videos, and has been shown to be more
effective than audio-only speech enhancement. This paper proposes further
incorporating ultrasound tongue images to improve lip-based AV-SE systems'
performance. Knowledge distillation is employed at the training stage to
address the challenge of acquiring ultrasound tongue images during inference,
enabling an audio-lip speech enhancement student model to learn from a
pre-trained audio-lip-tongue speech enhancement teacher model. Experimental
results demonstrate significant improvements in the quality and intelligibility
of the speech enhanced by the proposed method compared to the traditional
audio-lip speech enhancement baselines. Further analysis using phone error
rates (PER) of automatic speech recognition (ASR) shows that palatal and velar
consonants benefit most from the introduction of ultrasound tongue images.Comment: To be published in InterSpeech 202
The Conversation: Deep Audio-Visual Speech Enhancement
Our goal is to isolate individual speakers from multi-talker simultaneous
speech in videos. Existing works in this area have focussed on trying to
separate utterances from known speakers in controlled environments. In this
paper, we propose a deep audio-visual speech enhancement network that is able
to separate a speaker's voice given lip regions in the corresponding video, by
predicting both the magnitude and the phase of the target signal. The method is
applicable to speakers unheard and unseen during training, and for
unconstrained environments. We demonstrate strong quantitative and qualitative
results, isolating extremely challenging real-world examples.Comment: To appear in Interspeech 2018. We provide supplementary material with
interactive demonstrations on
http://www.robots.ox.ac.uk/~vgg/demo/theconversatio
AV2Wav: Diffusion-Based Re-synthesis from Continuous Self-supervised Features for Audio-Visual Speech Enhancement
Speech enhancement systems are typically trained using pairs of clean and
noisy speech. In audio-visual speech enhancement (AVSE), there is not as much
ground-truth clean data available; most audio-visual datasets are collected in
real-world environments with background noise and reverberation, hampering the
development of AVSE. In this work, we introduce AV2Wav, a resynthesis-based
audio-visual speech enhancement approach that can generate clean speech despite
the challenges of real-world training data. We obtain a subset of nearly clean
speech from an audio-visual corpus using a neural quality estimator, and then
train a diffusion model on this subset to generate waveforms conditioned on
continuous speech representations from AV-HuBERT with noise-robust training. We
use continuous rather than discrete representations to retain prosody and
speaker information. With this vocoding task alone, the model can perform
speech enhancement better than a masking-based baseline. We further fine-tune
the diffusion model on clean/noisy utterance pairs to improve the performance.
Our approach outperforms a masking-based baseline in terms of both automatic
metrics and a human listening test and is close in quality to the target speech
in the listening test. Audio samples can be found at
https://home.ttic.edu/~jcchou/demo/avse/avse_demo.html.Comment: Submitted to ICASSP 202
Audio-Visual Speech Enhancement with Score-Based Generative Models
This paper introduces an audio-visual speech enhancement system that
leverages score-based generative models, also known as diffusion models,
conditioned on visual information. In particular, we exploit audio-visual
embeddings obtained from a self-super\-vised learning model that has been
fine-tuned on lipreading. The layer-wise features of its transformer-based
encoder are aggregated, time-aligned, and incorporated into the noise
conditional score network. Experimental evaluations show that the proposed
audio-visual speech enhancement system yields improved speech quality and
reduces generative artifacts such as phonetic confusions with respect to the
audio-only equivalent. The latter is supported by the word error rate of a
downstream automatic speech recognition model, which decreases noticeably,
especially at low input signal-to-noise ratios.Comment: Submitted to ITG Conference on Speech Communicatio
- …