4,218 research outputs found
The Conversation: Deep Audio-Visual Speech Enhancement
Our goal is to isolate individual speakers from multi-talker simultaneous
speech in videos. Existing works in this area have focussed on trying to
separate utterances from known speakers in controlled environments. In this
paper, we propose a deep audio-visual speech enhancement network that is able
to separate a speaker's voice given lip regions in the corresponding video, by
predicting both the magnitude and the phase of the target signal. The method is
applicable to speakers unheard and unseen during training, and for
unconstrained environments. We demonstrate strong quantitative and qualitative
results, isolating extremely challenging real-world examples.Comment: To appear in Interspeech 2018. We provide supplementary material with
interactive demonstrations on
http://www.robots.ox.ac.uk/~vgg/demo/theconversatio
Ambient Sound Helps: Audiovisual Crowd Counting in Extreme Conditions
Visual crowd counting has been recently studied as a way to enable people
counting in crowd scenes from images. Albeit successful, vision-based crowd
counting approaches could fail to capture informative features in extreme
conditions, e.g., imaging at night and occlusion. In this work, we introduce a
novel task of audiovisual crowd counting, in which visual and auditory
information are integrated for counting purposes. We collect a large-scale
benchmark, named auDiovISual Crowd cOunting (DISCO) dataset, consisting of
1,935 images and the corresponding audio clips, and 170,270 annotated
instances. In order to fuse the two modalities, we make use of a linear
feature-wise fusion module that carries out an affine transformation on visual
and auditory features. Finally, we conduct extensive experiments using the
proposed dataset and approach. Experimental results show that introducing
auditory information can benefit crowd counting under different illumination,
noise, and occlusion conditions. The dataset and code will be released. Code
and data have been made availabl
A Review of Deep Learning Techniques for Speech Processing
The field of speech processing has undergone a transformative shift with the
advent of deep learning. The use of multiple processing layers has enabled the
creation of models capable of extracting intricate features from speech data.
This development has paved the way for unparalleled advancements in speech
recognition, text-to-speech synthesis, automatic speech recognition, and
emotion recognition, propelling the performance of these tasks to unprecedented
heights. The power of deep learning techniques has opened up new avenues for
research and innovation in the field of speech processing, with far-reaching
implications for a range of industries and applications. This review paper
provides a comprehensive overview of the key deep learning models and their
applications in speech-processing tasks. We begin by tracing the evolution of
speech processing research, from early approaches, such as MFCC and HMM, to
more recent advances in deep learning architectures, such as CNNs, RNNs,
transformers, conformers, and diffusion models. We categorize the approaches
and compare their strengths and weaknesses for solving speech-processing tasks.
Furthermore, we extensively cover various speech-processing tasks, datasets,
and benchmarks used in the literature and describe how different deep-learning
networks have been utilized to tackle these tasks. Additionally, we discuss the
challenges and future directions of deep learning in speech processing,
including the need for more parameter-efficient, interpretable models and the
potential of deep learning for multimodal speech processing. By examining the
field's evolution, comparing and contrasting different approaches, and
highlighting future directions and challenges, we hope to inspire further
research in this exciting and rapidly advancing field
Egocentric Auditory Attention Localization in Conversations
In a noisy conversation environment such as a dinner party, people often
exhibit selective auditory attention, or the ability to focus on a particular
speaker while tuning out others. Recognizing who somebody is listening to in a
conversation is essential for developing technologies that can understand
social behavior and devices that can augment human hearing by amplifying
particular sound sources. The computer vision and audio research communities
have made great strides towards recognizing sound sources and speakers in
scenes. In this work, we take a step further by focusing on the problem of
localizing auditory attention targets in egocentric video, or detecting who in
a camera wearer's field of view they are listening to. To tackle the new and
challenging Selective Auditory Attention Localization problem, we propose an
end-to-end deep learning approach that uses egocentric video and multichannel
audio to predict the heatmap of the camera wearer's auditory attention. Our
approach leverages spatiotemporal audiovisual features and holistic reasoning
about the scene to make predictions, and outperforms a set of baselines on a
challenging multi-speaker conversation dataset. Project page:
https://fkryan.github.io/saa
Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition
12 pages, 5 figures, Accepted by IJCAI 2023Preprin
Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser
Audio-visual learning has been a major pillar of multi-modal machine
learning, where the community mostly focused on its modality-aligned setting,
i.e., the audio and visual modality are both assumed to signal the prediction
target. With the Look, Listen, and Parse dataset (LLP), we investigate the
under-explored unaligned setting, where the goal is to recognize audio and
visual events in a video with only weak labels observed. Such weak video-level
labels only tell what events happen without knowing the modality they are
perceived (audio, visual, or both). To enhance learning in this challenging
setting, we incorporate large-scale contrastively pre-trained models as the
modality teachers. A simple, effective, and generic method, termed Visual-Audio
Label Elaboration (VALOR), is innovated to harvest modality labels for the
training events. Empirical studies show that the harvested labels significantly
improve an attentional baseline by 8.0 in average F-score (Type@AV).
Surprisingly, we found that modality-independent teachers outperform their
modality-fused counterparts since they are noise-proof from the other
potentially unaligned modality. Moreover, our best model achieves the new
state-of-the-art on all metrics of LLP by a substantial margin (+5.4 F-score
for Type@AV). VALOR is further generalized to Audio-Visual Event Localization
and achieves the new state-of-the-art as well. Code is available at:
https://github.com/Franklin905/VALOR
- …