3,383 research outputs found
Rule-embedded network for audio-visual voice activity detection in live musical video streams
Detecting anchor's voice in live musical streams is an important
preprocessing for music and speech signal processing. Existing approaches to
voice activity detection (VAD) primarily rely on audio, however, audio-based
VAD is difficult to effectively focus on the target voice in noisy
environments. With the help of visual information, this paper proposes a
rule-embedded network to fuse the audio-visual (A-V) inputs to help the model
better detect target voice. The core role of the rule in the model is to
coordinate the relation between the bi-modal information and use visual
representations as the mask to filter out the information of non-target sound.
Experiments show that: 1) with the help of cross-modal fusion by the proposed
rule, the detection result of A-V branch outperforms that of audio branch; 2)
the performance of bi-modal model far outperforms that of audio-only models,
indicating that the incorporation of both audio and visual signals is highly
beneficial for VAD. To attract more attention to the cross-modal music and
audio signal processing, a new live musical video corpus with frame-level label
is introduced.Comment: Submitted to ICASSP 202
Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams
Many previous audio-visual voice-related works focus on speech, ignoring the
singing voice in the growing number of musical video streams on the Internet.
For processing diverse musical video data, voice activity detection is a
necessary step. This paper attempts to detect the speech and singing voices of
target performers in musical video streams using audiovisual information. To
integrate information of audio and visual modalities, a multi-branch network is
proposed to learn audio and image representations, and the representations are
fused by attention based on semantic similarity to shape the acoustic
representations through the probability of anchor vocalization. Experiments
show the proposed audio-visual multi-branch network far outperforms the
audio-only model in challenging acoustic environments, indicating the
cross-modal information fusion based on semantic correlation is sensible and
successful.Comment: Accepted by INTERSPEECH 202
Bio-Inspired Modality Fusion for Active Speaker Detection
Human beings have developed fantastic abilities to integrate information from
various sensory sources exploring their inherent complementarity. Perceptual
capabilities are therefore heightened enabling, for instance, the well known
"cocktail party" and McGurk effects, i.e. speech disambiguation from a panoply
of sound signals. This fusion ability is also key in refining the perception of
sound source location, as in distinguishing whose voice is being heard in a
group conversation. Furthermore, Neuroscience has successfully identified the
superior colliculus region in the brain as the one responsible for this
modality fusion, with a handful of biological models having been proposed to
approach its underlying neurophysiological process. Deriving inspiration from
one of these models, this paper presents a methodology for effectively fusing
correlated auditory and visual information for active speaker detection. Such
an ability can have a wide range of applications, from teleconferencing systems
to social robotics. The detection approach initially routes auditory and visual
information through two specialized neural network structures. The resulting
embeddings are fused via a novel layer based on the superior colliculus, whose
topological structure emulates spatial neuron cross-mapping of unimodal
perceptual fields. The validation process employed two publicly available
datasets, with achieved results confirming and greatly surpassing initial
expectations.Comment: Submitted to IEEE RA-L with IROS option, 202
AVA-AVD: Audio-Visual Speaker Diarization in the Wild
Audio-visual speaker diarization aims at detecting "who spoke when" using
both auditory and visual signals. Existing audio-visual diarization datasets
are mainly focused on indoor environments like meeting rooms or news studios,
which are quite different from in-the-wild videos in many scenarios such as
movies, documentaries, and audience sitcoms. To develop diarization methods for
these challenging videos, we create the AVA Audio-Visual Diarization (AVA-AVD)
dataset. Our experiments demonstrate that adding AVA-AVD into training set can
produce significantly better diarization models for in-the-wild videos despite
that the data is relatively small. Moreover, this benchmark is challenging due
to the diverse scenes, complicated acoustic conditions, and completely
off-screen speakers. As a first step towards addressing the challenges, we
design the Audio-Visual Relation Network (AVR-Net) which introduces a simple
yet effective modality mask to capture discriminative information based on face
visibility. Experiments show that our method not only can outperform
state-of-the-art methods but is more robust as varying the ratio of off-screen
speakers. Our data and code has been made publicly available at
https://github.com/showlab/AVA-AVD.Comment: ACMMM 202
- …