1,987 research outputs found
Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline
Existing audio-visual event localization (AVE) handles manually trimmed
videos with only a single instance in each of them. However, this setting is
unrealistic as natural videos often contain numerous audio-visual events with
different categories. To better adapt to real-life applications, in this paper
we focus on the task of dense-localizing audio-visual events, which aims to
jointly localize and recognize all audio-visual events occurring in an
untrimmed video. The problem is challenging as it requires fine-grained
audio-visual scene and context understanding. To tackle this problem, we
introduce the first Untrimmed Audio-Visual (UnAV-100) dataset, which contains
10K untrimmed videos with over 30K audio-visual events. Each video has 2.8
audio-visual events on average, and the events are usually related to each
other and might co-occur as in real-life scenes. Next, we formulate the task
using a new learning-based framework, which is capable of fully integrating
audio and visual modalities to localize audio-visual events with various
lengths and capture dependencies between them in a single pass. Extensive
experiments demonstrate the effectiveness of our method as well as the
significance of multi-scale cross-modal perception and dependency modeling for
this task.Comment: Accepted by CVPR202
Temporal Sentence Grounding in Videos: A Survey and Future Directions
Temporal sentence grounding in videos (TSGV), \aka natural language video
localization (NLVL) or video moment retrieval (VMR), aims to retrieve a
temporal moment that semantically corresponds to a language query from an
untrimmed video. Connecting computer vision and natural language, TSGV has
drawn significant attention from researchers in both communities. This survey
attempts to provide a summary of fundamental concepts in TSGV and current
research status, as well as future research directions. As the background, we
present a common structure of functional components in TSGV, in a tutorial
style: from feature extraction from raw video and language query, to answer
prediction of the target moment. Then we review the techniques for multimodal
understanding and interaction, which is the key focus of TSGV for effective
alignment between the two modalities. We construct a taxonomy of TSGV
techniques and elaborate the methods in different categories with their
strengths and weaknesses. Lastly, we discuss issues with the current TSGV
research and share our insights about promising research directions.Comment: 29 pages, 32 figures, 9 table
Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing
This paper focuses on the weakly-supervised audio-visual video parsing task,
which aims to recognize all events belonging to each modality and localize
their temporal boundaries. This task is challenging because only overall labels
indicating the video events are provided for training. However, an event might
be labeled but not appear in one of the modalities, which results in a
modality-specific noisy label problem. In this work, we propose a training
strategy to identify and remove modality-specific noisy labels dynamically. It
is motivated by two key observations: 1) networks tend to learn clean samples
first; and 2) a labeled event would appear in at least one modality.
Specifically, we sort the losses of all instances within a mini-batch
individually in each modality, and then select noisy samples according to the
relationships between intra-modal and inter-modal losses. Besides, we also
propose a simple but valid noise ratio estimation method by calculating the
proportion of instances whose confidence is below a preset threshold. Our
method makes large improvements over the previous state of the arts (e.g. from
60.0\% to 63.8\% in segment-level visual metric), which demonstrates the
effectiveness of our approach. Code and trained models are publicly available
at \url{https://github.com/MCG-NJU/JoMoLD}.Comment: Accepted by ECCV 202
Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser
Audio-visual learning has been a major pillar of multi-modal machine
learning, where the community mostly focused on its modality-aligned setting,
i.e., the audio and visual modality are both assumed to signal the prediction
target. With the Look, Listen, and Parse dataset (LLP), we investigate the
under-explored unaligned setting, where the goal is to recognize audio and
visual events in a video with only weak labels observed. Such weak video-level
labels only tell what events happen without knowing the modality they are
perceived (audio, visual, or both). To enhance learning in this challenging
setting, we incorporate large-scale contrastively pre-trained models as the
modality teachers. A simple, effective, and generic method, termed Visual-Audio
Label Elaboration (VALOR), is innovated to harvest modality labels for the
training events. Empirical studies show that the harvested labels significantly
improve an attentional baseline by 8.0 in average F-score (Type@AV).
Surprisingly, we found that modality-independent teachers outperform their
modality-fused counterparts since they are noise-proof from the other
potentially unaligned modality. Moreover, our best model achieves the new
state-of-the-art on all metrics of LLP by a substantial margin (+5.4 F-score
for Type@AV). VALOR is further generalized to Audio-Visual Event Localization
and achieves the new state-of-the-art as well. Code is available at:
https://github.com/Franklin905/VALOR
CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation
Audio-visual video segmentation~(AVVS) aims to generate pixel-level maps of
sound-producing objects within image frames and ensure the maps faithfully
adhere to the given audio, such as identifying and segmenting a singing person
in a video. However, existing methods exhibit two limitations: 1) they address
video temporal features and audio-visual interactive features separately,
disregarding the inherent spatial-temporal dependence of combined audio and
video, and 2) they inadequately introduce audio constraints and object-level
information during the decoding stage, resulting in segmentation outcomes that
fail to comply with audio directives. To tackle these issues, we propose a
decoupled audio-video transformer that combines audio and video features from
their respective temporal and spatial dimensions, capturing their combined
dependence. To optimize memory consumption, we design a block, which, when
stacked, enables capturing audio-visual fine-grained combinatorial-dependence
in a memory-efficient manner. Additionally, we introduce audio-constrained
queries during the decoding phase. These queries contain rich object-level
information, ensuring the decoded mask adheres to the sounds. Experimental
results confirm our approach's effectiveness, with our framework achieving a
new SOTA performance on all three datasets using two backbones. The code is
available at \url{https://github.com/aspirinone/CATR.github.io}Comment: accepted by ACM MM 202
Audio-Visual Segmentation
We propose to explore a new problem called audio-visual segmentation (AVS),
in which the goal is to output a pixel-level map of the object(s) that produce
sound at the time of the image frame. To facilitate this research, we construct
the first audio-visual segmentation benchmark (AVSBench), providing pixel-wise
annotations for the sounding objects in audible videos. Two settings are
studied with this benchmark: 1) semi-supervised audio-visual segmentation with
a single sound source and 2) fully-supervised audio-visual segmentation with
multiple sound sources. To deal with the AVS problem, we propose a novel method
that uses a temporal pixel-wise audio-visual interaction module to inject audio
semantics as guidance for the visual segmentation process. We also design a
regularization loss to encourage the audio-visual mapping during training.
Quantitative and qualitative experiments on the AVSBench compare our approach
to several existing methods from related tasks, demonstrating that the proposed
method is promising for building a bridge between the audio and pixel-wise
visual semantics. Code is available at https://github.com/OpenNLPLab/AVSBench.Comment: ECCV 2022; Correct the equation (3) and update the notation of the
evaluation metrics in the last arxiv version; Code is available at
https://github.com/OpenNLPLab/AVSBenc
Interaction intermodale dans les réseaux neuronaux profonds pour la classification et la localisation d'évènements audiovisuels
La compréhension automatique du monde environnant a de nombreuses applications
telles que la surveillance et sécurité, l'interaction Homme-Machine,
la robotique, les soins de santé, etc. Plus précisément, la compréhension peut
s'exprimer par le biais de différentes taches telles que la classification et localisation
dans l'espace d'évènements. Les êtres vivants exploitent un maximum
de l'information disponible pour comprendre ce qui les entoure. En s'inspirant
du comportement des êtres vivants, les réseaux de neurones artificiels devraient
également utiliser conjointement plusieurs modalités, par exemple, la vision et
l'audition.
Premièrement, les modèles de classification et localisation, basés sur l'information
audio-visuelle, doivent être évalués de façon objective. Nous avons donc
enregistré une nouvelle base de données pour compléter les bases actuellement
disponibles. Comme aucun modèle audio-visuel de classification et localisation
n'existe, seule la partie sonore de la base est évaluée avec un modèle de la
littérature.
Deuxièmement, nous nous concentrons sur le cœur de la thèse: comment
utiliser conjointement de l'information visuelle et sonore pour résoudre une
tâche spécifique, la reconnaissance d'évènements. Le cerveau n'est pas constitué d'une "simple" fusion mais comprend de multiples interactions entre
les deux modalités. Il y a un couplage important entre le traitement de
l'information visuelle et sonore. Les réseaux de neurones offrent la possibilité de créer des interactions entre les modalités en plus de la fusion. Dans
cette thèse, nous explorons plusieurs stratégies pour fusionner les modalités
visuelles et sonores et pour créer des interactions entre les modalités. Ces techniques
ont les meilleures performances en comparaison aux architectures de
l'état de l'art au moment de la publication. Ces techniques montrent l'utilité
de la fusion audio-visuelle mais surtout l'importance des interactions entre les
modalités.
Pour conclure la thèse, nous proposons un réseau de référence pour la classification et localisation d'évènements audio-visuels. Ce réseau a été testé avec
la nouvelle base de données. Les modèles précédents de classification sont
modifiés pour prendre en compte la localisation dans l'espace en plus de la
classification.Abstract: The automatic understanding of the surrounding world has a wide range of applications, including surveillance, human-computer interaction, robotics, health care, etc. The understanding can be expressed in several ways such as event classification and its localization in space. Living beings exploit a maximum of the available information to understand the surrounding world. Artificial neural networks should build on this behavior and jointly use several modalities such as vision and hearing. First, audio-visual networks for classification and localization must be evaluated objectively. We recorded a new audio-visual dataset to fill a gap in the current available datasets. We were not able to find audio-visual models for classification and localization. Only the dataset audio part is evaluated with a state-of-the-art model. Secondly, we focus on the main challenge of the thesis: How to jointly use visual and audio information to solve a specific task, event recognition. The brain does not comprise a simple fusion but has multiple interactions between the two modalities to create a strong coupling between them. The neural networks offer the possibility to create interactions between the two modalities in addition to the fusion. We explore several strategies to fuse the audio and visual modalities and to create interactions between modalities. These techniques have the best performance compared to the state-of-the-art architectures at the time of publishing. They show the usefulness of audio-visual fusion but above all the contribution of the interaction between modalities. To conclude, we propose a benchmark for audio-visual classification and localization on the new dataset. Previous models for the audio-visual classification are modified to address the localization in addition to the classification
- …