11 research outputs found
Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation
The goal of the audio-visual segmentation (AVS) task is to segment the
sounding objects in the video frames using audio cues. However, current
fusion-based methods have the performance limitations due to the small
receptive field of convolution and inadequate fusion of audio-visual features.
To overcome these issues, we propose a novel \textbf{Au}dio-aware
query-enhanced \textbf{TR}ansformer (AuTR) to tackle the task. Unlike existing
methods, our approach introduces a multimodal transformer architecture that
enables deep fusion and aggregation of audio-visual features. Furthermore, we
devise an audio-aware query-enhanced transformer decoder that explicitly helps
the model focus on the segmentation of the pinpointed sounding objects based on
audio signals, while disregarding silent yet salient objects. Experimental
results show that our method outperforms previous methods and demonstrates
better generalization ability in multi-sound and open-set scenarios.Comment: arXiv admin note: text overlap with arXiv:2305.1101
Microphone array for speaker localization and identification in shared autonomous vehicles
With the current technological transformation in the automotive industry, autonomous vehicles are getting closer to the Society of Automative Engineers (SAE) automation level 5. This level corresponds to the full vehicle automation, where the driving system autonomously monitors and navigates the environment. With SAE-level 5, the concept of a Shared Autonomous Vehicle (SAV) will soon become a reality and mainstream. The main purpose of an SAV is to allow unrelated passengers to share an autonomous vehicle without a driver/moderator inside the shared space. However, to ensure their safety and well-being until they reach their final destination, active monitoring of all passengers is required. In this context, this article presents a microphone-based sensor system that is able to localize sound events inside an SAV. The solution is composed of a Micro-Electro-Mechanical System (MEMS) microphone array with a circular geometry connected to an embedded processing platform that resorts to Field-Programmable Gate Array (FPGA) technology to successfully process in the hardware the sound localization algorithms.This work is supported by: European Structural and Investment Funds in the FEDER component, through the Operational Competitiveness and Internationalization Programme (COMPETE 2020) [Project nº 039334; Funding Reference: POCI-01-0247-FEDER-039334]
AVSegFormer: Audio-Visual Segmentation with Transformer
The combination of audio and vision has long been a topic of interest in the
multi-modal community. Recently, a new audio-visual segmentation (AVS) task has
been introduced, aiming to locate and segment the sounding objects in a given
video. This task demands audio-driven pixel-level scene understanding for the
first time, posing significant challenges. In this paper, we propose
AVSegFormer, a novel framework for AVS tasks that leverages the transformer
architecture. Specifically, we introduce audio queries and learnable queries
into the transformer decoder, enabling the network to selectively attend to
interested visual features. Besides, we present an audio-visual mixer, which
can dynamically adjust visual features by amplifying relevant and suppressing
irrelevant spatial channels. Additionally, we devise an intermediate mask loss
to enhance the supervision of the decoder, encouraging the network to produce
more accurate intermediate predictions. Extensive experiments demonstrate that
AVSegFormer achieves state-of-the-art results on the AVS benchmark. The code is
available at https://github.com/vvvb-github/AVSegFormer.Comment: 9 pages, 7 figure
Audio-Visual Spatial Integration and Recursive Attention for Robust Sound Source Localization
The objective of the sound source localization task is to enable machines to
detect the location of sound-making objects within a visual scene. While the
audio modality provides spatial cues to locate the sound source, existing
approaches only use audio as an auxiliary role to compare spatial regions of
the visual modality. Humans, on the other hand, utilize both audio and visual
modalities as spatial cues to locate sound sources. In this paper, we propose
an audio-visual spatial integration network that integrates spatial cues from
both modalities to mimic human behavior when detecting sound-making objects.
Additionally, we introduce a recursive attention network to mimic human
behavior of iterative focusing on objects, resulting in more accurate attention
regions. To effectively encode spatial information from both modalities, we
propose audio-visual pair matching loss and spatial region alignment loss. By
utilizing the spatial cues of audio-visual modalities and recursively focusing
objects, our method can perform more robust sound source localization.
Comprehensive experimental results on the Flickr SoundNet and VGG-Sound Source
datasets demonstrate the superiority of our proposed method over existing
approaches. Our code is available at: https://github.com/VisualAIKHU/SIRA-SSLComment: Camera-Ready, ACM MM 202
CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation
Audio-visual video segmentation~(AVVS) aims to generate pixel-level maps of
sound-producing objects within image frames and ensure the maps faithfully
adhere to the given audio, such as identifying and segmenting a singing person
in a video. However, existing methods exhibit two limitations: 1) they address
video temporal features and audio-visual interactive features separately,
disregarding the inherent spatial-temporal dependence of combined audio and
video, and 2) they inadequately introduce audio constraints and object-level
information during the decoding stage, resulting in segmentation outcomes that
fail to comply with audio directives. To tackle these issues, we propose a
decoupled audio-video transformer that combines audio and video features from
their respective temporal and spatial dimensions, capturing their combined
dependence. To optimize memory consumption, we design a block, which, when
stacked, enables capturing audio-visual fine-grained combinatorial-dependence
in a memory-efficient manner. Additionally, we introduce audio-constrained
queries during the decoding phase. These queries contain rich object-level
information, ensuring the decoded mask adheres to the sounds. Experimental
results confirm our approach's effectiveness, with our framework achieving a
new SOTA performance on all three datasets using two backbones. The code is
available at \url{https://github.com/aspirinone/CATR.github.io}Comment: accepted by ACM MM 202
Self-supervised object detection from audio-visual correspondence
We tackle the problem of learning object detectors without supervision.
Differently from weakly-supervised object detection, we do not assume
image-level class labels. Instead, we extract a supervisory signal from
audio-visual data, using the audio component to "teach" the object detector.
While this problem is related to sound source localisation, it is considerably
harder because the detector must classify the objects by type, enumerate each
instance of the object, and do so even when the object is silent. We tackle
this problem by first designing a self-supervised framework with a contrastive
objective that jointly learns to classify and localise objects. Then, without
using any supervision, we simply use these self-supervised labels and boxes to
train an image-based object detector. With this, we outperform previous
unsupervised and weakly-supervised detectors for the task of object detection
and sound source localization. We also show that we can align this detector to
ground-truth classes with as little as one label per pseudo-class, and show how
our method can learn to detect generic objects that go beyond instruments, such
as airplanes and cats.Comment: Under revie