67 research outputs found
Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound Separation
The audio-visual sound separation field assumes visible sources in videos,
but this excludes invisible sounds beyond the camera's view. Current methods
struggle with such sounds lacking visible cues. This paper introduces a novel
"Audio-Visual Scene-Aware Separation" (AVSA-Sep) framework. It includes a
semantic parser for visible and invisible sounds and a separator for
scene-informed separation. AVSA-Sep successfully separates both sound types,
with joint training and cross-modal alignment enhancing effectiveness.Comment: Accepted at ICCV 2023 - AV4D, 4 figures, 3 table
Look, Listen, and Act: Towards Audio-Visual Embodied Navigation
A crucial ability of mobile intelligent agents is to integrate the evidence
from multiple sensory inputs in an environment and to make a sequence of
actions to reach their goals. In this paper, we attempt to approach the problem
of Audio-Visual Embodied Navigation, the task of planning the shortest path
from a random starting location in a scene to the sound source in an indoor
environment, given only raw egocentric visual and audio sensory data. To
accomplish this task, the agent is required to learn from various modalities,
i.e. relating the audio signal to the visual environment. Here we describe an
approach to audio-visual embodied navigation that takes advantage of both
visual and audio pieces of evidence. Our solution is based on three key ideas:
a visual perception mapper module that constructs its spatial memory of the
environment, a sound perception module that infers the relative location of the
sound source from the agent, and a dynamic path planner that plans a sequence
of actions based on the audio-visual observations and the spatial memory of the
environment to navigate toward the goal. Experimental results on a newly
collected Visual-Audio-Room dataset using the simulated multi-modal environment
demonstrate the effectiveness of our approach over several competitive
baselines.Comment: Accepted by ICRA 2020. Project page: http://avn.csail.mit.ed
Foley Music: Learning to Generate Music from Videos
In this paper, we introduce Foley Music, a system that can synthesize
plausible music for a silent video clip about people playing musical
instruments. We first identify two key intermediate representations for a
successful video to music generator: body keypoints from videos and MIDI events
from audio recordings. We then formulate music generation from videos as a
motion-to-MIDI translation problem. We present a GraphTransformer framework
that can accurately predict MIDI event sequences in accordance with the body
movements. The MIDI event can then be converted to realistic music using an
off-the-shelf music synthesizer tool. We demonstrate the effectiveness of our
models on videos containing a variety of music performances. Experimental
results show that our model outperforms several existing systems in generating
music that is pleasant to listen to. More importantly, the MIDI representations
are fully interpretable and transparent, thus enabling us to perform music
editing flexibly. We encourage the readers to watch the demo video with audio
turned on to experience the results.Comment: ECCV 2020. Project page: http://foley-music.csail.mit.ed
Mix and Localize: Localizing Sound Sources in Mixtures
We present a method for simultaneously localizing multiple sound sources
within a visual scene. This task requires a model to both group a sound mixture
into individual sources, and to associate them with a visual signal. Our method
jointly solves both tasks at once, using a formulation inspired by the
contrastive random walk of Jabri et al. We create a graph in which images and
separated sounds correspond to nodes, and train a random walker to transition
between nodes from different modalities with high return probability. The
transition probabilities for this walk are determined by an audio-visual
similarity metric that is learned by our model. We show through experiments
with musical instruments and human speech that our model can successfully
localize multiple sounds, outperforming other self-supervised methods. Project
site: https://hxixixh.github.io/mix-and-localizeComment: CVPR 202
- …