74 research outputs found
Recommended from our members
Geometry-aware multi-task learning for binaural audio generation from video
Human audio perception is inherently spatial and videos with binaural audio simulate the spatial experience by delivering different sounds to both ears. However, videos are typically recorded with mono audio and hence generally do not offer the rich audio experience of binaural audio. We propose an audio spatialization method that uses the visual information in videos to convert mono audio to binaural. We leverage the spatial and geometric information about the audio present in the visuals of the video to guide the learning process. We learn these geometry aware features in visuals in a multi-task manner to generate rich binaural audio. We also generate a large video dataset with binaural audio in photorealistic environments to better understand and evaluate the task. We demonstrate the efficacy of our method to generate better binaural audio by learning more spatially coherent visual features by extensive evaluation on two datasets.Computer Science
Audio-Visual Learning for Scene Understanding
Multimodal deep learning aims at combining the complementary information of different modalities. Among all modalities, audio and video are the predominant ones that humans use to explore the world. In this thesis, we decided to focus our study on audio-visual deep learning to mimic with our networks how humans perceive the world.
Our research includes images, audio signals and acoustic images. The latter provide spatial audio information and are obtained from a planar array of microphones combining their raw audios with the beamforming algorithm. They better mimic human auditory systems, which cannot be replicated using just one microphone, not able alone to give spatial sound cues.
However, as microphones arrays are not so widespread, we also study how to handle the missing spatialized audio modality at test time.
As a solution, we propose to distill acoustic images content to audio features during the training in order to handle their absence at test time. This is done for supervised audio classification using the generalized distillation framework, which we also extend for self-supervised learning.
Next, we devise a method for reconstructing acoustic images given a single microphone and an RGB frame. Therefore, in case we just dispose of a standard video, we are able to synthesize spatial audio, which is useful for many audio-visual tasks, including sound localization.
Lastly, as another example of restoring one modality from available ones, we inpaint degraded images providing audio features, to reconstruct the missing region not only to be visually plausible but also semantically consistent with the related sound. This includes also cross-modal generation, in the limit case of completely missing or hidden visual modality: our method naturally deals with it, being able to generate images from sound.
In summary we show how audio can help visual learning and vice versa, by transferring knowledge between the two modalities at training time, in order to distill, reconstruct, or restore the missing modality at test time
Sounding Bodies: Modeling 3D Spatial Sound of Humans Using Body Pose and Audio
While 3D human body modeling has received much attention in computer vision,
modeling the acoustic equivalent, i.e. modeling 3D spatial audio produced by
body motion and speech, has fallen short in the community. To close this gap,
we present a model that can generate accurate 3D spatial audio for full human
bodies. The system consumes, as input, audio signals from headset microphones
and body pose, and produces, as output, a 3D sound field surrounding the
transmitter's body, from which spatial audio can be rendered at any arbitrary
position in the 3D space. We collect a first-of-its-kind multimodal dataset of
human bodies, recorded with multiple cameras and a spherical array of 345
microphones. In an empirical evaluation, we demonstrate that our model can
produce accurate body-induced sound fields when trained with a suitable loss.
Dataset and code are available online.Comment: 37th Conference on Neural Information Processing Systems (NeurIPS
2023
Points2Sound: From mono to binaural audio using 3D point cloud scenes
For immersive applications, the generation of binaural sound that matches the
visual counterpart is crucial to bring meaningful experiences to people in a
virtual environment. Recent works have shown the possibility to use neural
networks for synthesizing binaural audio from mono audio using 2D visual
information as guidance. Extending this approach by guiding the audio using 3D
visual information and operating in the waveform domain may allow for a more
accurate auralization of a virtual audio scene. In this paper, we present
Points2Sound, a multi-modal deep learning model which generates a binaural
version from mono audio using 3D point cloud scenes. Specifically, Points2Sound
consists of a vision network with 3D sparse convolutions which extracts visual
features from the point cloud scene to condition an audio network, which
operates in the waveform domain, to synthesize the binaural version.
Experimental results indicate that 3D visual information can successfully guide
multi-modal deep learning models for the task of binaural synthesis. In
addition, we investigate different loss functions and 3D point cloud
attributes, showing that directly predicting the full binaural signal and using
rgb-depth features increases the performance of our proposed model.Comment: Code, data, and listening examples:
https://github.com/francesclluis/points2soun
Sound Localization from Motion: Jointly Learning Sound Direction and Camera Rotation
The images and sounds that we perceive undergo subtle but geometrically
consistent changes as we rotate our heads. In this paper, we use these cues to
solve a problem we call Sound Localization from Motion (SLfM): jointly
estimating camera rotation and localizing sound sources. We learn to solve
these tasks solely through self-supervision. A visual model predicts camera
rotation from a pair of images, while an audio model predicts the direction of
sound sources from binaural sounds. We train these models to generate
predictions that agree with one another. At test time, the models can be
deployed independently. To obtain a feature representation that is well-suited
to solving this challenging problem, we also propose a method for learning an
audio-visual representation through cross-view binauralization: estimating
binaural sound from one view, given images and sound from another. Our model
can successfully estimate accurate rotations on both real and synthetic scenes,
and localize sound sources with accuracy competitive with state-of-the-art
self-supervised approaches. Project site: https://ificl.github.io/SLfM/Comment: ICCV 2023. Project site: https://ificl.github.io/SLfM
- …