3,393 research outputs found
Novel-View Acoustic Synthesis from 3D Reconstructed Rooms
We investigate the benefit of combining blind audio recordings with 3D scene
information for novel-view acoustic synthesis. Given audio recordings from 2-4
microphones and the 3D geometry and material of a scene containing multiple
unknown sound sources, we estimate the sound anywhere in the scene. We identify
the main challenges of novel-view acoustic synthesis as sound source
localization, separation, and dereverberation. While naively training an
end-to-end network fails to produce high-quality results, we show that
incorporating room impulse responses (RIRs) derived from 3D reconstructed rooms
enables the same network to jointly tackle these tasks. Our method outperforms
existing methods designed for the individual tasks, demonstrating its
effectiveness at utilizing 3D visual information. In a simulated study on the
Matterport3D-NVAS dataset, our model achieves near-perfect accuracy on source
localization, a PSNR of 26.44 dB and a SDR of 14.23 dB for source separation
and dereverberation, resulting in a PSNR of 25.55 dB and a SDR of 14.20 dB on
novel-view acoustic synthesis. Code, pretrained model, and video results are
available on the project webpage (https://github.com/apple/ml-nvas3d)
SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning
We introduce SoundSpaces 2.0, a platform for on-the-fly geometry-based audio
rendering for 3D environments. Given a 3D mesh of a real-world environment,
SoundSpaces can generate highly realistic acoustics for arbitrary sounds
captured from arbitrary microphone locations. Together with existing 3D visual
assets, it supports an array of audio-visual research tasks, such as
audio-visual navigation, mapping, source localization and separation, and
acoustic matching. Compared to existing resources, SoundSpaces 2.0 has the
advantages of allowing continuous spatial sampling, generalization to novel
environments, and configurable microphone and material properties. To our
knowledge, this is the first geometry-based acoustic simulation that offers
high fidelity and realism while also being fast enough to use for embodied
learning. We showcase the simulator's properties and benchmark its performance
against real-world audio measurements. In addition, we demonstrate two
downstream tasks -- embodied navigation and far-field automatic speech
recognition -- and highlight sim2real performance for the latter. SoundSpaces
2.0 is publicly available to facilitate wider research for perceptual systems
that can both see and hear.Comment: Camera-ready version. Website: https://soundspaces.org. Project page:
https://vision.cs.utexas.edu/projects/soundspaces
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
Sonicverse: A Multisensory Simulation Platform for Embodied Household Agents that See and Hear
Developing embodied agents in simulation has been a key research topic in
recent years. Exciting new tasks, algorithms, and benchmarks have been
developed in various simulators. However, most of them assume deaf agents in
silent environments, while we humans perceive the world with multiple senses.
We introduce Sonicverse, a multisensory simulation platform with integrated
audio-visual simulation for training household agents that can both see and
hear. Sonicverse models realistic continuous audio rendering in 3D environments
in real-time. Together with a new audio-visual VR interface that allows humans
to interact with agents with audio, Sonicverse enables a series of embodied AI
tasks that need audio-visual perception. For semantic audio-visual navigation
in particular, we also propose a new multi-task learning model that achieves
state-of-the-art performance. In addition, we demonstrate Sonicverse's realism
via sim-to-real transfer, which has not been achieved by other simulators: an
agent trained in Sonicverse can successfully perform audio-visual navigation in
real-world environments. Sonicverse is available at:
https://github.com/StanfordVL/Sonicverse.Comment: In ICRA 2023. Project page:
https://ai.stanford.edu/~rhgao/sonicverse/. Code:
https://github.com/StanfordVL/sonicverse. Gao and Li contributed equally to
this work and are in alphabetical orde
- …