We present an end-to-end binaural audio rendering approach (Listen2Scene) for
virtual reality (VR) and augmented reality (AR) applications. We propose a
novel neural-network-based binaural sound propagation method to generate
acoustic effects for 3D models of real environments. Any clean audio or dry
audio can be convolved with the generated acoustic effects to render audio
corresponding to the real environment. We propose a graph neural network that
uses both the material and the topology information of the 3D scenes and
generates a scene latent vector. Moreover, we use a conditional generative
adversarial network (CGAN) to generate acoustic effects from the scene latent
vector. Our network is able to handle holes or other artifacts in the
reconstructed 3D mesh model. We present an efficient cost function to the
generator network to incorporate spatial audio effects. Given the source and
the listener position, our learning-based binaural sound propagation approach
can generate an acoustic effect in 0.1 milliseconds on an NVIDIA GeForce RTX
2080 Ti GPU and can easily handle multiple sources. We have evaluated the
accuracy of our approach with binaural acoustic effects generated using an
interactive geometric sound propagation algorithm and captured real acoustic
effects. We also performed a perceptual evaluation and observed that the audio
rendered by our approach is more plausible as compared to audio rendered using
prior learning-based sound propagation algorithms.Comment: Project page: https://anton-jeran.github.io/Listen2Scene