3 research outputs found
Neural Multisensory Scene Inference
For embodied agents to infer representations of the underlying 3D physical
world they inhabit, they should efficiently combine multisensory cues from
numerous trials, e.g., by looking at and touching objects. Despite its
importance, multisensory 3D scene representation learning has received less
attention compared to the unimodal setting. In this paper, we propose the
Generative Multisensory Network (GMN) for learning latent representations of 3D
scenes which are partially observable through multiple sensory modalities. We
also introduce a novel method, called the Amortized Product-of-Experts, to
improve the computational efficiency and the robustness to unseen combinations
of modalities at test time. Experimental results demonstrate that the proposed
model can efficiently infer robust modality-invariant 3D-scene representations
from arbitrary combinations of modalities and perform accurate cross-modal
generation. To perform this exploration, we also develop the Multisensory
Embodied 3D-Scene Environment (MESE)