19,073 research outputs found
Look, Listen, and Act: Towards Audio-Visual Embodied Navigation
A crucial ability of mobile intelligent agents is to integrate the evidence
from multiple sensory inputs in an environment and to make a sequence of
actions to reach their goals. In this paper, we attempt to approach the problem
of Audio-Visual Embodied Navigation, the task of planning the shortest path
from a random starting location in a scene to the sound source in an indoor
environment, given only raw egocentric visual and audio sensory data. To
accomplish this task, the agent is required to learn from various modalities,
i.e. relating the audio signal to the visual environment. Here we describe an
approach to audio-visual embodied navigation that takes advantage of both
visual and audio pieces of evidence. Our solution is based on three key ideas:
a visual perception mapper module that constructs its spatial memory of the
environment, a sound perception module that infers the relative location of the
sound source from the agent, and a dynamic path planner that plans a sequence
of actions based on the audio-visual observations and the spatial memory of the
environment to navigate toward the goal. Experimental results on a newly
collected Visual-Audio-Room dataset using the simulated multi-modal environment
demonstrate the effectiveness of our approach over several competitive
baselines.Comment: Accepted by ICRA 2020. Project page: http://avn.csail.mit.ed
Sonicverse: A Multisensory Simulation Platform for Embodied Household Agents that See and Hear
Developing embodied agents in simulation has been a key research topic in
recent years. Exciting new tasks, algorithms, and benchmarks have been
developed in various simulators. However, most of them assume deaf agents in
silent environments, while we humans perceive the world with multiple senses.
We introduce Sonicverse, a multisensory simulation platform with integrated
audio-visual simulation for training household agents that can both see and
hear. Sonicverse models realistic continuous audio rendering in 3D environments
in real-time. Together with a new audio-visual VR interface that allows humans
to interact with agents with audio, Sonicverse enables a series of embodied AI
tasks that need audio-visual perception. For semantic audio-visual navigation
in particular, we also propose a new multi-task learning model that achieves
state-of-the-art performance. In addition, we demonstrate Sonicverse's realism
via sim-to-real transfer, which has not been achieved by other simulators: an
agent trained in Sonicverse can successfully perform audio-visual navigation in
real-world environments. Sonicverse is available at:
https://github.com/StanfordVL/Sonicverse.Comment: In ICRA 2023. Project page:
https://ai.stanford.edu/~rhgao/sonicverse/. Code:
https://github.com/StanfordVL/sonicverse. Gao and Li contributed equally to
this work and are in alphabetical orde
SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning
We introduce SoundSpaces 2.0, a platform for on-the-fly geometry-based audio
rendering for 3D environments. Given a 3D mesh of a real-world environment,
SoundSpaces can generate highly realistic acoustics for arbitrary sounds
captured from arbitrary microphone locations. Together with existing 3D visual
assets, it supports an array of audio-visual research tasks, such as
audio-visual navigation, mapping, source localization and separation, and
acoustic matching. Compared to existing resources, SoundSpaces 2.0 has the
advantages of allowing continuous spatial sampling, generalization to novel
environments, and configurable microphone and material properties. To our
knowledge, this is the first geometry-based acoustic simulation that offers
high fidelity and realism while also being fast enough to use for embodied
learning. We showcase the simulator's properties and benchmark its performance
against real-world audio measurements. In addition, we demonstrate two
downstream tasks -- embodied navigation and far-field automatic speech
recognition -- and highlight sim2real performance for the latter. SoundSpaces
2.0 is publicly available to facilitate wider research for perceptual systems
that can both see and hear.Comment: Camera-ready version. Website: https://soundspaces.org. Project page:
https://vision.cs.utexas.edu/projects/soundspaces
Learning Semantic-Agnostic and Spatial-Aware Representation for Generalizable Visual-Audio Navigation
Visual-audio navigation (VAN) is attracting more and more attention from the
robotic community due to its broad applications, \emph{e.g.}, household robots
and rescue robots. In this task, an embodied agent must search for and navigate
to the sound source with egocentric visual and audio observations. However, the
existing methods are limited in two aspects: 1) poor generalization to unheard
sound categories; 2) sample inefficient in training. Focusing on these two
problems, we propose a brain-inspired plug-and-play method to learn a
semantic-agnostic and spatial-aware representation for generalizable
visual-audio navigation. We meticulously design two auxiliary tasks for
respectively accelerating learning representations with the above-desired
characteristics. With these two auxiliary tasks, the agent learns a
spatially-correlated representation of visual and audio inputs that can be
applied to work on environments with novel sounds and maps. Experiment results
on realistic 3D scenes (Replica and Matterport3D) demonstrate that our method
achieves better generalization performance when zero-shot transferred to scenes
with unseen maps and unheard sound categories
HoME: a Household Multimodal Environment
We introduce HoME: a Household Multimodal Environment for artificial agents
to learn from vision, audio, semantics, physics, and interaction with objects
and other agents, all within a realistic context. HoME integrates over 45,000
diverse 3D house layouts based on the SUNCG dataset, a scale which may
facilitate learning, generalization, and transfer. HoME is an open-source,
OpenAI Gym-compatible platform extensible to tasks in reinforcement learning,
language grounding, sound-based navigation, robotics, multi-agent learning, and
more. We hope HoME better enables artificial agents to learn as humans do: in
an interactive, multimodal, and richly contextualized setting.Comment: Presented at NIPS 2017's Visually-Grounded Interaction and Language
Worksho
FM radio: family interplay with sonic mementos
Digital mementos are increasingly problematic, as people acquire large amounts of digital belongings that are hard to access and often forgotten. Based on fieldwork with 10 families, we designed a new type of embodied digital memento, the FM Radio. It allows families to access and play sonic mementos of their previous holidays. We describe our underlying design motivation where recordings are presented as a series of channels on an old fashioned radio. User feedback suggests that the device met our design goals: being playful and intriguing, easy to use and social. It facilitated family interaction, and allowed ready access to mementos, thus sharing many of the properties of physical mementos that we intended to trigger
Show me the way to Monte Carlo: density-based trajectory navigation
We demonstrate the use of uncertain prediction in a system for pedestrian navigation via audio with a combination of Global Positioning System data, a music player, inertial sensing, magnetic bearing data and Monte Carlo sampling for a density following task, where a listener’s music is modulated according to the changing predictions of user position with respect to a target density, in this case a trajectory or path. We show that this system enables eyes-free navigation around set trajectories or paths unfamiliar to the user and demonstrate that the system may be used effectively for varying trajectory width and context
Towards virtual communities on the Web: Actors and audience
We report about ongoing research in a virtual
reality environment where visitors can interact with
agents that help them to obtain information, to perform
certain transactions and to collaborate with them in order
to get some tasks done. Our environment models a
theatre in our hometown. We discuss attempts to let this
environment evolve into a theatre community where we
do not only have goal-directed visitors, but also visitors
that that are not sure whether they want to buy or just
want information or visitors who just want to look
around. It is shown that we need a multi-user and multiagent
environment to realize our goals. Since our environment
models a theatre it is also interesting to investigate
the roles of performers and audience in this environment.
For that reason we discuss capabilities and personalities of agents. Some notes on the historical development of networked communities are included
- …