Search CORE

19,073 research outputs found

Look, Listen, and Act: Towards Audio-Visual Embodied Navigation

Author: Gan Chuang
Gong Boqing
Tenenbaum Joshua B.
Wu Jiajun
Zhang Yiwei
Publication venue
Publication date: 07/03/2020
Field of study

A crucial ability of mobile intelligent agents is to integrate the evidence from multiple sensory inputs in an environment and to make a sequence of actions to reach their goals. In this paper, we attempt to approach the problem of Audio-Visual Embodied Navigation, the task of planning the shortest path from a random starting location in a scene to the sound source in an indoor environment, given only raw egocentric visual and audio sensory data. To accomplish this task, the agent is required to learn from various modalities, i.e. relating the audio signal to the visual environment. Here we describe an approach to audio-visual embodied navigation that takes advantage of both visual and audio pieces of evidence. Our solution is based on three key ideas: a visual perception mapper module that constructs its spatial memory of the environment, a sound perception module that infers the relative location of the sound source from the agent, and a dynamic path planner that plans a sequence of actions based on the audio-visual observations and the spatial memory of the environment to navigate toward the goal. Experimental results on a newly collected Visual-Audio-Room dataset using the simulated multi-modal environment demonstrate the effectiveness of our approach over several competitive baselines.Comment: Accepted by ICRA 2020. Project page: http://avn.csail.mit.ed

arXiv.org e-Print Archive

Crossref

DSpace@MIT

Sonicverse: A Multisensory Simulation Platform for Embodied Household Agents that See and Hear

Author: Dharan Gokul
Fei-Fei Li
Gao Ruohan
Li Chengshu
Li Hao
Savarese Silvio
Wang Zhuzhu
Wu Jiajun
Xia Fei
Publication venue
Publication date: 16/09/2023
Field of study

Developing embodied agents in simulation has been a key research topic in recent years. Exciting new tasks, algorithms, and benchmarks have been developed in various simulators. However, most of them assume deaf agents in silent environments, while we humans perceive the world with multiple senses. We introduce Sonicverse, a multisensory simulation platform with integrated audio-visual simulation for training household agents that can both see and hear. Sonicverse models realistic continuous audio rendering in 3D environments in real-time. Together with a new audio-visual VR interface that allows humans to interact with agents with audio, Sonicverse enables a series of embodied AI tasks that need audio-visual perception. For semantic audio-visual navigation in particular, we also propose a new multi-task learning model that achieves state-of-the-art performance. In addition, we demonstrate Sonicverse's realism via sim-to-real transfer, which has not been achieved by other simulators: an agent trained in Sonicverse can successfully perform audio-visual navigation in real-world environments. Sonicverse is available at: https://github.com/StanfordVL/Sonicverse.Comment: In ICRA 2023. Project page: https://ai.stanford.edu/~rhgao/sonicverse/. Code: https://github.com/StanfordVL/sonicverse. Gao and Li contributed equally to this work and are in alphabetical orde

arXiv.org e-Print Archive

SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning

Author: Batra Dhruv
Calamia Paul
Chen Changan
Clegg Alexander
Garg Sanchit
Grauman Kristen
Kobernik Philip
Robinson Philip W
Schissler Carl
Publication venue
Publication date: 23/01/2023
Field of study

We introduce SoundSpaces 2.0, a platform for on-the-fly geometry-based audio rendering for 3D environments. Given a 3D mesh of a real-world environment, SoundSpaces can generate highly realistic acoustics for arbitrary sounds captured from arbitrary microphone locations. Together with existing 3D visual assets, it supports an array of audio-visual research tasks, such as audio-visual navigation, mapping, source localization and separation, and acoustic matching. Compared to existing resources, SoundSpaces 2.0 has the advantages of allowing continuous spatial sampling, generalization to novel environments, and configurable microphone and material properties. To our knowledge, this is the first geometry-based acoustic simulation that offers high fidelity and realism while also being fast enough to use for embodied learning. We showcase the simulator's properties and benchmark its performance against real-world audio measurements. In addition, we demonstrate two downstream tasks -- embodied navigation and far-field automatic speech recognition -- and highlight sim2real performance for the latter. SoundSpaces 2.0 is publicly available to facilitate wider research for perceptual systems that can both see and hear.Comment: Camera-ready version. Website: https://soundspaces.org. Project page: https://vision.cs.utexas.edu/projects/soundspaces

arXiv.org e-Print Archive

Learning Semantic-Agnostic and Spatial-Aware Representation for Generalizable Visual-Audio Navigation

Author: Dong Hao
Wang Hongcheng
Wang Yizhou
Wang Yuxuan
Wu Mingdong
Zhang Jianwei
Zhong Fangwei
Publication venue
Publication date: 21/04/2023
Field of study

Visual-audio navigation (VAN) is attracting more and more attention from the robotic community due to its broad applications, \emph{e.g.}, household robots and rescue robots. In this task, an embodied agent must search for and navigate to the sound source with egocentric visual and audio observations. However, the existing methods are limited in two aspects: 1) poor generalization to unheard sound categories; 2) sample inefficient in training. Focusing on these two problems, we propose a brain-inspired plug-and-play method to learn a semantic-agnostic and spatial-aware representation for generalizable visual-audio navigation. We meticulously design two auxiliary tasks for respectively accelerating learning representations with the above-desired characteristics. With these two auxiliary tasks, the agent learns a spatially-correlated representation of visual and audio inputs that can be applied to work on environments with novel sounds and maps. Experiment results on realistic 3D scenes (Replica and Matterport3D) demonstrate that our method achieves better generalization performance when zero-shot transferred to scenes with unseen maps and unheard sound categories

arXiv.org e-Print Archive

HoME: a Household Multimodal Environment

Author: Anand Ankesh
Brodeur Simon
Celotti Luca
Courville Aaron
Golemo Florian
Larochelle Hugo
Perez Ethan
Rouat Jean
Strub Florian
Publication venue
Publication date: 29/11/2017
Field of study

We introduce HoME: a Household Multimodal Environment for artificial agents to learn from vision, audio, semantics, physics, and interaction with objects and other agents, all within a realistic context. HoME integrates over 45,000 diverse 3D house layouts based on the SUNCG dataset, a scale which may facilitate learning, generalization, and transfer. HoME is an open-source, OpenAI Gym-compatible platform extensible to tasks in reinforcement learning, language grounding, sound-based navigation, robotics, multi-agent learning, and more. We hope HoME better enables artificial agents to learn as humans do: in an interactive, multimodal, and richly contextualized setting.Comment: Presented at NIPS 2017's Visually-Grounded Interaction and Language Worksho

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

FM radio: family interplay with sonic mementos

Author: Dib Lina
Kalnikaite Vaiva
Petrelli Daniela
Villar Nicolas
Whittaker Steve
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/04/2010
Field of study

Digital mementos are increasingly problematic, as people acquire large amounts of digital belongings that are hard to access and often forgotten. Based on fieldwork with 10 families, we designed a new type of embodied digital memento, the FM Radio. It allows families to access and play sonic mementos of their previous holidays. We describe our underlying design motivation where recordings are presented as a series of channels on an old fashioned radio. User feedback suggests that the device met our design goals: being playful and intriguing, easy to use and social. It facilitated family interaction, and allowed ready access to mementos, thus sharing many of the properties of physical mementos that we intended to trigger

Sheffield Hallam University Research Archive

Show me the way to Monte Carlo: density-based trajectory navigation

Author: Murray-Smith R.
Strachan S.
Williamson J.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2007
Field of study

We demonstrate the use of uncertain prediction in a system for pedestrian navigation via audio with a combination of Global Positioning System data, a music player, inertial sensing, magnetic bearing data and Monte Carlo sampling for a density following task, where a listener’s music is modulated according to the changing predictions of user position with respect to a target density, in this case a trajectory or path. We show that this system enables eyes-free navigation around set trajectories or paths unfamiliar to the user and demonstrate that the system may be used effectively for varying trajectory width and context

MURAL - Maynooth University Research Archive Library

NUI Maynooth Eprint Archive

Maynooth University ePrints and eTheses Archive

Enlighten

Towards virtual communities on the Web: Actors and audience

Author: Nijholt A.
Publication venue: ICSC Academic Press
Publication date: 01/01/2000
Field of study

We report about ongoing research in a virtual reality environment where visitors can interact with agents that help them to obtain information, to perform certain transactions and to collaborate with them in order to get some tasks done. Our environment models a theatre in our hometown. We discuss attempts to let this environment evolve into a theatre community where we do not only have goal-directed visitors, but also visitors that that are not sure whether they want to buy or just want information or visitors who just want to look around. It is shown that we need a multi-user and multiagent environment to realize our goals. Since our environment models a theatre it is also interesting to investigate the roles of performers and audience in this environment. For that reason we discuss capabilities and personalities of agents. Some notes on the historical development of networked communities are included

CiteSeerX

University of Twente Research Information