10 research outputs found
From Seeing to Moving: A Survey on Learning for Visual Indoor Navigation (VIN)
Visual Indoor Navigation (VIN) task has drawn increasing attention from the
data-driven machine learning communities especially with the recently reported
success from learning-based methods. Due to the innate complexity of this task,
researchers have tried approaching the problem from a variety of different
angles, the full scope of which has not yet been captured within an overarching
report. This survey first summarizes the representative work of learning-based
approaches for the VIN task and then identifies and discusses lingering issues
impeding the VIN performance, as well as motivates future research in these key
areas worth exploring for the community
Success Weighted by Completion Time: A Dynamics-Aware Evaluation Criteria for Embodied Navigation
We present Success weighted by Completion Time (SCT), a new metric for
evaluating navigation performance for mobile robots. Several related works on
navigation have used Success weighted by Path Length (SPL) as the primary
method of evaluating the path an agent makes to a goal location, but SPL is
limited in its ability to properly evaluate agents with complex dynamics. In
contrast, SCT explicitly takes the agent's dynamics model into consideration,
and aims to accurately capture how well the agent has approximated the fastest
navigation behavior afforded by its dynamics. While several embodied navigation
works use point-turn dynamics, we focus on unicycle-cart dynamics for our
agent, which better exemplifies the dynamics model of popular mobile robotics
platforms (e.g., LoCoBot, TurtleBot, Fetch, etc.). We also present
RRT*-Unicycle, an algorithm for unicycle dynamics that estimates the fastest
collision-free path and completion time from a starting pose to a goal location
in an environment containing obstacles. We experiment with deep reinforcement
learning and reward shaping to train and compare the navigation performance of
agents with different dynamics models. In evaluating these agents, we show that
in contrast to SPL, SCT is able to capture the advantages in navigation speed a
unicycle model has over a simpler point-turn model of dynamics. Lastly, we show
that we can successfully deploy our trained models and algorithms outside of
simulation in the real world. We embody our agents in an real robot to navigate
an apartment, and show that they can generalize in a zero-shot manner
A Survey of Embodied AI: From Simulators to Research Tasks
There has been an emerging paradigm shift from the era of "internet AI" to
"embodied AI", where AI algorithms and agents no longer learn from datasets of
images, videos or text curated primarily from the internet. Instead, they learn
through interactions with their environments from an egocentric perception
similar to humans. Consequently, there has been substantial growth in the
demand for embodied AI simulators to support various embodied AI research
tasks. This growing interest in embodied AI is beneficial to the greater
pursuit of Artificial General Intelligence (AGI), but there has not been a
contemporary and comprehensive survey of this field. This paper aims to provide
an encyclopedic survey for the field of embodied AI, from its simulators to its
research. By evaluating nine current embodied AI simulators with our proposed
seven features, this paper aims to understand the simulators in their provision
for use in embodied AI research and their limitations. Lastly, this paper
surveys the three main research tasks in embodied AI -- visual exploration,
visual navigation and embodied question answering (QA), covering the
state-of-the-art approaches, evaluation metrics and datasets. Finally, with the
new insights revealed through surveying the field, the paper will provide
suggestions for simulator-for-task selections and recommendations for the
future directions of the field.Comment: Under Review for IEEE TETC
Emergence of Intelligent Navigation Behavior in Embodied Agents from Massive-Scale Simulation
The goal of Artificial Intelligence is to build ‘thinking machines’ that ‘use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves.’ In this dissertation, we will argue that the intelligence required for this goal emerges from massive-scale simulation. We will show a specific case: that intel- ligent navigation behavior emerges from massive-scale simulation and deep reinforcement learning.
Towards this end, we introduce Decentralized Distributed PPO (DD-PPO), a method that scales reinforcement learning to multiple GPUs and machines. We use DD-PPO to train agents for PointGoal navigation (e.g. ‘Go 5 meters north and 10 meters east relative to start’) for the equivalent of 80 years of human experience. This massive-scale training results in near-perfect autonomous navigation in an unseen environment without access to a map. We then examine the inner workings of special case of PointGoalNav agents. We find that (1) their memory enables shortcuts, i.e. efficiently travel through previously unexplored parts of the environment; (2) there is emergence of maps in their memory, i.e. a detailed occupancy grid of the environment can be decoded from it.
We then introduce Variable Experience Rollout (VER), a method that efficiently scales
reinforcement learning on a single GPU or machine. We use VER to train chained skills for mobile manipulation. We find a surprising emergence of navigation in skills that do not ostensibly require any navigation. Specifically, the pick skill involves a robot picking an object from a table. During training, the robot was always spawned close to the table and never needs to navigate. However, we find that if navigation actions are part of the action space, the robot learns to navigate then pick an object in new environments with 50% success, demonstrating surprisingly high out-of-distribution generalization.Ph.D
Bayesian State Tracking and Sim-to-Real Transfer for Vision-and-Language Navigation
A visually-grounded navigation instruction can be interpreted as a sequence of expected observations and actions an agent following the correct trajectory would encounter and perform. Based on this intuition, we formulate the problem of finding the goal location in Vision-and-Language Navigation (VLN) within the framework of Bayesian state tracking -- learning observation and motion models conditioned on these expectable events. Together with a mapper that constructs a semantic spatial map on-the-fly during navigation, we formulate an end-to-end differentiable Bayes filter and train it to identify the goal by predicting the most likely trajectory through the map according to the instructions. The resulting navigation policy constitutes a new approach to instruction following that explicitly models a probability distribution over states, encoding strong geometric and algorithmic priors while enabling greater explainability. Our experiments show that our approach outperforms a strong LingUNet baseline when predicting the goal location on the map. On the full VLN task, i.e. navigating to the goal location, our approach achieves promising results with less reliance on navigation constraints.
In the second half of the thesis, we study the challenging problem of releasing a robot in a previously unseen environment, and having it follow unconstrained natural language navigation instructions. Recent work on the task of VLN has achieved significant progress in simulation. To assess the implications of this work for robotics, we transfer a VLN agent trained in simulation to a physical robot. To bridge the gap between the high-level discrete action space learned by the VLN agent, and the robot's low-level continuous action space, we propose a subgoal model to identify nearby waypoints, and use domain randomization to mitigate visual domain differences. For accurate sim and real comparisons in parallel environments, we annotate a 325m2 office space with 1.3km of navigation instructions, and create a digitized replica in simulation. We find that sim-to-real transfer to an environment not seen in training is successful if an occupancy map and navigation graph can be collected and annotated in advance (success rate of 46.8% vs. 55.9% in sim), but much more challenging in the hardest setting with no prior mapping at all (success rate of 22.5%).M.S