9 research outputs found
Last-Mile Embodied Visual Navigation
Realistic long-horizon tasks like image-goal navigation involve exploratory
and exploitative phases. Assigned with an image of the goal, an embodied agent
must explore to discover the goal, i.e., search efficiently using learned
priors. Once the goal is discovered, the agent must accurately calibrate the
last-mile of navigation to the goal. As with any robust system, switches
between exploratory goal discovery and exploitative last-mile navigation enable
better recovery from errors. Following these intuitive guide rails, we propose
SLING to improve the performance of existing image-goal navigation systems.
Entirely complementing prior methods, we focus on last-mile navigation and
leverage the underlying geometric structure of the problem with neural
descriptors. With simple but effective switches, we can easily connect SLING
with heuristic, reinforcement learning, and neural modular policies. On a
standardized image-goal navigation benchmark (Hahn et al. 2021), we improve
performance across policies, scenes, and episode complexity, raising the
state-of-the-art from 45% to 55% success rate. Beyond photorealistic
simulation, we conduct real-robot experiments in three physical scenes and find
these improvements to transfer well to real environments.Comment: Accepted at CoRL 2022. Code and results available at
https://jbwasse2.github.io/portfolio/SLIN
Learning to Prevent Monocular SLAM Failure using Reinforcement Learning
Monocular SLAM refers to using a single camera to estimate robot ego motion
while building a map of the environment. While Monocular SLAM is a well studied
problem, automating Monocular SLAM by integrating it with trajectory planning
frameworks is particularly challenging. This paper presents a novel formulation
based on Reinforcement Learning (RL) that generates fail safe trajectories
wherein the SLAM generated outputs do not deviate largely from their true
values. Quintessentially, the RL framework successfully learns the otherwise
complex relation between perceptual inputs and motor actions and uses this
knowledge to generate trajectories that do not cause failure of SLAM. We show
systematically in simulations how the quality of the SLAM dramatically improves
when trajectories are computed using RL. Our method scales effectively across
Monocular SLAM frameworks in both simulation and in real world experiments with
a mobile robot.Comment: Accepted at the 11th Indian Conference on Computer Vision, Graphics
and Image Processing (ICVGIP) 2018 More info can be found at the project page
at https://robotics.iiit.ac.in/people/vignesh.prasad/SLAMSafePlanner.html and
the supplementary video can be found at
https://www.youtube.com/watch?v=420QmM_Z8v
OVRL-V2: A simple state-of-art baseline for ImageNav and ObjectNav
We present a single neural network architecture composed of task-agnostic
components (ViTs, convolutions, and LSTMs) that achieves state-of-art results
on both the ImageNav ("go to location in ") and ObjectNav ("find
a chair") tasks without any task-specific modules like object detection,
segmentation, mapping, or planning modules. Such general-purpose methods offer
advantages of simplicity in design, positive scaling with available compute,
and versatile applicability to multiple tasks. Our work builds upon the recent
success of self-supervised learning (SSL) for pre-training vision transformers
(ViT). However, while the training recipes for convolutional networks are
mature and robust, the recipes for ViTs are contingent and brittle, and in the
case of ViTs for visual navigation, yet to be fully discovered. Specifically,
we find that vanilla ViTs do not outperform ResNets on visual navigation. We
propose the use of a compression layer operating over ViT patch representations
to preserve spatial information along with policy training improvements. These
improvements allow us to demonstrate positive scaling laws for the first time
in visual navigation tasks. Consequently, our model advances state-of-the-art
performance on ImageNav from 54.2% to 82.0% success and performs competitively
against concurrent state-of-art on ObjectNav with success rate of 64.0% vs.
65.0%. Overall, this work does not present a fundamentally new approach, but
rather recommendations for training a general-purpose architecture that
achieves state-of-art performance today and could serve as a strong baseline
for future methods.Comment: 15 pages, 7 figures, 9 table
Navigating to Objects Specified by Images
Images are a convenient way to specify which particular object instance an
embodied agent should navigate to. Solving this task requires semantic visual
reasoning and exploration of unknown environments. We present a system that can
perform this task in both simulation and the real world. Our modular method
solves sub-tasks of exploration, goal instance re-identification, goal
localization, and local navigation. We re-identify the goal instance in
egocentric vision using feature-matching and localize the goal instance by
projecting matched features to a map. Each sub-task is solved using
off-the-shelf components requiring zero fine-tuning. On the HM3D
InstanceImageNav benchmark, this system outperforms a baseline end-to-end RL
policy 7x and a state-of-the-art ImageNav model 2.3x (56% vs 25% success). We
deploy this system to a mobile robot platform and demonstrate effective
real-world performance, achieving an 88% success rate across a home and an
office environment
Habitat-Matterport 3D Semantics Dataset
We present the Habitat-Matterport 3D Semantics (HM3DSEM) dataset. HM3DSEM is
the largest dataset of 3D real-world spaces with densely annotated semantics
that is currently available to the academic community. It consists of 142,646
object instance annotations across 216 3D spaces and 3,100 rooms within those
spaces. The scale, quality, and diversity of object annotations far exceed
those of prior datasets. A key difference setting apart HM3DSEM from other
datasets is the use of texture information to annotate pixel-accurate object
boundaries. We demonstrate the effectiveness of HM3DSEM dataset for the Object
Goal Navigation task using different methods. Policies trained using HM3DSEM
perform outperform those trained on prior datasets. Introduction of HM3DSEM in
the Habitat ObjectNav Challenge lead to an increase in participation from 400
submissions in 2021 to 1022 submissions in 2022.Comment: 14 Pages, 10 Figures, 5 Table
Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?
We present the largest and most comprehensive empirical study of pre-trained
visual representations (PVRs) or visual 'foundation models' for Embodied AI.
First, we curate CortexBench, consisting of 17 different tasks spanning
locomotion, navigation, dexterous, and mobile manipulation. Next, we
systematically evaluate existing PVRs and find that none are universally
dominant.
To study the effect of pre-training data scale and diversity, we combine over
4,000 hours of egocentric videos from 7 different sources (over 5.6M images)
and ImageNet to train different-sized vision transformers using Masked
Auto-Encoding (MAE) on slices of this data. Contrary to inferences from prior
work, we find that scaling dataset size and diversity does not improve
performance universally (but does so on average).
Our largest model, named VC-1, outperforms all prior PVRs on average but does
not universally dominate either. Finally, we show that task or domain-specific
adaptation of VC-1 leads to substantial gains, with VC-1 (adapted) achieving
competitive or superior performance than the best known results on all of the
benchmarks in CortexBench. These models required over 10,000 GPU-hours to train
and can be found on our website for the benefit of the research community.Comment: Project website: https://eai-vc.github.i
HomeRobot: An Open Source Software Stack for Mobile Manipulation Research
Reproducibility in robotics research requires capable, shared hardware platforms which can be used for a wide variety of research. We’ve seen the power of these sorts of shared platforms in more general machine learning research, where there is constant iteration on shared AI platforms like PyTorch. To be able to make rapid progress in robotics in the same way, we propose that we need: (1) shared real-world platforms which allow different teams to test and compare methods at low cost; (2) challenging simulations that reflect real-world environments and especially can drive perception and planning research; and (3) low-cost platforms with enough software to get started addressing all of these problems. To this end, we propose HomeRobot, a mobile manipulator software stack with associated benchmark in simulation, which is initially based on the low-cost, human-safe Hello Robot Stretch
What do we learn from a large-scale study of pre-trained visual representations in sim and real environments?
We present a large empirical investigation on the use of pre-trained visual
representations (PVRs) for training downstream policies that execute real-world
tasks. Our study spans five different PVRs, two different policy-learning
paradigms (imitation and reinforcement learning), and three different robots
for 5 distinct manipulation and indoor navigation tasks. From this effort, we
can arrive at three insights: 1) the performance trends of PVRs in the
simulation are generally indicative of their trends in the real world, 2) the
use of PVRs enables a first-of-its-kind result with indoor ImageNav (zero-shot
transfer to a held-out scene in the real world), and 3) the benefits from
variations in PVRs, primarily data-augmentation and fine-tuning, also transfer
to the real-world performance. See project website for additional details and
visuals.Comment: Project website https://pvrs-sim2real.github.io