Search CORE

9 research outputs found

Last-Mile Embodied Visual Navigation

Author: Chowdhary Girish
Gupta Abhinav
Jain Unnat
Wasserman Justin
Yadav Karmesh
Publication venue
Publication date: 21/11/2022
Field of study

Realistic long-horizon tasks like image-goal navigation involve exploratory and exploitative phases. Assigned with an image of the goal, an embodied agent must explore to discover the goal, i.e., search efficiently using learned priors. Once the goal is discovered, the agent must accurately calibrate the last-mile of navigation to the goal. As with any robust system, switches between exploratory goal discovery and exploitative last-mile navigation enable better recovery from errors. Following these intuitive guide rails, we propose SLING to improve the performance of existing image-goal navigation systems. Entirely complementing prior methods, we focus on last-mile navigation and leverage the underlying geometric structure of the problem with neural descriptors. With simple but effective switches, we can easily connect SLING with heuristic, reinforcement learning, and neural modular policies. On a standardized image-goal navigation benchmark (Hahn et al. 2021), we improve performance across policies, scenes, and episode complexity, raising the state-of-the-art from 45% to 55% success rate. Beyond photorealistic simulation, we conduct real-robot experiments in three physical scenes and find these improvements to transfer well to real environments.Comment: Accepted at CoRL 2022. Code and results available at https://jbwasse2.github.io/portfolio/SLIN

arXiv.org e-Print Archive

Learning to Prevent Monocular SLAM Failure using Reinforcement Learning

Author: Bhowmick Brojeshwar
Daga Swapnil
Krishna K. Madhava
Pareekutty Nahas
Prasad Vignesh
Ravindran Balaraman
Saurabh Rohitashva Singh
Yadav Karmesh
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 07/01/2020
Field of study

Monocular SLAM refers to using a single camera to estimate robot ego motion while building a map of the environment. While Monocular SLAM is a well studied problem, automating Monocular SLAM by integrating it with trajectory planning frameworks is particularly challenging. This paper presents a novel formulation based on Reinforcement Learning (RL) that generates fail safe trajectories wherein the SLAM generated outputs do not deviate largely from their true values. Quintessentially, the RL framework successfully learns the otherwise complex relation between perceptual inputs and motor actions and uses this knowledge to generate trajectories that do not cause failure of SLAM. We show systematically in simulations how the quality of the SLAM dramatically improves when trajectories are computed using RL. Our method scales effectively across Monocular SLAM frameworks in both simulation and in real world experiments with a mobile robot.Comment: Accepted at the 11th Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP) 2018 More info can be found at the project page at https://robotics.iiit.ac.in/people/vignesh.prasad/SLAMSafePlanner.html and the supplementary video can be found at https://www.youtube.com/watch?v=420QmM_Z8v

arXiv.org e-Print Archive

Crossref

OVRL-V2: A simple state-of-art baseline for ImageNav and ObjectNav

Author: Baevski Alexei
Batra Dhruv
Kira Zsolt
Majumdar Arjun
Maksymets Oleksandr
Ramrakhya Ram
Yadav Karmesh
Yokoyama Naoki
Publication venue
Publication date: 14/03/2023
Field of study

We present a single neural network architecture composed of task-agnostic components (ViTs, convolutions, and LSTMs) that achieves state-of-art results on both the ImageNav ("go to location in ") and ObjectNav ("find a chair") tasks without any task-specific modules like object detection, segmentation, mapping, or planning modules. Such general-purpose methods offer advantages of simplicity in design, positive scaling with available compute, and versatile applicability to multiple tasks. Our work builds upon the recent success of self-supervised learning (SSL) for pre-training vision transformers (ViT). However, while the training recipes for convolutional networks are mature and robust, the recipes for ViTs are contingent and brittle, and in the case of ViTs for visual navigation, yet to be fully discovered. Specifically, we find that vanilla ViTs do not outperform ResNets on visual navigation. We propose the use of a compression layer operating over ViT patch representations to preserve spatial information along with policy training improvements. These improvements allow us to demonstrate positive scaling laws for the first time in visual navigation tasks. Consequently, our model advances state-of-the-art performance on ImageNav from 54.2% to 82.0% success and performs competitively against concurrent state-of-art on ObjectNav with success rate of 64.0% vs. 65.0%. Overall, this work does not present a fundamentally new approach, but rather recommendations for training a general-purpose architecture that achieves state-of-art performance today and could serve as a strong baseline for future methods.Comment: 15 pages, 7 figures, 9 table

arXiv.org e-Print Archive

Navigating to Objects Specified by Images

Author: Batra Dhruv
Chaplot Devendra Singh
Gervet Theophile
Krantz Jacob
Lee Stefan
Malik Jitendra
Mottaghi Roozbeh
Paxton Chris
Wang Austin
Yadav Karmesh
Publication venue
Publication date: 03/04/2023
Field of study

Images are a convenient way to specify which particular object instance an embodied agent should navigate to. Solving this task requires semantic visual reasoning and exploration of unknown environments. We present a system that can perform this task in both simulation and the real world. Our modular method solves sub-tasks of exploration, goal instance re-identification, goal localization, and local navigation. We re-identify the goal instance in egocentric vision using feature-matching and localize the goal instance by projecting matched features to a map. Each sub-task is solved using off-the-shelf components requiring zero fine-tuning. On the HM3D InstanceImageNav benchmark, this system outperforms a baseline end-to-end RL policy 7x and a state-of-the-art ImageNav model 2.3x (56% vs 25% success). We deploy this system to a mobile robot platform and demonstrate effective real-world performance, achieving an 88% success rate across a home and an office environment

arXiv.org e-Print Archive

Habitat-Matterport 3D Semantics Dataset

Author: Batra Dhruv
Chang Angel Xuan
Chaplot Devendra Singh
Clegg Alexander William
Gervet Theo
Gokaslan Aaron
Maestre Noah
Ramakrishnan Santhosh Kumar
Ramrakhya Ram
Savva Manolis
Turner John
Yadav Karmesh
Publication venue
Publication date: 12/12/2022
Field of study

We present the Habitat-Matterport 3D Semantics (HM3DSEM) dataset. HM3DSEM is the largest dataset of 3D real-world spaces with densely annotated semantics that is currently available to the academic community. It consists of 142,646 object instance annotations across 216 3D spaces and 3,100 rooms within those spaces. The scale, quality, and diversity of object annotations far exceed those of prior datasets. A key difference setting apart HM3DSEM from other datasets is the use of texture information to annotate pixel-accurate object boundaries. We demonstrate the effectiveness of HM3DSEM dataset for the Object Goal Navigation task using different methods. Policies trained using HM3DSEM perform outperform those trained on prior datasets. Introduction of HM3DSEM in the Habitat ObjectNav Challenge lead to an increase in participation from 400 submissions in 2021 to 1022 submissions in 2022.Comment: 14 Pages, 10 Figures, 5 Table

arXiv.org e-Print Archive

Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?

Author: Abbeel Pieter
Arnaud Sergio
Batra Dhruv
Berges Vincent-Pierre
Chen Claire
Jain Aryan
Lin Yixin
Ma Yecheng Jason
Majumdar Arjun
Maksymets Oleksandr
Malik Jitendra
Meier Franziska
Rajeswaran Aravind
Silwal Sneha
Yadav Karmesh
Publication venue
Publication date: 31/03/2023
Field of study

We present the largest and most comprehensive empirical study of pre-trained visual representations (PVRs) or visual 'foundation models' for Embodied AI. First, we curate CortexBench, consisting of 17 different tasks spanning locomotion, navigation, dexterous, and mobile manipulation. Next, we systematically evaluate existing PVRs and find that none are universally dominant. To study the effect of pre-training data scale and diversity, we combine over 4,000 hours of egocentric videos from 7 different sources (over 5.6M images) and ImageNet to train different-sized vision transformers using Masked Auto-Encoding (MAE) on slices of this data. Contrary to inferences from prior work, we find that scaling dataset size and diversity does not improve performance universally (but does so on average). Our largest model, named VC-1, outperforms all prior PVRs on average but does not universally dominate either. Finally, we show that task or domain-specific adaptation of VC-1 leads to substantial gains, with VC-1 (adapted) achieving competitive or superior performance than the best known results on all of the benchmarks in CortexBench. These models required over 10,000 GPU-hours to train and can be found on our website for the benefit of the research community.Comment: Project website: https://eai-vc.github.i

arXiv.org e-Print Archive

HomeRobot: An Open Source Software Stack for Mobile Manipulation Research

Author: Bisk Yonatan
Matulevich Blaine
Paxton Chris
Ramakrishnan Santhosh
Shah Binit
Shah Dhruv
Wang Austin
Yadav Karmesh
Yenamandra Sriram
Publication venue: AAAI Press
Publication date: 22/01/2024
Field of study

Reproducibility in robotics research requires capable, shared hardware platforms which can be used for a wide variety of research. We’ve seen the power of these sorts of shared platforms in more general machine learning research, where there is constant iteration on shared AI platforms like PyTorch. To be able to make rapid progress in robotics in the same way, we propose that we need: (1) shared real-world platforms which allow different teams to test and compare methods at low cost; (2) challenging simulations that reflect real-world environments and especially can drive perception and planning research; and (3) low-cost platforms with enough software to get started addressing all of these problems. To this end, we propose HomeRobot, a mobile manipulator software stack with associated benchmark in simulation, which is initially based on the low-cost, human-safe Hello Robot Stretch

Association for the Advancement of Artificial Intelligence: AAAI Publications

What do we learn from a large-scale study of pre-trained visual representations in sim and real environments?

Author: Arnaud Sergio
Batra Dhruv
Berges Vincent-Pierre
Chen Claire
Kalakrishnan Mrinal
Majumdar Arjun
Maksymets Oleksandr
Meier Franziska
Rajeswaran Aravind
Silwal Sneha
Vakil Jay
Wu Tingfan
Yadav Karmesh
Publication venue
Publication date: 03/10/2023
Field of study

We present a large empirical investigation on the use of pre-trained visual representations (PVRs) for training downstream policies that execute real-world tasks. Our study spans five different PVRs, two different policy-learning paradigms (imitation and reinforcement learning), and three different robots for 5 distinct manipulation and indoor navigation tasks. From this effort, we can arrive at three insights: 1) the performance trends of PVRs in the simulation are generally indicative of their trends in the real world, 2) the use of PVRs enables a first-of-its-kind result with indoor ImageNav (zero-shot transfer to a held-out scene in the real world), and 3) the benefits from variations in PVRs, primarily data-augmentation and fine-tuning, also transfer to the real-world performance. See project website for additional details and visuals.Comment: Project website https://pvrs-sim2real.github.io

arXiv.org e-Print Archive