13 research outputs found
Learning to See before Learning to Act: Visual Pre-training for Manipulation
Does having visual priors (e.g. the ability to detect objects) facilitate
learning to perform vision-based manipulation (e.g. picking up objects)? We
study this problem under the framework of transfer learning, where the model is
first trained on a passive vision task, and adapted to perform an active
manipulation task. We find that pre-training on vision tasks significantly
improves generalization and sample efficiency for learning to manipulate
objects. However, realizing these gains requires careful selection of which
parts of the model to transfer. Our key insight is that outputs of standard
vision models highly correlate with affordance maps commonly used in
manipulation. Therefore, we explore directly transferring model parameters from
vision networks to affordance prediction networks, and show that this can
result in successful zero-shot adaptation, where a robot can pick up certain
objects with zero robotic experience. With just a small amount of robotic
experience, we can further fine-tune the affordance model to achieve better
results. With just 10 minutes of suction experience or 1 hour of grasping
experience, our method achieves ~80% success rate at picking up novel objects.Comment: Accepted to ICRA 2020. Porject page:
http://yenchenlin.me/vision2action
M-EMBER: Tackling Long-Horizon Mobile Manipulation via Factorized Domain Transfer
In this paper, we propose a method to create visuomotor mobile manipulation
solutions for long-horizon activities. We propose to leverage the recent
advances in simulation to train visual solutions for mobile manipulation. While
previous works have shown success applying this procedure to autonomous visual
navigation and stationary manipulation, applying it to long-horizon visuomotor
mobile manipulation is still an open challenge that demands both perceptual and
compositional generalization of multiple skills. In this work, we develop
Mobile-EMBER, or M-EMBER, a factorized method that decomposes a long-horizon
mobile manipulation activity into a repertoire of primitive visual skills,
reinforcement-learns each skill, and composes these skills to a long-horizon
mobile manipulation activity. On a mobile manipulation robot, we find that
M-EMBER completes a long-horizon mobile manipulation activity,
cleaning_kitchen, achieving a 53% success rate. This requires successfully
planning and executing five factorized, learned visual skills
Reactive Semantic Planning in Unexplored Semantic Environments Using Deep Perceptual Feedback
This paper presents a reactive planning system that enriches the topological representation of an environment with a tightly integrated semantic representation, achieved by incorporating and exploiting advances in deep perceptual learning and probabilistic semantic reasoning. Our architecture combines object detection with semantic SLAM, affording robust, reactive logical as well as geometric planning in unexplored environments. Moreover, by incorporating a human mesh estimation algorithm, our system is capable of reacting and responding in real time to semantically labeled human motions and gestures. New formal results allow tracking of suitably non-adversarial moving targets, while maintaining the same collision avoidance guarantees. We suggest the empirical utility of the proposed control architecture with a numerical study including comparisons with a state-of-the-art dynamic replanning algorithm, and physical implementation on both a wheeled and legged platform in different settings with both geometric and semantic goals.
For more information: Kod*la
Multimodal Attention Networks for Low-Level Vision-and-Language Navigation
Vision-and-Language Navigation (VLN) is a challenging task in which an agent needs to follow a language-specified path to reach a target destination. The goal gets even harder as the actions available to the agent get simpler and move towards low-level, atomic interactions with the environment. This setting takes the name of low-level VLN. In this paper, we strive for the creation of an agent able to tackle three key issues: multi-modality, long-term dependencies, and adaptability towards different locomotive settings. To that end, we devise "Perceive, Transform, and Act" (PTA): a fully-attentive VLN architecture that leaves the recurrent approach behind and the first Transformer-like architecture incorporating three different modalities -- natural language, images, and low-level actions for the agent control. In particular, we adopt an early fusion strategy to merge lingual and visual information efficiently in our encoder. We then propose to refine the decoding phase with a late fusion extension between the agent's history of actions and the perceptual modalities. We experimentally validate our model on two datasets: PTA achieves promising results in low-level VLN on R2R and achieves good performance in the recently proposed R4R benchmark. Our code is publicly available at https://github.com/aimagelab/perceive-transform-and-act
Occupancy Anticipation for Efficient Exploration and Navigation
State-of-the-art navigation methods leverage a spatial memory to generalize
to new environments, but their occupancy maps are limited to capturing the
geometric structures directly observed by the agent. We propose occupancy
anticipation, where the agent uses its egocentric RGB-D observations to infer
the occupancy state beyond the visible regions. In doing so, the agent builds
its spatial awareness more rapidly, which facilitates efficient exploration and
navigation in 3D environments. By exploiting context in both the egocentric
views and top-down maps our model successfully anticipates a broader map of the
environment, with performance significantly better than strong baselines.
Furthermore, when deployed for the sequential decision-making tasks of
exploration and navigation, our model outperforms state-of-the-art methods on
the Gibson and Matterport3D datasets. Our approach is the winning entry in the
2020 Habitat PointNav Challenge. Project page:
http://vision.cs.utexas.edu/projects/occupancy_anticipation/Comment: Accepted in ECCV 2020. 19 pages, 6 figures, appendix at en