158 research outputs found
Deep Object-Centric Representations for Generalizable Robot Learning
Robotic manipulation in complex open-world scenarios requires both reliable
physical manipulation skills and effective and generalizable perception. In
this paper, we propose a method where general purpose pretrained visual models
serve as an object-centric prior for the perception system of a learned policy.
We devise an object-level attentional mechanism that can be used to determine
relevant objects from a few trajectories or demonstrations, and then
immediately incorporate those objects into a learned policy. A task-independent
meta-attention locates possible objects in the scene, and a task-specific
attention identifies which objects are predictive of the trajectories. The
scope of the task-specific attention is easily adjusted by showing
demonstrations with distractor objects or with diverse relevant objects. Our
results indicate that this approach exhibits good generalization across object
instances using very few samples, and can be used to learn a variety of
manipulation tasks using reinforcement learning
Learning Environment-Aware Affordance for 3D Articulated Object Manipulation under Occlusions
Perceiving and manipulating 3D articulated objects in diverse environments is
essential for home-assistant robots. Recent studies have shown that point-level
affordance provides actionable priors for downstream manipulation tasks.
However, existing works primarily focus on single-object scenarios with
homogeneous agents, overlooking the realistic constraints imposed by the
environment and the agent's morphology, e.g., occlusions and physical
limitations. In this paper, we propose an environment-aware affordance
framework that incorporates both object-level actionable priors and environment
constraints. Unlike object-centric affordance approaches, learning
environment-aware affordance faces the challenge of combinatorial explosion due
to the complexity of various occlusions, characterized by their quantities,
geometries, positions and poses. To address this and enhance data efficiency,
we introduce a novel contrastive affordance learning framework capable of
training on scenes containing a single occluder and generalizing to scenes with
complex occluder combinations. Experiments demonstrate the effectiveness of our
proposed approach in learning affordance considering environment constraints.
Project page at https://chengkaiacademycity.github.io/EnvAwareAfford/Comment: In 37th Conference on Neural Information Processing Systems (NeurIPS
2023). Website at https://chengkaiacademycity.github.io/EnvAwareAfford
GNFactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields
It is a long-standing problem in robotics to develop agents capable of
executing diverse manipulation tasks from visual observations in unstructured
real-world environments. To achieve this goal, the robot needs to have a
comprehensive understanding of the 3D structure and semantics of the scene. In
this work, we present , a visual behavior cloning agent for
multi-task robotic manipulation with eneralizable eural
feature ields. GNFactor jointly optimizes a generalizable neural
field (GNF) as a reconstruction module and a Perceiver Transformer as a
decision-making module, leveraging a shared deep 3D voxel representation. To
incorporate semantics in 3D, the reconstruction module utilizes a
vision-language foundation model (, Stable Diffusion) to distill
rich semantic information into the deep 3D voxel. We evaluate GNFactor on 3
real robot tasks and perform detailed ablations on 10 RLBench tasks with a
limited number of demonstrations. We observe a substantial improvement of
GNFactor over current state-of-the-art methods in seen and unseen tasks,
demonstrating the strong generalization ability of GNFactor. Our project
website is https://yanjieze.com/GNFactor/ .Comment: CoRL 2023 Oral. Website: https://yanjieze.com/GNFactor
Efficient Representations of Object Geometry for Reinforcement Learning of Interactive Grasping Policies
Grasping objects of different shapes and sizes - a foundational, effortless
skill for humans - remains a challenging task in robotics. Although model-based
approaches can predict stable grasp configurations for known object models,
they struggle to generalize to novel objects and often operate in a
non-interactive open-loop manner. In this work, we present a reinforcement
learning framework that learns the interactive grasping of various
geometrically distinct real-world objects by continuously controlling an
anthropomorphic robotic hand. We explore several explicit representations of
object geometry as input to the policy. Moreover, we propose to inform the
policy implicitly through signed distances and show that this is naturally
suited to guide the search through a shaped reward component. Finally, we
demonstrate that the proposed framework is able to learn even in more
challenging conditions, such as targeted grasping from a cluttered bin.
Necessary pre-grasping behaviors such as object reorientation and utilization
of environmental constraints emerge in this case. Videos of learned interactive
policies are available at https://maltemosbach.github.
io/geometry_aware_grasping_policies
The Treachery of Images: Bayesian Scene Keypoints for Deep Policy Learning in Robotic Manipulation
In policy learning for robotic manipulation, sample efficiency is of
paramount importance. Thus, learning and extracting more compact
representations from camera observations is a promising avenue. However,
current methods often assume full observability of the scene and struggle with
scale invariance. In many tasks and settings, this assumption does not hold as
objects in the scene are often occluded or lie outside the field of view of the
camera, rendering the camera observation ambiguous with regard to their
location. To tackle this problem, we present BASK, a Bayesian approach to
tracking scale-invariant keypoints over time. Our approach successfully
resolves inherent ambiguities in images, enabling keypoint tracking on
symmetrical objects and occluded and out-of-view objects. We employ our method
to learn challenging multi-object robot manipulation tasks from wrist camera
observations and demonstrate superior utility for policy learning compared to
other representation learning techniques. Furthermore, we show outstanding
robustness towards disturbances such as clutter, occlusions, and noisy depth
measurements, as well as generalization to unseen objects both in simulation
and real-world robotic experiments
- …