2,655 research outputs found
Time-Contrastive Networks: Self-Supervised Learning from Video
We propose a self-supervised approach for learning representations and
robotic behaviors entirely from unlabeled videos recorded from multiple
viewpoints, and study how this representation can be used in two robotic
imitation settings: imitating object interactions from videos of humans, and
imitating human poses. Imitation of human behavior requires a
viewpoint-invariant representation that captures the relationships between
end-effectors (hands or robot grippers) and the environment, object attributes,
and body pose. We train our representations using a metric learning loss, where
multiple simultaneous viewpoints of the same observation are attracted in the
embedding space, while being repelled from temporal neighbors which are often
visually similar but functionally different. In other words, the model
simultaneously learns to recognize what is common between different-looking
images, and what is different between similar-looking images. This signal
causes our model to discover attributes that do not change across viewpoint,
but do change across time, while ignoring nuisance variables such as
occlusions, motion blur, lighting and background. We demonstrate that this
representation can be used by a robot to directly mimic human poses without an
explicit correspondence, and that it can be used as a reward function within a
reinforcement learning algorithm. While representations are learned from an
unlabeled collection of task-related videos, robot behaviors such as pouring
are learned by watching a single 3rd-person demonstration by a human. Reward
functions obtained by following the human demonstrations under the learned
representation enable efficient reinforcement learning that is practical for
real-world robotic systems. Video results, open-source code and dataset are
available at https://sermanet.github.io/imitat
Motor control and strategy discovery for physically simulated characters
In physics-based character animation, motions are realized through control of simulated characters along with their interactions with the virtual environment. In this thesis, we study the problem of character control on two levels: joint-level motor control which transforms control signals to joint torques, and high-level motion control which outputs joint-level control signals given the current state of the character and the environment and the task objective. We propose a Modified Articulated-Body Algorithm (MABA) which achieves stable proportional-derivative (PD) low-level motor control with superior theoretical time complexity, practical efficiency and stability than prior implementations. We further propose a high-level motion control framework based on deep reinforcement learning (DRL) which enables the discovery of appropriate motion strategies without human demonstrations to complete a task objective. To facilitate the learning of realistic human motions, we propose a Pose Variational Autoencoder (P-VAE) to constrain the DRL actions to a subspace of natural poses. Our learning framework can be further combined with a sample-efficient Bayesian Diversity Search (BDS) algorithm and novel policy seeking to discover diverse strategies for tasks with multiple modes, such as various athletic jumping tasks
ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous States in Realistic 3D Scenes
Understanding the continuous states of objects is essential for task learning
and planning in the real world. However, most existing task learning benchmarks
assume discrete(e.g., binary) object goal states, which poses challenges for
the learning of complex tasks and transferring learned policy from simulated
environments to the real world. Furthermore, state discretization limits a
robot's ability to follow human instructions based on the grounding of actions
and states. To tackle these challenges, we present ARNOLD, a benchmark that
evaluates language-grounded task learning with continuous states in realistic
3D scenes. ARNOLD is comprised of 8 language-conditioned tasks that involve
understanding object states and learning policies for continuous goals. To
promote language-instructed learning, we provide expert demonstrations with
template-generated language descriptions. We assess task performance by
utilizing the latest language-conditioned policy learning models. Our results
indicate that current models for language-conditioned manipulations continue to
experience significant challenges in novel goal-state generalizations, scene
generalizations, and object generalizations. These findings highlight the need
to develop new algorithms that address this gap and underscore the potential
for further research in this area. See our project page at:
https://arnold-benchmark.github.ioComment: The first two authors contributed equally; 20 pages; 17 figures;
project availalbe: https://arnold-benchmark.github.io
Motion Synthesis and Control for Autonomous Agents using Generative Models and Reinforcement Learning
Imitating and predicting human motions have wide applications in both graphics and robotics, from developing realistic models of human movement and behavior in immersive virtual worlds and games to improving autonomous navigation for service agents deployed in the real world. Traditional approaches for motion imitation and prediction typically rely on pre-defined rules to model agent behaviors or use reinforcement learning with manually designed reward functions. Despite impressive results, such approaches cannot effectively capture the diversity of motor behaviors and the decision making capabilities of human beings. Furthermore, manually designing a model or reward function to explicitly describe human motion characteristics often involves laborious fine-tuning and repeated experiments, and may suffer from generalization issues. In this thesis, we explore data-driven approaches using generative models and reinforcement learning to study and simulate human motions. Specifically, we begin with motion synthesis and control of physically simulated agents imitating a wide range of human motor skills, and then focus on improving the local navigation decisions of autonomous agents in multi-agent interaction settings. For physics-based agent control, we introduce an imitation learning framework built upon generative adversarial networks and reinforcement learning that enables humanoid agents to learn motor skills from a few examples of human reference motion data. Our approach generates high-fidelity motions and robust controllers without needing to manually design and finetune a reward function, allowing at the same time interactive switching between different controllers based on user input. Based on this framework, we further propose a multi-objective learning scheme for composite and task-driven control of humanoid agents. Our multi-objective learning scheme balances the simultaneous learning of disparate motions from multiple reference sources and multiple goal-directed control objectives in an adaptive way, enabling the training of efficient composite motion controllers. Additionally, we present a general framework for fast and robust learning of motor control skills. Our framework exploits particle filtering to dynamically explore and discretize the high-dimensional action space involved in continuous control tasks, and provides a multi-modal policy as a substitute for the commonly used Gaussian policies. For navigation learning, we leverage human crowd data to train a human-inspired collision avoidance policy by combining knowledge distillation and reinforcement learning. Our approach enables autonomous agents to take human-like actions during goal-directed steering in fully decentralized, multi-agent environments. To inform better control in such environments, we propose SocialVAE, a variational autoencoder based architecture that uses timewise latent variables with socially-aware conditions and a backward posterior approximation to perform agent trajectory prediction. Our approach improves current state-of-the-art performance on trajectory prediction tasks in daily human interaction scenarios and more complex scenes involving interactions between NBA players. We further extend SocialVAE by exploiting semantic maps as context conditions to generate map-compliant trajectory prediction. Our approach processes context conditions and social conditions occurring during agent-agent interactions in an integrated manner through the use of a dual-attention mechanism. We demonstrate the real-time performance of our approach and its ability to provide high-fidelity, multi-modal predictions on various large-scale vehicle trajectory prediction tasks
Object-Independent Human-to-Robot Handovers using Real Time Robotic Vision
We present an approach for safe and object-independent human-to-robot
handovers using real time robotic vision and manipulation. We aim for general
applicability with a generic object detector, a fast grasp selection algorithm
and by using a single gripper-mounted RGB-D camera, hence not relying on
external sensors. The robot is controlled via visual servoing towards the
object of interest. Putting a high emphasis on safety, we use two perception
modules: human body part segmentation and hand/finger segmentation. Pixels that
are deemed to belong to the human are filtered out from candidate grasp poses,
hence ensuring that the robot safely picks the object without colliding with
the human partner. The grasp selection and perception modules run concurrently
in real-time, which allows monitoring of the progress. In experiments with 13
objects, the robot was able to successfully take the object from the human in
81.9% of the trials.Comment: IEEE Robotics and Automation Letters (RA-L). Preprint Version.
Accepted September, 2020. The code and videos can be found at
https://patrosat.github.io/h2r_handovers
- …