102 research outputs found
VPE: Variational Policy Embedding for Transfer Reinforcement Learning
Reinforcement Learning methods are capable of solving complex problems, but
resulting policies might perform poorly in environments that are even slightly
different. In robotics especially, training and deployment conditions often
vary and data collection is expensive, making retraining undesirable.
Simulation training allows for feasible training times, but on the other hand
suffers from a reality-gap when applied in real-world settings. This raises the
need of efficient adaptation of policies acting in new environments. We
consider this as a problem of transferring knowledge within a family of similar
Markov decision processes.
For this purpose we assume that Q-functions are generated by some
low-dimensional latent variable. Given such a Q-function, we can find a master
policy that can adapt given different values of this latent variable. Our
method learns both the generative mapping and an approximate posterior of the
latent variables, enabling identification of policies for new tasks by
searching only in the latent space, rather than the space of all policies. The
low-dimensional space, and master policy found by our method enables policies
to quickly adapt to new environments. We demonstrate the method on both a
pendulum swing-up task in simulation, and for simulation-to-real transfer on a
pushing task
Global Search with Bernoulli Alternation Kernel for Task-oriented Grasping Informed by Simulation
We develop an approach that benefits from large simulated datasets and takes
full advantage of the limited online data that is most relevant. We propose a
variant of Bayesian optimization that alternates between using informed and
uninformed kernels. With this Bernoulli Alternation Kernel we ensure that
discrepancies between simulation and reality do not hinder adapting robot
control policies online. The proposed approach is applied to a challenging
real-world problem of task-oriented grasping with novel objects. Our further
contribution is a neural network architecture and training pipeline that use
experience from grasping objects in simulation to learn grasp stability scores.
We learn task scores from a labeled dataset with a convolutional network, which
is used to construct an informed kernel for our variant of Bayesian
optimization. Experiments on an ABB Yumi robot with real sensor data
demonstrate success of our approach, despite the challenge of fulfilling task
requirements and high uncertainty over physical properties of objects.Comment: To appear in 2nd Conference on Robot Learning (CoRL) 201
Reinforcement Learning in Topology-based Representation for Human Body Movement with Whole Arm Manipulation
Moving a human body or a large and bulky object can require the strength of
whole arm manipulation (WAM). This type of manipulation places the load on the
robot's arms and relies on global properties of the interaction to
succeed---rather than local contacts such as grasping or non-prehensile
pushing. In this paper, we learn to generate motions that enable WAM for
holding and transporting of humans in certain rescue or patient care scenarios.
We model the task as a reinforcement learning problem in order to provide a
behavior that can directly respond to external perturbation and human motion.
For this, we represent global properties of the robot-human interaction with
topology-based coordinates that are computed from arm and torso positions.
These coordinates also allow transferring the learned policy to other body
shapes and sizes. For training and evaluation, we simulate a dynamic sea rescue
scenario and show in quantitative experiments that the policy can solve unseen
scenarios with differently-shaped humans, floating humans, or with perception
noise. Our qualitative experiments show the subsequent transporting after
holding is achieved and we demonstrate that the policy can be directly
transferred to a real world setting.Comment: Submitted to RA-L with ICRA 201
Probabilistic consolidation of grasp experience
We present a probabilistic model for joint representation of several sensory modalities and action parameters in a robotic grasping scenario. Our non-linear probabilistic latent variable model encodes relationships between grasp-related parameters, learns the importance of features, and expresses confidence in estimates. The model learns associations between stable and unstable grasps that it experiences during an exploration phase. We demonstrate the applicability of the model for estimating grasp stability, correcting grasps, identifying objects based on tactile imprints and predicting tactile imprints from object-relative gripper poses. We performed experiments on a real platform with both known and novel objects, i.e., objects the robot trained with, and previously unseen objects. Grasp correction had a 75% success rate on known objects, and 73% on new objects. We compared our model to a traditional regression model that succeeded in correcting grasps in only 38% of cases
DataSP: A Differential All-to-All Shortest Path Algorithm for Learning Costs and Predicting Paths with Context
Learning latent costs of transitions on graphs from trajectories
demonstrations under various contextual features is challenging but useful for
path planning. Yet, existing methods either oversimplify cost assumptions or
scale poorly with the number of observed trajectories. This paper introduces
DataSP, a differentiable all-to-all shortest path algorithm to facilitate
learning latent costs from trajectories. It allows to learn from a large number
of trajectories in each learning step without additional computation. Complex
latent cost functions from contextual features can be represented in the
algorithm through a neural network approximation. We further propose a method
to sample paths from DataSP in order to reconstruct/mimic observed paths'
distributions. We prove that the inferred distribution follows the maximum
entropy principle. We show that DataSP outperforms state-of-the-art
differentiable combinatorial solver and classical machine learning approaches
in predicting paths on graphs
Towards Interpretable Reinforcement Learning with Constrained Normalizing Flow Policies
Reinforcement learning policies are typically represented by black-box neural
networks, which are non-interpretable and not well-suited for safety-critical
domains. To address both of these issues, we propose constrained normalizing
flow policies as interpretable and safe-by-construction policy models. We
achieve safety for reinforcement learning problems with instantaneous safety
constraints, for which we can exploit domain knowledge by analytically
constructing a normalizing flow that ensures constraint satisfaction. The
normalizing flow corresponds to an interpretable sequence of transformations on
action samples, each ensuring alignment with respect to a particular
constraint. Our experiments reveal benefits beyond interpretability in an
easier learning objective and maintained constraint satisfaction throughout the
entire learning process. Our approach leverages constraints over reward
engineering while offering enhanced interpretability, safety, and direct means
of providing domain knowledge to the agent without relying on complex reward
functions
Towards Task-Prioritized Policy Composition
Combining learned policies in a prioritized, ordered manner is desirable
because it allows for modular design and facilitates data reuse through
knowledge transfer. In control theory, prioritized composition is realized by
null-space control, where low-priority control actions are projected into the
null-space of high-priority control actions. Such a method is currently
unavailable for Reinforcement Learning. We propose a novel, task-prioritized
composition framework for Reinforcement Learning, which involves a novel
concept: The indifferent-space of Reinforcement Learning policies. Our
framework has the potential to facilitate knowledge transfer and modular design
while greatly increasing data efficiency and data reuse for Reinforcement
Learning agents. Further, our approach can ensure high-priority constraint
satisfaction, which makes it promising for learning in safety-critical domains
like robotics. Unlike null-space control, our approach allows learning globally
optimal policies for the compound task by online learning in the
indifference-space of higher-level policies after initial compound policy
construction
- …
