30 research outputs found
Self-Paced Absolute Learning Progress as a Regularized Approach to Curriculum Learning
The usability of Reinforcement Learning is restricted by the large
computation times it requires. Curriculum Reinforcement Learning speeds up
learning by defining a helpful order in which an agent encounters tasks, i.e.
from simple to hard. Curricula based on Absolute Learning Progress (ALP) have
proven successful in different environments, but waste computation on repeating
already learned behaviour in new tasks. We solve this problem by introducing a
new regularization method based on Self-Paced (Deep) Learning, called
Self-Paced Absolute Learning Progress (SPALP). We evaluate our method in three
different environments. Our method achieves performance comparable to original
ALP in all cases, and reaches it quicker than ALP in two of them. We illustrate
possibilities to further improve the efficiency and performance of SPALP.Comment: 11 pages, 8 figures. The paper was a result from an Integrated
Project at TU Darmstadt for which we received course credit (9 ECTS) and is
not meant to be published elsewher
Reinforcement Learning Curricula as Interpolations between Task Distributions
In the last decade, the increased availability of powerful computing machinery has led to an increasingly widespread application of machine learning methods. Machine learning has been particularly successful when large models, typically neural networks with an ever-increasing number of parameters, can leverage vast data to make predictions.
While reinforcement learning (RL) has been no exception from this development, a distinguishing feature of RL is its well-known exploration-exploitation trade-off, whose optimal solution – while possible to model as a partially observable Markov decision process – evades computation in all but the simplest problems. Consequently, it seems unsurprising that notable demonstrations of reinforcement learning, such as an RL-based Go agent (AlphaGo) by Deepmind beating the professional Go player Lee Sedol, relied both on the availability of massive computing capabilities and specific forms of regularization that facilitate learning. In the case of AlphaGo, this regularization came in the form of self-play, enabling learning by interacting with gradually more proficient opponents.
In this thesis, we develop techniques that, similarly to the concept of self-play of AlphaGo, improve the learning performance of RL agents by training on sequences of increasingly complex tasks. These task sequences are typically called curricula and are known to side-step problems such as slow learning or convergence to poor behavior that may occur when directly learning in complicated tasks. The algorithms we develop in this thesis create curricula by minimizing distances or divergences between probability distributions of learning tasks, generating interpolations between an initial distribution of easy learning tasks and a target task distribution. Apart from improving the learning performance of RL agents in experiments, developing methods that realize curricula as interpolations between task distributions results in a nuanced picture of key aspects of successful reinforcement learning curricula.
In Chapter 1, we start this thesis by introducing required reinforcement learning notation and then motivating curriculum reinforcement learning from the perspective of continuation methods for non-linear optimization. Similar to curricula for reinforcement learning agents, continuation methods have been used in non-linear optimization to solve challenging optimization problems. This similarity provides an intuition about the effect of the curricula we aim to generate and their limits.
In Chapter 2, we transfer the concept of self-paced learning, initially proposed in the supervised learning community, to the problem of RL, showing that an automated curriculum generation for RL agents can be motivated by a regularized RL objective. This regularized RL objective implies generating a curriculum as a sequence of task distributions that trade off the expected agent performance against similarity to a specified distribution of target tasks. This view on curriculum RL contrasts existing approaches, as it motivates curricula via a regularized RL objective instead of generating them from a set of assumptions about an optimal curriculum. In experiments, we show that an approximate implementation of the aforementioned curriculum – that restricts the interpolating task distribution to a Gaussian – results in improved learning performance compared to regular reinforcement learning, matching or surpassing the performance of existing curriculum-based methods.
Subsequently, Chapter 3 builds up on the intuition of curricula as sequences of interpolating task distributions established in Chapter 2. Motivated by using more flexible task distribution representations, we show how parametric assumptions play a crucial role in the empirical success of the previous approach and subsequently uncover key ingredients that enable the generation of meaningful curricula without assuming a parametric model of the task distributions. One major ingredient is an explicit notion of task similarity via a distance function of two Markov Decision Processes. We turn towards optimal transport theory, allowing for flexible particle-based representations of the task distributions while properly considering the newly introduced metric structure of the task space. Combined with other improvements to our first method, such as a more aggressive restriction of the curriculum to tasks that are not too hard for the agent, the resulting approach delivers consistently high learning performance in multiple experiments.
In the final Chapter 4, we apply the refined method of Chapter 3 to a trajectory-tracking task, in which we task an RL agent to follow a three-dimensional reference trajectory with the tip of an inverted pendulum mounted on a Barrett Whole Arm Manipulator. The access to only positional information results in a partially observable system that, paired with its inherent instability, underactuation, and non-trivial kinematic structure, presents a challenge for modern reinforcement learning algorithms, which we tackle via curricula. The technically infinite-dimensional task space of target trajectories allows us to probe the developed curriculum learning method for flaws that have not surfaced in the rather low-dimensional experiments of the previous chapters. Through an improved optimization scheme that better respects the non-Euclidean structure of target trajectories, we reliably generate curricula of trajectories to be tracked, resulting in faster and more robust learning compared to an RL baseline that does not exploit this form of structured learning. The learned policy matches the performance of an optimal control baseline on the real system, demonstrating the potential of curriculum RL to learn state estimation and control for non-linear tracking tasks jointly.
In summary, this thesis introduces a perspective on reinforcement learning curricula as interpolations between task distributions. The methods developed under this perspective enjoy a precise formulation as optimization problems and deliver empirical benefits throughout experiments. Building upon this precise formulation may allow future work to advance the formal understanding of reinforcement learning curricula and, with that, enable the solution of challenging decision-making and control problems with reinforcement learning
Generalization and Transferability in Reinforcement Learning
Reinforcement learning has proven capable of extending the applicability of machine learning to domains in which
knowledge cannot be acquired from labeled examples but only via trial-and-error. Being able to solve problems with such
characteristics is a crucial requirement for autonomous agents that can accomplish tasks without human intervention.
However, most reinforcement learning algorithms are designed to solve exactly one task, not offering means to systematically
reuse previous knowledge acquired in other problems. Motivated by insights from homotopic continuation methods,
in this work we investigate approaches based on optimization- and concurrent systems theory to gain an understanding
of conceptual and technical challenges of knowledge transfer in reinforcement learning domains. Building upon these
findings, we present an algorithm based on contextual relative entropy policy search that allows an agent to generate
a structured sequence of learning tasks that guide its learning towards a target distribution of tasks by giving it control
over an otherwise hidden context distribution. The presented algorithm is evaluated on a number of robotic tasks, in
which a desired system state needs to be reached, demonstrating that the proposed learning scheme helps to increase
and stabilize learning performance
Variational Hierarchical Mixtures for Learning Probabilistic Inverse Dynamics
Well-calibrated probabilistic regression models are a crucial learning
component in robotics applications as datasets grow rapidly and tasks become
more complex. Classical regression models are usually either probabilistic
kernel machines with a flexible structure that does not scale gracefully with
data or deterministic and vastly scalable automata, albeit with a restrictive
parametric form and poor regularization. In this paper, we consider a
probabilistic hierarchical modeling paradigm that combines the benefits of both
worlds to deliver computationally efficient representations with inherent
complexity regularization. The presented approaches are probabilistic
interpretations of local regression techniques that approximate nonlinear
functions through a set of local linear or polynomial units. Importantly, we
rely on principles from Bayesian nonparametrics to formulate flexible models
that adapt their complexity to the data and can potentially encompass an
infinite number of components. We derive two efficient variational inference
techniques to learn these representations and highlight the advantages of
hierarchical infinite local regression models, such as dealing with non-smooth
functions, mitigating catastrophic forgetting, and enabling parameter sharing
and fast predictions. Finally, we validate this approach on a set of large
inverse dynamics datasets and test the learned models in real-world control
scenarios.Comment: arXiv admin note: text overlap with arXiv:2011.0521
On the Benefit of Optimal Transport for Curriculum Reinforcement Learning
Curriculum reinforcement learning (CRL) allows solving complex tasks by
generating a tailored sequence of learning tasks, starting from easy ones and
subsequently increasing their difficulty. Although the potential of curricula
in RL has been clearly shown in various works, it is less clear how to generate
them for a given learning environment, resulting in various methods aiming to
automate this task. In this work, we focus on framing curricula as
interpolations between task distributions, which has previously been shown to
be a viable approach to CRL. Identifying key issues of existing methods, we
frame the generation of a curriculum as a constrained optimal transport problem
between task distributions. Benchmarks show that this way of curriculum
generation can improve upon existing CRL methods, yielding high performance in
various tasks with different characteristics
Tracking Control for a Spherical Pendulum via Curriculum Reinforcement Learning
Reinforcement Learning (RL) allows learning non-trivial robot control laws
purely from data. However, many successful applications of RL have relied on
ad-hoc regularizations, such as hand-crafted curricula, to regularize the
learning performance. In this paper, we pair a recent algorithm for
automatically building curricula with RL on massively parallelized simulations
to learn a tracking controller for a spherical pendulum on a robotic arm via
RL. Through an improved optimization scheme that better respects the
non-Euclidean task structure, we allow the method to reliably generate
curricula of trajectories to be tracked, resulting in faster and more robust
learning compared to an RL baseline that does not exploit this form of
structured learning. The learned policy matches the performance of an optimal
control baseline on the real system, demonstrating the potential of curriculum
RL to jointly learn state estimation and control for non-linear tracking tasks