7 research outputs found
Understanding representation learning for deep reinforcement learning
Representation learning is essential to practical success of reinforcement learning. Through a state representation, an agent can describe its environment to efficiently explore the state space, generalize to new states and perform credit assignment from delayed feedback. These representations may be state abstractions, hand-engineered or fixed features or implied by a neural network. In this thesis, we investigate several desirable theoretical properties of state representations and, using this categorization, design novel principled RL algorithms aiming at learning these state representations at scale through deep learning.
First, we consider state abstractions induced by behavioral metrics and their generalization properties. We show that supporting the continuity of the value function is central to generalization in reinforcement learning. Together with this formalization, we provide an empirical evaluation comparing various metrics and demonstrating the importance of the choice of a neighborhood in RL algorithms.
Then, we draw on statistical learning theory to characterize what it means for arbitrary state features to generalize in RL. We introduce a new notion called effective dimension of a representation that drives the generalization to unseen states and demonstrate its usefulness for value-based deep reinforcement learning in Atari games.
The third contribution of this dissertation is a scalable algorithm to learn a state representation from a very large number of auxiliary tasks through deep learning. It is a stochastic gradient descent method to learn the principal components of a target matrix by means of a neural network from a handful of entries.
Finally, the last part presents our findings on how the state representation in reinforcement learning influences the quality of an agent’s predictions but is also shaped by these predictions. We provide a formal mathematical model for studying this phenomenon and show how these theoretical results can be leveraged to improve the quality of the learning process
Metrics and continuity in reinforcement learning
In most practical applications of reinforcement learning, it is untenable to
maintain direct estimates for individual states; in continuous-state systems,
it is impossible. Instead, researchers often leverage state similarity (whether
explicitly or implicitly) to build models that can generalize well from a
limited set of samples. The notion of state similarity used, and the
neighbourhoods and topologies they induce, is thus of crucial importance, as it
will directly affect the performance of the algorithms. Indeed, a number of
recent works introduce algorithms assuming the existence of "well-behaved"
neighbourhoods, but leave the full specification of such topologies for future
work. In this paper we introduce a unified formalism for defining these
topologies through the lens of metrics. We establish a hierarchy amongst these
metrics and demonstrate their theoretical implications on the Markov Decision
Process specifying the reinforcement learning problem. We complement our
theoretical results with empirical evaluations showcasing the differences
between the metrics considered.Comment: Accepted at AAAI 202
Continuous Hierarchical Representations with Poincar\'e Variational Auto-Encoders
The variational auto-encoder (VAE) is a popular method for learning a
generative model and embeddings of the data. Many real datasets are
hierarchically structured. However, traditional VAEs map data in a Euclidean
latent space which cannot efficiently embed tree-like structures. Hyperbolic
spaces with negative curvature can. We therefore endow VAEs with a Poincar\'e
ball model of hyperbolic geometry as a latent space and rigorously derive the
necessary methods to work with two main Gaussian generalisations on that space.
We empirically show better generalisation to unseen data than the Euclidean
counterpart, and can qualitatively and quantitatively better recover
hierarchical structures.Comment: Advances in Neural Information Processing System
Bootstrapped Representations in Reinforcement Learning
In reinforcement learning (RL), state representations are key to dealing with
large or continuous state spaces. While one of the promises of deep learning
algorithms is to automatically construct features well-tuned for the task they
try to solve, such a representation might not emerge from end-to-end training
of deep RL agents. To mitigate this issue, auxiliary objectives are often
incorporated into the learning process and help shape the learnt state
representation. Bootstrapping methods are today's method of choice to make
these additional predictions. Yet, it is unclear which features these
algorithms capture and how they relate to those from other auxiliary-task-based
approaches. In this paper, we address this gap and provide a theoretical
characterization of the state representation learnt by temporal difference
learning (Sutton, 1988). Surprisingly, we find that this representation differs
from the features learned by Monte Carlo and residual gradient algorithms for
most transition structures of the environment in the policy evaluation setting.
We describe the efficacy of these representations for policy evaluation, and
use our theoretical analysis to design new auxiliary learning rules. We
complement our theoretical results with an empirical comparison of these
learning rules for different cumulant functions on classic domains such as the
four-room domain (Sutton et al, 1999) and Mountain Car (Moore, 1990).Comment: ICML 202
Human Alignment of Large Language Models through Online Preference Optimisation
Ensuring alignment of language models' outputs with human preferences is
critical to guarantee a useful, safe, and pleasant user experience. Thus, human
alignment has been extensively studied recently and several methods such as
Reinforcement Learning from Human Feedback (RLHF), Direct Policy Optimisation
(DPO) and Sequence Likelihood Calibration (SLiC) have emerged. In this paper,
our contribution is two-fold. First, we show the equivalence between two recent
alignment methods, namely Identity Policy Optimisation (IPO) and Nash Mirror
Descent (Nash-MD). Second, we introduce a generalisation of IPO, named IPO-MD,
that leverages the regularised sampling approach proposed by Nash-MD.
This equivalence may seem surprising at first sight, since IPO is an offline
method whereas Nash-MD is an online method using a preference model. However,
this equivalence can be proven when we consider the online version of IPO, that
is when both generations are sampled by the online policy and annotated by a
trained preference model. Optimising the IPO loss with such a stream of data
becomes then equivalent to finding the Nash equilibrium of the preference model
through self-play. Building on this equivalence, we introduce the IPO-MD
algorithm that generates data with a mixture policy (between the online and
reference policy) similarly as the general Nash-MD algorithm. We compare
online-IPO and IPO-MD to different online versions of existing losses on
preference data such as DPO and SLiC on a summarisation task
Perfect Density Models Cannot Guarantee Anomaly Detection
Thanks to the tractability of their likelihood, several deep generative models show promise for seemingly straightforward but important applications like anomaly detection, uncertainty estimation, and active learning. However, the likelihood values empirically attributed to anomalies conflict with the expectations these proposed applications suggest. In this paper, we take a closer look at the behavior of distribution densities through the lens of reparametrization and show that these quantities carry less meaningful information than previously thought, beyond estimation issues or the curse of dimensionality. We conclude that the use of these likelihoods for anomaly detection relies on strong and implicit hypotheses, and highlight the necessity of explicitly formulating these assumptions for reliable anomaly detection