34 research outputs found
Path Consistency Learning in Tsallis Entropy Regularized MDPs
We study the sparse entropy-regularized reinforcement learning (ERL) problem
in which the entropy term is a special form of the Tsallis entropy. The optimal
policy of this formulation is sparse, i.e.,~at each state, it has non-zero
probability for only a small number of actions. This addresses the main
drawback of the standard Shannon entropy-regularized RL (soft ERL) formulation,
in which the optimal policy is softmax, and thus, may assign a non-negligible
probability mass to non-optimal actions. This problem is aggravated as the
number of actions is increased. In this paper, we follow the work of Nachum et
al. (2017) in the soft ERL setting, and propose a class of novel path
consistency learning (PCL) algorithms, called {\em sparse PCL}, for the sparse
ERL problem that can work with both on-policy and off-policy data. We first
derive a {\em sparse consistency} equation that specifies a relationship
between the optimal value function and policy of the sparse ERL along any
system trajectory. Crucially, a weak form of the converse is also true, and we
quantify the sub-optimality of a policy which satisfies sparse consistency, and
show that as we increase the number of actions, this sub-optimality is better
than that of the soft ERL optimal policy. We then use this result to derive the
sparse PCL algorithms. We empirically compare sparse PCL with its soft
counterpart, and show its advantage, especially in problems with a large number
of actions
The Laplacian in RL: Learning Representations with Efficient Approximations
The smallest eigenvectors of the graph Laplacian are well-known to provide a
succinct representation of the geometry of a weighted graph. In reinforcement
learning (RL), where the weighted graph may be interpreted as the state
transition process induced by a behavior policy acting on the environment,
approximating the eigenvectors of the Laplacian provides a promising approach
to state representation learning. However, existing methods for performing this
approximation are ill-suited in general RL settings for two main reasons:
First, they are computationally expensive, often requiring operations on large
matrices. Second, these methods lack adequate justification beyond simple,
tabular, finite-state settings. In this paper, we present a fully general and
scalable method for approximating the eigenvectors of the Laplacian in a
model-free RL context. We systematically evaluate our approach and empirically
show that it generalizes beyond the tabular, finite-state setting. Even in
tabular, finite-state settings, its ability to approximate the eigenvectors
outperforms previous proposals. Finally, we show the potential benefits of
using a Laplacian representation learned using our method in goal-achieving RL
tasks, providing evidence that our technique can be used to significantly
improve the performance of an RL agent
Identifying and Correcting Label Bias in Machine Learning
Datasets often contain biases which unfairly disadvantage certain groups, and
classifiers trained on such datasets can inherit these biases. In this paper,
we provide a mathematical formulation of how this bias can arise. We do so by
assuming the existence of underlying, unknown, and unbiased labels which are
overwritten by an agent who intends to provide accurate labels but may have
biases against certain groups. Despite the fact that we only observe the biased
labels, we are able to show that the bias may nevertheless be corrected by
re-weighting the data points without changing the labels. We show, with
theoretical guarantees, that training on the re-weighted dataset corresponds to
training on the unobserved but unbiased labels, thus leading to an unbiased
machine learning classifier. Our procedure is fast and robust and can be used
with virtually any learning algorithm. We evaluate on a number of standard
machine learning fairness datasets and a variety of fairness notions, finding
that our method outperforms standard approaches in achieving fair
classification
Reinforcement Learning via Fenchel-Rockafellar Duality
We review basic concepts of convex duality, focusing on the very general and
supremely useful Fenchel-Rockafellar duality. We summarize how this duality may
be applied to a variety of reinforcement learning (RL) settings, including
policy evaluation or optimization, online or offline learning, and discounted
or undiscounted rewards. The derivations yield a number of intriguing results,
including the ability to perform policy evaluation and on-policy policy
gradient with behavior-agnostic offline data and methods to learn a policy via
max-likelihood optimization. Although many of these results have appeared
previously in various forms, we provide a unified treatment and perspective on
these results, which we hope will enable researchers to better use and apply
the tools of convex duality to make further progress in RL
Statistical Bootstrapping for Uncertainty Estimation in Off-Policy Evaluation
In reinforcement learning, it is typical to use the empirically observed
transitions and rewards to estimate the value of a policy via either
model-based or Q-fitting approaches. Although straightforward, these techniques
in general yield biased estimates of the true value of the policy. In this
work, we investigate the potential for statistical bootstrapping to be used as
a way to take these biased estimates and produce calibrated confidence
intervals for the true value of the policy. We identify conditions -
specifically, sufficient data size and sufficient coverage - under which
statistical bootstrapping in this setting is guaranteed to yield correct
confidence intervals. In practical situations, these conditions often do not
hold, and so we discuss and propose mechanisms that can be employed to mitigate
their effects. We evaluate our proposed method and show that it can yield
accurate confidence intervals in a variety of conditions, including challenging
continuous control environments and small data regimes
Trust-PCL: An Off-Policy Trust Region Method for Continuous Control
Trust region methods, such as TRPO, are often used to stabilize policy
optimization algorithms in reinforcement learning (RL). While current trust
region strategies are effective for continuous control, they typically require
a prohibitively large amount of on-policy interaction with the environment. To
address this problem, we propose an off-policy trust region method, Trust-PCL.
The algorithm is the result of observing that the optimal policy and state
values of a maximum reward objective with a relative-entropy regularizer
satisfy a set of multi-step pathwise consistencies along any path. Thus,
Trust-PCL is able to maintain optimization stability while exploiting
off-policy data to improve sample efficiency. When evaluated on a number of
continuous control tasks, Trust-PCL improves the solution quality and sample
efficiency of TRPO.Comment: ICLR 201
Bridging the Gap Between Value and Policy Based Reinforcement Learning
We establish a new connection between value and policy based reinforcement
learning (RL) based on a relationship between softmax temporal value
consistency and policy optimality under entropy regularization. Specifically,
we show that softmax consistent action values correspond to optimal entropy
regularized policy probabilities along any action sequence, regardless of
provenance. From this observation, we develop a new RL algorithm, Path
Consistency Learning (PCL), that minimizes a notion of soft consistency error
along multi-step action sequences extracted from both on- and off-policy
traces. We examine the behavior of PCL in different scenarios and show that PCL
can be interpreted as generalizing both actor-critic and Q-learning algorithms.
We subsequently deepen the relationship by showing how a single model can be
used to represent both a policy and the corresponding softmax state values,
eliminating the need for a separate critic. The experimental evaluation
demonstrates that PCL significantly outperforms strong actor-critic and
Q-learning baselines across several benchmarks.Comment: NIPS 201
Learning to Remember Rare Events
Despite recent advances, memory-augmented deep neural networks are still
limited when it comes to life-long and one-shot learning, especially in
remembering rare events. We present a large-scale life-long memory module for
use in deep learning. The module exploits fast nearest-neighbor algorithms for
efficiency and thus scales to large memory sizes. Except for the
nearest-neighbor query, the module is fully differentiable and trained
end-to-end with no extra supervision. It operates in a life-long manner, i.e.,
without the need to reset it during training.
Our memory module can be easily added to any part of a supervised neural
network. To show its versatility we add it to a number of networks, from simple
convolutional ones tested on image classification to deep sequence-to-sequence
and recurrent-convolutional models. In all cases, the enhanced network gains
the ability to remember and do life-long one-shot learning. Our module
remembers training examples shown many thousands of steps in the past and it
can successfully generalize from them. We set new state-of-the-art for one-shot
learning on the Omniglot dataset and demonstrate, for the first time, life-long
one-shot learning in recurrent neural networks on a large-scale machine
translation task.Comment: Conference paper accepted for ICLR'1
Group-based Fair Learning Leads to Counter-intuitive Predictions
A number of machine learning (ML) methods have been proposed recently to
maximize model predictive accuracy while enforcing notions of group parity or
fairness across sub-populations. We propose a desirable property for these
procedures, slack-consistency: For any individual, the predictions of the model
should be monotonic with respect to allowed slack (i.e., maximum allowed
group-parity violation). Such monotonicity can be useful for individuals to
understand the impact of enforcing fairness on their predictions. Surprisingly,
we find that standard ML methods for enforcing fairness violate this basic
property. Moreover, this undesirable behavior arises in situations agnostic to
the complexity of the underlying model or approximate optimizations, suggesting
that the simple act of incorporating a constraint can lead to drastically
unintended behavior in ML. We present a simple theoretical method for enforcing
slack-consistency, while encouraging further discussions on the unintended
behaviors potentially induced when enforcing group-based parity
Near-Optimal Representation Learning for Hierarchical Reinforcement Learning
We study the problem of representation learning in goal-conditioned
hierarchical reinforcement learning. In such hierarchical structures, a
higher-level controller solves tasks by iteratively communicating goals which a
lower-level policy is trained to reach. Accordingly, the choice of
representation -- the mapping of observation space to goal space -- is crucial.
To study this problem, we develop a notion of sub-optimality of a
representation, defined in terms of expected reward of the optimal hierarchical
policy using this representation. We derive expressions which bound the
sub-optimality and show how these expressions can be translated to
representation learning objectives which may be optimized in practice. Results
on a number of difficult continuous-control tasks show that our approach to
representation learning yields qualitatively better representations as well as
quantitatively better hierarchical policies, compared to existing methods (see
videos at https://sites.google.com/view/representation-hrl).Comment: ICLR 2019 Conference Pape