2,157 research outputs found
A Greedy Approach to Adapting the Trace Parameter for Temporal Difference Learning
One of the main obstacles to broad application of reinforcement learning
methods is the parameter sensitivity of our core learning algorithms. In many
large-scale applications, online computation and function approximation
represent key strategies in scaling up reinforcement learning algorithms. In
this setting, we have effective and reasonably well understood algorithms for
adapting the learning-rate parameter, online during learning. Such
meta-learning approaches can improve robustness of learning and enable
specialization to current task, improving learning speed. For
temporal-difference learning algorithms which we study here, there is yet
another parameter, , that similarly impacts learning speed and
stability in practice. Unfortunately, unlike the learning-rate parameter,
parametrizes the objective function that temporal-difference methods
optimize. Different choices of produce different fixed-point
solutions, and thus adapting online and characterizing the
optimization is substantially more complex than adapting the learning-rate
parameter. There are no meta-learning method for that can achieve (1)
incremental updating, (2) compatibility with function approximation, and (3)
maintain stability of learning under both on and off-policy sampling. In this
paper we contribute a novel objective function for optimizing as a
function of state rather than time. We derive a new incremental, linear
complexity -adaption algorithm that does not require offline batch
updating or access to a model of the world, and present a suite of experiments
illustrating the practicality of our new algorithm in three different settings.
Taken together, our contributions represent a concrete step towards black-box
application of temporal-difference learning methods in real world problems
Investigating practical linear temporal difference learning
Off-policy reinforcement learning has many applications including: learning
from demonstration, learning multiple goal seeking policies in parallel, and
representing predictive knowledge. Recently there has been an proliferation of
new policy-evaluation algorithms that fill a longstanding algorithmic void in
reinforcement learning: combining robustness to off-policy sampling, function
approximation, linear complexity, and temporal difference (TD) updates. This
paper contains two main contributions. First, we derive two new hybrid TD
policy-evaluation algorithms, which fill a gap in this collection of
algorithms. Second, we perform an empirical comparison to elicit which of these
new linear TD methods should be preferred in different situations, and make
concrete suggestions about practical use.Comment: Autonomous Agents and Multi-agent Systems, 201
Meta-Learning Representations for Continual Learning
A continual learning agent should be able to build on top of existing
knowledge to learn on new data quickly while minimizing forgetting. Current
intelligent systems based on neural network function approximators arguably do
the opposite---they are highly prone to forgetting and rarely trained to
facilitate future learning. One reason for this poor behavior is that they
learn from a representation that is not explicitly trained for these two goals.
In this paper, we propose OML, an objective that directly minimizes
catastrophic interference by learning representations that accelerate future
learning and are robust to forgetting under online updates in continual
learning. We show that it is possible to learn naturally sparse representations
that are more effective for online updating. Moreover, our algorithm is
complementary to existing continual learning strategies, such as MER and GEM.
Finally, we demonstrate that a basic online updating strategy on
representations learned by OML is competitive with rehearsal based methods for
continual learning. We release an implementation of our method at
https://github.com/khurramjaved96/mrcl .Comment: Accepted at NeurIPS19, 15 pages, 10 figures, open-source,
representation learning, continual learning, online learnin
Identifying global optimality for dictionary learning
Learning new representations of input observations in machine learning is
often tackled using a factorization of the data. For many such problems,
including sparse coding and matrix completion, learning these factorizations
can be difficult, in terms of efficiency and to guarantee that the solution is
a global minimum. Recently, a general class of objectives have been
introduced-which we term induced dictionary learning models (DLMs)-that have an
induced convex form that enables global optimization. Though attractive
theoretically, this induced form is impractical, particularly for large or
growing datasets. In this work, we investigate the use of practical alternating
minimization algorithms for induced DLMs, that ensure convergence to global
optima. We characterize the stationary points of these models, and, using these
insights, highlight practical choices for the objectives. We then provide
theoretical and empirical evidence that alternating minimization, from a random
initialization, converges to global minima for a large subclass of induced
DLMs. In particular, we take advantage of the existence of the (potentially
unknown) convex induced form, to identify when stationary points are global
minima for the dictionary learning objective. We then provide an empirical
investigation into practical optimization choices for using alternating
minimization for induced DLMs, for both batch and stochastic gradient descent.Comment: Updates to previous version include a small modification to
Proposition 2, to only use normed regularizers, and a modification to the
main theorem (previously Theorem 13) to focus on the overcomplete, full rank
setting and to better characterize non-differentiable induced regularizers.
The theory has been significantly modified since version
Context-Dependent Upper-Confidence Bounds for Directed Exploration
Directed exploration strategies for reinforcement learning are critical for
learning an optimal policy in a minimal number of interactions with the
environment. Many algorithms use optimism to direct exploration, either through
visitation estimates or upper confidence bounds, as opposed to data-inefficient
strategies like \epsilon-greedy that use random, undirected exploration. Most
data-efficient exploration methods require significant computation, typically
relying on a learned model to guide exploration. Least-squares methods have the
potential to provide some of the data-efficiency benefits of model-based
approaches -- because they summarize past interactions -- with the computation
closer to that of model-free approaches. In this work, we provide a novel,
computationally efficient, incremental exploration strategy, leveraging this
property of least-squares temporal difference learning (LSTD). We derive upper
confidence bounds on the action-values learned by LSTD, with context-dependent
(or state-dependent) noise variance. Such context-dependent noise focuses
exploration on a subset of variable states, and allows for reduced exploration
in other states. We empirically demonstrate that our algorithm can converge
more quickly than other incremental exploration strategies using confidence
estimates on action-values.Comment: Neural Information Processing Systems 201
High-confidence error estimates for learned value functions
Estimating the value function for a fixed policy is a fundamental problem in
reinforcement learning. Policy evaluation algorithms---to estimate value
functions---continue to be developed, to improve convergence rates, improve
stability and handle variability, particularly for off-policy learning. To
understand the properties of these algorithms, the experimenter needs
high-confidence estimates of the accuracy of the learned value functions. For
environments with small, finite state-spaces, like chains, the true value
function can be easily computed, to compute accuracy. For large, or continuous
state-spaces, however, this is no longer feasible. In this paper, we address
the largely open problem of how to obtain these high-confidence estimates, for
general state-spaces. We provide a high-confidence bound on an empirical
estimate of the value error to the true value error. We use this bound to
design an offline sampling algorithm, which stores the required quantities to
repeatedly compute value error estimates for any learned value function. We
provide experiments investigating the number of samples required by this
offline algorithm in simple benchmark reinforcement learning domains, and
highlight that there are still many open questions to be solved for this
important problem.Comment: Presented at (UAI) Uncertainty in Artificial Intelligence 201
Incremental Truncated LSTD
Balancing between computational efficiency and sample efficiency is an
important goal in reinforcement learning. Temporal difference (TD) learning
algorithms stochastically update the value function, with a linear time
complexity in the number of features, whereas least-squares temporal difference
(LSTD) algorithms are sample efficient but can be quadratic in the number of
features. In this work, we develop an efficient incremental low-rank
LSTD({\lambda}) algorithm that progresses towards the goal of better balancing
computation and sample efficiency. The algorithm reduces the computation and
storage complexity to the number of features times the chosen rank parameter
while summarizing past samples efficiently to nearly obtain the sample
complexity of LSTD. We derive a simulation bound on the solution given by
truncated low-rank approximation, illustrating a bias- variance trade-off
dependent on the choice of rank. We demonstrate that the algorithm effectively
balances computational complexity and sample efficiency for policy evaluation
in a benchmark task and a high-dimensional energy allocation domain.Comment: Accepted to IJCAI 201
Estimating the class prior and posterior from noisy positives and unlabeled data
We develop a classification algorithm for estimating posterior distributions
from positive-unlabeled data, that is robust to noise in the positive labels
and effective for high-dimensional data. In recent years, several algorithms
have been proposed to learn from positive-unlabeled data; however, many of
these contributions remain theoretical, performing poorly on real
high-dimensional data that is typically contaminated with noise. We build on
this previous work to develop two practical classification algorithms that
explicitly model the noise in the positive labels and utilize univariate
transforms built on discriminative classifiers. We prove that these univariate
transforms preserve the class prior, enabling estimation in the univariate
space and avoiding kernel density estimation for high-dimensional data. The
theoretical development and both parametric and nonparametric algorithms
proposed here constitutes an important step towards wide-spread use of robust
classification algorithms for positive-unlabeled data.Comment: Fixed a typo in the MSGMM update equations in the appendix. Other
minor change
Organizing Experience: A Deeper Look at Replay Mechanisms for Sample-based Planning in Continuous State Domains
Model-based strategies for control are critical to obtain sample efficient
learning. Dyna is a planning paradigm that naturally interleaves learning and
planning, by simulating one-step experience to update the action-value
function. This elegant planning strategy has been mostly explored in the
tabular setting. The aim of this paper is to revisit sample-based planning, in
stochastic and continuous domains with learned models. We first highlight the
flexibility afforded by a model over Experience Replay (ER). Replay-based
methods can be seen as stochastic planning methods that repeatedly sample from
a buffer of recent agent-environment interactions and perform updates to
improve data efficiency. We show that a model, as opposed to a replay buffer,
is particularly useful for specifying which states to sample from during
planning, such as predecessor states that propagate information in reverse from
a state more quickly. We introduce a semi-parametric model learning approach,
called Reweighted Experience Models (REMs), that makes it simple to sample next
states or predecessors. We demonstrate that REM-Dyna exhibits similar
advantages over replay-based methods in learning in continuous state problems,
and that the performance gap grows when moving to stochastic domains, of
increasing size.Comment: IJCAI 201
An Off-policy Policy Gradient Theorem Using Emphatic Weightings
Policy gradient methods are widely used for control in reinforcement
learning, particularly for the continuous action setting. There have been a
host of theoretically sound algorithms proposed for the on-policy setting, due
to the existence of the policy gradient theorem which provides a simplified
form for the gradient. In off-policy learning, however, where the behaviour
policy is not necessarily attempting to learn and follow the optimal policy for
the given task, the existence of such a theorem has been elusive. In this work,
we solve this open problem by providing the first off-policy policy gradient
theorem. The key to the derivation is the use of . We
develop a new actor-critic algorithm\unicode{x2014}called Actor Critic with
Emphatic weightings (ACE)\unicode{x2014}that approximates the simplified
gradients provided by the theorem. We demonstrate in a simple counterexample
that previous off-policy policy gradient methods\unicode{x2014}particularly
OffPAC and DPG\unicode{x2014}converge to the wrong solution whereas ACE finds
the optimal solution.Comment: Updated to final NeurIPS versio
- …