3,088 research outputs found
A Dantzig Selector Approach to Temporal Difference Learning
LSTD is a popular algorithm for value function approximation. Whenever the
number of features is larger than the number of samples, it must be paired with
some form of regularization. In particular, L1-regularization methods tend to
perform feature selection by promoting sparsity, and thus, are well-suited for
high-dimensional problems. However, since LSTD is not a simple regression
algorithm, but it solves a fixed--point problem, its integration with
L1-regularization is not straightforward and might come with some drawbacks
(e.g., the P-matrix assumption for LASSO-TD). In this paper, we introduce a
novel algorithm obtained by integrating LSTD with the Dantzig Selector. We
investigate the performance of the proposed algorithm and its relationship with
the existing regularized approaches, and show how it addresses some of their
drawbacks.Comment: Appears in Proceedings of the 29th International Conference on
Machine Learning (ICML 2012
A Theory of Regularized Markov Decision Processes
Many recent successful (deep) reinforcement learning algorithms make use of
regularization, generally based on entropy or Kullback-Leibler divergence. We
propose a general theory of regularized Markov Decision Processes that
generalizes these approaches in two directions: we consider a larger class of
regularizers, and we consider the general modified policy iteration approach,
encompassing both policy iteration and value iteration. The core building
blocks of this theory are a notion of regularized Bellman operator and the
Legendre-Fenchel transform, a classical tool of convex optimization. This
approach allows for error propagation analyses of general algorithmic schemes
of which (possibly variants of) classical algorithms such as Trust Region
Policy Optimization, Soft Q-learning, Stochastic Actor Critic or Dynamic Policy
Programming are special cases. This also draws connections to proximal convex
optimization, especially to Mirror Descent.Comment: ICML 201
When Waiting is not an Option : Learning Options with a Deliberation Cost
Recent work has shown that temporally extended actions (options) can be
learned fully end-to-end as opposed to being specified in advance. While the
problem of "how" to learn options is increasingly well understood, the question
of "what" good options should be has remained elusive. We formulate our answer
to what "good" options should be in the bounded rationality framework (Simon,
1957) through the notion of deliberation cost. We then derive practical
gradient-based learning algorithms to implement this objective. Our results in
the Arcade Learning Environment (ALE) show increased performance and
interpretability
- …