4,348 research outputs found
A Deterministic Analysis of an Online Convex Mixture of Expert Algorithms
Cataloged from PDF version of article.We analyze an online learning algorithm that adaptively
combines outputs of two constituent algorithms (or the
experts) running in parallel to model an unknown desired signal.
This online learning algorithm is shown to achieve (and in some
cases outperform) the mean-square error (MSE) performance of
the best constituent algorithm in the mixture in the steady-state.
However, the MSE analysis of this algorithm in the literature
uses approximations and relies on statistical models on the
underlying signals and systems. Hence, such an analysis may not
be useful or valid for signals generated by various real life systems
that show high degrees of nonstationarity, limit cycles and, in
many cases, that are even chaotic. In this paper, we produce
results in an individual sequence manner. In particular, we relate
the time-accumulated squared estimation error of this online
algorithm at any time over any interval to the time-accumulated
squared estimation error of the optimal convex mixture of the
constituent algorithms directly tuned to the underlying signal
in a deterministic sense without any statistical assumptions. In
this sense, our analysis provides the transient, steady-state and
tracking behavior of this algorithm in a strong sense without any
approximations in the derivations or statistical assumptions on
the underlying signals such that our results are guaranteed to
hold. We illustrate the introduced results through examples. © 2012 IEEE
A Second-order Bound with Excess Losses
We study online aggregation of the predictions of experts, and first show new
second-order regret bounds in the standard setting, which are obtained via a
version of the Prod algorithm (and also a version of the polynomially weighted
average algorithm) with multiple learning rates. These bounds are in terms of
excess losses, the differences between the instantaneous losses suffered by the
algorithm and the ones of a given expert. We then demonstrate the interest of
these bounds in the context of experts that report their confidences as a
number in the interval [0,1] using a generic reduction to the standard setting.
We conclude by two other applications in the standard setting, which improve
the known bounds in case of small excess losses and show a bounded regret
against i.i.d. sequences of losses
Batch Policy Learning under Constraints
When learning policies for real-world domains, two important questions arise:
(i) how to efficiently use pre-collected off-policy, non-optimal behavior data;
and (ii) how to mediate among different competing objectives and constraints.
We thus study the problem of batch policy learning under multiple constraints,
and offer a systematic solution. We first propose a flexible meta-algorithm
that admits any batch reinforcement learning and online learning procedure as
subroutines. We then present a specific algorithmic instantiation and provide
performance guarantees for the main objective and all constraints. To certify
constraint satisfaction, we propose a new and simple method for off-policy
policy evaluation (OPE) and derive PAC-style bounds. Our algorithm achieves
strong empirical results in different domains, including in a challenging
problem of simulated car driving subject to multiple constraints such as lane
keeping and smooth driving. We also show experimentally that our OPE method
outperforms other popular OPE techniques on a standalone basis, especially in a
high-dimensional setting
Deep Reinforcement Learning from Self-Play in Imperfect-Information Games
Many real-world applications can be described as large-scale games of
imperfect information. To deal with these challenging domains, prior work has
focused on computing Nash equilibria in a handcrafted abstraction of the
domain. In this paper we introduce the first scalable end-to-end approach to
learning approximate Nash equilibria without prior domain knowledge. Our method
combines fictitious self-play with deep reinforcement learning. When applied to
Leduc poker, Neural Fictitious Self-Play (NFSP) approached a Nash equilibrium,
whereas common reinforcement learning methods diverged. In Limit Texas Holdem,
a poker game of real-world scale, NFSP learnt a strategy that approached the
performance of state-of-the-art, superhuman algorithms based on significant
domain expertise.Comment: updated version, incorporating conference feedbac
- …