969 research outputs found
Policy Optimization as Online Learning with Mediator Feedback
Policy Optimization (PO) is a widely used approach to address continuous
control tasks. In this paper, we introduce the notion of mediator feedback that
frames PO as an online learning problem over the policy space. The additional
available information, compared to the standard bandit feedback, allows reusing
samples generated by one policy to estimate the performance of other policies.
Based on this observation, we propose an algorithm, RANDomized-exploration
policy Optimization via Multiple Importance Sampling with Truncation
(RANDOMIST), for regret minimization in PO, that employs a randomized
exploration strategy, differently from the existing optimistic approaches. When
the policy space is finite, we show that under certain circumstances, it is
possible to achieve constant regret, while always enjoying logarithmic regret.
We also derive problem-dependent regret lower bounds. Then, we extend RANDOMIST
to compact policy spaces. Finally, we provide numerical simulations on finite
and compact policy spaces, in comparison with PO and bandit baselines
A Contextual Bandit Bake-off
Contextual bandit algorithms are essential for solving many real-world
interactive machine learning problems. Despite multiple recent successes on
statistically and computationally efficient methods, the practical behavior of
these algorithms is still poorly understood. We leverage the availability of
large numbers of supervised learning datasets to empirically evaluate
contextual bandit algorithms, focusing on practical methods that learn by
relying on optimization oracles from supervised learning. We find that a recent
method (Foster et al., 2018) using optimism under uncertainty works the best
overall. A surprisingly close second is a simple greedy baseline that only
explores implicitly through the diversity of contexts, followed by a variant of
Online Cover (Agarwal et al., 2014) which tends to be more conservative but
robust to problem specification by design. Along the way, we also evaluate
various components of contextual bandit algorithm design such as loss
estimators. Overall, this is a thorough study and review of contextual bandit
methodology
Matroid Bandits: Fast Combinatorial Optimization with Learning
A matroid is a notion of independence in combinatorial optimization which is
closely related to computational efficiency. In particular, it is well known
that the maximum of a constrained modular function can be found greedily if and
only if the constraints are associated with a matroid. In this paper, we bring
together the ideas of bandits and matroids, and propose a new class of
combinatorial bandits, matroid bandits. The objective in these problems is to
learn how to maximize a modular function on a matroid. This function is
stochastic and initially unknown. We propose a practical algorithm for solving
our problem, Optimistic Matroid Maximization (OMM); and prove two upper bounds,
gap-dependent and gap-free, on its regret. Both bounds are sublinear in time
and at most linear in all other quantities of interest. The gap-dependent upper
bound is tight and we prove a matching lower bound on a partition matroid
bandit. Finally, we evaluate our method on three real-world problems and show
that it is practical
- …