37 research outputs found
Pure Exploration with Multiple Correct Answers
We determine the sample complexity of pure exploration bandit problems with
multiple good answers. We derive a lower bound using a new game equilibrium
argument. We show how continuity and convexity properties of single-answer
problems ensures that the Track-and-Stop algorithm has asymptotically optimal
sample complexity. However, that convexity is lost when going to the
multiple-answer setting. We present a new algorithm which extends
Track-and-Stop to the multiple-answer case and has asymptotic sample complexity
matching the lower bound
A Formalization of Doob's Martingale Convergence Theorems in mathlib
We present the formalization of Doob's martingale convergence theorems in the
mathlib library for the Lean theorem prover. These theorems give conditions
under which (sub)martingales converge, almost everywhere or in . In order
to formalize those results, we build a definition of the conditional
expectation in Banach spaces and develop the theory of stochastic processes,
stopping times and martingales. As an application of the convergence theorems,
we also present the formalization of L\'evy's generalized Borel-Cantelli lemma.
This work on martingale theory is one of the first developments of probability
theory in mathlib, and it builds upon diverse parts of that library such as
topology, analysis and most importantly measure theory
Dealing with Unknown Variances in Best-Arm Identification
The problem of identifying the best arm among a collection of items having
Gaussian rewards distribution is well understood when the variances are known.
Despite its practical relevance for many applications, few works studied it for
unknown variances. In this paper we introduce and analyze two approaches to
deal with unknown variances, either by plugging in the empirical variance or by
adapting the transportation costs. In order to calibrate our two stopping
rules, we derive new time-uniform concentration inequalities, which are of
independent interest. Then, we illustrate the theoretical and empirical
performances of our two sampling rule wrappers on Track-and-Stop and on a Top
Two algorithm. Moreover, by quantifying the impact on the sample complexity of
not knowing the variances, we reveal that it is rather small.Comment: 73 pages, 5 figures, 3 tables. To be published in the 34th
International Conference on Algorithmic Learning Theory, Singapore, 202
Structure Adaptive Algorithms for Stochastic Bandits
We study reward maximisation in a wide class of structured stochastic
multi-armed bandit problems, where the mean rewards of arms satisfy some given
structural constraints, e.g. linear, unimodal, sparse, etc. Our aim is to
develop methods that are flexible (in that they easily adapt to different
structures), powerful (in that they perform well empirically and/or provably
match instance-dependent lower bounds) and efficient in that the per-round
computational burden is small.
We develop asymptotically optimal algorithms from instance-dependent
lower-bounds using iterative saddle-point solvers. Our approach generalises
recent iterative methods for pure exploration to reward maximisation, where a
major challenge arises from the estimation of the sub-optimality gaps and their
reciprocals. Still we manage to achieve all the above desiderata. Notably, our
technique avoids the computational cost of the full-blown saddle point oracle
employed by previous work, while at the same time enabling finite-time regret
bounds.
Our experiments reveal that our method successfully leverages the structural
assumptions, while its regret is at worst comparable to that of vanilla UCB.Comment: 10+18 pages. To be published in the proceedings of ICML 202
Pure Exploration with Multiple Correct Answers
We determine the sample complexity of pure exploration bandit problems with multiple good answers. We derive a lower bound using a new game equilibrium argument. We show how continuity and convexity properties of single-answer problems ensure that the existing Track-and-Stop algorithm has asymptotically optimal sample complexity. However, that convexity is lost when going to the multiple-answer setting. We present a new algorithm which extends Track-and-Stop to the multiple-answer case and has asymptotic sample complexity matching the lower bound
Gamification of Pure Exploration for Linear Bandits
We investigate an active pure-exploration setting, that includes best-arm
identification, in the context of linear stochastic bandits. While
asymptotically optimal algorithms exist for standard multi-arm bandits, the
existence of such algorithms for the best-arm identification in linear bandits
has been elusive despite several attempts to address it. First, we provide a
thorough comparison and new insight over different notions of optimality in the
linear case, including G-optimality, transductive optimality from optimal
experimental design and asymptotic optimality. Second, we design the first
asymptotically optimal algorithm for fixed-confidence pure exploration in
linear bandits. As a consequence, our algorithm naturally bypasses the pitfall
caused by a simple but difficult instance, that most prior algorithms had to be
engineered to deal with explicitly. Finally, we avoid the need to fully solve
an optimal design problem by providing an approach that entails an efficient
implementation.Comment: 11+25 pages. To be published in the proceedings of ICML 202
Non-Asymptotic Pure Exploration by Solving Games
Pure exploration (aka active testing) is the fundamental task of sequentially gathering information to answer a query about a stochastic environment. Good algorithms make few mistakes and take few samples. Lower bounds (for multi-armed bandit models with arms in an exponential family) reveal that the sample complexity is determined by the solution to an optimisation problem. The existing state of the art algorithms achieve asymptotic optimality by solving a plug-in estimate of that optimisation problem at each step. We interpret the optimisation problem as an unknown game, and propose sampling rules based on iterative strategies to estimate and converge to its saddle point. We apply no-regret learners to obtain the first finite confidence guarantees that are adapted to the exponential family and which apply to any pure exploration query and bandit structure. Moreover, our algorithms only use a best response oracle instead of fully solving the optimisation problem