Search CORE

425 research outputs found

Online Stochastic Optimization under Correlated Bandit Feedback

Author: Azar Mohammad Gheshlaghi
Brunskill Emma
Lazaric Alessandro
Publication venue
Publication date: 19/05/2014
Field of study

In this paper we consider the problem of online stochastic optimization of a locally smooth function under bandit feedback. We introduce the high-confidence tree (HCT) algorithm, a novel any-time

\mathcal{X}

-armed bandit algorithm, and derive regret bounds matching the performance of existing state-of-the-art in terms of dependency on number of steps and smoothness factor. The main advantage of HCT is that it handles the challenging case of correlated rewards, whereas existing methods require that the reward-generating process of each arm is an identically and independent distributed (iid) random process. HCT also improves on the state-of-the-art in terms of its memory requirement as well as requiring a weaker smoothness assumption on the mean-reward function in compare to the previous anytime algorithms. Finally, we discuss how HCT can be applied to the problem of policy search in reinforcement learning and we report preliminary empirical results

arXiv.org e-Print Archive

HAL - Lille 3

INRIA a CCSD electronic archive server

HAL Descartes

A simple parameter-free and adaptive approach to optimization under a minimal local smoothness assumption

Author: Bartlett Peter L.
Gabillon Victor
Valko Michal
Publication venue
Publication date: 01/01/2019
Field of study

We study the problem of optimizing a function under a \emph{budgeted number of evaluations}. We only assume that the function is \emph{locally} smooth around one of its global optima. The difficulty of optimization is measured in terms of 1) the amount of \emph{noise}

b

of the function evaluation and 2) the local smoothness,

d

, of the function. A smaller

d

results in smaller optimization error. We come with a new, simple, and parameter-free approach. First, for all values of

b

and

d

, this approach recovers at least the state-of-the-art regret guarantees. Second, our approach additionally obtains these results while being \textit{agnostic} to the values of both

b

and

d

. This leads to the first algorithm that naturally adapts to an \textit{unknown} range of noise

b

and leads to significant improvements in a moderate and low-noise regime. Third, our approach also obtains a remarkable improvement over the state-of-the-art SOO algorithm when the noise is very low which includes the case of optimization under deterministic feedback (

b=0

). There, under our minimal local smoothness assumption, this improvement is of exponential magnitude and holds for a class of functions that covers the vast majority of functions that practitioners optimize (

d=0

). We show that our algorithmic improvement is borne out in experiments as we empirically show faster convergence on common benchmarks

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

HAL-Rennes 1

On the robustness of learning in games with stochastically perturbed payoff observations

Author: Bravo Mario
Mertikopoulos Panayotis
Publication venue
Publication date: 02/06/2016
Field of study

Motivated by the scarcity of accurate payoff feedback in practical applications of game theory, we examine a class of learning dynamics where players adjust their choices based on past payoff observations that are subject to noise and random disturbances. First, in the single-player case (corresponding to an agent trying to adapt to an arbitrarily changing environment), we show that the stochastic dynamics under study lead to no regret almost surely, irrespective of the noise level in the player's observations. In the multi-player case, we find that dominated strategies become extinct and we show that strict Nash equilibria are stochastically stable and attracting; conversely, if a state is stable or attracting with positive probability, then it is a Nash equilibrium. Finally, we provide an averaging principle for 2-player games, and we show that in zero-sum games with an interior equilibrium, time averages converge to Nash equilibrium for any noise level.Comment: 36 pages, 4 figure

arXiv.org e-Print Archive

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

HAL Descartes

Linear Partial Monitoring for Sequential Decision-Making: Algorithms, Regret Bounds and Applications

Author: Kirschner Johannes
Krause Andreas
Lattimore Tor
Publication venue
Publication date: 13/11/2023
Field of study

Partial monitoring is an expressive framework for sequential decision-making with an abundance of applications, including graph-structured and dueling bandits, dynamic pricing and transductive feedback models. We survey and extend recent results on the linear formulation of partial monitoring that naturally generalizes the standard linear bandit setting. The main result is that a single algorithm, information-directed sampling (IDS), is (nearly) worst-case rate optimal in all finite-action games. We present a simple and unified analysis of stochastic partial monitoring, and further extend the model to the contextual and kernelized setting

arXiv.org e-Print Archive

Online optimization and learning in games: Theory and applications

Author: Mertikopoulos Panayotis
Publication venue: HAL CCSD
Publication date: 20/12/2019
Field of study

INRIA a CCSD electronic archive server

Local and adaptive mirror descents in extensive-form games

Author: Fiegel Côme
Kozuno Tadashi
Munos Rémi
Ménard Pierre
Perchet Vianney
Valko Michal
Publication venue
Publication date: 01/09/2023
Field of study

We study how to learn

\epsilon

-optimal strategies in zero-sum imperfect information games (IIG) with trajectory feedback. In this setting, players update their policies sequentially based on their observations over a fixed number of episodes, denoted by

T

. Existing procedures suffer from high variance due to the use of importance sampling over sequences of actions (Steinberger et al., 2020; McAleer et al., 2022). To reduce this variance, we consider a fixed sampling approach, where players still update their policies over time, but with observations obtained through a given fixed sampling policy. Our approach is based on an adaptive Online Mirror Descent (OMD) algorithm that applies OMD locally to each information set, using individually decreasing learning rates and a regularized loss. We show that this approach guarantees a convergence rate of

\tilde{\mathcal{O}}(T^{-1/2})

with high probability and has a near-optimal dependence on the game parameters when applied with the best theoretical choices of learning rates and sampling policies. To achieve these results, we generalize the notion of OMD stabilization, allowing for time-varying regularization with convex increments

arXiv.org e-Print Archive