425 research outputs found
Online Stochastic Optimization under Correlated Bandit Feedback
In this paper we consider the problem of online stochastic optimization of a
locally smooth function under bandit feedback. We introduce the high-confidence
tree (HCT) algorithm, a novel any-time -armed bandit algorithm,
and derive regret bounds matching the performance of existing state-of-the-art
in terms of dependency on number of steps and smoothness factor. The main
advantage of HCT is that it handles the challenging case of correlated rewards,
whereas existing methods require that the reward-generating process of each arm
is an identically and independent distributed (iid) random process. HCT also
improves on the state-of-the-art in terms of its memory requirement as well as
requiring a weaker smoothness assumption on the mean-reward function in compare
to the previous anytime algorithms. Finally, we discuss how HCT can be applied
to the problem of policy search in reinforcement learning and we report
preliminary empirical results
A simple parameter-free and adaptive approach to optimization under a minimal local smoothness assumption
We study the problem of optimizing a function under a \emph{budgeted number
of evaluations}. We only assume that the function is \emph{locally} smooth
around one of its global optima. The difficulty of optimization is measured in
terms of 1) the amount of \emph{noise} of the function evaluation and 2)
the local smoothness, , of the function. A smaller results in smaller
optimization error. We come with a new, simple, and parameter-free approach.
First, for all values of and , this approach recovers at least the
state-of-the-art regret guarantees. Second, our approach additionally obtains
these results while being \textit{agnostic} to the values of both and .
This leads to the first algorithm that naturally adapts to an \textit{unknown}
range of noise and leads to significant improvements in a moderate and
low-noise regime. Third, our approach also obtains a remarkable improvement
over the state-of-the-art SOO algorithm when the noise is very low which
includes the case of optimization under deterministic feedback (). There,
under our minimal local smoothness assumption, this improvement is of
exponential magnitude and holds for a class of functions that covers the vast
majority of functions that practitioners optimize (). We show that our
algorithmic improvement is borne out in experiments as we empirically show
faster convergence on common benchmarks
On the robustness of learning in games with stochastically perturbed payoff observations
Motivated by the scarcity of accurate payoff feedback in practical
applications of game theory, we examine a class of learning dynamics where
players adjust their choices based on past payoff observations that are subject
to noise and random disturbances. First, in the single-player case
(corresponding to an agent trying to adapt to an arbitrarily changing
environment), we show that the stochastic dynamics under study lead to no
regret almost surely, irrespective of the noise level in the player's
observations. In the multi-player case, we find that dominated strategies
become extinct and we show that strict Nash equilibria are stochastically
stable and attracting; conversely, if a state is stable or attracting with
positive probability, then it is a Nash equilibrium. Finally, we provide an
averaging principle for 2-player games, and we show that in zero-sum games with
an interior equilibrium, time averages converge to Nash equilibrium for any
noise level.Comment: 36 pages, 4 figure
Linear Partial Monitoring for Sequential Decision-Making: Algorithms, Regret Bounds and Applications
Partial monitoring is an expressive framework for sequential decision-making
with an abundance of applications, including graph-structured and dueling
bandits, dynamic pricing and transductive feedback models. We survey and extend
recent results on the linear formulation of partial monitoring that naturally
generalizes the standard linear bandit setting. The main result is that a
single algorithm, information-directed sampling (IDS), is (nearly) worst-case
rate optimal in all finite-action games. We present a simple and unified
analysis of stochastic partial monitoring, and further extend the model to the
contextual and kernelized setting
Local and adaptive mirror descents in extensive-form games
We study how to learn -optimal strategies in zero-sum imperfect
information games (IIG) with trajectory feedback. In this setting, players
update their policies sequentially based on their observations over a fixed
number of episodes, denoted by . Existing procedures suffer from high
variance due to the use of importance sampling over sequences of actions
(Steinberger et al., 2020; McAleer et al., 2022). To reduce this variance, we
consider a fixed sampling approach, where players still update their policies
over time, but with observations obtained through a given fixed sampling
policy. Our approach is based on an adaptive Online Mirror Descent (OMD)
algorithm that applies OMD locally to each information set, using individually
decreasing learning rates and a regularized loss. We show that this approach
guarantees a convergence rate of with high
probability and has a near-optimal dependence on the game parameters when
applied with the best theoretical choices of learning rates and sampling
policies. To achieve these results, we generalize the notion of OMD
stabilization, allowing for time-varying regularization with convex increments
- …