540 research outputs found
Calibration and Internal no-Regret with Partial Monitoring
Calibrated strategies can be obtained by performing strategies that have no
internal regret in some auxiliary game. Such strategies can be constructed
explicitly with the use of Blackwell's approachability theorem, in an other
auxiliary game. We establish the converse: a strategy that approaches a convex
-set can be derived from the construction of a calibrated strategy. We
develop these tools in the framework of a game with partial monitoring, where
players do not observe the actions of their opponents but receive random
signals, to define a notion of internal regret and construct strategies that
have no such regret
Approachability of Convex Sets in Games with Partial Monitoring
We provide a necessary and sufficient condition under which a convex set is
approachable in a game with partial monitoring, i.e.\ where players do not
observe their opponents' moves but receive random signals. This condition is an
extension of Blackwell's Criterion in the full monitoring framework, where
players observe at least their payoffs. When our condition is fulfilled, we
construct explicitly an approachability strategy, derived from a strategy
satisfying some internal consistency property in an auxiliary game. We also
provide an example of a convex set, that is neither (weakly)-approachable nor
(weakly)-excludable, a situation that cannot occur in the full monitoring case.
We finally apply our result to describe an -optimal strategy of the
uninformed player in a zero-sum repeated game with incomplete information on
one side
Gains and Losses are Fundamentally Different in Regret Minimization: The Sparse Case
We demonstrate that, in the classical non-stochastic regret minimization
problem with decisions, gains and losses to be respectively maximized or
minimized are fundamentally different. Indeed, by considering the additional
sparsity assumption (at each stage, at most decisions incur a nonzero
outcome), we derive optimal regret bounds of different orders. Specifically,
with gains, we obtain an optimal regret guarantee after stages of order
, so the classical dependency in the dimension is replaced by
the sparsity size. With losses, we provide matching upper and lower bounds of
order , which is decreasing in . Eventually, we also
study the bandit setting, and obtain an upper bound of order when outcomes are losses. This bound is proven to be optimal up to the
logarithmic factor
Approachability in unknown games: Online learning meets multi-objective optimization
In the standard setting of approachability there are two players and a target
set. The players play repeatedly a known vector-valued game where the first
player wants to have the average vector-valued payoff converge to the target
set which the other player tries to exclude it from this set. We revisit this
setting in the spirit of online learning and do not assume that the first
player knows the game structure: she receives an arbitrary vector-valued reward
vector at every round. She wishes to approach the smallest ("best") possible
set given the observed average payoffs in hindsight. This extension of the
standard setting has implications even when the original target set is not
approachable and when it is not obvious which expansion of it should be
approached instead. We show that it is impossible, in general, to approach the
best target set in hindsight and propose achievable though ambitious
alternative goals. We further propose a concrete strategy to approach these
goals. Our method does not require projection onto a target set and amounts to
switching between scalar regret minimization algorithms that are performed in
episodes. Applications to global cost minimization and to approachability under
sample path constraints are considered
Stochastic Bandit Models for Delayed Conversions
Online advertising and product recommendation are important domains of
applications for multi-armed bandit methods. In these fields, the reward that
is immediately available is most often only a proxy for the actual outcome of
interest, which we refer to as a conversion. For instance, in web advertising,
clicks can be observed within a few seconds after an ad display but the
corresponding sale --if any-- will take hours, if not days to happen. This
paper proposes and investigates a new stochas-tic multi-armed bandit model in
the framework proposed by Chapelle (2014) --based on empirical studies in the
field of web advertising-- in which each action may trigger a future reward
that will then happen with a stochas-tic delay. We assume that the probability
of conversion associated with each action is unknown while the distribution of
the conversion delay is known, distinguishing between the (idealized) case
where the conversion events may be observed whatever their delay and the more
realistic setting in which late conversions are censored. We provide
performance lower bounds as well as two simple but efficient algorithms based
on the UCB and KLUCB frameworks. The latter algorithm, which is preferable when
conversion rates are low, is based on a Poissonization argument, of independent
interest in other settings where aggregation of Bernoulli observations with
different success probabilities is required.Comment: Conference on Uncertainty in Artificial Intelligence, Aug 2017,
Sydney, Australi
Online learning in repeated auctions
Motivated by online advertising auctions, we consider repeated Vickrey
auctions where goods of unknown value are sold sequentially and bidders only
learn (potentially noisy) information about a good's value once it is
purchased. We adopt an online learning approach with bandit feedback to model
this problem and derive bidding strategies for two models: stochastic and
adversarial. In the stochastic model, the observed values of the goods are
random variables centered around the true value of the good. In this case,
logarithmic regret is achievable when competing against well behaved
adversaries. In the adversarial model, the goods need not be identical and we
simply compare our performance against that of the best fixed bid in hindsight.
We show that sublinear regret is also achievable in this case and prove
matching minimax lower bounds. To our knowledge, this is the first complete set
of strategies for bidders participating in auctions of this type
Highly-Smooth Zero-th Order Online Optimization Vianney Perchet
The minimization of convex functions which are only available through partial
and noisy information is a key methodological problem in many disciplines. In
this paper we consider convex optimization with noisy zero-th order
information, that is noisy function evaluations at any desired point. We focus
on problems with high degrees of smoothness, such as logistic regression. We
show that as opposed to gradient-based algorithms, high-order smoothness may be
used to improve estimation rates, with a precise dependence of our upper-bounds
on the degree of smoothness. In particular, we show that for infinitely
differentiable functions, we recover the same dependence on sample size as
gradient-based algorithms, with an extra dimension-dependent factor. This is
done for both convex and strongly-convex functions, with finite horizon and
anytime algorithms. Finally, we also recover similar results in the online
optimization setting.Comment: Conference on Learning Theory (COLT), Jun 2016, New York, United
States. 201
- …
