818 research outputs found
Optimal No-regret Learning in Repeated First-price Auctions
We study online learning in repeated first-price auctions with censored
feedback, where a bidder, only observing the winning bid at the end of each
auction, learns to adaptively bid in order to maximize her cumulative payoff.
To achieve this goal, the bidder faces a challenging dilemma: if she wins the
bid--the only way to achieve positive payoffs--then she is not able to observe
the highest bid of the other bidders, which we assume is iid drawn from an
unknown distribution. This dilemma, despite being reminiscent of the
exploration-exploitation trade-off in contextual bandits, cannot directly be
addressed by the existing UCB or Thompson sampling algorithms in that
literature, mainly because contrary to the standard bandits setting, when a
positive reward is obtained here, nothing about the environment can be learned.
In this paper, by exploiting the structural properties of first-price
auctions, we develop the first learning algorithm that achieves
regret bound when the bidder's private values are
stochastically generated. We do so by providing an algorithm on a general class
of problems, which we call monotone group contextual bandits, where the same
regret bound is established under stochastically generated contexts. Further,
by a novel lower bound argument, we characterize an lower
bound for the case where the contexts are adversarially generated, thus
highlighting the impact of the contexts generation mechanism on the fundamental
learning limit. Despite this, we further exploit the structure of first-price
auctions and develop a learning algorithm that operates sample-efficiently (and
computationally efficiently) in the presence of adversarially generated private
values. We establish an regret bound for this algorithm,
hence providing a complete characterization of optimal learning guarantees for
this problem
Parameter estimation in softmax decision-making models with linear objective functions
With an eye towards human-centered automation, we contribute to the
development of a systematic means to infer features of human decision-making
from behavioral data. Motivated by the common use of softmax selection in
models of human decision-making, we study the maximum likelihood parameter
estimation problem for softmax decision-making models with linear objective
functions. We present conditions under which the likelihood function is convex.
These allow us to provide sufficient conditions for convergence of the
resulting maximum likelihood estimator and to construct its asymptotic
distribution. In the case of models with nonlinear objective functions, we show
how the estimator can be applied by linearizing about a nominal parameter
value. We apply the estimator to fit the stochastic UCL (Upper Credible Limit)
model of human decision-making to human subject data. We show statistically
significant differences in behavior across related, but distinct, tasks.Comment: In pres
Counterfactual Risk Minimization: Learning from Logged Bandit Feedback
We develop a learning principle and an efficient algorithm for batch learning
from logged bandit feedback. This learning setting is ubiquitous in online
systems (e.g., ad placement, web search, recommendation), where an algorithm
makes a prediction (e.g., ad ranking) for a given input (e.g., query) and
observes bandit feedback (e.g., user clicks on presented ads). We first address
the counterfactual nature of the learning problem through propensity scoring.
Next, we prove generalization error bounds that account for the variance of the
propensity-weighted empirical risk estimator. These constructive bounds give
rise to the Counterfactual Risk Minimization (CRM) principle. We show how CRM
can be used to derive a new learning method -- called Policy Optimizer for
Exponential Models (POEM) -- for learning stochastic linear rules for
structured output prediction. We present a decomposition of the POEM objective
that enables efficient stochastic gradient optimization. POEM is evaluated on
several multi-label classification problems showing substantially improved
robustness and generalization performance compared to the state-of-the-art.Comment: 10 page
Query Complexity of Derivative-Free Optimization
This paper provides lower bounds on the convergence rate of Derivative Free
Optimization (DFO) with noisy function evaluations, exposing a fundamental and
unavoidable gap between the performance of algorithms with access to gradients
and those with access to only function evaluations. However, there are
situations in which DFO is unavoidable, and for such situations we propose a
new DFO algorithm that is proved to be near optimal for the class of strongly
convex objective functions. A distinctive feature of the algorithm is that it
uses only Boolean-valued function comparisons, rather than function
evaluations. This makes the algorithm useful in an even wider range of
applications, such as optimization based on paired comparisons from human
subjects, for example. We also show that regardless of whether DFO is based on
noisy function evaluations or Boolean-valued function comparisons, the
convergence rate is the same
- …