185 research outputs found
Optimal No-regret Learning in Repeated First-price Auctions
We study online learning in repeated first-price auctions with censored
feedback, where a bidder, only observing the winning bid at the end of each
auction, learns to adaptively bid in order to maximize her cumulative payoff.
To achieve this goal, the bidder faces a challenging dilemma: if she wins the
bid--the only way to achieve positive payoffs--then she is not able to observe
the highest bid of the other bidders, which we assume is iid drawn from an
unknown distribution. This dilemma, despite being reminiscent of the
exploration-exploitation trade-off in contextual bandits, cannot directly be
addressed by the existing UCB or Thompson sampling algorithms in that
literature, mainly because contrary to the standard bandits setting, when a
positive reward is obtained here, nothing about the environment can be learned.
In this paper, by exploiting the structural properties of first-price
auctions, we develop the first learning algorithm that achieves
regret bound when the bidder's private values are
stochastically generated. We do so by providing an algorithm on a general class
of problems, which we call monotone group contextual bandits, where the same
regret bound is established under stochastically generated contexts. Further,
by a novel lower bound argument, we characterize an lower
bound for the case where the contexts are adversarially generated, thus
highlighting the impact of the contexts generation mechanism on the fundamental
learning limit. Despite this, we further exploit the structure of first-price
auctions and develop a learning algorithm that operates sample-efficiently (and
computationally efficiently) in the presence of adversarially generated private
values. We establish an regret bound for this algorithm,
hence providing a complete characterization of optimal learning guarantees for
this problem
Towards Practical Lipschitz Bandits
Stochastic Lipschitz bandit algorithms balance exploration and exploitation,
and have been used for a variety of important task domains. In this paper, we
present a framework for Lipschitz bandit methods that adaptively learns
partitions of context- and arm-space. Due to this flexibility, the algorithm is
able to efficiently optimize rewards and minimize regret, by focusing on the
portions of the space that are most relevant. In our analysis, we link
tree-based methods to Gaussian processes. In light of our analysis, we design a
novel hierarchical Bayesian model for Lipschitz bandit problems. Our
experiments show that our algorithms can achieve state-of-the-art performance
in challenging real-world tasks such as neural network hyperparameter tuning
Balanced Linear Contextual Bandits
Contextual bandit algorithms are sensitive to the estimation method of the
outcome model as well as the exploration method used, particularly in the
presence of rich heterogeneity or complex outcome models, which can lead to
difficult estimation problems along the path of learning. We develop algorithms
for contextual bandits with linear payoffs that integrate balancing methods
from the causal inference literature in their estimation to make it less prone
to problems of estimation bias. We provide the first regret bound analyses for
linear contextual bandits with balancing and show that our algorithms match the
state of the art theoretical guarantees. We demonstrate the strong practical
advantage of balanced contextual bandits on a large number of supervised
learning datasets and on a synthetic example that simulates model
misspecification and prejudice in the initial training data.Comment: AAAI 2019 Oral Presentation. arXiv admin note: substantial text
overlap with arXiv:1711.0707
- …