8 research outputs found
Optimal No-regret Learning in Repeated First-price Auctions
We study online learning in repeated first-price auctions with censored
feedback, where a bidder, only observing the winning bid at the end of each
auction, learns to adaptively bid in order to maximize her cumulative payoff.
To achieve this goal, the bidder faces a challenging dilemma: if she wins the
bid--the only way to achieve positive payoffs--then she is not able to observe
the highest bid of the other bidders, which we assume is iid drawn from an
unknown distribution. This dilemma, despite being reminiscent of the
exploration-exploitation trade-off in contextual bandits, cannot directly be
addressed by the existing UCB or Thompson sampling algorithms in that
literature, mainly because contrary to the standard bandits setting, when a
positive reward is obtained here, nothing about the environment can be learned.
In this paper, by exploiting the structural properties of first-price
auctions, we develop the first learning algorithm that achieves
regret bound when the bidder's private values are
stochastically generated. We do so by providing an algorithm on a general class
of problems, which we call monotone group contextual bandits, where the same
regret bound is established under stochastically generated contexts. Further,
by a novel lower bound argument, we characterize an lower
bound for the case where the contexts are adversarially generated, thus
highlighting the impact of the contexts generation mechanism on the fundamental
learning limit. Despite this, we further exploit the structure of first-price
auctions and develop a learning algorithm that operates sample-efficiently (and
computationally efficiently) in the presence of adversarially generated private
values. We establish an regret bound for this algorithm,
hence providing a complete characterization of optimal learning guarantees for
this problem
Advancing Ad Auction Realism: Practical Insights & Modeling Implications
This paper proposes a learning model of online ad auctions that allows for
the following four key realistic characteristics of contemporary online
auctions: (1) ad slots can have different values and click-through rates
depending on users' search queries, (2) the number and identity of competing
advertisers are unobserved and change with each auction, (3) advertisers only
receive partial, aggregated feedback, and (4) payment rules are only partially
specified. We model advertisers as agents governed by an adversarial bandit
algorithm, independent of auction mechanism intricacies. Our objective is to
simulate the behavior of advertisers for counterfactual analysis, prediction,
and inference purposes. Our findings reveal that, in such richer environments,
"soft floors" can enhance key performance metrics even when bidders are drawn
from the same population. We further demonstrate how to infer advertiser value
distributions from observed bids, thereby affirming the practical efficacy of
our approach even in a more realistic auction setting
Multi-Platform Budget Management in Ad Markets with Non-IC Auctions
In online advertising markets, budget-constrained advertisers acquire ad
placements through repeated bidding in auctions on various platforms. We
present a strategy for bidding optimally in a set of auctions that may or may
not be incentive-compatible under the presence of budget constraints. Our
strategy maximizes the expected total utility across auctions while satisfying
the advertiser's budget constraints in expectation. Additionally, we
investigate the online setting where the advertiser must submit bids across
platforms while learning about other bidders' bids over time. Our algorithm has
regret under the full-information setting. Finally, we demonstrate
that our algorithms have superior cumulative regret on both synthetic and
real-world datasets of ad placement auctions, compared to existing adaptive
pacing algorithms.Comment: 34 pages, 5 figure
Online Learning under Budget and ROI Constraints via Weak Adaptivity
We study online learning problems in which a decision maker has to make a
sequence of costly decisions, with the goal of maximizing their expected reward
while adhering to budget and return-on-investment (ROI) constraints. Existing
primal-dual algorithms designed for constrained online learning problems under
adversarial inputs rely on two fundamental assumptions. First, the decision
maker must know beforehand the value of parameters related to the degree of
strict feasibility of the problem (i.e. Slater parameters). Second, a strictly
feasible solution to the offline optimization problem must exist at each round.
Both requirements are unrealistic for practical applications such as bidding in
online ad auctions. In this paper, we show how such assumptions can be
circumvented by endowing standard primal-dual templates with weakly adaptive
regret minimizers. This results in a ``dual-balancing'' framework which ensures
that dual variables stay sufficiently small, even in the absence of knowledge
about Slater's parameter. We prove the first best-of-both-worlds no-regret
guarantees which hold in absence of the two aforementioned assumptions, under
stochastic and adversarial inputs. Finally, we show how to instantiate the
framework to optimally bid in various mechanisms of practical relevance, such
as first- and second-price auctions
Reinforcement Learning for Non-Stationary Markov Decision Processes: The Blessing of (More) Optimism
We consider un-discounted reinforcement learning (RL) in Markov decision
processes (MDPs) under drifting non-stationarity, i.e., both the reward and
state transition distributions are allowed to evolve over time, as long as
their respective total variations, quantified by suitable metrics, do not
exceed certain variation budgets. We first develop the Sliding Window
Upper-Confidence bound for Reinforcement Learning with Confidence Widening
(SWUCRL2-CW) algorithm, and establish its dynamic regret bound when the
variation budgets are known. In addition, we propose the
Bandit-over-Reinforcement Learning (BORL) algorithm to adaptively tune the
SWUCRL2-CW algorithm to achieve the same dynamic regret bound, but in a
parameter-free manner, i.e., without knowing the variation budgets. Notably,
learning non-stationary MDPs via the conventional optimistic exploration
technique presents a unique challenge absent in existing (non-stationary)
bandit learning settings. We overcome the challenge by a novel confidence
widening technique that incorporates additional optimism.Comment: To appear in proceedings of the 37th International Conference on
Machine Learning. Shortened conference version of its journal version
(available at: arXiv:1906.02922
Online Learning in Multi-unit Auctions
We consider repeated multi-unit auctions with uniform pricing, which are
widely used in practice for allocating goods such as carbon licenses. In each
round, identical units of a good are sold to a group of buyers that have
valuations with diminishing marginal returns. The buyers submit bids for the
units, and then a price is set per unit so that all the units are sold. We
consider two variants of the auction, where the price is set to the -th
highest bid and -st highest bid, respectively.
We analyze the properties of this auction in both the offline and online
settings. In the offline setting, we consider the problem that one player
is facing: given access to a data set that contains the bids submitted by
competitors in past auctions, find a bid vector that maximizes player 's
cumulative utility on the data set. We design a polynomial time algorithm for
this problem, by showing it is equivalent to finding a maximum-weight path on a
carefully constructed directed acyclic graph.
In the online setting, the players run learning algorithms to update their
bids as they participate in the auction over time. Based on our offline
algorithm, we design efficient online learning algorithms for bidding. The
algorithms have sublinear regret, under both full information and bandit
feedback structures. We complement our online learning algorithms with regret
lower bounds.
Finally, we analyze the quality of the equilibria in the worst case through
the lens of the core solution concept in the game among the bidders. We show
that the -st price format is susceptible to collusion among the bidders;
meanwhile, the -th price format does not have this issue
Efficient Algorithms for Minimizing Compositions of Convex Functions and Random Functions and Its Applications in Network Revenue Management
In this paper, we study a class of nonconvex stochastic optimization in the
form of , where
the objective function is a composition of a convex function and a
random function . Leveraging an (implicit) convex reformulation via a
variable transformation , we develop stochastic
gradient-based algorithms and establish their sample and gradient complexities
for achieving an -global optimal solution. Interestingly, our
proposed Mirror Stochastic Gradient (MSG) method operates only in the original
-space using gradient estimators of the original nonconvex objective and
achieves sample and gradient complexities,
which matches the lower bounds for solving stochastic convex optimization
problems. Under booking limits control, we formulate the air-cargo network
revenue management (NRM) problem with random two-dimensional capacity, random
consumption, and routing flexibility as a special case of the stochastic
nonconvex optimization, where the random function ,
i.e., the random demand truncates the booking limit decision .
Extensive numerical experiments demonstrate the superior performance of our
proposed MSG algorithm for booking limit control with higher revenue and lower
computation cost than state-of-the-art bid-price-based control policies,
especially when the variance of random capacity is large.
KEYWORDS: stochastic nonconvex optimization, hidden convexity, air-cargo
network revenue management, gradient-based algorithm