232 research outputs found
Last Switch Dependent Bandits with Monotone Payoff Functions
In a recent work, Laforgue et al. introduce the model of last switch
dependent (LSD) bandits, in an attempt to capture nonstationary phenomena
induced by the interaction between the player and the environment. Examples
include satiation, where consecutive plays of the same action lead to decreased
performance, or deprivation, where the payoff of an action increases after an
interval of inactivity. In this work, we take a step towards understanding the
approximability of planning LSD bandits, namely, the (NP-hard) problem of
computing an optimal arm-pulling strategy under complete knowledge of the
model. In particular, we design the first efficient constant approximation
algorithm for the problem and show that, under a natural monotonicity
assumption on the payoffs, its approximation guarantee (almost) matches the
state-of-the-art for the special and well-studied class of recharging bandits
(also known as delay-dependent). In this attempt, we develop new tools and
insights for this class of problems, including a novel higher-dimensional
relaxation and the technique of mirroring the evolution of virtual states. We
believe that these novel elements could potentially be used for approaching
richer classes of action-induced nonstationary bandits (e.g., special instances
of restless bandits). In the case where the model parameters are initially
unknown, we develop an online learning adaptation of our algorithm for which we
provide sublinear regret guarantees against its full-information counterpart.Comment: Accepted to the 40th International Conference on Machine Learning
(ICML 2023
Learning to Crawl
Web crawling is the problem of keeping a cache of webpages fresh, i.e.,
having the most recent copy available when a page is requested. This problem is
usually coupled with the natural restriction that the bandwidth available to
the web crawler is limited. The corresponding optimization problem was solved
optimally by Azar et al. [2018] under the assumption that, for each webpage,
both the elapsed time between two changes and the elapsed time between two
requests follow a Poisson distribution with known parameters. In this paper, we
study the same control problem but under the assumption that the change rates
are unknown a priori, and thus we need to estimate them in an online fashion
using only partial observations (i.e., single-bit signals indicating whether
the page has changed since the last refresh). As a point of departure, we
characterise the conditions under which one can solve the problem with such
partial observability. Next, we propose a practical estimator and compute
confidence intervals for it in terms of the elapsed time between the
observations. Finally, we show that the explore-and-commit algorithm achieves
an regret with a carefully chosen exploration horizon.
Our simulation study shows that our online policy scales well and achieves
close to optimal performance for a wide range of the parameters.Comment: Published at AAAI 202
Bandits with Dynamic Arm-acquisition Costs
We consider a bandit problem where at any time, the decision maker can add
new arms to her consideration set. A new arm is queried at a cost from an
"arm-reservoir" containing finitely many "arm-types," each characterized by a
distinct mean reward. The cost of query reflects in a diminishing probability
of the returned arm being optimal, unbeknown to the decision maker; this
feature encapsulates defining characteristics of a broad class of
operations-inspired online learning problems, e.g., those arising in markets
with churn, or those involving allocations subject to costly resource
acquisition. The decision maker's goal is to maximize her cumulative expected
payoffs over a sequence of n pulls, oblivious to the statistical properties as
well as types of the queried arms. We study two natural modes of endogeneity in
the reservoir distribution, and characterize a necessary condition for
achievability of sub-linear regret in the problem. We also discuss a
UCB-inspired adaptive algorithm that is long-run-average optimal whenever said
condition is satisfied, thereby establishing its tightness
Bandits with Deterministically Evolving States
We propose a model for learning with bandit feedback while accounting for
deterministically evolving and unobservable states that we call Bandits with
Deterministically Evolving States. The workhorse applications of our model are
learning for recommendation systems and learning for online ads. In both cases,
the reward that the algorithm obtains at each round is a function of the
short-term reward of the action chosen and how ``healthy'' the system is (i.e.,
as measured by its state). For example, in recommendation systems, the reward
that the platform obtains from a user's engagement with a particular type of
content depends not only on the inherent features of the specific content, but
also on how the user's preferences have evolved as a result of interacting with
other types of content on the platform. Our general model accounts for the
different rate at which the state evolves (e.g., how fast a
user's preferences shift as a result of previous content consumption) and
encompasses standard multi-armed bandits as a special case. The goal of the
algorithm is to minimize a notion of regret against the best-fixed sequence of
arms pulled. We analyze online learning algorithms for any possible
parametrization of the evolution rate . Specifically, the regret rates
obtained are: for : ; for
with : ; for ; and for
Stochastic Bandits with Delay-Dependent Payoffs
Motivated by recommendation problems in music streaming platforms, we propose a nonstationary stochastic bandit model in which the expected reward of an arm depends on the number of rounds that have passed since the arm was last pulled. After proving that finding an optimal policy is NP-hard even when all model parameters are known, we introduce a class of ranking policies provably approximating, to within a constant factor, the expected reward of the optimal policy. We show an algorithm whose regret with respect to the best ranking policy is bounded by Oe 1a kT , where k is the number of arms and T is time. Our algorithm uses only O k ln ln T) switches, which helps when switching between policies is costly. As constructing the class of learning policies requires ordering the arms according to their expectations, we also bound the number of pulls required to do so. Finally, we run experiments to compare our algorithm against UCB on different problem instance
- …