30 research outputs found
Last Switch Dependent Bandits with Monotone Payoff Functions
In a recent work, Laforgue et al. introduce the model of last switch
dependent (LSD) bandits, in an attempt to capture nonstationary phenomena
induced by the interaction between the player and the environment. Examples
include satiation, where consecutive plays of the same action lead to decreased
performance, or deprivation, where the payoff of an action increases after an
interval of inactivity. In this work, we take a step towards understanding the
approximability of planning LSD bandits, namely, the (NP-hard) problem of
computing an optimal arm-pulling strategy under complete knowledge of the
model. In particular, we design the first efficient constant approximation
algorithm for the problem and show that, under a natural monotonicity
assumption on the payoffs, its approximation guarantee (almost) matches the
state-of-the-art for the special and well-studied class of recharging bandits
(also known as delay-dependent). In this attempt, we develop new tools and
insights for this class of problems, including a novel higher-dimensional
relaxation and the technique of mirroring the evolution of virtual states. We
believe that these novel elements could potentially be used for approaching
richer classes of action-induced nonstationary bandits (e.g., special instances
of restless bandits). In the case where the model parameters are initially
unknown, we develop an online learning adaptation of our algorithm for which we
provide sublinear regret guarantees against its full-information counterpart.Comment: Accepted to the 40th International Conference on Machine Learning
(ICML 2023
Learning to Crawl
Web crawling is the problem of keeping a cache of webpages fresh, i.e.,
having the most recent copy available when a page is requested. This problem is
usually coupled with the natural restriction that the bandwidth available to
the web crawler is limited. The corresponding optimization problem was solved
optimally by Azar et al. [2018] under the assumption that, for each webpage,
both the elapsed time between two changes and the elapsed time between two
requests follow a Poisson distribution with known parameters. In this paper, we
study the same control problem but under the assumption that the change rates
are unknown a priori, and thus we need to estimate them in an online fashion
using only partial observations (i.e., single-bit signals indicating whether
the page has changed since the last refresh). As a point of departure, we
characterise the conditions under which one can solve the problem with such
partial observability. Next, we propose a practical estimator and compute
confidence intervals for it in terms of the elapsed time between the
observations. Finally, we show that the explore-and-commit algorithm achieves
an regret with a carefully chosen exploration horizon.
Our simulation study shows that our online policy scales well and achieves
close to optimal performance for a wide range of the parameters.Comment: Published at AAAI 202
Influencing Bandits: Arm Selection for Preference Shaping
We consider a non stationary multi-armed bandit in which the population
preferences are positively and negatively reinforced by the observed rewards.
The objective of the algorithm is to shape the population preferences to
maximize the fraction of the population favouring a predetermined arm. For the
case of binary opinions, two types of opinion dynamics are considered --
decreasing elasticity (modeled as a Polya urn with increasing number of balls)
and constant elasticity (using the voter model). For the first case, we
describe an Explore-then-commit policy and a Thompson sampling policy and
analyse the regret for each of these policies. We then show that these
algorithms and their analyses carry over to the constant elasticity case. We
also describe a Thompson sampling based algorithm for the case when more than
two types of opinions are present. Finally, we discuss the case where presence
of multiple recommendation systems gives rise to a trade-off between their
popularity and opinion shaping objectives.Comment: 14 pages, 8 figures, 24 references, proofs in appendi
Quasi-regular sequences and optimal schedules for security games
We study security games in which a defender commits to a mixed strategy for
protecting a finite set of targets of different values. An attacker, knowing
the defender's strategy, chooses which target to attack and for how long. If
the attacker spends time at a target of value , and if he
leaves before the defender visits the target, his utility is ; if the defender visits before he leaves, his utility is 0. The defender's
goal is to minimize the attacker's utility. The defender's strategy consists of
a schedule for visiting the targets; it takes her unit time to switch between
targets. Such games are a simplified model of a number of real-world scenarios
such as protecting computer networks from intruders, crops from thieves, etc.
We show that optimal defender play for this continuous time security games
reduces to the solution of a combinatorial question regarding the existence of
infinite sequences over a finite alphabet, with the following properties for
each symbol : (1) constitutes a prescribed fraction of the
sequence. (2) The occurrences of are spread apart close to evenly, in that
the ratio of the longest to shortest interval between consecutive occurrences
is bounded by a parameter . We call such sequences -quasi-regular.
We show that, surprisingly, -quasi-regular sequences suffice for optimal
defender play. What is more, even randomized -quasi-regular sequences
suffice for optimality. We show that such sequences always exist, and can be
calculated efficiently.
The question of the least for which deterministic -quasi-regular
sequences exist is fascinating. Using an ergodic theoretical approach, we show
that deterministic -quasi-regular sequences always exist. For
we do not know whether deterministic -quasi-regular sequences always exist.Comment: to appear in Proc. of SODA 201
Bandits with Deterministically Evolving States
We propose a model for learning with bandit feedback while accounting for
deterministically evolving and unobservable states that we call Bandits with
Deterministically Evolving States. The workhorse applications of our model are
learning for recommendation systems and learning for online ads. In both cases,
the reward that the algorithm obtains at each round is a function of the
short-term reward of the action chosen and how ``healthy'' the system is (i.e.,
as measured by its state). For example, in recommendation systems, the reward
that the platform obtains from a user's engagement with a particular type of
content depends not only on the inherent features of the specific content, but
also on how the user's preferences have evolved as a result of interacting with
other types of content on the platform. Our general model accounts for the
different rate at which the state evolves (e.g., how fast a
user's preferences shift as a result of previous content consumption) and
encompasses standard multi-armed bandits as a special case. The goal of the
algorithm is to minimize a notion of regret against the best-fixed sequence of
arms pulled. We analyze online learning algorithms for any possible
parametrization of the evolution rate . Specifically, the regret rates
obtained are: for : ; for
with : ; for ; and for