46 research outputs found
Regret Minimisation in Multi-Armed Bandits Using Bounded Arm Memory
In this paper, we propose a constant word (RAM model) algorithm for regret
minimisation for both finite and infinite Stochastic Multi-Armed Bandit (MAB)
instances. Most of the existing regret minimisation algorithms need to remember
the statistics of all the arms they encounter. This may become a problem for
the cases where the number of available words of memory is limited. Designing
an efficient regret minimisation algorithm that uses a constant number of words
has long been interesting to the community. Some early attempts consider the
number of arms to be infinite, and require the reward distribution of the arms
to belong to some particular family. Recently, for finitely many-armed bandits
an explore-then-commit based algorithm~\citep{Liau+PSY:2018} seems to escape
such assumption. However, due to the underlying PAC-based elimination their
method incurs a high regret. We present a conceptually simple, and efficient
algorithm that needs to remember statistics of at most arms, and for any
-armed finite bandit instance it enjoys a upper-bound on regret. We extend it to achieve sub-linear
\textit{quantile-regret}~\citep{RoyChaudhuri+K:2018} and empirically verify the
efficiency of our algorithm via experiments
Tight Regret Bounds for Single-pass Streaming Multi-armed Bandits
Regret minimization in streaming multi-armed bandits (MABs) has been studied
extensively in recent years. In the single-pass setting with arms and
trials, a regret lower bound of has been proved for any
algorithm with memory (Maiti et al. [NeurIPS'21]; Agarwal at al.
[COLT'22]). On the other hand, however, the previous best regret upper bound is
still , which is achieved by the streaming
implementation of the simple uniform exploration. The
gap leaves the open question of the tight regret bound in the single-pass MABs
with sublinear arm memory.
In this paper, we answer this open problem and complete the picture of regret
minimization in single-pass streaming MABs. We first improve the regret lower
bound to for algorithms with memory, which
matches the uniform exploration regret up to a logarithm factor in . We then
show that the factor is not necessary, and we can achieve
regret by finding an -best arm and committing
to it in the rest of the trials. For regret minimization with high constant
probability, we can apply the single-memory -best arm algorithms
in Jin et al. [ICML'21] to obtain the optimal bound. Furthermore, for the
expected regret minimization, we design an algorithm with a single-arm memory
that achieves regret, and an algorithm with
-memory with the optimal regret following
the -best arm algorithm in Assadi and Wang [STOC'20].
We further tested the empirical performances of our algorithms. The
simulation results show that the proposed algorithms consistently outperform
the benchmark uniform exploration algorithm by a large margin, and on occasion,
reduce the regret by up to 70%.Comment: ICML 202
PAC Identification of Many Good Arms in Stochastic Multi-Armed Bandits
We consider the problem of identifying any out of the best arms in an
-armed stochastic multi-armed bandit. Framed in the PAC setting, this
particular problem generalises both the problem of `best subset selection' and
that of selecting `one out of the best m' arms [arcsk 2017]. In applications
such as crowd-sourcing and drug-designing, identifying a single good solution
is often not sufficient. Moreover, finding the best subset might be hard due to
the presence of many indistinguishably close solutions. Our generalisation of
identifying exactly arms out of the best , where ,
serves as a more effective alternative. We present a lower bound on the
worst-case sample complexity for general , and a fully sequential PAC
algorithm, \GLUCB, which is more sample-efficient on easy instances. Also,
extending our analysis to infinite-armed bandits, we present a PAC algorithm
that is independent of , which identifies an arm from the best
fraction of arms using at most an additive poly-log number of samples than
compared to the lower bound, thereby improving over [arcsk 2017] and
[Aziz+AKA:2018]. The problem of identifying distinct arms from the best
fraction is not always well-defined; for a special class of this
problem, we present lower and upper bounds. Finally, through a reduction, we
establish a relation between upper bounds for the `one out of the best '
problem for infinite instances and the `one out of the best ' problem for
finite instances. We conjecture that it is more efficient to solve `small'
finite instances using the latter formulation, rather than going through the
former
The Best Arm Evades: Near-optimal Multi-pass Streaming Lower Bounds for Pure Exploration in Multi-armed Bandits
We give a near-optimal sample-pass trade-off for pure exploration in
multi-armed bandits (MABs) via multi-pass streaming algorithms: any streaming
algorithm with sublinear memory that uses the optimal sample complexity of
requires
passes. Here, is
the number of arms and is the reward gap between the best and the
second-best arms. Our result matches the -pass
algorithm of Jin et al. [ICML'21] (up to lower order terms) that only uses
memory and answers an open question posed by Assadi and Wang [STOC'20]
Learning how to act: making good decisions with machine learning
This thesis is about machine learning and statistical approaches
to decision making. How can we learn from data to anticipate the
consequence of, and optimally select, interventions or actions?
Problems such as deciding which medication to prescribe to
patients, who should be released on bail, and how much to charge
for insurance are ubiquitous, and have far reaching impacts on
our lives. There are two fundamental approaches to learning how
to act: reinforcement learning, in which an agent directly
intervenes in a system and learns from the outcome, and
observational causal inference, whereby we seek to infer the
outcome of an intervention from observing the system.
The goal of this thesis to connect and unify these key
approaches. I introduce causal bandit problems: a synthesis that
combines causal graphical models, which were developed for
observational causal inference, with multi-armed bandit problems,
which are a subset of reinforcement learning problems that are
simple enough to admit formal analysis. I show that knowledge of
the causal structure allows us to transfer information learned
about the outcome of one action to predict the outcome of an
alternate action, yielding a novel form of structure between
bandit arms that cannot be exploited by existing algorithms. I
propose an algorithm for causal bandit problems and prove bounds
on the simple regret demonstrating it is close to mini-max
optimal and better than algorithms that do not use the additional
causal information
Exploration with Limited Memory: Streaming Algorithms for Coin Tossing, Noisy Comparisons, and Multi-Armed Bandits
Consider the following abstract coin tossing problem: Given a set of
coins with unknown biases, find the most biased coin using a minimal number of
coin tosses. This is a common abstraction of various exploration problems in
theoretical computer science and machine learning and has been studied
extensively over the years. In particular, algorithms with optimal sample
complexity (number of coin tosses) have been known for this problem for quite
some time.
Motivated by applications to processing massive datasets, we study the space
complexity of solving this problem with optimal number of coin tosses in the
streaming model. In this model, the coins are arriving one by one and the
algorithm is only allowed to store a limited number of coins at any point --
any coin not present in the memory is lost and can no longer be tossed or
compared to arriving coins. Prior algorithms for the coin tossing problem with
optimal sample complexity are based on iterative elimination of coins which
inherently require storing all the coins, leading to memory-inefficient
streaming algorithms.
We remedy this state-of-affairs by presenting a series of improved streaming
algorithms for this problem: we start with a simple algorithm which require
storing only coins and then iteratively refine it further and
further, leading to algorithms with memory,
memory, and finally a one that only stores a single extra coin in memory -- the
same exact space needed to just store the best coin throughout the stream.
Furthermore, we extend our algorithms to the problem of finding the most
biased coins as well as other exploration problems such as finding top-
elements using noisy comparisons or finding an -best arm in
stochastic multi-armed bandits, and obtain efficient streaming algorithms for
these problems
Exploration via linearly perturbed loss minimisation
We introduce exploration via linear loss perturbations (EVILL), a randomised
exploration method for structured stochastic bandit problems that works by
solving for the minimiser of a linearly perturbed regularised negative
log-likelihood function. We show that, for the case of generalised linear
bandits, EVILL reduces to perturbed history exploration (PHE), a method where
exploration is done by training on randomly perturbed rewards. In doing so, we
provide a simple and clean explanation of when and why random reward
perturbations give rise to good bandit algorithms. With the data-dependent
perturbations we propose, not present in previous PHE-type methods, EVILL is
shown to match the performance of Thompson-sampling-style
parameter-perturbation methods, both in theory and in practice. Moreover, we
show an example outside of generalised linear bandits where PHE leads to
inconsistent estimates, and thus linear regret, while EVILL remains performant.
Like PHE, EVILL can be implemented in just a few lines of code