Search CORE

46 research outputs found

Regret Minimisation in Multi-Armed Bandits Using Bounded Arm Memory

Author: Chaudhuri Arghya Roy
Kalyanakrishnan Shivaram
Publication venue
Publication date: 24/01/2019
Field of study

In this paper, we propose a constant word (RAM model) algorithm for regret minimisation for both finite and infinite Stochastic Multi-Armed Bandit (MAB) instances. Most of the existing regret minimisation algorithms need to remember the statistics of all the arms they encounter. This may become a problem for the cases where the number of available words of memory is limited. Designing an efficient regret minimisation algorithm that uses a constant number of words has long been interesting to the community. Some early attempts consider the number of arms to be infinite, and require the reward distribution of the arms to belong to some particular family. Recently, for finitely many-armed bandits an explore-then-commit based algorithm~\citep{Liau+PSY:2018} seems to escape such assumption. However, due to the underlying PAC-based elimination their method incurs a high regret. We present a conceptually simple, and efficient algorithm that needs to remember statistics of at most

M

arms, and for any

K

-armed finite bandit instance it enjoys a

O(KM +K^{1.5}\sqrt{T\log (T/MK)}/M)

upper-bound on regret. We extend it to achieve sub-linear \textit{quantile-regret}~\citep{RoyChaudhuri+K:2018} and empirically verify the efficiency of our algorithm via experiments

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Tight Regret Bounds for Single-pass Streaming Multi-armed Bandits

Author: Wang Chen
Publication venue
Publication date: 03/06/2023
Field of study

Regret minimization in streaming multi-armed bandits (MABs) has been studied extensively in recent years. In the single-pass setting with

K

arms and

T

trials, a regret lower bound of

\Omega(T^{2/3})

has been proved for any algorithm with

o(K)

memory (Maiti et al. [NeurIPS'21]; Agarwal at al. [COLT'22]). On the other hand, however, the previous best regret upper bound is still

O(K^{1/3} T^{2/3}\log^{1/3}(T))

, which is achieved by the streaming implementation of the simple uniform exploration. The

O(K^{1/3}\log^{1/3}(T))

gap leaves the open question of the tight regret bound in the single-pass MABs with sublinear arm memory. In this paper, we answer this open problem and complete the picture of regret minimization in single-pass streaming MABs. We first improve the regret lower bound to

\Omega(K^{1/3}T^{2/3})

for algorithms with

o(K)

memory, which matches the uniform exploration regret up to a logarithm factor in

T

. We then show that the

\log^{1/3}(T)

factor is not necessary, and we can achieve

O(K^{1/3}T^{2/3})

regret by finding an

\varepsilon

-best arm and committing to it in the rest of the trials. For regret minimization with high constant probability, we can apply the single-memory

\varepsilon

-best arm algorithms in Jin et al. [ICML'21] to obtain the optimal bound. Furthermore, for the expected regret minimization, we design an algorithm with a single-arm memory that achieves

O(K^{1/3} T^{2/3}\log(K))

regret, and an algorithm with

O(\log^{*}(n))

-memory with the optimal

O(K^{1/3} T^{2/3})

regret following the

\varepsilon

-best arm algorithm in Assadi and Wang [STOC'20]. We further tested the empirical performances of our algorithms. The simulation results show that the proposed algorithms consistently outperform the benchmark uniform exploration algorithm by a large margin, and on occasion, reduce the regret by up to 70%.Comment: ICML 202

arXiv.org e-Print Archive

PAC Identification of Many Good Arms in Stochastic Multi-Armed Bandits

Author: Chaudhuri Arghya Roy
Kalyanakrishnan Shivaram
Publication venue
Publication date: 10/07/2017
Field of study

We consider the problem of identifying any

k

out of the best

m

arms in an

n

-armed stochastic multi-armed bandit. Framed in the PAC setting, this particular problem generalises both the problem of `best subset selection' and that of selecting `one out of the best m' arms [arcsk 2017]. In applications such as crowd-sourcing and drug-designing, identifying a single good solution is often not sufficient. Moreover, finding the best subset might be hard due to the presence of many indistinguishably close solutions. Our generalisation of identifying exactly

k

arms out of the best

m

, where

1 \leq k \leq m

, serves as a more effective alternative. We present a lower bound on the worst-case sample complexity for general

k

, and a fully sequential PAC algorithm, \GLUCB, which is more sample-efficient on easy instances. Also, extending our analysis to infinite-armed bandits, we present a PAC algorithm that is independent of

n

, which identifies an arm from the best

\rho

fraction of arms using at most an additive poly-log number of samples than compared to the lower bound, thereby improving over [arcsk 2017] and [Aziz+AKA:2018]. The problem of identifying

k > 1

distinct arms from the best

\rho

fraction is not always well-defined; for a special class of this problem, we present lower and upper bounds. Finally, through a reduction, we establish a relation between upper bounds for the `one out of the best

\rho

' problem for infinite instances and the `one out of the best

m

' problem for finite instances. We conjecture that it is more efficient to solve `small' finite instances using the latter formulation, rather than going through the former

arXiv.org e-Print Archive

opac.isi.ac.id

Indonesian Institute of the Art Yogyakarta

The Best Arm Evades: Near-optimal Multi-pass Streaming Lower Bounds for Pure Exploration in Multi-armed Bandits

Author: Assadi Sepehr
Wang Chen
Publication venue
Publication date: 06/09/2023
Field of study

We give a near-optimal sample-pass trade-off for pure exploration in multi-armed bandits (MABs) via multi-pass streaming algorithms: any streaming algorithm with sublinear memory that uses the optimal sample complexity of

O(\frac{n}{\Delta^2})

requires

\Omega(\frac{\log{(1/\Delta)}}{\log\log{(1/\Delta)}})

passes. Here,

n

is the number of arms and

\Delta

is the reward gap between the best and the second-best arms. Our result matches the

O(\log(\frac{1}{\Delta}))

-pass algorithm of Jin et al. [ICML'21] (up to lower order terms) that only uses

O(1)

memory and answers an open question posed by Assadi and Wang [STOC'20]

arXiv.org e-Print Archive

Learning how to act: making good decisions with machine learning

Author: Lattimore Finnian Rachel
Publication venue
Publication date: 01/01/2017
Field of study

This thesis is about machine learning and statistical approaches to decision making. How can we learn from data to anticipate the consequence of, and optimally select, interventions or actions? Problems such as deciding which medication to prescribe to patients, who should be released on bail, and how much to charge for insurance are ubiquitous, and have far reaching impacts on our lives. There are two fundamental approaches to learning how to act: reinforcement learning, in which an agent directly intervenes in a system and learns from the outcome, and observational causal inference, whereby we seek to infer the outcome of an intervention from observing the system. The goal of this thesis to connect and unify these key approaches. I introduce causal bandit problems: a synthesis that combines causal graphical models, which were developed for observational causal inference, with multi-armed bandit problems, which are a subset of reinforcement learning problems that are simple enough to admit formal analysis. I show that knowledge of the causal structure allows us to transfer information learned about the outcome of one action to predict the outcome of an alternate action, yielding a novel form of structure between bandit arms that cannot be exploited by existing algorithms. I propose an algorithm for causal bandit problems and prove bounds on the simple regret demonstrating it is close to mini-max optimal and better than algorithms that do not use the additional causal information

The Australian National University

Exploration with Limited Memory: Streaming Algorithms for Coin Tossing, Noisy Comparisons, and Multi-Armed Bandits

Author: Assadi Sepehr
Wang Chen
Publication venue
Publication date: 09/04/2020
Field of study

Consider the following abstract coin tossing problem: Given a set of

n

coins with unknown biases, find the most biased coin using a minimal number of coin tosses. This is a common abstraction of various exploration problems in theoretical computer science and machine learning and has been studied extensively over the years. In particular, algorithms with optimal sample complexity (number of coin tosses) have been known for this problem for quite some time. Motivated by applications to processing massive datasets, we study the space complexity of solving this problem with optimal number of coin tosses in the streaming model. In this model, the coins are arriving one by one and the algorithm is only allowed to store a limited number of coins at any point -- any coin not present in the memory is lost and can no longer be tossed or compared to arriving coins. Prior algorithms for the coin tossing problem with optimal sample complexity are based on iterative elimination of coins which inherently require storing all the coins, leading to memory-inefficient streaming algorithms. We remedy this state-of-affairs by presenting a series of improved streaming algorithms for this problem: we start with a simple algorithm which require storing only

O(\log{n})

coins and then iteratively refine it further and further, leading to algorithms with

O(\log\log{(n)})

memory,

O(\log^*{(n)})

memory, and finally a one that only stores a single extra coin in memory -- the same exact space needed to just store the best coin throughout the stream. Furthermore, we extend our algorithms to the problem of finding the

k

most biased coins as well as other exploration problems such as finding top-

k

elements using noisy comparisons or finding an

\epsilon

-best arm in stochastic multi-armed bandits, and obtain efficient streaming algorithms for these problems

arXiv.org e-Print Archive

Exploration via linearly perturbed loss minimisation

Author: Ayoub Alex
Janz David
Liu Shuai
Szepesvári Csaba
Publication venue
Publication date: 13/11/2023
Field of study

We introduce exploration via linear loss perturbations (EVILL), a randomised exploration method for structured stochastic bandit problems that works by solving for the minimiser of a linearly perturbed regularised negative log-likelihood function. We show that, for the case of generalised linear bandits, EVILL reduces to perturbed history exploration (PHE), a method where exploration is done by training on randomly perturbed rewards. In doing so, we provide a simple and clean explanation of when and why random reward perturbations give rise to good bandit algorithms. With the data-dependent perturbations we propose, not present in previous PHE-type methods, EVILL is shown to match the performance of Thompson-sampling-style parameter-perturbation methods, both in theory and in practice. Moreover, we show an example outside of generalised linear bandits where PHE leads to inconsistent estimates, and thus linear regret, while EVILL remains performant. Like PHE, EVILL can be implemented in just a few lines of code

arXiv.org e-Print Archive