46 research outputs found

    Regret Minimisation in Multi-Armed Bandits Using Bounded Arm Memory

    Full text link
    In this paper, we propose a constant word (RAM model) algorithm for regret minimisation for both finite and infinite Stochastic Multi-Armed Bandit (MAB) instances. Most of the existing regret minimisation algorithms need to remember the statistics of all the arms they encounter. This may become a problem for the cases where the number of available words of memory is limited. Designing an efficient regret minimisation algorithm that uses a constant number of words has long been interesting to the community. Some early attempts consider the number of arms to be infinite, and require the reward distribution of the arms to belong to some particular family. Recently, for finitely many-armed bandits an explore-then-commit based algorithm~\citep{Liau+PSY:2018} seems to escape such assumption. However, due to the underlying PAC-based elimination their method incurs a high regret. We present a conceptually simple, and efficient algorithm that needs to remember statistics of at most MM arms, and for any KK-armed finite bandit instance it enjoys a O(KM+K1.5Tlog(T/MK)/M)O(KM +K^{1.5}\sqrt{T\log (T/MK)}/M) upper-bound on regret. We extend it to achieve sub-linear \textit{quantile-regret}~\citep{RoyChaudhuri+K:2018} and empirically verify the efficiency of our algorithm via experiments

    Tight Regret Bounds for Single-pass Streaming Multi-armed Bandits

    Full text link
    Regret minimization in streaming multi-armed bandits (MABs) has been studied extensively in recent years. In the single-pass setting with KK arms and TT trials, a regret lower bound of Ω(T2/3)\Omega(T^{2/3}) has been proved for any algorithm with o(K)o(K) memory (Maiti et al. [NeurIPS'21]; Agarwal at al. [COLT'22]). On the other hand, however, the previous best regret upper bound is still O(K1/3T2/3log1/3(T))O(K^{1/3} T^{2/3}\log^{1/3}(T)), which is achieved by the streaming implementation of the simple uniform exploration. The O(K1/3log1/3(T))O(K^{1/3}\log^{1/3}(T)) gap leaves the open question of the tight regret bound in the single-pass MABs with sublinear arm memory. In this paper, we answer this open problem and complete the picture of regret minimization in single-pass streaming MABs. We first improve the regret lower bound to Ω(K1/3T2/3)\Omega(K^{1/3}T^{2/3}) for algorithms with o(K)o(K) memory, which matches the uniform exploration regret up to a logarithm factor in TT. We then show that the log1/3(T)\log^{1/3}(T) factor is not necessary, and we can achieve O(K1/3T2/3)O(K^{1/3}T^{2/3}) regret by finding an ε\varepsilon-best arm and committing to it in the rest of the trials. For regret minimization with high constant probability, we can apply the single-memory ε\varepsilon-best arm algorithms in Jin et al. [ICML'21] to obtain the optimal bound. Furthermore, for the expected regret minimization, we design an algorithm with a single-arm memory that achieves O(K1/3T2/3log(K))O(K^{1/3} T^{2/3}\log(K)) regret, and an algorithm with O(log(n))O(\log^{*}(n))-memory with the optimal O(K1/3T2/3)O(K^{1/3} T^{2/3}) regret following the ε\varepsilon-best arm algorithm in Assadi and Wang [STOC'20]. We further tested the empirical performances of our algorithms. The simulation results show that the proposed algorithms consistently outperform the benchmark uniform exploration algorithm by a large margin, and on occasion, reduce the regret by up to 70%.Comment: ICML 202

    PAC Identification of Many Good Arms in Stochastic Multi-Armed Bandits

    Get PDF
    We consider the problem of identifying any kk out of the best mm arms in an nn-armed stochastic multi-armed bandit. Framed in the PAC setting, this particular problem generalises both the problem of `best subset selection' and that of selecting `one out of the best m' arms [arcsk 2017]. In applications such as crowd-sourcing and drug-designing, identifying a single good solution is often not sufficient. Moreover, finding the best subset might be hard due to the presence of many indistinguishably close solutions. Our generalisation of identifying exactly kk arms out of the best mm, where 1km1 \leq k \leq m, serves as a more effective alternative. We present a lower bound on the worst-case sample complexity for general kk, and a fully sequential PAC algorithm, \GLUCB, which is more sample-efficient on easy instances. Also, extending our analysis to infinite-armed bandits, we present a PAC algorithm that is independent of nn, which identifies an arm from the best ρ\rho fraction of arms using at most an additive poly-log number of samples than compared to the lower bound, thereby improving over [arcsk 2017] and [Aziz+AKA:2018]. The problem of identifying k>1k > 1 distinct arms from the best ρ\rho fraction is not always well-defined; for a special class of this problem, we present lower and upper bounds. Finally, through a reduction, we establish a relation between upper bounds for the `one out of the best ρ\rho' problem for infinite instances and the `one out of the best mm' problem for finite instances. We conjecture that it is more efficient to solve `small' finite instances using the latter formulation, rather than going through the former

    The Best Arm Evades: Near-optimal Multi-pass Streaming Lower Bounds for Pure Exploration in Multi-armed Bandits

    Full text link
    We give a near-optimal sample-pass trade-off for pure exploration in multi-armed bandits (MABs) via multi-pass streaming algorithms: any streaming algorithm with sublinear memory that uses the optimal sample complexity of O(nΔ2)O(\frac{n}{\Delta^2}) requires Ω(log(1/Δ)loglog(1/Δ))\Omega(\frac{\log{(1/\Delta)}}{\log\log{(1/\Delta)}}) passes. Here, nn is the number of arms and Δ\Delta is the reward gap between the best and the second-best arms. Our result matches the O(log(1Δ))O(\log(\frac{1}{\Delta}))-pass algorithm of Jin et al. [ICML'21] (up to lower order terms) that only uses O(1)O(1) memory and answers an open question posed by Assadi and Wang [STOC'20]

    Learning how to act: making good decisions with machine learning

    Get PDF
    This thesis is about machine learning and statistical approaches to decision making. How can we learn from data to anticipate the consequence of, and optimally select, interventions or actions? Problems such as deciding which medication to prescribe to patients, who should be released on bail, and how much to charge for insurance are ubiquitous, and have far reaching impacts on our lives. There are two fundamental approaches to learning how to act: reinforcement learning, in which an agent directly intervenes in a system and learns from the outcome, and observational causal inference, whereby we seek to infer the outcome of an intervention from observing the system. The goal of this thesis to connect and unify these key approaches. I introduce causal bandit problems: a synthesis that combines causal graphical models, which were developed for observational causal inference, with multi-armed bandit problems, which are a subset of reinforcement learning problems that are simple enough to admit formal analysis. I show that knowledge of the causal structure allows us to transfer information learned about the outcome of one action to predict the outcome of an alternate action, yielding a novel form of structure between bandit arms that cannot be exploited by existing algorithms. I propose an algorithm for causal bandit problems and prove bounds on the simple regret demonstrating it is close to mini-max optimal and better than algorithms that do not use the additional causal information

    Exploration with Limited Memory: Streaming Algorithms for Coin Tossing, Noisy Comparisons, and Multi-Armed Bandits

    Full text link
    Consider the following abstract coin tossing problem: Given a set of nn coins with unknown biases, find the most biased coin using a minimal number of coin tosses. This is a common abstraction of various exploration problems in theoretical computer science and machine learning and has been studied extensively over the years. In particular, algorithms with optimal sample complexity (number of coin tosses) have been known for this problem for quite some time. Motivated by applications to processing massive datasets, we study the space complexity of solving this problem with optimal number of coin tosses in the streaming model. In this model, the coins are arriving one by one and the algorithm is only allowed to store a limited number of coins at any point -- any coin not present in the memory is lost and can no longer be tossed or compared to arriving coins. Prior algorithms for the coin tossing problem with optimal sample complexity are based on iterative elimination of coins which inherently require storing all the coins, leading to memory-inefficient streaming algorithms. We remedy this state-of-affairs by presenting a series of improved streaming algorithms for this problem: we start with a simple algorithm which require storing only O(logn)O(\log{n}) coins and then iteratively refine it further and further, leading to algorithms with O(loglog(n))O(\log\log{(n)}) memory, O(log(n))O(\log^*{(n)}) memory, and finally a one that only stores a single extra coin in memory -- the same exact space needed to just store the best coin throughout the stream. Furthermore, we extend our algorithms to the problem of finding the kk most biased coins as well as other exploration problems such as finding top-kk elements using noisy comparisons or finding an ϵ\epsilon-best arm in stochastic multi-armed bandits, and obtain efficient streaming algorithms for these problems

    Exploration via linearly perturbed loss minimisation

    Full text link
    We introduce exploration via linear loss perturbations (EVILL), a randomised exploration method for structured stochastic bandit problems that works by solving for the minimiser of a linearly perturbed regularised negative log-likelihood function. We show that, for the case of generalised linear bandits, EVILL reduces to perturbed history exploration (PHE), a method where exploration is done by training on randomly perturbed rewards. In doing so, we provide a simple and clean explanation of when and why random reward perturbations give rise to good bandit algorithms. With the data-dependent perturbations we propose, not present in previous PHE-type methods, EVILL is shown to match the performance of Thompson-sampling-style parameter-perturbation methods, both in theory and in practice. Moreover, we show an example outside of generalised linear bandits where PHE leads to inconsistent estimates, and thus linear regret, while EVILL remains performant. Like PHE, EVILL can be implemented in just a few lines of code
    corecore