2,274 research outputs found

    Combinatorial Bandits Revisited

    Full text link
    This paper investigates stochastic and adversarial combinatorial multi-armed bandit problems. In the stochastic setting under semi-bandit feedback, we derive a problem-specific regret lower bound, and discuss its scaling with the dimension of the decision space. We propose ESCB, an algorithm that efficiently exploits the structure of the problem and provide a finite-time analysis of its regret. ESCB has better performance guarantees than existing algorithms, and significantly outperforms these algorithms in practice. In the adversarial setting under bandit feedback, we propose \textsc{CombEXP}, an algorithm with the same regret scaling as state-of-the-art algorithms, but with lower computational complexity for some combinatorial problems.Comment: 30 pages, Advances in Neural Information Processing Systems 28 (NIPS 2015

    Sequential Monte Carlo Bandits

    Full text link
    In this paper we propose a flexible and efficient framework for handling multi-armed bandits, combining sequential Monte Carlo algorithms with hierarchical Bayesian modeling techniques. The framework naturally encompasses restless bandits, contextual bandits, and other bandit variants under a single inferential model. Despite the model's generality, we propose efficient Monte Carlo algorithms to make inference scalable, based on recent developments in sequential Monte Carlo methods. Through two simulation studies, the framework is shown to outperform other empirical methods, while also naturally scaling to more complex problems for which existing approaches can not cope. Additionally, we successfully apply our framework to online video-based advertising recommendation, and show its increased efficacy as compared to current state of the art bandit algorithms

    Correlated Multi-armed Bandits with a Latent Random Source

    Full text link
    We consider a novel multi-armed bandit framework where the rewards obtained by pulling the arms are functions of a common latent random variable. The correlation between arms due to the common random source can be used to design a generalized upper-confidence-bound (UCB) algorithm that identifies certain arms as noncompetitivenon-competitive, and avoids exploring them. As a result, we reduce a KK-armed bandit problem to a C+1C+1-armed problem, where C+1C+1 includes the best arm and CC competitivecompetitive arms. Our regret analysis shows that the competitive arms need to be pulled O(logT)\mathcal{O}(\log T) times, while the non-competitive arms are pulled only O(1)\mathcal{O}(1) times. As a result, there are regimes where our algorithm achieves a O(1)\mathcal{O}(1) regret as opposed to the typical logarithmic regret scaling of multi-armed bandit algorithms. We also evaluate lower bounds on the expected regret and prove that our correlated-UCB algorithm achieves O(1)\mathcal{O}(1) regret whenever possible

    Bandits with adversarial scaling

    Full text link
    We study "adversarial scaling", a multi-armed bandit model where rewards have a stochastic and an adversarial component. Our model captures display advertising where the "click-through-rate" can be decomposed to a (fixed across time) arm-quality component and a non-stochastic user-relevance component (fixed across arms). Despite the relative stochasticity of our model, we demonstrate two settings where most bandit algorithms suffer. On the positive side, we show that two algorithms, one from the action elimination and one from the mirror descent family are adaptive enough to be robust to adversarial scaling. Our results shed light on the robustness of adaptive parameter selection in stochastic bandits, which may be of independent interest.Comment: Appeared in ICML 202

    A Unified Approach to Translate Classical Bandit Algorithms to the Structured Bandit Setting

    Full text link
    We consider a finite-armed structured bandit problem in which mean rewards of different arms are known functions of a common hidden parameter θ\theta^*. Since we do not place any restrictions of these functions, the problem setting subsumes several previously studied frameworks that assume linear or invertible reward functions. We propose a novel approach to gradually estimate the hidden θ\theta^* and use the estimate together with the mean reward functions to substantially reduce exploration of sub-optimal arms. This approach enables us to fundamentally generalize any classic bandit algorithm including UCB and Thompson Sampling to the structured bandit setting. We prove via regret analysis that our proposed UCB-C and TS-C algorithms (structured bandit versions of UCB and Thompson Sampling, respectively) pull only a subset of the sub-optimal arms O(logT)O(\log T) times while the other sub-optimal arms (referred to as non-competitive arms) are pulled O(1)O(1) times. As a result, in cases where all sub-optimal arms are non-competitive, which can happen in many practical scenarios, the proposed algorithms achieve bounded regret. We also conduct simulations on the Movielens recommendations dataset to demonstrate the improvement of the proposed algorithms over existing structured bandit algorithms

    Unimodal Bandits without Smoothness

    Full text link
    We consider stochastic bandit problems with a continuous set of arms and where the expected reward is a continuous and unimodal function of the arm. No further assumption is made regarding the smoothness and the structure of the expected reward function. For these problems, we propose the Stochastic Pentachotomy (SP) algorithm, and derive finite-time upper bounds on its regret and optimization error. In particular, we show that, for any expected reward function μ\mu that behaves as μ(x)=μ(x)Cxxξ\mu(x)=\mu(x^\star)-C|x-x^\star|^\xi locally around its maximizer xx^\star for some ξ,C>0\xi, C>0, the SP algorithm is order-optimal. Namely its regret and optimization error scale as O(Tlog(T))O(\sqrt{T\log(T)}) and O(log(T)/T)O(\sqrt{\log(T)/T}), respectively, when the time horizon TT grows large. These scalings are achieved without the knowledge of ξ\xi and CC. Our algorithm is based on asymptotically optimal sequential statistical tests used to successively trim an interval that contains the best arm with high probability. To our knowledge, the SP algorithm constitutes the first sequential arm selection rule that achieves a regret and optimization error scaling as O(T)O(\sqrt{T}) and O(1/T)O(1/\sqrt{T}), respectively, up to a logarithmic factor for non-smooth expected reward functions, as well as for smooth functions with unknown smoothness.Comment: 25 page

    HAMLET -- A Learning Curve-Enabled Multi-Armed Bandit for Algorithm Selection

    Full text link
    Automated algorithm selection and hyperparameter tuning facilitates the application of machine learning. Traditional multi-armed bandit strategies look to the history of observed rewards to identify the most promising arms for optimizing expected total reward in the long run. When considering limited time budgets and computational resources, this backward view of rewards is inappropriate as the bandit should look into the future for anticipating the highest final reward at the end of a specified time budget. This work addresses that insight by introducing HAMLET, which extends the bandit approach with learning curve extrapolation and computation time-awareness for selecting among a set of machine learning algorithms. Results show that the HAMLET Variants 1-3 exhibit equal or better performance than other bandit-based algorithm selection strategies in experiments with recorded hyperparameter tuning traces for the majority of considered time budgets. The best performing HAMLET Variant 3 combines learning curve extrapolation with the well-known upper confidence bound exploration bonus. That variant performs better than all non-HAMLET policies with statistical significance at the 95% level for 1,485 runs.Comment: 8 pages, 8 figures; IJCNN 2020: International Joint Conference on Neural Network

    Learning Unknown Service Rates in Queues: A Multi-Armed Bandit Approach

    Full text link
    Consider a queueing system consisting of multiple servers. Jobs arrive over time and enter a queue for service; the goal is to minimize the size of this queue. At each opportunity for service, at most one server can be chosen, and at most one job can be served. Service is successful with a probability (the service probability) that is a priori unknown for each server. An algorithm that knows the service probabilities (the "genie") can always choose the server of highest service probability. We study algorithms that learn the unknown service probabilities. Our goal is to minimize queue-regret: the (expected) difference between the queue-lengths obtained by the algorithm, and those obtained by the "genie." Since queue-regret cannot be larger than classical regret, results for the standard multi-armed bandit problem give algorithms for which queue-regret increases no more than logarithmically in time. Our paper shows surprisingly more complex behavior. In particular, as long as the bandit algorithm's queues have relatively long regenerative cycles, queue-regret is similar to cumulative regret, and scales (essentially) logarithmically. However, we show that this "early stage" of the queueing bandit eventually gives way to a "late stage", where the optimal queue-regret scaling is O(1/t)O(1/t). We demonstrate an algorithm that (order-wise) achieves this asymptotic queue-regret in the late stage. Our results are developed in a more general model that allows for multiple job classes as well

    Learning Sequential Channel Selection for Interference Alignment using Reconfigurable Antennas

    Full text link
    In recent years, machine learning techniques have been explored to support, enhance or augment wireless systems especially at the physical layer of the protocol stack. Traditional ML based approach or optimization is often not suitable due to algorithmic complexity, reliance on existing training data and/or due to distributed setting. In this paper, we formulate a reconfigurable antenna based channel selection problem for interference alignment in a multi-user wireless network as a learning problem. More specifically, we propose that by using sequential learning, an effective channel or combination of channels can be selected in order to enhance interference alignment using reconfigurable antennas. We first formulate the channel selection as a multi-armed problem that aims to optimize the sum rate of the network. We show that by using an adaptive sequential learning policy, each node in the network can learn to select optimal channels without requiring full and instantaneous CSI for all the available antenna states. We conduct performance analysis of our technique for a MIMO interference channel using a conventional IA scheme and quantify the benefits of pattern diversity and learning channel selection

    UCB Algorithm for Exponential Distributions

    Full text link
    We introduce in this paper a new algorithm for Multi-Armed Bandit (MAB) problems. A machine learning paradigm popular within Cognitive Network related topics (e.g., Spectrum Sensing and Allocation). We focus on the case where the rewards are exponentially distributed, which is common when dealing with Rayleigh fading channels. This strategy, named Multiplicative Upper Confidence Bound (MUCB), associates a utility index to every available arm, and then selects the arm with the highest index. For every arm, the associated index is equal to the product of a multiplicative factor by the sample mean of the rewards collected by this arm. We show that the MUCB policy has a low complexity and is order optimal.Comment: 10 pages. Introduces Multiplicative Upper Confidence Bound (MUCB) algorithms for Multi-Armed Bandit problem
    corecore