Search CORE

21,077 research outputs found

Multi-armed bandits and applications to large datasets

Author: Kong Seo Taek
Publication venue
Publication date: 01/05/2019
Field of study

This thesis considers the multi-armed bandit (MAB) problem, both the traditional bandit feedback and graphical bandits when there is side information. Motivated by the Boltzmann exploration algorithm often used in the more general context of reinforcement learning, we present Almost Boltzmann Exploration (ABE) which fixes the under-exploration issue while maintaining an expression similar to Boltzmann exploration. We then present some real world applications of the MAB framework, comparing the performance of ABE with other bandit algorithms on real world datasets

Illinois Digital Environment for Access to Learning and Scholarship Repository

An efficient algorithm for learning with semi-bandit feedback

Author: A. György
A. Kalai
C. Allenberg
D. Suehiro
E. Takimoto
H.B. McMahan
J. Hannan
J. Poland
J.-Y. Audibert
N. Cesa-Bianchi
N. Cesa-Bianchi
P. Auer
Publication venue
Publication date: 01/01/2013
Field of study

We consider the problem of online combinatorial optimization under semi-bandit feedback. The goal of the learner is to sequentially select its actions from a combinatorial decision set so as to minimize its cumulative loss. We propose a learning algorithm for this problem based on combining the Follow-the-Perturbed-Leader (FPL) prediction method with a novel loss estimation procedure called Geometric Resampling (GR). Contrary to previous solutions, the resulting algorithm can be efficiently implemented for any decision set where efficient offline combinatorial optimization is possible at all. Assuming that the elements of the decision set can be described with d-dimensional binary vectors with at most m non-zero entries, we show that the expected regret of our algorithm after T rounds is O(m sqrt(dT log d)). As a side result, we also improve the best known regret bounds for FPL in the full information setting to O(m^(3/2) sqrt(T log d)), gaining a factor of sqrt(d/m) over previous bounds for this algorithm.Comment: submitted to ALT 201

arXiv.org e-Print Archive

Crossref