13 research outputs found
Lipschitz Bandits: Regret Lower Bounds and Optimal Algorithms
We consider stochastic multi-armed bandit problems where the expected reward
is a Lipschitz function of the arm, and where the set of arms is either
discrete or continuous. For discrete Lipschitz bandits, we derive asymptotic
problem specific lower bounds for the regret satisfied by any algorithm, and
propose OSLB and CKL-UCB, two algorithms that efficiently exploit the Lipschitz
structure of the problem. In fact, we prove that OSLB is asymptotically
optimal, as its asymptotic regret matches the lower bound. The regret analysis
of our algorithms relies on a new concentration inequality for weighted sums of
KL divergences between the empirical distributions of rewards and their true
distributions. For continuous Lipschitz bandits, we propose to first discretize
the action space, and then apply OSLB or CKL-UCB, algorithms that provably
exploit the structure efficiently. This approach is shown, through numerical
experiments, to significantly outperform existing algorithms that directly deal
with the continuous set of arms. Finally the results and algorithms are
extended to contextual bandits with similarities.Comment: COLT 201
Towards Practical Lipschitz Bandits
Stochastic Lipschitz bandit algorithms balance exploration and exploitation,
and have been used for a variety of important task domains. In this paper, we
present a framework for Lipschitz bandit methods that adaptively learns
partitions of context- and arm-space. Due to this flexibility, the algorithm is
able to efficiently optimize rewards and minimize regret, by focusing on the
portions of the space that are most relevant. In our analysis, we link
tree-based methods to Gaussian processes. In light of our analysis, we design a
novel hierarchical Bayesian model for Lipschitz bandit problems. Our
experiments show that our algorithms can achieve state-of-the-art performance
in challenging real-world tasks such as neural network hyperparameter tuning
Structured Stochastic Bandits
In this thesis we address the multi-armed bandit (MAB) problem with stochastic rewards and correlated arms. Particularly, we investigate the case when the expected rewards are a Lipschitz function of the arm, and the learning to rank problem, as viewed from a MAB perspective. For the former, we derive a problem specific lower bound and propose both an asymptotically optimal algorithm (OSLB) and a (pareto)optimal, algorithm (POSLB). For the latter, we construct the regret lower bound and determine its closed form for some particular settings, as well as propose two asymptotically optimal algorithms PIE and PIE-C. For all algorithms mentioned above, we present performance analysis in the form of theoretical regret guarantees as well as numerical evaluation on artificial datasets as well as real-world datasets, in the case of PIE and PIE-C.QC 20160223</p
Efficient Online Learning under Bandit Feedback
In this thesis we address the multi-armed bandit (MAB) problem with stochastic rewards and correlated arms. Particularly, we investigate the case when the expected rewards are a Lipschitz function of the arm and extend these results to bandits with arbitrary structure that is known to the decision maker. In these settings, we derive problem specific regret lower bounds and propose both an asymptotically optimal algorithm (OSLB and OSSB respectively) and (pareto) optimal algorithms (POSLB and POSB, in the generic setting). We further examine the \emph{learning to rank} problem, as viewed from a MAB perspective. We construct the regret lower bound and determine its closed form for some particular settings, as well as propose two asymptotically optimal algorithms PIE and PIE-C. We further present a mathematical model of the learning to rank problem where the need for diversity appears naturally and devise an order optimal, numerically competitive algorithm, LDR. For all algorithms mentioned above, we present performance analysis in the form of theoretical regret guarantees as well as numerical evaluation on artificial as well as real-world datasets.QC 20180130</p
Distributed Trust-Aware Recommender Systems
Collaborative filtering(CF) recommender systems are among the most popular approaches to solving the information overload problem in social networks by generating accurate predictions based on the ratings of similar users. Traditional CF recommenders suffer from lack of scalability while decentralized CF recommenders (DHT based, gossip based etc.) have promised to alleviate this problem. Thus, in this thesis we propose a decentralized approach to CF recommender systems that uses the T-Man algorithm to create and maintain an overlay network that in turn would facilitate the generation of recommendations based on local information of a node. We analyze the influence of the number of rounds and neighbors on the accuracy of prediction and item coverage and we propose a new approach to inferring trust values between a user and its neighbors. Our experiments on three important datasets show an improvement of prediction accuracy relative to previous approaches while using a highly scalable, decentralized paradigm. We also analyze item coverage and show that our system is able to generate predictions for significant fraction of the users, which is comparable with the centralized approaches
Minimal Exploration in Structured Stochastic Bandits
International audienceThis paper introduces and addresses a wide class of stochastic bandit problems where the function mapping the arm to the corresponding reward exhibits some known structural properties. Most existing structures (e.g. linear, Lipschitz, unimodal, combinatorial, dueling, . . . ) are covered by our framework. We derive an asymptotic instance-specific regret lower bound for these problems, and develop OSSB, an algorithm whose regret matches this fundamental limit. OSSB is not based on the classical principle of "optimism in the face of uncertainty" or on Thompson sampling, and rather aims at matching the minimal exploration rates of sub-optimal arms as characterized in the derivation of the regret lower bound. We illustrate the efficiency of OSSB using numerical experiments in the case of the linear bandit problem and show that OSSB outperforms existing algorithms, including Thompson sampling
Learning to Rank: Regret Lower Bound and Efficient Algorithms
International audienceAlgorithms for learning to rank Web documents, display ads, or other types of items constitute a fundamental component of search engines and more generally of online services. In such systems, when a user makes a request or visits a web page, an ordered list of items (e.g. documents or ads) is displayed; the user scans this list in order, and clicks on the first relevant item if any. When the user clicks on an item, the reward collected by the system typically decreases with the position of the item in the displayed list. The main challenge in the design of sequential list selection algorithms stems from the fact that the probabilities with which the user clicks on the various items are unknown and need to be learned. We formulate the design of such algorithms as a stochastic bandit optimization problem. This problem differs from the classical bandit framework: (1) the type of feedback received by the system depends on the actual relevance of the various items in the displayed list (if the user clicks on the last item, we know that none of the previous items in the list are relevant); (2) there are inherent correlations between the average relevance of the items (e.g. the user may be interested in a specific topic only). We assume that items are categorized according to their topic and that users are clustered, so that users of the same cluster are interested in the same topic. We investigate several scenarios depending on the available side-information on the user before selecting the displayed list: (a) we first treat the case where the topic the user is interested in is known when she places a request; (b) we then study the case where the user cluster is known but the mapping between user clusters and topics is unknown. For both scenarios, we derive regret lower bounds and devise algorithms that approach these fundamental limits