5 research outputs found

    Reward-Biased Maximum Likelihood Estimation for Linear Stochastic Bandits

    Full text link
    Modifying the reward-biased maximum likelihood method originally proposed in the adaptive control literature, we propose novel learning algorithms to handle the explore-exploit trade-off in linear bandits problems as well as generalized linear bandits problems. We develop novel index policies that we prove achieve order-optimality, and show that they achieve empirical performance competitive with the state-of-the-art benchmark methods in extensive experiments. The new policies achieve this with low computation time per pull for linear bandits, and thereby resulting in both favorable regret as well as computational efficiency

    Top-k selection with pairwise comparisons

    Get PDF
    In this work we consider active, pairwise top- selection, the problem of identifying the highest quality subset of given size from a set of alternatives, based on the information collected from noisy, sequentially chosen pairwise comparisons. We adapt two well known Bayesian sequential sampling techniques, the Knowledge Gradient policy and the Optimal Computing Budget Allocation framework for the pairwise setting and compare their performance on a range of empirical tests. We demonstrate that these methods are able to match or outperform the current state of the art racing algorithm approach

    Optimality, Scalability, and Reneging in Bandit Learning

    Get PDF
    Bandit learning has been widely applied to handle the exploration-exploitation dilemma in sequential decision problems. To solve the dilemma, a large number of bandit algorithms have been proposed. While many of these algorithms have been proved to be order-optimal with respect to regret, the difference between the best expected reward and that actually achieved, there remain two fundamental challenges. First, the “efficiency” of the best-performing bandit algorithms is often unsatisfactory, where the efficiency is measured jointly with respect to the performance in maximizing rewards as well as the computational complexity. For instance, the Information Directed Sampling (IDS), variance-based IDS (VIDS), and Kullback-Leibler Upper Confidence Bounds (KL-UCB) have often been reported to achieve outstanding performance with respect to regret. Unfortunately, they suffer from high computational complexity even after approximation, and exhibit poor scalability of computational complexity as the number of arms increases. Second, most of the existing bandit algorithms assume that the sequential decision-making process will continue forever without an end. However, users may renege and stop playing. They also assume the underlying reward distribution is homoscedastic. Both these assumptions are often violated in real-world applications, where participants may disengage from future interactions if they do not have a rewarding experience, and at the same time, the variances of underlying distributions differs under different contexts. To address the aforementioned challenges, we propose a family of novel bandit algorithms. To address the efficiency issue, we propose Biased Maximum Likelihood Estimation (BMLE) - a family of novel bandit algorithms that generally apply to both parametric and non-parametric reward distributions, often have a closed-form solution and low computation complexity, have a quantifiable regret bound, and demonstrate satisfactory empirical performance. To enable bandit algorithms handle the reneging risk and reward heteroscedasticity, we propose a Heteroscedastic Reneging Upper Confidence Bound policy (HR-UCB) - a novel UCB-type algorithm that achieves outstanding and quantifiable performance in the presence of reneging risk and heteroscedasticity

    Efficient pairwise information collection for subset selection

    Get PDF
    In this work, we consider the problems of selecting the subset of the top-k best of a set of alternatives, where the fitness of alternatives must be estimated through noisy pairwise sampling. To do this, we propose two novel active pairwise sampling methods, adapted from popular non-pairwise ranking and selection frameworks. We prove that our proposed methods have desirable asymptotic properties, and demonstrate empirically that they can perform better than current state-of-the art pairwise selection algorithms on a range of tasks. We show how our proposed methods can be integrated into the Covariance Matrix Adaptation Evolutionary Strategy, to improve fitness evaluation and optimizer performance including in the evolution of neural network based agents for playing No Limit Texas Hold’em poker. Finally, we demonstrate how parametric models can be used to help our proposed sampling algorithms exploit transitive preference structure between alternative pairs
    corecore