378 research outputs found

    Nonparametric Stochastic Contextual Bandits

    Full text link
    We analyze the KK-armed bandit problem where the reward for each arm is a noisy realization based on an observed context under mild nonparametric assumptions. We attain tight results for top-arm identification and a sublinear regret of O~(T1+D2+D)\widetilde{O}\Big(T^{\frac{1+D}{2+D}}\Big), where DD is the context dimension, for a modified UCB algorithm that is simple to implement (kkNN-UCB). We then give global intrinsic dimension dependent and ambient dimension independent regret bounds. We also discuss recovering topological structures within the context space based on expected bandit performance and provide an extension to infinite-armed contextual bandits. Finally, we experimentally show the improvement of our algorithm over existing multi-armed bandit approaches for both simulated tasks and MNIST image classification.Comment: AAAI 201

    PAC-Bayesian Analysis of the Exploration-Exploitation Trade-off

    Full text link
    We develop a coherent framework for integrative simultaneous analysis of the exploration-exploitation and model order selection trade-offs. We improve over our preceding results on the same subject (Seldin et al., 2011) by combining PAC-Bayesian analysis with Bernstein-type inequality for martingales. Such a combination is also of independent interest for studies of multiple simultaneously evolving martingales.Comment: On-line Trading of Exploration and Exploitation 2 - ICML-2011 workshop. http://explo.cs.ucl.ac.uk/workshop

    Contextual Bandits with Cross-learning

    Full text link
    In the classical contextual bandits problem, in each round tt, a learner observes some context cc, chooses some action aa to perform, and receives some reward ra,t(c)r_{a,t}(c). We consider the variant of this problem where in addition to receiving the reward ra,t(c)r_{a,t}(c), the learner also learns the values of ra,t(c′)r_{a,t}(c') for all other contexts c′c'; i.e., the rewards that would have been achieved by performing that action under different contexts. This variant arises in several strategic settings, such as learning how to bid in non-truthful repeated auctions (in this setting the context is the decision maker's private valuation for each auction). We call this problem the contextual bandits problem with cross-learning. The best algorithms for the classical contextual bandits problem achieve O~(CKT)\tilde{O}(\sqrt{CKT}) regret against all stationary policies, where CC is the number of contexts, KK the number of actions, and TT the number of rounds. We demonstrate algorithms for the contextual bandits problem with cross-learning that remove the dependence on CC and achieve regret O(KT)O(\sqrt{KT}) (when contexts are stochastic with known distribution), O~(K1/3T2/3)\tilde{O}(K^{1/3}T^{2/3}) (when contexts are stochastic with unknown distribution), and O~(KT)\tilde{O}(\sqrt{KT}) (when contexts are adversarial but rewards are stochastic).Comment: 48 pages, 5 figure
    • …
    corecore