14 research outputs found

    Multi-Objective contextual bandits with a dominant objective

    Get PDF
    In this paper, we propose a new contextual bandit problem with two objectives, where one of the objectives dominates the other objective. Unlike single-objective bandit problems in which the learner obtains a random scalar reward for each arm it selects, in the proposed problem, the learner obtains a random reward vector, where each component of the reward vector corresponds to one of the objectives. The goal of the learner is to maximize its total reward in the non-dominant objective while ensuring that it maximizes its reward in the dominant objective. In this case, the optimal arm given a context is the one that maximizes the expected reward in the non-dominant objective among all arms that maximize the expected reward in the dominant objective. For this problem, we propose the multi-objective contextual multi-armed bandit algorithm (MOC-MAB), and prove that it achieves sublinear regret with respect to the optimal context dependent policy. Then, we compare the performance of the proposed algorithm with other state-of-the-art bandit algorithms. The proposed contextual bandit model and the algorithm have a wide range of real-world applications that involve multiple and possibly conflicting objectives ranging from wireless communication to medical diagnosis and recommender systems. © 2017 IEEE

    A practical guide to multi-objective reinforcement learning and planning

    Get PDF
    Real-world sequential decision-making tasks are generally complex, requiring trade-offs between multiple, often conflicting, objectives. Despite this, the majority of research in reinforcement learning and decision-theoretic planning either assumes only a single objective, or that multiple objectives can be adequately handled via a simple linear combination. Such approaches may oversimplify the underlying problem and hence produce suboptimal results. This paper serves as a guide to the application of multi-objective methods to difficult problems, and is aimed at researchers who are already familiar with single-objective reinforcement learning and planning methods who wish to adopt a multi-objective perspective on their research, as well as practitioners who encounter multi-objective decision problems in practice. It identifies the factors that may influence the nature of the desired solution, and illustrates by example how these influence the design of multi-objective decision-making systems for complex problems. © 2022, The Author(s)

    A Practical Guide to Multi-Objective Reinforcement Learning and Planning

    Get PDF
    Real-world decision-making tasks are generally complex, requiring trade-offs between multiple, often conflicting, objectives. Despite this, the majority of research in reinforcement learning and decision-theoretic planning either assumes only a single objective, or that multiple objectives can be adequately handled via a simple linear combination. Such approaches may oversimplify the underlying problem and hence produce suboptimal results. This paper serves as a guide to the application of multi-objective methods to difficult problems, and is aimed at researchers who are already familiar with single-objective reinforcement learning and planning methods who wish to adopt a multi-objective perspective on their research, as well as practitioners who encounter multi-objective decision problems in practice. It identifies the factors that may influence the nature of the desired solution, and illustrates by example how these influence the design of multi-objective decision-making systems for complex problems

    Efficient Algorithms for Generalized Linear Bandits with Heavy-tailed Rewards

    Full text link
    This paper investigates the problem of generalized linear bandits with heavy-tailed rewards, whose (1+Ï”)(1+\epsilon)-th moment is bounded for some ϔ∈(0,1]\epsilon\in (0,1]. Although there exist methods for generalized linear bandits, most of them focus on bounded or sub-Gaussian rewards and are not well-suited for many real-world scenarios, such as financial markets and web-advertising. To address this issue, we propose two novel algorithms based on truncation and mean of medians. These algorithms achieve an almost optimal regret bound of O~(dT11+Ï”)\widetilde{O}(dT^{\frac{1}{1+\epsilon}}), where dd is the dimension of contextual information and TT is the time horizon. Our truncation-based algorithm supports online learning, distinguishing it from existing truncation-based approaches. Additionally, our mean-of-medians-based algorithm requires only O(log⁥T)O(\log T) rewards and one estimator per epoch, making it more practical. Moreover, our algorithms improve the regret bounds by a logarithmic factor compared to existing algorithms when Ï”=1\epsilon=1. Numerical experimental results confirm the merits of our algorithms

    Creating Systems and Applying Large-Scale Methods to Improve Student Remediation in Online Tutoring Systems in Real-time and at Scale

    Get PDF
    A common problem shared amongst online tutoring systems is the time-consuming nature of content creation. It has been estimated that an hour of online instruction can take up to 100-300 hours to create. Several systems have created tools to expedite content creation, such as the Cognitive Tutors Authoring Tool (CTAT) and the ASSISTments builder. Although these tools make content creation more efficient, they all still depend on the efforts of a content creator and/or past historical. These tools do not take full advantage of the power of the crowd. These issues and challenges faced by online tutoring systems provide an ideal environment to implement a solution using crowdsourcing. I created the PeerASSIST system to provide a solution to the challenges faced with tutoring content creation. PeerASSIST crowdsources the work students have done on problems inside the ASSISTments online tutoring system and redistributes that work as a form of tutoring to their peers, who are in need of assistance. Multi-objective multi-armed bandit algorithms are used to distribute student work, which balance exploring which work is good and exploiting the best currently known work. These policies are customized to run in a real-world environment with multiple asynchronous reward functions and an infinite number of actions. Inspired by major companies such as Google, Facebook, and Bing, PeerASSIST is also designed as a platform for simultaneous online experimentation in real-time and at scale. Currently over 600 teachers (grades K-12) are requiring students to show their work. Over 300,000 instances of student work have been collected from over 18,000 students across 28,000 problems. From the student work collected, 2,000 instances have been redistributed to over 550 students who needed help over the past few months. I conducted a randomized controlled experiment to evaluate the effectiveness of PeerASSIST on student performance. Other contributions include representing learning maps as Bayesian networks to model student performance, creating a machine-learning algorithm to derive student incorrect processes from their incorrect answer and the inputs of the problem, and applying Bayesian hypothesis testing to A/B experiments. We showed that learning maps can be simplified without practical loss of accuracy and that time series data is necessary to simplify learning maps if the static data is highly correlated. I also created several interventions to evaluate the effectiveness of the buggy messages generated from the machine-learned incorrect processes. The null results of these experiments demonstrate the difficulty of creating a successful tutoring and suggest that other methods of tutoring content creation (i.e. PeerASSIST) should be explored

    DĂ©clinaisons de bandits et leurs applications

    Get PDF
    Cette thĂšse s’intĂ©resse Ă  diffĂ©rentes variantes du problĂšme des bandits, une instance simplifiĂ©e d’un problĂšme de reinforcement learning (RL) dont l’accent est mis sur le compromis entre l’exploration et l’exploitation. Plus spĂ©cifiquement, l’accent est mis sur trois variantes, soient les bandits contextuels, structurĂ©s et multi-objectifs. Dans la premiĂšre, un agent recherche l’action optimale dĂ©pendant d’un contexte donnĂ©. Dans la seconde, un agent recherche l’action optimale dans un espace potentiellement grand et caractĂ©risĂ© par une mĂ©trique de similaritĂ©. Dans la derniĂšre, un agent recherche le compromis optimal sur un front de Pareto selon une fonction d’articulation des prĂ©fĂ©rences non observable directement. La thĂšse propose des algorithmes adaptĂ©s Ă  chacune de ces variantes, dont les performances sont appuyĂ©es par des garanties thĂ©oriques ou des expĂ©riences empiriques. Ces variantes de bandits servent de cadre Ă  deux applications rĂ©elles et Ă  haut potentiel d’impact, soient l’allocation de traitements adaptative pour la dĂ©couverte de stratĂ©gies de traitement du cancer personnalisĂ©es, et l’optimisation en-ligne de paramĂštres d’imagerie microscopique Ă  grande rĂ©solution pour l’acquisition efficace d’images utilisables en neuroscience. La thĂšse apporte donc des contributions Ă  la fois algorithmiques, thĂ©oriques et applicatives. Une adaptation de l’algorithme best empirical sampled average (BESA), GP BESA, est proposĂ©e pour le problĂšme des bandits contextuels. Son potentiel est mis en lumiĂšre par des expĂ©riences en simulation, lesquelles ont motivĂ© le dĂ©ploiement de la stratĂ©gie dans une Ă©tude sur des animaux en laboratoire. Les rĂ©sultats, prometteurs, montrent que GP BESA est en mesure d’étendre la longĂ©vitĂ© de souris atteintes du cancer et ainsi augmenter significativement la quantitĂ© de donnĂ©es recueillies sur les sujets. Une adaptation de l’algorithme Thompson sampling (TS), Kernel TS, est proposĂ©e pour le problĂšme des bandits structurĂ©s en reproducing kernel Hilbert space (RKHS). Une analyse thĂ©orique permet d’obtenir des garanties de convergence sur le pseudo-regret cumulatif. Des rĂ©sultats de concentration pour la rĂ©gression Ă  noyau avec rĂ©gularisation variable ainsi qu’une procĂ©dure d’ajustement adaptative de la rĂ©gularisation basĂ©e sur l’estimation empirique de la variance du bruit sont Ă©galement introduits. Ces contributions permettent de lever l’hypothĂšse classique sur la connaissance a priori de la variance du bruit en rĂ©gression Ă  noyau en-ligne. Des rĂ©sultats numĂ©riques illustrent le potentiel de ces outils. Des expĂ©riences empiriques illustrent Ă©galement la performance de Kernel TS et permettent de soulever des questionnements intĂ©ressants relativement Ă  l’optimalitĂ© des intuitions thĂ©oriques. Une nouvelle variante de bandits multi-objectifs gĂ©nĂ©ralisant la littĂ©rature est proposĂ©e. Plus spĂ©cifiquement, le nouveau cadre considĂšre que l’articulation des prĂ©fĂ©rences entre les objectifs provient d’une fonction non observable, typiquement d’un utilisateur (expert), et suggĂšre d’intĂ©grer cet expert Ă  la boucle d’apprentissage. Le concept des rayons de prĂ©fĂ©rence est ensuite introduit pour Ă©valuer la robustesse de la fonction de prĂ©fĂ©rences de l’expert Ă  des erreurs dans l’estimation de l’environnement. Une variante de l’algorithme TS, TS-MVN, est proposĂ©e et analysĂ©e. Des expĂ©riences empiriques appuient ces rĂ©sultats et constituent une investigation prĂ©liminaire des questionnements relatifs Ă  la prĂ©sence d’un expert dans la boucle d’apprentissage. La mise en commun des approches de bandits structurĂ©s et multi-objectifs permet de s’attaquer au problĂšme d’optimisation des paramĂštres d’imagerie STED de maniĂšre en-ligne. Les rĂ©sultats expĂ©rimentaux sur un vrai montage microscopique et avec de vrais Ă©chantillons neuronaux montrent que la technique proposĂ©e permet d’accĂ©lĂ©rer considĂ©rablement le processus de caractĂ©risation des paramĂštres et facilitent l’obtention rapide d’images pertinentes pour des experts en neuroscience.This thesis deals with various variants of the bandits problem, wihch corresponds to a simplified instance of a RL problem with emphasis on the exploration-exploitation trade-off. More specifically, the focus is on three variants: contextual, structured, and multi-objective bandits. In the first, an agent searches for the optimal action depending on a given context. In the second, an agent searches for the optimal action in a potentially large space characterized by a similarity metric. In the latter, an agent searches for the optimal trade-off on a Pareto front according to a non-observable preference function. The thesis introduces algorithms adapted to each of these variants, whose performances are supported by theoretical guarantees and/or empirical experiments. These bandit variants provide a framework for two real-world applications with high potential impact: 1) adaptive treatment allocation for the discovery of personalized cancer treatment strategies; and 2) online optimization of microscopic imaging parameters for the efficient acquisition of useful images. The thesis therefore offers both algorithmic, theoretical, and applicative contributions. An adaptation of the BESA algorithm, GP BESA, is proposed for the problem of contextual bandits. Its potential is highlighted by simulation experiments, which motivated the deployment of the strategy in a wet lab experiment on real animals. Promising results show that GP BESA is able to extend the longevity of mice with cancer and thus significantly increase the amount of data collected on subjects. An adaptation of the TS algorithm, Kernel TS, is proposed for the problem of structured bandits in RKHS. A theoretical analysis allows to obtain convergence guarantees on the cumulative pseudo-regret. Concentration results for the regression with variable regularization as well as a procedure for adaptive tuning of the regularization based on the empirical estimation of the noise variance are also introduced. These contributions make it possible to lift the typical assumption on the a priori knowledge of the noise variance in streaming kernel regression. Numerical results illustrate the potential of these tools. Empirical experiments also illustrate the performance of Kernel TS and raise interesting questions about the optimality of theoretical intuitions. A new variant of multi-objective bandits, generalizing the literature, is also proposed. More specifically, the new framework considers that the preference articulation between the objectives comes from a nonobservable function, typically a user (expert), and suggests integrating this expert into the learning loop. The concept of preference radius is then introduced to evaluate the robustness of the expert’s preference function to errors in the estimation of the environment. A variant of the TS algorithm, TS-MVN, is introduced and analyzed. Empirical experiments support the theoreitcal results and provide a preliminary investigation of questions about the presence of an expert in the learning loop. Put together, structured and multi-objective bandits approaches are then used to tackle the online STED imaging parameters optimization problem. Experimental results on a real microscopy setting and with real neural samples show that the proposed technique makes it possible to significantly accelerate the process of parameters characterization and facilitate the acquisition of images relevant to experts in neuroscience
    corecore