9 research outputs found

    Conditionally Risk-Averse Contextual Bandits

    Full text link
    Contextual bandits with average-case statistical guarantees are inadequate in risk-averse situations because they might trade off degraded worst-case behaviour for better average performance. Designing a risk-averse contextual bandit is challenging because exploration is necessary but risk-aversion is sensitive to the entire distribution of rewards; nonetheless we exhibit the first risk-averse contextual bandit algorithm with an online regret guarantee. We conduct experiments from diverse scenarios where worst-case outcomes should be avoided, from dynamic pricing, inventory management, and self-tuning software; including a production exascale data processing system

    Best-Arm Identification for Quantile Bandits with Privacy

    Full text link
    We study the best-arm identification problem in multi-armed bandits with stochastic, potentially private rewards, when the goal is to identify the arm with the highest quantile at a fixed, prescribed level. First, we propose a (non-private) successive elimination algorithm for strictly optimal best-arm identification, we show that our algorithm is Ī“\delta-PAC and we characterize its sample complexity. Further, we provide a lower bound on the expected number of pulls, showing that the proposed algorithm is essentially optimal up to logarithmic factors. Both upper and lower complexity bounds depend on a special definition of the associated suboptimality gap, designed in particular for the quantile bandit problem, as we show when the gap approaches zero, best-arm identification is impossible. Second, motivated by applications where the rewards are private, we provide a differentially private successive elimination algorithm whose sample complexity is finite even for distributions with infinite support-size, and we characterize its sample complexity as well. Our algorithms do not require prior knowledge of either the suboptimality gap or other statistical information related to the bandit problem at hand.Comment: 24 pages, 4 figure

    Learning-based perception and control with adaptive stress testing for safe autonomous air mobility

    Get PDF
    The use of electrical vertical takeoff and landing (eVTOL) aircraft to provide efficient, high-speed, on-demand air transportation within a metropolitan area is a topic of increasing interest, which is expected to bring fundamental changes to the city infrastructures and daily commutes. NASA, Uber, and Airbus have been exploring this exciting concept of Urban Air Mobility (UAM), which has the potential to provide meaningful door-to-door trip time savings compared with automobiles. However, successfully bringing such vehicles and airspace operations to fruition will require introducing orders-of-magnitude more aircraft to a given airspace volume, and the ability to manage many of these eVTOL aircraft safely in a congested urban area presents a challenge unprecedented in air traffic management. Although there are existing solutions for communication technology, onboard computing capability, and sensor technology, the computation guidance algorithm to enable safe, efficient, and scalable flight operations for dense self-organizing air traffic still remains an open question. In order to enable safe and efficient autonomous on-demand free flight operations in this UAM concept, a suite of tools in learning-based perception and control systems with stress testing for safe autonomous air mobility is proposed in this dissertation. First, a key component for the safe autonomous operation of unmanned aircraft is an effective onboard perception system, which will support sense-and-avoid functions. For example, in a package delivery mission, or an emergency landing event, pedestrian detection could help unmanned aircraft with safe landing zone identification. In this dissertation, we developed a deep-learning-based onboard computer vision algorithm on unmanned aircraft for pedestrian detection and tracking. In contrast with existing research with ground-level pedestrian detection, the developed algorithm achieves highly accurate multiple pedestrian detection from a bird-eye view, when both the pedestrians and the aircraft platform are moving. Second, for the aircraft guidance, a message-based decentralized computational guidance algorithm with separation assurance capability for single aircraft case and multiple cooperative aircraft case is designed and analyzed in this dissertation. The algorithm proposed in this work is to formulate this problem as a Markov Decision Process (MDP) and solve it using an online algorithm Monte Carlo Tree Search (MCTS). For the multiple cooperative aircraft case, a novel coordination strategy is introduced by using the logit level-kk model in behavioral game theory. To achieve higher scalability, we introduce the airspace sector concept into the UAM environment by dividing the airspace into sectors, so that each aircraft only needs to coordinate with aircraft in the same sector. At each decision step, all of the aircraft will run the proposed computational guidance algorithm onboard, which can guide all the aircraft to their respective destinations while avoiding potential conflicts among them. In addition, to make the proposed algorithm more practical, we also consider the communication constraints and communication loss among the aircraft by modifying our computational guidance algorithms given certain communication constraints (time, bandwidth, and communication loss) and designing air-to-air and air-to-ground communication frameworks to facilitate the computational guidance algorithm. To demonstrate the performance of the proposed computational guidance algorithm, a free-flight airspace simulator that incorporates environment uncertainty is built in an OpenAI Gym environment. Numerical experiment results over several case studies including the roundabout test problem show that the proposed computational guidance algorithm has promising performance even with the high-density air traffic case. Third, to ensure the developed autonomous systems meet the high safety standards of aviation, we propose a novel, simulation driven approach for validation that can automatically discover the failure modes of a decision-making system, and optimize the parameters that configure the system to improve its safety performance. Using simulation, we demonstrate that the proposed validation algorithm is able to discover failure modes in the system that would be challenging for humans to find and fix, and we show how the algorithm can learn from these failure modes to improve the performance of the decision-making system under test

    Lā€™Apprentissage Automatique pour la prise de DeĢcisions

    Get PDF
    Strategic decision-making over valuable resources should consider risk-averse objectives. Many practical areas of application consider risk as central to decision- making. However, machine learning does not. As a result, research should provide insights and algorithms that endow machine learning with the ability to consider decision-theoretic risk. In particular, in estimating decision-theoretic risk on short dependent sequences generated from the most general possible class of processes for statistical inference and through decision-theoretic risk objectives in sequential decision-making. This thesis studies these two problems to provide principled algorithmic methods for considering decision-theoretic risk in machine learning. An algorithm with state-of-the-art performance is introduced for accurate estimation of risk statistics on the most general class of stationaryā€“ergodic processes ļæ¼and risk-averse objectives are introduced in sequential decision-making (online learning) in both the stochastic multi-arm bandit setting and the adversarial full-information setting.La prise de deĢcision strateĢgique concernant des ressources de valeur devrait tenir compte du degreĢ dā€™aversion au risque. Dā€™ailleurs, de nombreux domaines dā€™application mettent le risque au cœur de la prise de deĢcision. Toutefois, ce nā€™est pas le cas de lā€™apprentissage automatique. Ainsi, il semble essentiel de devoir fournir des indicateurs et des algorithmes dotant lā€™apprentissage automatique de la possibiliteĢ de prendre en consideĢration le risque dans la prise de deĢcision. En particulier, nous souhaiterions pouvoir estimer ce dernier sur de courtes seĢquences deĢpendantes geĢneĢreĢes aĢ€ partir de la classe la plus geĢneĢrale possible de processus stochastiques en utilisant des outils theĢoriques dā€™infeĢrence statistique et dā€™aversion au risque dans la prise de deĢcision seĢquentielle. Cette theĢ€se eĢtudie ces deux probleĢ€mes en fournissant des meĢthodes algorithmiques prenant en consideĢration le risque dans le cadre de la prise de deĢcision en apprentissage automatique. Un algorithme avec des performances de pointe est proposeĢ pour une estimation preĢcise des statistiques de risque avec la classe la plus geĢneĢrale de processus ergodiques et stochastiques. De plus, la notion dā€™aversion au risque est introduite dans la prise de deĢcision seĢquentielle (apprentissage en ligne) aĢ€ la fois dans les jeux de bandits stochastiques et dans lā€™apprentissage seĢquentiel antagoniste

    Risk-Sensitive Online Learning

    No full text
    Abstract. We consider the problem of online learning in settings in which we want to compete not simply with the rewards of the best expert or stock, but with the best trade-off between rewards and risk. Motivated by finance applications, we consider two common measures balancing returns and risk: the Sharpe ratio [7] and the mean-variance criterion of Markowitz [6]. We first provide negative results establishing the impossibility of no-regret algorithms under these measures, thus providing a stark contrast with the returns-only setting. We then show that the recent algorithm of Cesa-Bianchi et al. [3] achieves nontrivial performance under a modified bicriteria risk-return measure, and also give a no-regret algorithm for a ā€œlocalized ā€ version of the mean-variance criterion. To our knowledge this paper initiates the investigation of explicit risk considerations in the standard models of worst-case online learning.

    Risk and Safety in Online Learning and Optimization: Theory and Applications

    No full text
    149 pagesThis dissertation focuses on risk and safety considerations in the design and analysis of online learning algorithms for sequential decision-making problems under uncertainty. The particular motivating application for the mathematical models and methods developed in this dissertation is demand response programs. Demand response programs denote the general family of mechanisms designed to improve the efficiency and reliability of electric power systems by affecting the demand of residential customers. First, we design a risk-sensitive online learning algorithm for linear models. In particular, we consider the setting in which an electric power utility seeks to curtail its peak electricity demand by offering a fixed group of customers a uniform price for reductions in consumption relative to their predetermined baselines. The underlying demand curve, which describes the aggregate reduction in consumption in response to the offered price, is assumed to be affine and subject to unobservable random shocks. Assuming that both the parameters of the demand curve and the distribution of the random shocks are initially unknown to the utility, we investigate the extent to which the utility might dynamically adjust its offered prices to maximize its cumulative risk-sensitive payoff over a finite number of T days. In order to do so effectively, the utility must design its pricing policy to balance the trade-off between the need to learn the unknown demand model (exploration) and maximize its payoff (exploitation) over time. We propose a semi-greedy pricing policy, and show that its expected regret defined as the risk-sensitive payoff loss over T days, relative to an oracle pricing policy that knows the underlying demand model, is no more than O(T^0.5\log(T)) . Moreover, the proposed pricing policy is shown to yield a sequence of prices that converge to the oracle optimal prices in the mean square sense. Second, we develop an online learning algorithm for linear models subject to stagewise safety constraints. More specifically, we introduce the safe linear stochastic bandit framework - a generalization of linear stochastic bandits - where, in each stage, the learner is required to select an arm with an expected reward that is no less than a predetermined (safe) threshold with high probability. We assume that the learner initially has knowledge of an arm that is known to be safe, but not necessarily optimal. Leveraging on this assumption, we introduce a learning algorithm that systematically combines known safe arms with exploratory arms to safely expand the set of safe arms over time, while facilitating safe greedy exploitation in subsequent stages. In addition to ensuring the satisfaction of the safety constraint at every stage of play, the proposed algorithm is shown to exhibit an expected regret that is no more than O((T\log(T))^0.5) after T stages of play. Third, we extend our methodology developed for linear models to design an online learning algorithm with a near-optimal performance for a more general class of nonparametric smooth reward models. Specifically, we adopt the perspective of an aggregator, which seeks to coordinate its purchase of demand reductions from a fixed group of residential electricity customers, with its sale of the aggregate demand reduction in a two-settlement wholesale energy market. The aggregator procures reductions in demand by offering its customers a uniform price for reductions in consumption relative to their predetermined baselines. Prior to its realization of the aggregate demand reduction, the aggregator must also determine how much energy to sell into the two-settlement energy market. In the day-ahead market, the aggregator commits to a forward contract, which calls for the delivery of energy in the real-time market. The underlying aggregate demand curve, which relates the aggregate demand reduction to the aggregator's offered price, is assumed to be unknown and subject to unobservable, random shocks. Assuming that both the demand curve and the distribution of the random shocks are initially unknown to the aggregator, we investigate the extent to which the aggregator might dynamically adapt its offered prices and forward contracts to maximize its expected profit over a time window of T days. Specifically, we design a dynamic pricing and contract offering policy that resolves the aggregator's need to learn the unknown demand model with its desire to maximize its cumulative expected profit over time. In particular, the proposed pricing policy is proven to incur an expected regret over T days that is no greater than O(T^0.5\log^2(T))
    corecore