564 research outputs found

    Trading off rewards and errors in multi-armed bandits

    Get PDF
    International audienceIn multi-armed bandits, the most common objective is the maximization of the cumulative reward. Alternative settings include active exploration, where a learner tries to gain accurate estimates of the rewards of all arms. While these objectives are contrasting, in many scenarios it is desirable to trade off rewards and errors. For instance, in educational games the designer wants to gather generalizable knowledge about the behavior of the students and teaching strategies (small estimation errors) but, at the same time, the system needs to avoid giving a bad experience to the players, who may leave the system permanently (large reward). In this paper, we formalize this tradeoff and introduce the ForcingBalance algorithm whose performance is provably close to the best possible tradeoff strategy. Finally, we demonstrate on real-world educational data that ForcingBalance returns useful information about the arms without compromising the overall reward

    Hedging using reinforcement learning: Contextual kk-Armed Bandit versus QQ-learning

    Full text link
    The construction of replication strategies for contingent claims in the presence of risk and market friction is a key problem of financial engineering. In real markets, continuous replication, such as in the model of Black, Scholes and Merton, is not only unrealistic but it is also undesirable due to high transaction costs. Over the last decades stochastic optimal-control methods have been developed to balance between effective replication and losses. More recently, with the rise of artificial intelligence, temporal-difference Reinforcement Learning, in particular variations of QQ-learning in conjunction with Deep Neural Networks, have attracted significant interest. From a practical point of view, however, such methods are often relatively sample inefficient, hard to train and lack performance guarantees. This motivates the investigation of a stable benchmark algorithm for hedging. In this article, the hedging problem is viewed as an instance of a risk-averse contextual kk-armed bandit problem, for which a large body of theoretical results and well-studied algorithms are available. We find that the kk-armed bandit model naturally fits to the P&LP\&L formulation of hedging, providing for a more accurate and sample efficient approach than QQ-learning and reducing to the Black-Scholes model in the absence of transaction costs and risks.Comment: 15 pages, 7 figure

    Rewards and errors in multi-arm bandits for interactive education

    Get PDF
    International audienceIn multi-armed bandits, the most common objective is the maximization of the cumulative reward. Alternative settings include active exploration, where a learner tries to gain accurate estimates of the rewards of all arms. While these objectives are contrasting, in many scenarios it is desirable to trade off rewards and errors. For instance, in educational games the designer wants to gather generalizable knowledge about the behavior of the students and teaching strategies (small estimation errors) but, at the same time, the system needs to avoid giving a bad experience to the players, who may leave the system permanently (large reward). In this paper, we formalize this tradeoff and introduce the ForcingBalance algorithm whose performance is provably close to the best possible tradeoff strategy. Finally, we demonstrate on real-world educational data that ForcingBalance returns useful information about the arms without compromising the overall reward

    Regret Distribution in Stochastic Bandits: Optimal Trade-off between Expectation and Tail Risk

    Full text link
    We study the trade-off between expectation and tail risk for regret distribution in the stochastic multi-armed bandit problem. We fully characterize the interplay among three desired properties for policy design: worst-case optimality, instance-dependent consistency, and light-tailed risk. We show how the order of expected regret exactly affects the decaying rate of the regret tail probability for both the worst-case and instance-dependent scenario. A novel policy is proposed to characterize the optimal regret tail probability for any regret threshold. Concretely, for any given α[1/2,1)\alpha\in[1/2, 1) and β[0,α]\beta\in[0, \alpha], our policy achieves a worst-case expected regret of O~(Tα)\tilde O(T^\alpha) (we call it α\alpha-optimal) and an instance-dependent expected regret of O~(Tβ)\tilde O(T^\beta) (we call it β\beta-consistent), while enjoys a probability of incurring an O~(Tδ)\tilde O(T^\delta) regret (δα\delta\geq\alpha in the worst-case scenario and δβ\delta\geq\beta in the instance-dependent scenario) that decays exponentially with a polynomial TT term. Such decaying rate is proved to be best achievable. Moreover, we discover an intrinsic gap of the optimal tail rate under the instance-dependent scenario between whether the time horizon TT is known a priori or not. Interestingly, when it comes to the worst-case scenario, this gap disappears. Finally, we extend our proposed policy design to (1) a stochastic multi-armed bandit setting with non-stationary baseline rewards, and (2) a stochastic linear bandit setting. Our results reveal insights on the trade-off between regret expectation and regret tail risk for both worst-case and instance-dependent scenarios, indicating that more sub-optimality and inconsistency leave space for more light-tailed risk of incurring a large regret, and that knowing the planning horizon in advance can make a difference on alleviating tail risks
    corecore