22 research outputs found

    Bellman Error Based Feature Generation using Random Projections on Sparse Spaces

    Full text link
    We address the problem of automatic generation of features for value function approximation. Bellman Error Basis Functions (BEBFs) have been shown to improve the error of policy evaluation with function approximation, with a convergence rate similar to that of value iteration. We propose a simple, fast and robust algorithm based on random projections to generate BEBFs for sparse feature spaces. We provide a finite sample analysis of the proposed method, and prove that projections logarithmic in the dimension of the original space are enough to guarantee contraction in the error. Empirical results demonstrate the strength of this method

    Valuing Pilot Project Investments in Incomplete Markets : A Compound Option Approach

    Get PDF
    We introduce a general framework to value pilot project investments under the presence of both, market and technical uncertainty. The model generalizes different settings introduced previously in the literature. By distinguishing between the pilot and the commercial stages of the project we are able to frame the problem as a compound perpetual Bermudan option. We work on an incomplete market setting where market uncertainty is spanned by tradable assets and technical uncertainty is private to the firm. The value of these investment opportunities as well as the optimal exercise problem are solved by approximate dynamic programming techniques. We prove the convergence of our algorithm and derive a theoretical bound on how the errors compound as the number of stages of the compound option is increased. Furthermore, we show some numerical results and provide an economic interpretation of the model dynamicsreal options, dynamic programming, incomplete markets

    Examining average and discounted reward optimality criteria in reinforcement learning

    Full text link
    In reinforcement learning (RL), the goal is to obtain an optimal policy, for which the optimality criterion is fundamentally important. Two major optimality criteria are average and discounted rewards, where the later is typically considered as an approximation to the former. While the discounted reward is more popular, it is problematic to apply in environments that have no natural notion of discounting. This motivates us to revisit a) the progression of optimality criteria in dynamic programming, b) justification for and complication of an artificial discount factor, and c) benefits of directly maximizing the average reward. Our contributions include a thorough examination of the relationship between average and discounted rewards, as well as a discussion of their pros and cons in RL. We emphasize that average-reward RL methods possess the ingredient and mechanism for developing the general discounting-free optimality criterion (Veinott, 1969) in RL.Comment: 14 pages, 3 figures, 10-page main conten

    Stability of Q-Learning Through Design and Optimism

    Full text link
    Q-learning has become an important part of the reinforcement learning toolkit since its introduction in the dissertation of Chris Watkins in the 1980s. The purpose of this paper is in part a tutorial on stochastic approximation and Q-learning, providing details regarding the INFORMS APS inaugural Applied Probability Trust Plenary Lecture, presented in Nancy France, June 2023. The paper also presents new approaches to ensure stability and potentially accelerated convergence for these algorithms, and stochastic approximation in other settings. Two contributions are entirely new: 1. Stability of Q-learning with linear function approximation has been an open topic for research for over three decades. It is shown that with appropriate optimistic training in the form of a modified Gibbs policy, there exists a solution to the projected Bellman equation, and the algorithm is stable (in terms of bounded parameter estimates). Convergence remains one of many open topics for research. 2. The new Zap Zero algorithm is designed to approximate the Newton-Raphson flow without matrix inversion. It is stable and convergent under mild assumptions on the mean flow vector field for the algorithm, and compatible statistical assumption on an underlying Markov chain. The algorithm is a general approach to stochastic approximation which in particular applies to Q-learning with "oblivious" training even with non-linear function approximation.Comment: Companion paper to the INFORMS APS inaugural Applied Probability Trust Plenary Lecture, presented in Nancy France, June 2023. Slides available online, Online, DOI 10.13140/RG.2.2.24897.3312

    Order acceptance with reinforcement learning

    Get PDF
    Order Acceptance (OA) is one of the main functions in a business control framework. Basically, OA involves for each order a 0/1 (i.e., reject/accept) decision. Always accepting an order when capacity is available could unable the system to accept more convenient orders in the future. Another important aspect is the aV'(tiiability of information to the decisionmaker. We use a stochastic modeling approach using Markov decision theory and learning methods from Artificial Intelligence techniques in order to deal with uncertainty and long-term decisions in Ok Reinforcement Learning (RL) is a quite new approach that already combines this idea of modeling and solution method. Here we report on RL-solutions for some OA models

    Offline Reinforcement Learning with Instrumental Variables in Confounded Markov Decision Processes

    Full text link
    We study the offline reinforcement learning (RL) in the face of unmeasured confounders. Due to the lack of online interaction with the environment, offline RL is facing the following two significant challenges: (i) the agent may be confounded by the unobserved state variables; (ii) the offline data collected a prior does not provide sufficient coverage for the environment. To tackle the above challenges, we study the policy learning in the confounded MDPs with the aid of instrumental variables. Specifically, we first establish value function (VF)-based and marginalized importance sampling (MIS)-based identification results for the expected total reward in the confounded MDPs. Then by leveraging pessimism and our identification results, we propose various policy learning methods with the finite-sample suboptimality guarantee of finding the optimal in-class policy under minimal data coverage and modeling assumptions. Lastly, our extensive theoretical investigations and one numerical study motivated by the kidney transplantation demonstrate the promising performance of the proposed methods
    corecore