22 research outputs found
Bellman Error Based Feature Generation using Random Projections on Sparse Spaces
We address the problem of automatic generation of features for value function
approximation. Bellman Error Basis Functions (BEBFs) have been shown to improve
the error of policy evaluation with function approximation, with a convergence
rate similar to that of value iteration. We propose a simple, fast and robust
algorithm based on random projections to generate BEBFs for sparse feature
spaces. We provide a finite sample analysis of the proposed method, and prove
that projections logarithmic in the dimension of the original space are enough
to guarantee contraction in the error. Empirical results demonstrate the
strength of this method
Valuing Pilot Project Investments in Incomplete Markets : A Compound Option Approach
We introduce a general framework to value pilot project investments under the presence of both, market and technical uncertainty. The model generalizes different settings introduced previously in the literature. By distinguishing between the pilot and the commercial stages of the project we are able to frame the problem as a compound perpetual Bermudan option. We work on an incomplete market setting where market uncertainty is spanned by tradable assets and technical uncertainty is private to the firm. The value of these investment opportunities as well as the optimal exercise problem are solved by approximate dynamic programming techniques. We prove the convergence of our algorithm and derive a theoretical bound on how the errors compound as the number of stages of the compound option is increased. Furthermore, we show some numerical results and provide an economic interpretation of the model dynamicsreal options, dynamic programming, incomplete markets
Examining average and discounted reward optimality criteria in reinforcement learning
In reinforcement learning (RL), the goal is to obtain an optimal policy, for
which the optimality criterion is fundamentally important. Two major optimality
criteria are average and discounted rewards, where the later is typically
considered as an approximation to the former. While the discounted reward is
more popular, it is problematic to apply in environments that have no natural
notion of discounting. This motivates us to revisit a) the progression of
optimality criteria in dynamic programming, b) justification for and
complication of an artificial discount factor, and c) benefits of directly
maximizing the average reward. Our contributions include a thorough examination
of the relationship between average and discounted rewards, as well as a
discussion of their pros and cons in RL. We emphasize that average-reward RL
methods possess the ingredient and mechanism for developing the general
discounting-free optimality criterion (Veinott, 1969) in RL.Comment: 14 pages, 3 figures, 10-page main conten
Stability of Q-Learning Through Design and Optimism
Q-learning has become an important part of the reinforcement learning toolkit
since its introduction in the dissertation of Chris Watkins in the 1980s. The
purpose of this paper is in part a tutorial on stochastic approximation and
Q-learning, providing details regarding the INFORMS APS inaugural Applied
Probability Trust Plenary Lecture, presented in Nancy France, June 2023.
The paper also presents new approaches to ensure stability and potentially
accelerated convergence for these algorithms, and stochastic approximation in
other settings. Two contributions are entirely new:
1. Stability of Q-learning with linear function approximation has been an
open topic for research for over three decades. It is shown that with
appropriate optimistic training in the form of a modified Gibbs policy, there
exists a solution to the projected Bellman equation, and the algorithm is
stable (in terms of bounded parameter estimates). Convergence remains one of
many open topics for research.
2. The new Zap Zero algorithm is designed to approximate the Newton-Raphson
flow without matrix inversion. It is stable and convergent under mild
assumptions on the mean flow vector field for the algorithm, and compatible
statistical assumption on an underlying Markov chain. The algorithm is a
general approach to stochastic approximation which in particular applies to
Q-learning with "oblivious" training even with non-linear function
approximation.Comment: Companion paper to the INFORMS APS inaugural Applied Probability
Trust Plenary Lecture, presented in Nancy France, June 2023. Slides available
online, Online, DOI 10.13140/RG.2.2.24897.3312
Order acceptance with reinforcement learning
Order Acceptance (OA) is one of the main functions in a business control framework. Basically, OA involves for each order a 0/1 (i.e., reject/accept) decision. Always accepting an order when capacity is available could unable the system to accept more convenient orders in the future. Another important aspect is the aV'(tiiability of information to the decisionmaker. We use a stochastic modeling approach using Markov decision theory and learning methods from Artificial Intelligence techniques in order to deal with uncertainty and long-term decisions in Ok Reinforcement Learning (RL) is a quite new approach that already combines this idea of modeling and solution method. Here we report on RL-solutions for some OA models
Offline Reinforcement Learning with Instrumental Variables in Confounded Markov Decision Processes
We study the offline reinforcement learning (RL) in the face of unmeasured
confounders. Due to the lack of online interaction with the environment,
offline RL is facing the following two significant challenges: (i) the agent
may be confounded by the unobserved state variables; (ii) the offline data
collected a prior does not provide sufficient coverage for the environment. To
tackle the above challenges, we study the policy learning in the confounded
MDPs with the aid of instrumental variables. Specifically, we first establish
value function (VF)-based and marginalized importance sampling (MIS)-based
identification results for the expected total reward in the confounded MDPs.
Then by leveraging pessimism and our identification results, we propose various
policy learning methods with the finite-sample suboptimality guarantee of
finding the optimal in-class policy under minimal data coverage and modeling
assumptions. Lastly, our extensive theoretical investigations and one numerical
study motivated by the kidney transplantation demonstrate the promising
performance of the proposed methods