8 research outputs found
Pricing Bermudan options using regression: Optimal rates of convergence for lower estimates
The problem of pricing Bermudan options using simulations and nonparametric regression is considered. We derive optimal non-asymptotic bounds for the low biased estimate based on a suboptimal stopping rule constructed from some estimates of the optimal continuation values. These estimates may be of different nature, they may be local or global, with the only requirement being that the deviations of these estimates from the true continuation values can be uniformly bounded in probability. As an illustration, we discuss a class of local polynomial estimates which, under some regularity conditions, yield continuation values estimates possessing the required property
Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming
We consider the classical nite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for fi nding the optimal Q-factors. Instead of policy evaluation by solving
a linear system of equations, our algorithm requires (possibly inexact) solution of a nonlinear system of equations, involving estimates of state costs as well as Q-factors. This is Bellman's equation for an optimal
stopping problem that can be solved with simple Q-learning iterations, in the case where a lookup table representation is used; it can also be solved with the Q-learning algorithm of Tsitsiklis and Van Roy [TsV99], in the case where feature-based Q-factor approximations are used. In exact/lookup table representation form, our algorithm admits asynchronous and stochastic iterative implementations, in the spirit of
asynchronous/modi ed policy iteration, with lower overhead and more reliable convergence advantages over existing Q-learning schemes. Furthermore, for large-scale problems, where linear basis function approximations and simulation-based temporal di erence implementations are used, our algorithm resolves e ffectively the inherent difficulties of existing schemes due to inadequate exploration
Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming (Revised)
The revised technical report C-2010-10We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal Q-factors. Instead of policy evaluation by solving a linear system of equations, our algorithm requires (possibly inexact) solution of a nonlinear system of equations, involving estimates of state costs as well as Q-factors. This is Bellman's equation for an optimal stopping problem that can be solved with simple Q-learning iterations, in the case where a lookup table representation is used; it can also be solved with the Q-learning algorithm of Tsitsiklis and Van Roy [TsV99], in the case where feature-based Q-factor approximations are used. In exact/lookup table representation form, our algorithm admits asynchronous and stochastic iterative implementations, in the spirit of asynchronous/modified policy iteration, with lower overhead and/or more reliable convergence advantages over existing Q-learning schemes. Furthermore, for large-scale problems, where linear basis function approximations and simulation-based temporal difference implementations are used, our algorithm resolves effectively the inherent difficulties of existing schemes due to inadequate exploration
New Error Bounds for Approximations from Projected Linear Equations
Joint Technical Report of U.H. and M.I.T.
Technical Report C-2008-43
Dept. Computer Science
University of Helsinki
and LIDS Report 2797
Dept. EECS
M.I.T.
July 2008; revised July 2009We consider linear fixed point equations and their approximations by projection on a low dimensional subspace. We derive new bounds on the approximation error of the solution, which are expressed in terms of low dimensional matrices and can be computed by simulation. When the fixed point mapping is a contraction, as is typically the case in Markov decision processes (MDP), one of our bounds is always sharper than the standard contraction-based bounds, and another one is often sharper. The former bound is also tight in a worst-case sense. Our bounds also apply to the non-contraction case, including policy evaluation in MDP with nonstandard projections that enhance exploration. There are no error bounds currently available for this case to our knowledge
Recommended from our members
Approximate dynamic programming for large scale systems
Sequential decision making under uncertainty is at the heart of a wide variety of practical problems. These problems can be cast as dynamic programs and the optimal value function can be computed by solving Bellman's equation. However, this approach is limited in its applicability. As the number of state variables increases, the state space size grows exponentially, a phenomenon known as the curse of dimensionality, rendering the standard dynamic programming approach impractical. An effective way of addressing curse of dimensionality is through parameterized value function approximation. Such an approximation is determined by relatively small number of parameters and serves as an estimate of the optimal value function. But in order for this approach to be effective, we need Approximate Dynamic Programming (ADP) algorithms that can deliver `good' approximation to the optimal value function and such an approximation can then be used to derive policies for effective decision-making. From a practical standpoint, in order to assess the effectiveness of such an approximation, there is also a need for methods that give a sense for the suboptimality of a policy. This thesis is an attempt to address both these issues. First, we introduce a new ADP algorithm based on linear programming, to compute value function approximations. LP approaches to approximate DP have typically relied on a natural `projection' of a well studied linear program for exact dynamic programming. Such programs restrict attention to approximations that are lower bounds to the optimal cost-to-go function. Our program -- the `smoothed approximate linear program' -- is distinct from such approaches and relaxes the restriction to lower bounding approximations in an appropriate fashion while remaining computationally tractable. The resulting program enjoys strong approximation guarantees and is shown to perform well in numerical experiments with the game of Tetris and queueing network control problem. Next, we consider optimal stopping problems with applications to pricing of high-dimensional American options. We introduce the pathwise optimization (PO) method: a new convex optimization procedure to produce upper and lower bounds on the optimal value (the `price') of high-dimensional optimal stopping problems. The PO method builds on a dual characterization of optimal stopping problems as optimization problems over the space of martingales, which we dub the martingale duality approach. We demonstrate via numerical experiments that the PO method produces upper bounds and lower bounds (via suboptimal exercise policies) of a quality comparable with state-of-the-art approaches. Further, we develop an approximation theory relevant to martingale duality approaches in general and the PO method in particular. Finally, we consider a broad class of MDPs and introduce a new tractable method for computing bounds by consider information relaxation and introducing penalty. The method delivers tight bounds by identifying the best penalty function among a parameterized class of penalty functions. We implement our method on a high-dimensional financial application, namely, optimal execution and demonstrate the practical value of the method vis-a-vis competing methods available in the literature. In addition, we provide theory to show that bounds generated by our method are provably tighter than some of the other available approaches