74,124 research outputs found
The Role of Lookahead and Approximate Policy Evaluation in Reinforcement Learning with Linear Value Function Approximation
Function approximation is widely used in reinforcement learning to handle the
computational difficulties associated with very large state spaces. However,
function approximation introduces errors which may lead to instabilities when
using approximate dynamic programming techniques to obtain the optimal policy.
Therefore, techniques such as lookahead for policy improvement and m-step
rollout for policy evaluation are used in practice to improve the performance
of approximate dynamic programming with function approximation. We
quantitatively characterize, for the first time, the impact of lookahead and
m-step rollout on the performance of approximate dynamic programming (DP) with
function approximation: (i) without a sufficient combination of lookahead and
m-step rollout, approximate DP may not converge, (ii) both lookahead and m-step
rollout improve the convergence rate of approximate DP, and (iii) lookahead
helps mitigate the effect of function approximation and the discount factor on
the asymptotic performance of the algorithm. Our results are presented for two
approximate DP methods: one which uses least-squares regression to perform
function approximation and another which performs several steps of gradient
descent of the least-squares objective in each iteration.Comment: 36 pages, 4 figure
Dynamic shortest path problem with travel-time-dependent stochastic disruptions : hybrid approximate dynamic programming algorithms with a clustering approach
We consider a dynamic shortest path problem with stochastic disruptions in the network. We use both historical information and real-time information of the network for the dynamic routing decisions. We model the problem as a discrete time nite horizon Markov Decision Process (MDP). For networks with many levels of disruptions, the MDP faces the curses of dimensionality. We rst apply Approximate Dynamic Programming (ADP) algorithm with a standard value function approximation. Then, we improve the ADP algorithm by exploiting the structure of the disruption transition functions. We develop a hybrid ADP with a clustering approach using both a deterministic lookahead policy and a value function approximation. We develop a test bed of networks to evaluate the quality of the solutions. The hybrid ADP algorithm with clustering approach signicantly reduces the computational time, while stil providing good quality solutions. Keywords: Dynamic shortest path problem, Approximate Dynamic Programming, Disruption handling, Clusterin
Policy Search: Any Local Optimum Enjoys a Global Performance Guarantee
Local Policy Search is a popular reinforcement learning approach for handling
large state spaces. Formally, it searches locally in a paramet erized policy
space in order to maximize the associated value function averaged over some
predefined distribution. It is probably commonly b elieved that the best one
can hope in general from such an approach is to get a local optimum of this
criterion. In this article, we show th e following surprising result:
\emph{any} (approximate) \emph{local optimum} enjoys a \emph{global performance
guarantee}. We compare this g uarantee with the one that is satisfied by Direct
Policy Iteration, an approximate dynamic programming algorithm that does some
form of Poli cy Search: if the approximation error of Local Policy Search may
generally be bigger (because local search requires to consider a space of s
tochastic policies), we argue that the concentrability coefficient that appears
in the performance bound is much nicer. Finally, we discuss several practical
and theoretical consequences of our analysis
Q-learning Decision Transformer: Leveraging Dynamic Programming for Conditional Sequence Modelling in Offline RL
Recent works have shown that tackling offline reinforcement learning (RL)
with a conditional policy produces promising results. The Decision Transformer
(DT) combines the conditional policy approach and a transformer architecture,
showing competitive performance against several benchmarks. However, DT lacks
stitching ability -- one of the critical abilities for offline RL to learn the
optimal policy from sub-optimal trajectories. This issue becomes particularly
significant when the offline dataset only contains sub-optimal trajectories. On
the other hand, the conventional RL approaches based on Dynamic Programming
(such as Q-learning) do not have the same limitation; however, they suffer from
unstable learning behaviours, especially when they rely on function
approximation in an off-policy learning setting. In this paper, we propose the
Q-learning Decision Transformer (QDT) to address the shortcomings of DT by
leveraging the benefits of Dynamic Programming (Q-learning). It utilises the
Dynamic Programming results to relabel the return-to-go in the training data to
then train the DT with the relabelled data. Our approach efficiently exploits
the benefits of these two approaches and compensates for each other's
shortcomings to achieve better performance. We empirically show these in both
simple toy environments and the more complex D4RL benchmark, showing
competitive performance gains
Approximation Benefits of Policy Gradient Methods with Aggregated States
Folklore suggests that policy gradient can be more robust to misspecification
than its relative, approximate policy iteration. This paper studies the case of
state-aggregation, where the state space is partitioned and either the policy
or value function approximation is held constant over partitions. This paper
shows a policy gradient method converges to a policy whose regret per-period is
bounded by , the largest difference between two elements of the
state-action value function belonging to a common partition. With the same
representation, both approximate policy iteration and approximate value
iteration can produce policies whose per-period regret scales as
, where is a discount factor. Theoretical results
synthesize recent analysis of policy gradient methods with insights of Van Roy
(2006) into the critical role of state-relevance weights in approximate dynamic
programming
- …