2,437 research outputs found
Policy evaluation with temporal differences: a survey and comparison
Policy evaluation is an essential step in most reinforcement learning approaches. It yields a value function, the quality assessment of states for a given policy, which can be used in a policy improvement step. Since the late 1980s, this research area has been dominated by temporal-difference (TD) methods due to their data-efficiency. However, core issues such as stability guarantees in the off-policy scenario, improved sample efficiency and probabilistic treatment of the uncertainty in the estimates have only been tackled recently, which has led to a large number of new approaches.
This paper aims at making these new developments accessible in a concise overview, with foci on underlying cost functions, the off-policy scenario as well as on regularization in high dimensional feature spaces. By presenting the first extensive, systematic comparative evaluations comparing TD, LSTD, LSPE, FPKF, the residual- gradient algorithm, Bellman residual minimization, GTD, GTD2 and TDC, we shed light on the strengths and weaknesses of the methods. Moreover, we present alternative versions of LSTD and LSPE with drastically improved off-policy performance
Recommended from our members
ADAPTIVE STEP-SIZES FOR REINFORCEMENT LEARNING
The central theme motivating this dissertation is the desire to develop reinforcement learning algorithms that “just work” regardless of the domain in which they are applied. The largest impediment to this goal is the sensitivity of reinforcement learning algorithms to the step-size parameter used to rescale incremental updates. Adaptive step-size algorithms attempt to reduce this sensitivity or eliminate the step-size parameter entirely by automatically adjusting the step size throughout the learning process. Such algorithms provide an alternative to the standard “guess-and-check” methods used to find parameters known as parameter tuning.
However, the problems with parameter tuning are currently masked by the way experiments are conducted and presented. In this dissertation we seek algorithms that perform well over a broad subset of reinforcement learning problems with minimal parameter tuning. To accomplish this we begin by addressing the limitations of current empirical methods in reinforcement learning and propose improvements with benefits far outside the area of adaptive step-sizes.
In order to study adaptive step-sizes in reinforcement learning we show that the general form of the adaptive step-size problem is a combination of two dissociable problems (adaptive scalar step-size and update whitening). We then derive new parameter-free adaptive scalar step-size algorithms for the reinforcement learning algorithm Sarsa(λ) and use our improved empirical methods to conduct a thorough experimental study of step-size algorithms in reinforcement learning. Our adaptive algorithms (VES and PARL2) both eliminate the need for a tunable step-size parameter and perform at least as well as Sarsa(λ) with an optimized step-size value. We conclude by developing natural temporal difference algorithms that provide an approximate solution to the update whitening problem and improve performance over their non-natural counterparts
Ensemble Reinforcement Learning: A Survey
Reinforcement Learning (RL) has emerged as a highly effective technique for
addressing various scientific and applied problems. Despite its success,
certain complex tasks remain challenging to be addressed solely with a single
model and algorithm. In response, ensemble reinforcement learning (ERL), a
promising approach that combines the benefits of both RL and ensemble learning
(EL), has gained widespread popularity. ERL leverages multiple models or
training algorithms to comprehensively explore the problem space and possesses
strong generalization capabilities. In this study, we present a comprehensive
survey on ERL to provide readers with an overview of recent advances and
challenges in the field. First, we introduce the background and motivation for
ERL. Second, we analyze in detail the strategies that have been successfully
applied in ERL, including model averaging, model selection, and model
combination. Subsequently, we summarize the datasets and analyze algorithms
used in relevant studies. Finally, we outline several open questions and
discuss future research directions of ERL. By providing a guide for future
scientific research and engineering applications, this survey contributes to
the advancement of ERL.Comment: 42 page
Adaptive Railway Traffic Control using Approximate Dynamic Programming
Railway networks around the world have become challenging to operate in recent decades, with a mixture of track layouts running several different classes of trains with varying operational speeds. This complexity has come about as a result of the sustained increase in passenger numbers where in many countries railways are now more popular than ever before as means of commuting to cities. To address operational challenges, governments and railway undertakings are encouraging development of intelligent and digital transport systems to regulate and optimise train operations in real-time to increase capacity and customer satisfaction by improved usage of existing railway infrastructure. Accordingly, this thesis presents an adaptive railway traffic control system for realtime operations based on a data-based approximate dynamic programming (ADP) approach with integrated reinforcement learning (RL). By assessing requirements and opportunities, the controller aims to reduce delays resulting from trains that entered a control area behind schedule by re-scheduling control plans in real-time at critical locations in a timely manner. The present data-based approach depends on an approximation to the value function of dynamic programming after optimisation from a specified state, which is estimated dynamically from operational experience using RL techniques. By using this approximation, ADP avoids extensive explicit evaluation of performance and so reduces the computational burden substantially. In this thesis, formulations of the approximation function and variants of the RL learning techniques used to estimate it are explored. Evaluation of this controller shows considerable improvements in delays by comparison with current industry practices
- …