152 research outputs found
Finite Sample Analysis of the GTD Policy Evaluation Algorithms in Markov Setting
In reinforcement learning (RL) , one of the key components is policy
evaluation, which aims to estimate the value function (i.e., expected long-term
accumulated reward) of a policy. With a good policy evaluation method, the RL
algorithms will estimate the value function more accurately and find a better
policy. When the state space is large or continuous \emph{Gradient-based
Temporal Difference(GTD)} policy evaluation algorithms with linear function
approximation are widely used. Considering that the collection of the
evaluation data is both time and reward consuming, a clear understanding of the
finite sample performance of the policy evaluation algorithms is very important
to reinforcement learning. Under the assumption that data are i.i.d. generated,
previous work provided the finite sample analysis of the GTD algorithms with
constant step size by converting them into convex-concave saddle point
problems. However, it is well-known that, the data are generated from Markov
processes rather than i.i.d. in RL problems.. In this paper, in the realistic
Markov setting, we derive the finite sample bounds for the general
convex-concave saddle point problems, and hence for the GTD algorithms. We have
the following discussions based on our bounds. (1) With variants of step size,
GTD algorithms converge. (2) The convergence rate is determined by the step
size, with the mixing time of the Markov process as the coefficient. The faster
the Markov processes mix, the faster the convergence. (3) We explain that the
experience replay trick is effective by improving the mixing property of the
Markov process. To the best of our knowledge, our analysis is the first to
provide finite sample bounds for the GTD algorithms in Markov setting
Finite-sample Analysis of Greedy-GQ with Linear Function Approximation under Markovian Noise
Greedy-GQ is an off-policy two timescale algorithm for optimal control in
reinforcement learning. This paper develops the first finite-sample analysis
for the Greedy-GQ algorithm with linear function approximation under Markovian
noise. Our finite-sample analysis provides theoretical justification for
choosing stepsizes for this two timescale algorithm for faster convergence in
practice, and suggests a trade-off between the convergence rate and the quality
of the obtained policy. Our paper extends the finite-sample analyses of two
timescale reinforcement learning algorithms from policy evaluation to optimal
control, which is of more practical interest. Specifically, in contrast to
existing finite-sample analyses for two timescale methods, e.g., GTD, GTD2 and
TDC, where their objective functions are convex, the objective function of the
Greedy-GQ algorithm is non-convex. Moreover, the Greedy-GQ algorithm is also
not a linear two-timescale stochastic approximation algorithm. Our techniques
in this paper provide a general framework for finite-sample analysis of
non-convex value-based reinforcement learning algorithms for optimal control.Comment: UAI 202
On a convergent off -policy temporal difference learning algorithm in on-line learning environment
In this paper we provide a rigorous convergence analysis of a "off"-policy
temporal difference learning algorithm with linear function approximation and
per time-step linear computational complexity in "online" learning environment.
The algorithm considered here is TDC with importance weighting introduced by
Maei et al. We support our theoretical results by providing suitable empirical
results for standard off-policy counterexamples.Comment: 14 pages. arXiv admin note: text overlap with arXiv:1503.0910
Finite-Sample Analysis of Proximal Gradient TD Algorithms
In this paper, we analyze the convergence rate of the gradient temporal
difference learning (GTD) family of algorithms. Previous analyses of this class
of algorithms use ODE techniques to prove asymptotic convergence, and to the
best of our knowledge, no finite-sample analysis has been done. Moreover, there
has been not much work on finite-sample analysis for convergent off-policy
reinforcement learning algorithms. In this paper, we formulate GTD methods as
stochastic gradient algorithms w.r.t.~a primal-dual saddle-point objective
function, and then conduct a saddle-point error analysis to obtain
finite-sample bounds on their performance. Two revised algorithms are also
proposed, namely projected GTD2 and GTD2-MP, which offer improved convergence
guarantees and acceleration, respectively. The results of our theoretical
analysis show that the GTD family of algorithms are indeed comparable to the
existing LSTD methods in off-policy learning scenarios.Comment: 31st Conference on Uncertainty in Artificial Intelligence (UAI).
arXiv admin note: substantial text overlap with arXiv:2006.0397
A Greedy Approach to Adapting the Trace Parameter for Temporal Difference Learning
One of the main obstacles to broad application of reinforcement learning
methods is the parameter sensitivity of our core learning algorithms. In many
large-scale applications, online computation and function approximation
represent key strategies in scaling up reinforcement learning algorithms. In
this setting, we have effective and reasonably well understood algorithms for
adapting the learning-rate parameter, online during learning. Such
meta-learning approaches can improve robustness of learning and enable
specialization to current task, improving learning speed. For
temporal-difference learning algorithms which we study here, there is yet
another parameter, , that similarly impacts learning speed and
stability in practice. Unfortunately, unlike the learning-rate parameter,
parametrizes the objective function that temporal-difference methods
optimize. Different choices of produce different fixed-point
solutions, and thus adapting online and characterizing the
optimization is substantially more complex than adapting the learning-rate
parameter. There are no meta-learning method for that can achieve (1)
incremental updating, (2) compatibility with function approximation, and (3)
maintain stability of learning under both on and off-policy sampling. In this
paper we contribute a novel objective function for optimizing as a
function of state rather than time. We derive a new incremental, linear
complexity -adaption algorithm that does not require offline batch
updating or access to a model of the world, and present a suite of experiments
illustrating the practicality of our new algorithm in three different settings.
Taken together, our contributions represent a concrete step towards black-box
application of temporal-difference learning methods in real world problems
Investigating practical linear temporal difference learning
Off-policy reinforcement learning has many applications including: learning
from demonstration, learning multiple goal seeking policies in parallel, and
representing predictive knowledge. Recently there has been an proliferation of
new policy-evaluation algorithms that fill a longstanding algorithmic void in
reinforcement learning: combining robustness to off-policy sampling, function
approximation, linear complexity, and temporal difference (TD) updates. This
paper contains two main contributions. First, we derive two new hybrid TD
policy-evaluation algorithms, which fill a gap in this collection of
algorithms. Second, we perform an empirical comparison to elicit which of these
new linear TD methods should be preferred in different situations, and make
concrete suggestions about practical use.Comment: Autonomous Agents and Multi-agent Systems, 201
Direct Gradient Temporal Difference Learning
Off-policy learning enables a reinforcement learning (RL) agent to reason
counterfactually about policies that are not executed and is one of the most
important ideas in RL. It, however, can lead to instability when combined with
function approximation and bootstrapping, two arguably indispensable
ingredients for large-scale reinforcement learning. This is the notorious
deadly triad. Gradient Temporal Difference (GTD) is one powerful tool to solve
the deadly triad. Its success results from solving a doubling sampling issue
indirectly with weight duplication or Fenchel duality. In this paper, we
instead propose a direct method to solve the double sampling issue by simply
using two samples in a Markovian data stream with an increasing gap. The
resulting algorithm is as computationally efficient as GTD but gets rid of
GTD's extra weights. The only price we pay is a logarithmically increasing
memory as time progresses. We provide both asymptotic and finite sample
analysis, where the convergence rate is on-par with the canonical on-policy
temporal difference learning. Key to our analysis is a novel refined
discretization of limiting ODEs.Comment: Submitted to JMLR in Apr 202
On the Global Convergence of Actor-Critic: A Case for Linear Quadratic Regulator with Ergodic Cost
Despite the empirical success of the actor-critic algorithm, its theoretical
understanding lags behind. In a broader context, actor-critic can be viewed as
an online alternating update algorithm for bilevel optimization, whose
convergence is known to be fragile. To understand the instability of
actor-critic, we focus on its application to linear quadratic regulators, a
simple yet fundamental setting of reinforcement learning. We establish a
nonasymptotic convergence analysis of actor-critic in this setting. In
particular, we prove that actor-critic finds a globally optimal pair of actor
(policy) and critic (action-value function) at a linear rate of convergence.
Our analysis may serve as a preliminary step towards a complete theoretical
understanding of bilevel optimization with nonconvex subproblems, which is
NP-hard in the worst case and is often solved using heuristics.Comment: 41 page
Concentration bounds for temporal difference learning with linear function approximation: The case of batch data and uniform sampling
We propose a stochastic approximation (SA) based method with randomization of
samples for policy evaluation using the least squares temporal difference
(LSTD) algorithm. Our proposed scheme is equivalent to running regular temporal
difference learning with linear function approximation, albeit with samples
picked uniformly from a given dataset. Our method results in an
improvement in complexity in comparison to LSTD, where is the dimension of
the data. We provide non-asymptotic bounds for our proposed method, both in
high probability and in expectation, under the assumption that the matrix
underlying the LSTD solution is positive definite. The latter assumption can be
easily satisfied for the pathwise LSTD variant proposed in [23]. Moreover, we
also establish that using our method in place of LSTD does not impact the rate
of convergence of the approximate value function to the true value function.
These rate results coupled with the low computational complexity of our method
make it attractive for implementation in big data settings, where is large.
A similar low-complexity alternative for least squares regression is well-known
as the stochastic gradient descent (SGD) algorithm. We provide finite-time
bounds for SGD. We demonstrate the practicality of our method as an efficient
alternative for pathwise LSTD empirically by combining it with the least
squares policy iteration (LSPI) algorithm in a traffic signal control
application. We also conduct another set of experiments that combines the SA
based low-complexity variant for least squares regression with the LinUCB
algorithm for contextual bandits, using the large scale news recommendation
dataset from Yahoo
Distributed Policy Evaluation Under Multiple Behavior Strategies
We apply diffusion strategies to develop a fully-distributed cooperative
reinforcement learning algorithm in which agents in a network communicate only
with their immediate neighbors to improve predictions about their environment.
The algorithm can also be applied to off-policy learning, meaning that the
agents can predict the response to a behavior different from the actual
policies they are following. The proposed distributed strategy is efficient,
with linear complexity in both computation time and memory footprint. We
provide a mean-square-error performance analysis and establish convergence
under constant step-size updates, which endow the network with continuous
learning capabilities. The results show a clear gain from cooperation: when the
individual agents can estimate the solution, cooperation increases stability
and reduces bias and variance of the prediction error; but, more importantly,
the network is able to approach the optimal solution even when none of the
individual agents can (e.g., when the individual behavior policies restrict
each agent to sample a small portion of the state space).Comment: 36 pages, 4 figures, accepted for publication on IEEE Transactions on
Automatic Contro
- …