99,079 research outputs found
Two Timescale Stochastic Approximation with Controlled Markov noise and Off-policy temporal difference learning
We present for the first time an asymptotic convergence analysis of two
time-scale stochastic approximation driven by `controlled' Markov noise. In
particular, both the faster and slower recursions have non-additive controlled
Markov noise components in addition to martingale difference noise. We analyze
the asymptotic behavior of our framework by relating it to limiting
differential inclusions in both time-scales that are defined in terms of the
ergodic occupation measures associated with the controlled Markov processes.
Finally, we present a solution to the off-policy convergence problem for
temporal difference learning with linear function approximation, using our
results.Comment: 23 pages (relaxed some important assumptions from the previous
version), accepted in Mathematics of Operations Research in Feb, 201
The Essential Dynamics Algorithm: Essential Results
This paper presents a novel algorithm for learning in a class of stochastic Markov decision processes (MDPs) with continuous state and action spaces that trades speed for accuracy. A transform of the stochastic MDP into a deterministic one is presented which captures the essence of the original dynamics, in a sense made precise. In this transformed MDP, the calculation of values is greatly simplified. The online algorithm estimates the model of the transformed MDP and simultaneously does policy search against it. Bounds on the error of this approximation are proven, and experimental results in a bicycle riding domain are presented. The algorithm learns near optimal policies in orders of magnitude fewer interactions with the stochastic MDP, using less domain knowledge. All code used in the experiments is available on the project's web site
Convergence of Batch Asynchronous Stochastic Approximation With Applications to Reinforcement Learning
The stochastic approximation algorithm is a widely used probabilistic method
for finding a zero of a vector-valued funtion, when only noisy measurements of
the function are available. In the literature to date, one can make a
distinction between "synchronous" updating, whereby every component of the
current guess is updated at each time, and `"synchronous" updating, whereby
only one component is updated. In principle, it is also possible to update, at
each time instant, some but not all components of , which might be
termed as "batch asynchronous stochastic approximation" (BASA). Also, one can
also make a distinction between using a "local" clock versus a "global" clock.
In this paper, we propose a unified formulation of batch asynchronous
stochastic approximation (BASA) algorithms, and develop a general methodology
for proving that such algorithms converge, irrespective of whether global or
local clocks are used. These convergence proofs make use of weaker hypotheses
than existing results. For example: existing convergence proofs when a local
clock is used require that the measurement noise is an i.i.d sequence. Here, it
is assumed that the measurement errors form a martingale difference sequence.
Also, all results to date assume that the stochastic step sizes satisfy a
probabilistic analog of the Robbins-Monro conditions. We replace this by a
purely deterministic condition on the irreducibility of the underlying Markov
processes.
As specific applications to Reinforcement Learning, we introduce ``batch''
versions of the temporal difference algorithm for value iteration, and
the -learning algorithm for finding the optimal action-value function, and
also permit the use of local clocks instead of a global clock. In all cases, we
establish the convergence of these algorithms, under milder conditions than in
the existing literature.Comment: 27 page
Risk-Sensitive Reinforcement Learning: A Constrained Optimization Viewpoint
The classic objective in a reinforcement learning (RL) problem is to find a
policy that minimizes, in expectation, a long-run objective such as the
infinite-horizon discounted or long-run average cost. In many practical
applications, optimizing the expected value alone is not sufficient, and it may
be necessary to include a risk measure in the optimization process, either as
the objective or as a constraint. Various risk measures have been proposed in
the literature, e.g., mean-variance tradeoff, exponential utility, the
percentile performance, value at risk, conditional value at risk, prospect
theory and its later enhancement, cumulative prospect theory. In this article,
we focus on the combination of risk criteria and reinforcement learning in a
constrained optimization framework, i.e., a setting where the goal to find a
policy that optimizes the usual objective of infinite-horizon
discounted/average cost, while ensuring that an explicit risk constraint is
satisfied. We introduce the risk-constrained RL framework, cover popular risk
measures based on variance, conditional value-at-risk and cumulative prospect
theory, and present a template for a risk-sensitive RL algorithm. We survey
some of our recent work on this topic, covering problems encompassing
discounted cost, average cost, and stochastic shortest path settings, together
with the aforementioned risk measures in a constrained framework. This
non-exhaustive survey is aimed at giving a flavor of the challenges involved in
solving a risk-sensitive RL problem, and outlining some potential future
research directions
Diffusion Approximations for Online Principal Component Estimation and Global Convergence
In this paper, we propose to adopt the diffusion approximation tools to study
the dynamics of Oja's iteration which is an online stochastic gradient descent
method for the principal component analysis. Oja's iteration maintains a
running estimate of the true principal component from streaming data and enjoys
less temporal and spatial complexities. We show that the Oja's iteration for
the top eigenvector generates a continuous-state discrete-time Markov chain
over the unit sphere. We characterize the Oja's iteration in three phases using
diffusion approximation and weak convergence tools. Our three-phase analysis
further provides a finite-sample error bound for the running estimate, which
matches the minimax information lower bound for principal component analysis
under the additional assumption of bounded samples.Comment: Appeared in NIPS 201
- …