10,680 research outputs found
Experiments with Infinite-Horizon, Policy-Gradient Estimation
In this paper, we present algorithms that perform gradient ascent of the
average reward in a partially observable Markov decision process (POMDP). These
algorithms are based on GPOMDP, an algorithm introduced in a companion paper
(Baxter and Bartlett, this volume), which computes biased estimates of the
performance gradient in POMDPs. The algorithm's chief advantages are that it
uses only one free parameter beta, which has a natural interpretation in terms
of bias-variance trade-off, it requires no knowledge of the underlying state,
and it can be applied to infinite state, control and observation spaces. We
show how the gradient estimates produced by GPOMDP can be used to perform
gradient ascent, both with a traditional stochastic-gradient algorithm, and
with an algorithm based on conjugate-gradients that utilizes gradient
information to bracket maxima in line searches. Experimental results are
presented illustrating both the theoretical results of (Baxter and Bartlett,
this volume) on a toy problem, and practical aspects of the algorithms on a
number of more realistic problems
Linear Convergence for Natural Policy Gradient with Log-linear Policy Parametrization
We analyze the convergence rate of the unregularized natural policy gradient
algorithm with log-linear policy parametrizations in infinite-horizon
discounted Markov decision processes. In the deterministic case, when the
Q-value is known and can be approximated by a linear combination of a known
feature function up to a bias error, we show that a geometrically-increasing
step size yields a linear convergence rate towards an optimal policy. We then
consider the sample-based case, when the best representation of the Q- value
function among linear combinations of a known feature function is known up to
an estimation error. In this setting, we show that the algorithm enjoys the
same linear guarantees as in the deterministic case up to an error term that
depends on the estimation error, the bias error, and the condition number of
the feature covariance matrix. Our results build upon the general framework of
policy mirror descent and extend previous findings for the softmax tabular
parametrization to the log-linear policy class
Sharp high-probability sample complexities for policy evaluation with linear function approximation
This paper is concerned with the problem of policy evaluation with linear
function approximation in discounted infinite horizon Markov decision
processes. We investigate the sample complexities required to guarantee a
predefined estimation error of the best linear coefficients for two widely-used
policy evaluation algorithms: the temporal difference (TD) learning algorithm
and the two-timescale linear TD with gradient correction (TDC) algorithm. In
both the on-policy setting, where observations are generated from the target
policy, and the off-policy setting, where samples are drawn from a behavior
policy potentially different from the target policy, we establish the first
sample complexity bound with high-probability convergence guarantee that
attains the optimal dependence on the tolerance level. We also exhihit an
explicit dependence on problem-related quantities, and show in the on-policy
setting that our upper bound matches the minimax lower bound on crucial problem
parameters, including the choice of the feature maps and the problem dimension.Comment: The first two authors contributed equall
Optimal Energy Allocation for Kalman Filtering over Packet Dropping Links with Imperfect Acknowledgments and Energy Harvesting Constraints
This paper presents a design methodology for optimal transmission energy
allocation at a sensor equipped with energy harvesting technology for remote
state estimation of linear stochastic dynamical systems. In this framework, the
sensor measurements as noisy versions of the system states are sent to the
receiver over a packet dropping communication channel. The packet dropout
probabilities of the channel depend on both the sensor's transmission energies
and time varying wireless fading channel gains. The sensor has access to an
energy harvesting source which is an everlasting but unreliable energy source
compared to conventional batteries with fixed energy storages. The receiver
performs optimal state estimation with random packet dropouts to minimize the
estimation error covariances based on received measurements. The receiver also
sends packet receipt acknowledgments to the sensor via an erroneous feedback
communication channel which is itself packet dropping.
The objective is to design optimal transmission energy allocation at the
energy harvesting sensor to minimize either a finite-time horizon sum or a long
term average (infinite-time horizon) of the trace of the expected estimation
error covariance of the receiver's Kalman filter. These problems are formulated
as Markov decision processes with imperfect state information. The optimal
transmission energy allocation policies are obtained by the use of dynamic
programming techniques. Using the concept of submodularity, the structure of
the optimal transmission energy policies are studied. Suboptimal solutions are
also discussed which are far less computationally intensive than optimal
solutions. Numerical simulation results are presented illustrating the
performance of the energy allocation algorithms.Comment: Submitted to IEEE Transactions on Automatic Control. arXiv admin
note: text overlap with arXiv:1402.663
Simultaneous Perturbation Algorithms for Batch Off-Policy Search
We propose novel policy search algorithms in the context of off-policy, batch
mode reinforcement learning (RL) with continuous state and action spaces. Given
a batch collection of trajectories, we perform off-line policy evaluation using
an algorithm similar to that by [Fonteneau et al., 2010]. Using this
Monte-Carlo like policy evaluator, we perform policy search in a class of
parameterized policies. We propose both first order policy gradient and second
order policy Newton algorithms. All our algorithms incorporate simultaneous
perturbation estimates for the gradient as well as the Hessian of the
cost-to-go vector, since the latter is unknown and only biased estimates are
available. We demonstrate their practicality on a simple 1-dimensional
continuous state space problem
Risk-Sensitive Reinforcement Learning: A Constrained Optimization Viewpoint
The classic objective in a reinforcement learning (RL) problem is to find a
policy that minimizes, in expectation, a long-run objective such as the
infinite-horizon discounted or long-run average cost. In many practical
applications, optimizing the expected value alone is not sufficient, and it may
be necessary to include a risk measure in the optimization process, either as
the objective or as a constraint. Various risk measures have been proposed in
the literature, e.g., mean-variance tradeoff, exponential utility, the
percentile performance, value at risk, conditional value at risk, prospect
theory and its later enhancement, cumulative prospect theory. In this article,
we focus on the combination of risk criteria and reinforcement learning in a
constrained optimization framework, i.e., a setting where the goal to find a
policy that optimizes the usual objective of infinite-horizon
discounted/average cost, while ensuring that an explicit risk constraint is
satisfied. We introduce the risk-constrained RL framework, cover popular risk
measures based on variance, conditional value-at-risk and cumulative prospect
theory, and present a template for a risk-sensitive RL algorithm. We survey
some of our recent work on this topic, covering problems encompassing
discounted cost, average cost, and stochastic shortest path settings, together
with the aforementioned risk measures in a constrained framework. This
non-exhaustive survey is aimed at giving a flavor of the challenges involved in
solving a risk-sensitive RL problem, and outlining some potential future
research directions
- …