667 research outputs found
Simultaneous Perturbation Algorithms for Batch Off-Policy Search
We propose novel policy search algorithms in the context of off-policy, batch
mode reinforcement learning (RL) with continuous state and action spaces. Given
a batch collection of trajectories, we perform off-line policy evaluation using
an algorithm similar to that by [Fonteneau et al., 2010]. Using this
Monte-Carlo like policy evaluator, we perform policy search in a class of
parameterized policies. We propose both first order policy gradient and second
order policy Newton algorithms. All our algorithms incorporate simultaneous
perturbation estimates for the gradient as well as the Hessian of the
cost-to-go vector, since the latter is unknown and only biased estimates are
available. We demonstrate their practicality on a simple 1-dimensional
continuous state space problem
Risk-Sensitive Reinforcement Learning: A Constrained Optimization Viewpoint
The classic objective in a reinforcement learning (RL) problem is to find a
policy that minimizes, in expectation, a long-run objective such as the
infinite-horizon discounted or long-run average cost. In many practical
applications, optimizing the expected value alone is not sufficient, and it may
be necessary to include a risk measure in the optimization process, either as
the objective or as a constraint. Various risk measures have been proposed in
the literature, e.g., mean-variance tradeoff, exponential utility, the
percentile performance, value at risk, conditional value at risk, prospect
theory and its later enhancement, cumulative prospect theory. In this article,
we focus on the combination of risk criteria and reinforcement learning in a
constrained optimization framework, i.e., a setting where the goal to find a
policy that optimizes the usual objective of infinite-horizon
discounted/average cost, while ensuring that an explicit risk constraint is
satisfied. We introduce the risk-constrained RL framework, cover popular risk
measures based on variance, conditional value-at-risk and cumulative prospect
theory, and present a template for a risk-sensitive RL algorithm. We survey
some of our recent work on this topic, covering problems encompassing
discounted cost, average cost, and stochastic shortest path settings, together
with the aforementioned risk measures in a constrained framework. This
non-exhaustive survey is aimed at giving a flavor of the challenges involved in
solving a risk-sensitive RL problem, and outlining some potential future
research directions
Two Timescale Convergent Q-learning for Sleep--Scheduling in Wireless Sensor Networks
In this paper, we consider an intrusion detection application for Wireless
Sensor Networks (WSNs). We study the problem of scheduling the sleep times of
the individual sensors to maximize the network lifetime while keeping the
tracking error to a minimum. We formulate this problem as a
partially-observable Markov decision process (POMDP) with continuous
state-action spaces, in a manner similar to (Fuemmeler and Veeravalli [2008]).
However, unlike their formulation, we consider infinite horizon discounted and
average cost objectives as performance criteria. For each criterion, we propose
a convergent on-policy Q-learning algorithm that operates on two timescales,
while employing function approximation to handle the curse of dimensionality
associated with the underlying POMDP. Our proposed algorithm incorporates a
policy gradient update using a one-simulation simultaneous perturbation
stochastic approximation (SPSA) estimate on the faster timescale, while the
Q-value parameter (arising from a linear function approximation for the
Q-values) is updated in an on-policy temporal difference (TD) algorithm-like
fashion on the slower timescale. The feature selection scheme employed in each
of our algorithms manages the energy and tracking components in a manner that
assists the search for the optimal sleep-scheduling policy. For the sake of
comparison, in both discounted and average settings, we also develop a function
approximation analogue of the Q-learning algorithm. This algorithm, unlike the
two-timescale variant, does not possess theoretical convergence guarantees.
Finally, we also adapt our algorithms to include a stochastic iterative
estimation scheme for the intruder's mobility model. Our simulation results on
a 2-dimensional network setting suggest that our algorithms result in better
tracking accuracy at the cost of only a few additional sensors, in comparison
to a recent prior work
Fast gradient descent for drifting least squares regression, with application to bandits
Online learning algorithms require to often recompute least squares
regression estimates of parameters. We study improving the computational
complexity of such algorithms by using stochastic gradient descent (SGD) type
schemes in place of classic regression solvers. We show that SGD schemes
efficiently track the true solutions of the regression problems, even in the
presence of a drift. This finding coupled with an improvement in
complexity, where is the dimension of the data, make them attractive for
implementation in the big data settings. In the case when strong convexity in
the regression problem is guaranteed, we provide bounds on the error both in
expectation and high probability (the latter is often needed to provide
theoretical guarantees for higher level algorithms), despite the drifting least
squares solution. As an example of this case we prove that the regret
performance of an SGD version of the PEGE linear bandit algorithm
[Rusmevichientong and Tsitsiklis 2010] is worse that that of PEGE itself only
by a factor of . When strong convexity of the regression problem
cannot be guaranteed, we investigate using an adaptive regularisation. We make
an empirical study of an adaptively regularised, SGD version of LinUCB [Li et
al. 2010] in a news article recommendation application, which uses the large
scale news recommendation dataset from Yahoo! front page. These experiments
show a large gain in computational complexity, with a consistently low tracking
error and click-through-rate (CTR) performance that is close
- …