2,356 research outputs found
Is the Bellman residual a bad proxy?
This paper aims at theoretically and empirically comparing two standard
optimization criteria for Reinforcement Learning: i) maximization of the mean
value and ii) minimization of the Bellman residual. For that purpose, we place
ourselves in the framework of policy search algorithms, that are usually
designed to maximize the mean value, and derive a method that minimizes the
residual over policies. A theoretical analysis
shows how good this proxy is to policy optimization, and notably that it is
better than its value-based counterpart. We also propose experiments on
randomly generated generic Markov decision processes, specifically designed for
studying the influence of the involved concentrability coefficient. They show
that the Bellman residual is generally a bad proxy to policy optimization and
that directly maximizing the mean value is much better, despite the current
lack of deep theoretical analysis. This might seem obvious, as directly
addressing the problem of interest is usually better, but given the prevalence
of (projected) Bellman residual minimization in value-based reinforcement
learning, we believe that this question is worth to be considered.Comment: Final NIPS 2017 version (title, among other things, changed
Foresighted Demand Side Management
We consider a smart grid with an independent system operator (ISO), and
distributed aggregators who have energy storage and purchase energy from the
ISO to serve its customers. All the entities in the system are foresighted:
each aggregator seeks to minimize its own long-term payments for energy
purchase and operational costs of energy storage by deciding how much energy to
buy from the ISO, and the ISO seeks to minimize the long-term total cost of the
system (e.g. energy generation costs and the aggregators' costs) by dispatching
the energy production among the generators. The decision making of the entities
is complicated for two reasons. First, the information is decentralized: the
ISO does not know the aggregators' states (i.e. their energy consumption
requests from customers and the amount of energy in their storage), and each
aggregator does not know the other aggregators' states or the ISO's state (i.e.
the energy generation costs and the status of the transmission lines). Second,
the coupling among the aggregators is unknown to them. Specifically, each
aggregator's energy purchase affects the price, and hence the payments of the
other aggregators. However, none of them knows how its decision influences the
price because the price is determined by the ISO based on its state. We propose
a design framework in which the ISO provides each aggregator with a conjectured
future price, and each aggregator distributively minimizes its own long-term
cost based on its conjectured price as well as its local information. The
proposed framework can achieve the social optimum despite being decentralized
and involving complex coupling among the various entities
f-Divergence constrained policy improvement
To ensure stability of learning, state-of-the-art generalized policy
iteration algorithms augment the policy improvement step with a trust region
constraint bounding the information loss. The size of the trust region is
commonly determined by the Kullback-Leibler (KL) divergence, which not only
captures the notion of distance well but also yields closed-form solutions. In
this paper, we consider a more general class of f-divergences and derive the
corresponding policy update rules. The generic solution is expressed through
the derivative of the convex conjugate function to f and includes the KL
solution as a special case. Within the class of f-divergences, we further focus
on a one-parameter family of -divergences to study effects of the
choice of divergence on policy improvement. Previously known as well as new
policy updates emerge for different values of . We show that every type
of policy update comes with a compatible policy evaluation resulting from the
chosen f-divergence. Interestingly, the mean-squared Bellman error minimization
is closely related to policy evaluation with the Pearson -divergence
penalty, while the KL divergence results in the soft-max policy update and a
log-sum-exp critic. We carry out asymptotic analysis of the solutions for
different values of and demonstrate the effects of using different
divergence functions on a multi-armed bandit problem and on common standard
reinforcement learning problems
Deploy-As-You-Go Wireless Relay Placement: An Optimal Sequential Decision Approach using the Multi-Relay Channel Model
We use information theoretic achievable rate formulas for the multi-relay
channel to study the problem of as-you-go deployment of relay nodes. The
achievable rate formulas are for full-duplex radios at the relays and for
decode-and-forward relaying. Deployment is done along the straight line joining
a source node and a sink node at an unknown distance from the source. The
problem is for a deployment agent to walk from the source to the sink,
deploying relays as he walks, given that the distance to the sink is
exponentially distributed with known mean. As a precursor, we apply the
multi-relay channel achievable rate formula to obtain the optimal power
allocation to relays placed along a line, at fixed locations. This permits us
to obtain the optimal placement of a given number of nodes when the distance
between the source and sink is given. Numerical work suggests that, at low
attenuation, the relays are mostly clustered near the source in order to be
able to cooperate, whereas at high attenuation they are uniformly placed and
work as repeaters. We also prove that the effect of path-loss can be entirely
mitigated if a large enough number of relays are placed uniformly between the
source and the sink. The structure of the optimal power allocation for a given
placement of the nodes, then motivates us to formulate the problem of as-you-go
placement of relays along a line of exponentially distributed length, and with
the exponential path-loss model, so as to minimize a cost function that is
additive over hops. The hop cost trades off a capacity limiting term, motivated
from the optimal power allocation solution, against the cost of adding a relay
node. We formulate the problem as a total cost Markov decision process,
establish results for the value function, and provide insights into the
placement policy and the performance of the deployed network via numerical
exploration.Comment: 21 pages. arXiv admin note: substantial text overlap with
arXiv:1204.432
Sensor Management for Tracking in Sensor Networks
We study the problem of tracking an object moving through a network of
wireless sensors. In order to conserve energy, the sensors may be put into a
sleep mode with a timer that determines their sleep duration. It is assumed
that an asleep sensor cannot be communicated with or woken up, and hence the
sleep duration needs to be determined at the time the sensor goes to sleep
based on all the information available to the sensor. Having sleeping sensors
in the network could result in degraded tracking performance, therefore, there
is a tradeoff between energy usage and tracking performance. We design sleeping
policies that attempt to optimize this tradeoff and characterize their
performance. As an extension to our previous work in this area [1], we consider
generalized models for object movement, object sensing, and tracking cost. For
discrete state spaces and continuous Gaussian observations, we derive a lower
bound on the optimal energy-tracking tradeoff. It is shown that in the low
tracking error regime, the generated policies approach the derived lower bound
- …