2,356 research outputs found

    Is the Bellman residual a bad proxy?

    Get PDF
    This paper aims at theoretically and empirically comparing two standard optimization criteria for Reinforcement Learning: i) maximization of the mean value and ii) minimization of the Bellman residual. For that purpose, we place ourselves in the framework of policy search algorithms, that are usually designed to maximize the mean value, and derive a method that minimizes the residual Tvπvπ1,ν\|T_* v_\pi - v_\pi\|_{1,\nu} over policies. A theoretical analysis shows how good this proxy is to policy optimization, and notably that it is better than its value-based counterpart. We also propose experiments on randomly generated generic Markov decision processes, specifically designed for studying the influence of the involved concentrability coefficient. They show that the Bellman residual is generally a bad proxy to policy optimization and that directly maximizing the mean value is much better, despite the current lack of deep theoretical analysis. This might seem obvious, as directly addressing the problem of interest is usually better, but given the prevalence of (projected) Bellman residual minimization in value-based reinforcement learning, we believe that this question is worth to be considered.Comment: Final NIPS 2017 version (title, among other things, changed

    Foresighted Demand Side Management

    Full text link
    We consider a smart grid with an independent system operator (ISO), and distributed aggregators who have energy storage and purchase energy from the ISO to serve its customers. All the entities in the system are foresighted: each aggregator seeks to minimize its own long-term payments for energy purchase and operational costs of energy storage by deciding how much energy to buy from the ISO, and the ISO seeks to minimize the long-term total cost of the system (e.g. energy generation costs and the aggregators' costs) by dispatching the energy production among the generators. The decision making of the entities is complicated for two reasons. First, the information is decentralized: the ISO does not know the aggregators' states (i.e. their energy consumption requests from customers and the amount of energy in their storage), and each aggregator does not know the other aggregators' states or the ISO's state (i.e. the energy generation costs and the status of the transmission lines). Second, the coupling among the aggregators is unknown to them. Specifically, each aggregator's energy purchase affects the price, and hence the payments of the other aggregators. However, none of them knows how its decision influences the price because the price is determined by the ISO based on its state. We propose a design framework in which the ISO provides each aggregator with a conjectured future price, and each aggregator distributively minimizes its own long-term cost based on its conjectured price as well as its local information. The proposed framework can achieve the social optimum despite being decentralized and involving complex coupling among the various entities

    f-Divergence constrained policy improvement

    Full text link
    To ensure stability of learning, state-of-the-art generalized policy iteration algorithms augment the policy improvement step with a trust region constraint bounding the information loss. The size of the trust region is commonly determined by the Kullback-Leibler (KL) divergence, which not only captures the notion of distance well but also yields closed-form solutions. In this paper, we consider a more general class of f-divergences and derive the corresponding policy update rules. The generic solution is expressed through the derivative of the convex conjugate function to f and includes the KL solution as a special case. Within the class of f-divergences, we further focus on a one-parameter family of α\alpha-divergences to study effects of the choice of divergence on policy improvement. Previously known as well as new policy updates emerge for different values of α\alpha. We show that every type of policy update comes with a compatible policy evaluation resulting from the chosen f-divergence. Interestingly, the mean-squared Bellman error minimization is closely related to policy evaluation with the Pearson χ2\chi^2-divergence penalty, while the KL divergence results in the soft-max policy update and a log-sum-exp critic. We carry out asymptotic analysis of the solutions for different values of α\alpha and demonstrate the effects of using different divergence functions on a multi-armed bandit problem and on common standard reinforcement learning problems

    Deploy-As-You-Go Wireless Relay Placement: An Optimal Sequential Decision Approach using the Multi-Relay Channel Model

    Full text link
    We use information theoretic achievable rate formulas for the multi-relay channel to study the problem of as-you-go deployment of relay nodes. The achievable rate formulas are for full-duplex radios at the relays and for decode-and-forward relaying. Deployment is done along the straight line joining a source node and a sink node at an unknown distance from the source. The problem is for a deployment agent to walk from the source to the sink, deploying relays as he walks, given that the distance to the sink is exponentially distributed with known mean. As a precursor, we apply the multi-relay channel achievable rate formula to obtain the optimal power allocation to relays placed along a line, at fixed locations. This permits us to obtain the optimal placement of a given number of nodes when the distance between the source and sink is given. Numerical work suggests that, at low attenuation, the relays are mostly clustered near the source in order to be able to cooperate, whereas at high attenuation they are uniformly placed and work as repeaters. We also prove that the effect of path-loss can be entirely mitigated if a large enough number of relays are placed uniformly between the source and the sink. The structure of the optimal power allocation for a given placement of the nodes, then motivates us to formulate the problem of as-you-go placement of relays along a line of exponentially distributed length, and with the exponential path-loss model, so as to minimize a cost function that is additive over hops. The hop cost trades off a capacity limiting term, motivated from the optimal power allocation solution, against the cost of adding a relay node. We formulate the problem as a total cost Markov decision process, establish results for the value function, and provide insights into the placement policy and the performance of the deployed network via numerical exploration.Comment: 21 pages. arXiv admin note: substantial text overlap with arXiv:1204.432

    Sensor Management for Tracking in Sensor Networks

    Full text link
    We study the problem of tracking an object moving through a network of wireless sensors. In order to conserve energy, the sensors may be put into a sleep mode with a timer that determines their sleep duration. It is assumed that an asleep sensor cannot be communicated with or woken up, and hence the sleep duration needs to be determined at the time the sensor goes to sleep based on all the information available to the sensor. Having sleeping sensors in the network could result in degraded tracking performance, therefore, there is a tradeoff between energy usage and tracking performance. We design sleeping policies that attempt to optimize this tradeoff and characterize their performance. As an extension to our previous work in this area [1], we consider generalized models for object movement, object sensing, and tracking cost. For discrete state spaces and continuous Gaussian observations, we derive a lower bound on the optimal energy-tracking tradeoff. It is shown that in the low tracking error regime, the generated policies approach the derived lower bound
    corecore