16,290 research outputs found

    Trajectory-Based Off-Policy Deep Reinforcement Learning

    Full text link
    Policy gradient methods are powerful reinforcement learning algorithms and have been demonstrated to solve many complex tasks. However, these methods are also data-inefficient, afflicted with high variance gradient estimates, and frequently get stuck in local optima. This work addresses these weaknesses by combining recent improvements in the reuse of off-policy data and exploration in parameter space with deterministic behavioral policies. The resulting objective is amenable to standard neural network optimization strategies like stochastic gradient descent or stochastic gradient Hamiltonian Monte Carlo. Incorporation of previous rollouts via importance sampling greatly improves data-efficiency, whilst stochastic optimization schemes facilitate the escape from local optima. We evaluate the proposed approach on a series of continuous control benchmark tasks. The results show that the proposed algorithm is able to successfully and reliably learn solutions using fewer system interactions than standard policy gradient methods.Comment: Includes appendix. Accepted for ICML 201

    Distributive Network Utility Maximization (NUM) over Time-Varying Fading Channels

    Full text link
    Distributed network utility maximization (NUM) has received an increasing intensity of interest over the past few years. Distributed solutions (e.g., the primal-dual gradient method) have been intensively investigated under fading channels. As such distributed solutions involve iterative updating and explicit message passing, it is unrealistic to assume that the wireless channel remains unchanged during the iterations. Unfortunately, the behavior of those distributed solutions under time-varying channels is in general unknown. In this paper, we shall investigate the convergence behavior and tracking errors of the iterative primal-dual scaled gradient algorithm (PDSGA) with dynamic scaling matrices (DSC) for solving distributive NUM problems under time-varying fading channels. We shall also study a specific application example, namely the multi-commodity flow control and multi-carrier power allocation problem in multi-hop ad hoc networks. Our analysis shows that the PDSGA converges to a limit region rather than a single point under the finite state Markov chain (FSMC) fading channels. We also show that the order of growth of the tracking errors is given by O(T/N), where T and N are the update interval and the average sojourn time of the FSMC, respectively. Based on this analysis, we derive a low complexity distributive adaptation algorithm for determining the adaptive scaling matrices, which can be implemented distributively at each transmitter. The numerical results show the superior performance of the proposed dynamic scaling matrix algorithm over several baseline schemes, such as the regular primal-dual gradient algorithm
    • …
    corecore