Search CORE

79 research outputs found

P3O: Policy-on Policy-off Policy Optimization

Author: Chaudhari Pratik
Fakoor Rasool
Smola Alexander J.
Publication venue
Publication date: 15/07/2019
Field of study

On-policy reinforcement learning (RL) algorithms have high sample complexity while off-policy algorithms are difficult to tune. Merging the two holds the promise to develop efficient algorithms that generalize across diverse environments. It is however challenging in practice to find suitable hyper-parameters that govern this trade off. This paper develops a simple algorithm named P3O that interleaves off-policy updates with on-policy updates. P3O uses the effective sample size between the behavior policy and the target policy to control how far they can be from each other and does not introduce any additional hyper-parameters. Extensive experiments on the Atari-2600 and MuJoCo benchmark suites show that this simple technique is effective in reducing the sample complexity of state-of-the-art algorithms. Code to reproduce experiments in this paper is at https://github.com/rasoolfa/P3O.Comment: UAI 2019 conference paper. Code: https://github.com/rasoolfa/P3

arXiv.org e-Print Archive

Enhanced Experience Replay Generation for Efficient Reinforcement Learning

Author: Hu Wenfeng
Huang Vincent
Ley Tobias
Vlachou-Konchylaki Martha
Publication venue
Publication date: 29/05/2017
Field of study

Applying deep reinforcement learning (RL) on real systems suffers from slow data sampling. We propose an enhanced generative adversarial network (EGAN) to initialize an RL agent in order to achieve faster learning. The EGAN utilizes the relation between states and actions to enhance the quality of data samples generated by a GAN. Pre-training the agent with the EGAN shows a steeper learning curve with a 20% improvement of training time in the beginning of learning, compared to no pre-training, and an improvement compared to training with GAN by about 5% with smaller variations. For real time systems with sparse and slow data sampling the EGAN could be used to speed up the early phases of the training process

arXiv.org e-Print Archive

A short variational proof of equivalence between policy gradients and soft Q learning

Author: Maginnis Brendan
Richemond Pierre H.
Publication venue
Publication date: 22/12/2017
Field of study

Two main families of reinforcement learning algorithms, Q-learning and policy gradients, have recently been proven to be equivalent when using a softmax relaxation on one part, and an entropic regularization on the other. We relate this result to the well-known convex duality of Shannon entropy and the softmax function. Such a result is also known as the Donsker-Varadhan formula. This provides a short proof of the equivalence. We then interpret this duality further, and use ideas of convex analysis to prove a new policy inequality relative to soft Q-learning

arXiv.org e-Print Archive

Deep Reinforcement Learning for Autonomous Driving

Author: Jia Daoyuan
Wang Sen
Weng Xinshuo
Publication venue
Publication date: 19/05/2019
Field of study

Reinforcement learning has steadily improved and outperform human in lots of traditional games since the resurgence of deep neural network. However, these success is not easy to be copied to autonomous driving because the state spaces in real world are extreme complex and action spaces are continuous and fine control is required. Moreover, the autonomous driving vehicles must also keep functional safety under the complex environments. To deal with these challenges, we first adopt the deep deterministic policy gradient (DDPG) algorithm, which has the capacity to handle complex state and action spaces in continuous domain. We then choose The Open Racing Car Simulator (TORCS) as our environment to avoid physical damage. Meanwhile, we select a set of appropriate sensor information from TORCS and design our own rewarder. In order to fit DDPG algorithm to TORCS, we design our network architecture for both actor and critic inside DDPG paradigm. To demonstrate the effectiveness of our model, We evaluate on different modes in TORCS and show both quantitative and qualitative results.Comment: no time for further improvemen

arXiv.org e-Print Archive

Similarities between policy gradient methods (PGM) in Reinforcement learning (RL) and supervised learning (SL)

Author: Benhamou Eric
Publication venue
Publication date: 02/05/2019
Field of study

Reinforcement learning (RL) is about sequential decision making and is traditionally opposed to supervised learning (SL) and unsupervised learning (USL). In RL, given the current state, the agent makes a decision that may influence the next state as opposed to SL (and USL) where, the next state remains the same, regardless of the decisions taken, either in batch or online learning. Although this difference is fundamental between SL and RL, there are connections that have been overlooked. In particular, we prove in this paper that gradient policy method can be cast as a supervised learning problem where true label are replaced with discounted rewards. We provide a new proof of policy gradient methods (PGM) that emphasizes the tight link with the cross entropy and supervised learning. We provide a simple experiment where we interchange label and pseudo rewards. We conclude that other relationships with SL could be made if we modify the reward functions wisely.Comment: 6 pages, 1 figur

arXiv.org e-Print Archive

Gaussian Processes for Individualized Continuous Treatment Rule Estimation

Author: Riabenko Evgeniy
Shvechikov Pavel
Publication venue
Publication date: 14/08/2017
Field of study

Individualized treatment rule (ITR) recommends treatment on the basis of individual patient characteristics and the previous history of applied treatments and their outcomes. Despite the fact there are many ways to estimate ITR with binary treatment, algorithms for continuous treatment have only just started to emerge. We propose a novel approach to continuous ITR estimation based on explicit modelling of uncertainty in the subject's outcome as well as direct estimation of the mean outcome using gaussian process regression. Our method incorporates two intuitively appealing properties - it is more inclined to give a treatment with the outcome of higher expected value and lower variance. Experiments show that this direct incorporation of the uncertainty into ITR estimation process allows to select better treatment than standard indirect approach that just models the average. Compared to the competitors (including OWL), the proposed method shows improved performance in terms of value function maximization, has better interpretability, and could be easier generalized to multiple interdependent continuous treatments setting.Comment: 26 pages, 2 figures, presented at American Statistical Association Joint Statistical Meetings 201

arXiv.org e-Print Archive

Experience Replay for Continual Learning

Author: Ahuja Arun
Lillicrap Timothy P.
Rolnick David
Schwarz Jonathan
Wayne Greg
Publication venue
Publication date: 26/11/2019
Field of study

Continual learning is the problem of learning new tasks or knowledge while protecting old knowledge and ideally generalizing from old experience to learn new tasks faster. Neural networks trained by stochastic gradient descent often degrade on old tasks when trained successively on new tasks with different data distributions. This phenomenon, referred to as catastrophic forgetting, is considered a major hurdle to learning with non-stationary data or sequences of new tasks, and prevents networks from continually accumulating knowledge and skills. We examine this issue in the context of reinforcement learning, in a setting where an agent is exposed to tasks in a sequence. Unlike most other work, we do not provide an explicit indication to the model of task boundaries, which is the most general circumstance for a learning agent exposed to continuous experience. While various methods to counteract catastrophic forgetting have recently been proposed, we explore a straightforward, general, and seemingly overlooked solution - that of using experience replay buffers for all past events - with a mixture of on- and off-policy learning, leveraging behavioral cloning. We show that this strategy can still learn new tasks quickly yet can substantially reduce catastrophic forgetting in both Atari and DMLab domains, even matching the performance of methods that require task identities. When buffer storage is constrained, we confirm that a simple mechanism for randomly discarding data allows a limited size buffer to perform almost as well as an unbounded one.Comment: NeurIPS 201

arXiv.org e-Print Archive

Joint Multi-Dimension Pruning

Author: Cheng Kwang-Ting
Li Zhe
Liu Zechun
Shen Zhiqiang
Sun Jian
Wei Yichen
Zhang Xiangyu
Publication venue
Publication date: 18/05/2020
Field of study

We present joint multi-dimension pruning (named as JointPruning), a new perspective of pruning a network on three crucial aspects: spatial, depth and channel simultaneously. The joint strategy enables to search a better status than previous studies that focused on individual dimension solely, as our method is optimized collaboratively across the three dimensions in a single end-to-end training. Moreover, each dimension that we consider can promote to get better performance through colluding with the other two. Our method is realized by the adapted stochastic gradient estimation. Extensive experiments on large-scale ImageNet dataset across a variety of network architectures MobileNet V1&V2 and ResNet demonstrate the effectiveness of our proposed method. For instance, we achieve significant margins of 2.5% and 2.6% improvement over the state-of-the-art approach on the already compact MobileNet V1&V2 under an extremely large compression ratio

arXiv.org e-Print Archive

An Information-Theoretic Optimality Principle for Deep Reinforcement Learning

Author: Bou-Ammar Haitham
Grau-Moya Jordi
Leibfried Felix
Publication venue
Publication date: 20/11/2018
Field of study

We methodologically address the problem of Q-value overestimation in deep reinforcement learning to handle high-dimensional state spaces efficiently. By adapting concepts from information theory, we introduce an intrinsic penalty signal encouraging reduced Q-value estimates. The resultant algorithm encompasses a wide range of learning outcomes containing deep Q-networks as a special case. Different learning outcomes can be demonstrated by tuning a Lagrange multiplier accordingly. We furthermore propose a novel scheduling scheme for this Lagrange multiplier to ensure efficient and robust learning. In experiments on Atari, our algorithm outperforms other algorithms (e.g. deep and double deep Q-networks) in terms of both game-play performance and sample complexity. These results remain valid under the recently proposed dueling architecture.Comment: Presented at the NIPS Deep Reinforcement Learning Workshop, Montreal, Canada, 201

arXiv.org e-Print Archive

Equivalence Between Policy Gradients and Soft Q-Learning

Author: Abbeel Pieter
Chen Xi
Schulman John
Publication venue
Publication date: 14/10/2018
Field of study

Two of the leading approaches for model-free reinforcement learning are policy gradient methods and

Q

-learning methods.

Q

-learning methods can be effective and sample-efficient when they work, however, it is not well-understood why they work, since empirically, the

Q

-values they estimate are very inaccurate. A partial explanation may be that

Q

-learning methods are secretly implementing policy gradient updates: we show that there is a precise equivalence between

Q

-learning and policy gradient methods in the setting of entropy-regularized reinforcement learning, that "soft" (entropy-regularized)

Q

-learning is exactly equivalent to a policy gradient method. We also point out a connection between

Q

-learning methods and natural policy gradient methods. Experimentally, we explore the entropy-regularized versions of

Q

-learning and policy gradients, and we find them to perform as well as (or slightly better than) the standard variants on the Atari benchmark. We also show that the equivalence holds in practical settings by constructing a

Q

-learning method that closely matches the learning dynamics of A3C without using a target network or

\epsilon

-greedy exploration schedule

arXiv.org e-Print Archive