79 research outputs found
P3O: Policy-on Policy-off Policy Optimization
On-policy reinforcement learning (RL) algorithms have high sample complexity
while off-policy algorithms are difficult to tune. Merging the two holds the
promise to develop efficient algorithms that generalize across diverse
environments. It is however challenging in practice to find suitable
hyper-parameters that govern this trade off. This paper develops a simple
algorithm named P3O that interleaves off-policy updates with on-policy updates.
P3O uses the effective sample size between the behavior policy and the target
policy to control how far they can be from each other and does not introduce
any additional hyper-parameters. Extensive experiments on the Atari-2600 and
MuJoCo benchmark suites show that this simple technique is effective in
reducing the sample complexity of state-of-the-art algorithms. Code to
reproduce experiments in this paper is at https://github.com/rasoolfa/P3O.Comment: UAI 2019 conference paper. Code: https://github.com/rasoolfa/P3
Enhanced Experience Replay Generation for Efficient Reinforcement Learning
Applying deep reinforcement learning (RL) on real systems suffers from slow
data sampling. We propose an enhanced generative adversarial network (EGAN) to
initialize an RL agent in order to achieve faster learning. The EGAN utilizes
the relation between states and actions to enhance the quality of data samples
generated by a GAN. Pre-training the agent with the EGAN shows a steeper
learning curve with a 20% improvement of training time in the beginning of
learning, compared to no pre-training, and an improvement compared to training
with GAN by about 5% with smaller variations. For real time systems with sparse
and slow data sampling the EGAN could be used to speed up the early phases of
the training process
A short variational proof of equivalence between policy gradients and soft Q learning
Two main families of reinforcement learning algorithms, Q-learning and policy
gradients, have recently been proven to be equivalent when using a softmax
relaxation on one part, and an entropic regularization on the other. We relate
this result to the well-known convex duality of Shannon entropy and the softmax
function. Such a result is also known as the Donsker-Varadhan formula. This
provides a short proof of the equivalence. We then interpret this duality
further, and use ideas of convex analysis to prove a new policy inequality
relative to soft Q-learning
Deep Reinforcement Learning for Autonomous Driving
Reinforcement learning has steadily improved and outperform human in lots of
traditional games since the resurgence of deep neural network. However, these
success is not easy to be copied to autonomous driving because the state spaces
in real world are extreme complex and action spaces are continuous and fine
control is required. Moreover, the autonomous driving vehicles must also keep
functional safety under the complex environments. To deal with these
challenges, we first adopt the deep deterministic policy gradient (DDPG)
algorithm, which has the capacity to handle complex state and action spaces in
continuous domain. We then choose The Open Racing Car Simulator (TORCS) as our
environment to avoid physical damage. Meanwhile, we select a set of appropriate
sensor information from TORCS and design our own rewarder. In order to fit DDPG
algorithm to TORCS, we design our network architecture for both actor and
critic inside DDPG paradigm. To demonstrate the effectiveness of our model, We
evaluate on different modes in TORCS and show both quantitative and qualitative
results.Comment: no time for further improvemen
Similarities between policy gradient methods (PGM) in Reinforcement learning (RL) and supervised learning (SL)
Reinforcement learning (RL) is about sequential decision making and is
traditionally opposed to supervised learning (SL) and unsupervised learning
(USL). In RL, given the current state, the agent makes a decision that may
influence the next state as opposed to SL (and USL) where, the next state
remains the same, regardless of the decisions taken, either in batch or online
learning. Although this difference is fundamental between SL and RL, there are
connections that have been overlooked. In particular, we prove in this paper
that gradient policy method can be cast as a supervised learning problem where
true label are replaced with discounted rewards. We provide a new proof of
policy gradient methods (PGM) that emphasizes the tight link with the cross
entropy and supervised learning. We provide a simple experiment where we
interchange label and pseudo rewards. We conclude that other relationships with
SL could be made if we modify the reward functions wisely.Comment: 6 pages, 1 figur
Gaussian Processes for Individualized Continuous Treatment Rule Estimation
Individualized treatment rule (ITR) recommends treatment on the basis of
individual patient characteristics and the previous history of applied
treatments and their outcomes. Despite the fact there are many ways to estimate
ITR with binary treatment, algorithms for continuous treatment have only just
started to emerge. We propose a novel approach to continuous ITR estimation
based on explicit modelling of uncertainty in the subject's outcome as well as
direct estimation of the mean outcome using gaussian process regression. Our
method incorporates two intuitively appealing properties - it is more inclined
to give a treatment with the outcome of higher expected value and lower
variance. Experiments show that this direct incorporation of the uncertainty
into ITR estimation process allows to select better treatment than standard
indirect approach that just models the average. Compared to the competitors
(including OWL), the proposed method shows improved performance in terms of
value function maximization, has better interpretability, and could be easier
generalized to multiple interdependent continuous treatments setting.Comment: 26 pages, 2 figures, presented at American Statistical Association
Joint Statistical Meetings 201
Experience Replay for Continual Learning
Continual learning is the problem of learning new tasks or knowledge while
protecting old knowledge and ideally generalizing from old experience to learn
new tasks faster. Neural networks trained by stochastic gradient descent often
degrade on old tasks when trained successively on new tasks with different data
distributions. This phenomenon, referred to as catastrophic forgetting, is
considered a major hurdle to learning with non-stationary data or sequences of
new tasks, and prevents networks from continually accumulating knowledge and
skills. We examine this issue in the context of reinforcement learning, in a
setting where an agent is exposed to tasks in a sequence. Unlike most other
work, we do not provide an explicit indication to the model of task boundaries,
which is the most general circumstance for a learning agent exposed to
continuous experience. While various methods to counteract catastrophic
forgetting have recently been proposed, we explore a straightforward, general,
and seemingly overlooked solution - that of using experience replay buffers for
all past events - with a mixture of on- and off-policy learning, leveraging
behavioral cloning. We show that this strategy can still learn new tasks
quickly yet can substantially reduce catastrophic forgetting in both Atari and
DMLab domains, even matching the performance of methods that require task
identities. When buffer storage is constrained, we confirm that a simple
mechanism for randomly discarding data allows a limited size buffer to perform
almost as well as an unbounded one.Comment: NeurIPS 201
Joint Multi-Dimension Pruning
We present joint multi-dimension pruning (named as JointPruning), a new
perspective of pruning a network on three crucial aspects: spatial, depth and
channel simultaneously. The joint strategy enables to search a better status
than previous studies that focused on individual dimension solely, as our
method is optimized collaboratively across the three dimensions in a single
end-to-end training. Moreover, each dimension that we consider can promote to
get better performance through colluding with the other two. Our method is
realized by the adapted stochastic gradient estimation. Extensive experiments
on large-scale ImageNet dataset across a variety of network architectures
MobileNet V1&V2 and ResNet demonstrate the effectiveness of our proposed
method. For instance, we achieve significant margins of 2.5% and 2.6%
improvement over the state-of-the-art approach on the already compact MobileNet
V1&V2 under an extremely large compression ratio
An Information-Theoretic Optimality Principle for Deep Reinforcement Learning
We methodologically address the problem of Q-value overestimation in deep
reinforcement learning to handle high-dimensional state spaces efficiently. By
adapting concepts from information theory, we introduce an intrinsic penalty
signal encouraging reduced Q-value estimates. The resultant algorithm
encompasses a wide range of learning outcomes containing deep Q-networks as a
special case. Different learning outcomes can be demonstrated by tuning a
Lagrange multiplier accordingly. We furthermore propose a novel scheduling
scheme for this Lagrange multiplier to ensure efficient and robust learning. In
experiments on Atari, our algorithm outperforms other algorithms (e.g. deep and
double deep Q-networks) in terms of both game-play performance and sample
complexity. These results remain valid under the recently proposed dueling
architecture.Comment: Presented at the NIPS Deep Reinforcement Learning Workshop, Montreal,
Canada, 201
Equivalence Between Policy Gradients and Soft Q-Learning
Two of the leading approaches for model-free reinforcement learning are
policy gradient methods and -learning methods. -learning methods can be
effective and sample-efficient when they work, however, it is not
well-understood why they work, since empirically, the -values they estimate
are very inaccurate. A partial explanation may be that -learning methods are
secretly implementing policy gradient updates: we show that there is a precise
equivalence between -learning and policy gradient methods in the setting of
entropy-regularized reinforcement learning, that "soft" (entropy-regularized)
-learning is exactly equivalent to a policy gradient method. We also point
out a connection between -learning methods and natural policy gradient
methods. Experimentally, we explore the entropy-regularized versions of
-learning and policy gradients, and we find them to perform as well as (or
slightly better than) the standard variants on the Atari benchmark. We also
show that the equivalence holds in practical settings by constructing a
-learning method that closely matches the learning dynamics of A3C without
using a target network or -greedy exploration schedule
- …