200,259 research outputs found
Is the Bellman residual a bad proxy?
This paper aims at theoretically and empirically comparing two standard
optimization criteria for Reinforcement Learning: i) maximization of the mean
value and ii) minimization of the Bellman residual. For that purpose, we place
ourselves in the framework of policy search algorithms, that are usually
designed to maximize the mean value, and derive a method that minimizes the
residual over policies. A theoretical analysis
shows how good this proxy is to policy optimization, and notably that it is
better than its value-based counterpart. We also propose experiments on
randomly generated generic Markov decision processes, specifically designed for
studying the influence of the involved concentrability coefficient. They show
that the Bellman residual is generally a bad proxy to policy optimization and
that directly maximizing the mean value is much better, despite the current
lack of deep theoretical analysis. This might seem obvious, as directly
addressing the problem of interest is usually better, but given the prevalence
of (projected) Bellman residual minimization in value-based reinforcement
learning, we believe that this question is worth to be considered.Comment: Final NIPS 2017 version (title, among other things, changed
Specialized Deep Residual Policy Safe Reinforcement Learning-Based Controller for Complex and Continuous State-Action Spaces
Traditional controllers have limitations as they rely on prior knowledge
about the physics of the problem, require modeling of dynamics, and struggle to
adapt to abnormal situations. Deep reinforcement learning has the potential to
address these problems by learning optimal control policies through exploration
in an environment. For safety-critical environments, it is impractical to
explore randomly, and replacing conventional controllers with black-box models
is also undesirable. Also, it is expensive in continuous state and action
spaces, unless the search space is constrained. To address these challenges we
propose a specialized deep residual policy safe reinforcement learning with a
cycle of learning approach adapted for complex and continuous state-action
spaces. Residual policy learning allows learning a hybrid control architecture
where the reinforcement learning agent acts in synchronous collaboration with
the conventional controller. The cycle of learning initiates the policy through
the expert trajectory and guides the exploration around it. Further, the
specialization through the input-output hidden Markov model helps to optimize
policy that lies within the region of interest (such as abnormality), where the
reinforcement learning agent is required and is activated. The proposed
solution is validated on the Tennessee Eastman process control
Reinforcement Learning Experience Reuse with Policy Residual Representation
Experience reuse is key to sample-efficient reinforcement learning. One of
the critical issues is how the experience is represented and stored.
Previously, the experience can be stored in the forms of features, individual
models, and the average model, each lying at a different granularity. However,
new tasks may require experience across multiple granularities. In this paper,
we propose the policy residual representation (PRR) network, which can extract
and store multiple levels of experience. PRR network is trained on a set of
tasks with a multi-level architecture, where a module in each level corresponds
to a subset of the tasks. Therefore, the PRR network represents the experience
in a spectrum-like way. When training on a new task, PRR can provide different
levels of experience for accelerating the learning. We experiment with the PRR
network on a set of grid world navigation tasks, locomotion tasks, and fighting
tasks in a video game. The results show that the PRR network leads to better
reuse of experience and thus outperforms some state-of-the-art approaches.Comment: Conference version appears in IJCAI 201
SkipNet: Learning Dynamic Routing in Convolutional Networks
While deeper convolutional networks are needed to achieve maximum accuracy in
visual perception tasks, for many inputs shallower networks are sufficient. We
exploit this observation by learning to skip convolutional layers on a
per-input basis. We introduce SkipNet, a modified residual network, that uses a
gating network to selectively skip convolutional blocks based on the
activations of the previous layer. We formulate the dynamic skipping problem in
the context of sequential decision making and propose a hybrid learning
algorithm that combines supervised learning and reinforcement learning to
address the challenges of non-differentiable skipping decisions. We show
SkipNet reduces computation by 30-90% while preserving the accuracy of the
original model on four benchmark datasets and outperforms the state-of-the-art
dynamic networks and static compression methods. We also qualitatively evaluate
the gating policy to reveal a relationship between image scale and saliency and
the number of layers skipped.Comment: ECCV 2018 Camera ready version. Code is available at
https://github.com/ucbdrive/skipne
BlockDrop: Dynamic Inference Paths in Residual Networks
Very deep convolutional neural networks offer excellent recognition results,
yet their computational expense limits their impact for many real-world
applications. We introduce BlockDrop, an approach that learns to dynamically
choose which layers of a deep network to execute during inference so as to best
reduce total computation without degrading prediction accuracy. Exploiting the
robustness of Residual Networks (ResNets) to layer dropping, our framework
selects on-the-fly which residual blocks to evaluate for a given novel image.
In particular, given a pretrained ResNet, we train a policy network in an
associative reinforcement learning setting for the dual reward of utilizing a
minimal number of blocks while preserving recognition accuracy. We conduct
extensive experiments on CIFAR and ImageNet. The results provide strong
quantitative and qualitative evidence that these learned policies not only
accelerate inference but also encode meaningful visual information. Built upon
a ResNet-101 model, our method achieves a speedup of 20\% on average, going as
high as 36\% for some images, while maintaining the same 76.4\% top-1 accuracy
on ImageNet.Comment: CVPR 201
- …