176 research outputs found
Bayesian Reinforcement Learning via Deep, Sparse Sampling
We address the problem of Bayesian reinforcement learning using efficient
model-based online planning. We propose an optimism-free Bayes-adaptive
algorithm to induce deeper and sparser exploration with a theoretical bound on
its performance relative to the Bayes optimal policy, with a lower
computational complexity. The main novelty is the use of a candidate policy
generator, to generate long-term options in the planning tree (over beliefs),
which allows us to create much sparser and deeper trees. Experimental results
on different environments show that in comparison to the state-of-the-art, our
algorithm is both computationally more efficient, and obtains significantly
higher reward in discrete environments.Comment: Published in AISTATS 202
VIME: Variational Information Maximizing Exploration
Scalable and effective exploration remains a key challenge in reinforcement
learning (RL). While there are methods with optimality guarantees in the
setting of discrete state and action spaces, these methods cannot be applied in
high-dimensional deep RL scenarios. As such, most contemporary RL relies on
simple heuristics such as epsilon-greedy exploration or adding Gaussian noise
to the controls. This paper introduces Variational Information Maximizing
Exploration (VIME), an exploration strategy based on maximization of
information gain about the agent's belief of environment dynamics. We propose a
practical implementation, using variational inference in Bayesian neural
networks which efficiently handles continuous state and action spaces. VIME
modifies the MDP reward function, and can be applied with several different
underlying RL algorithms. We demonstrate that VIME achieves significantly
better performance compared to heuristic exploration methods across a variety
of continuous control tasks and algorithms, including tasks with very sparse
rewards.Comment: Published in Advances in Neural Information Processing Systems 29
(NIPS), pages 1109-111
Efficient Exploration in Continuous-time Model-based Reinforcement Learning
Reinforcement learning algorithms typically consider discrete-time dynamics,
even though the underlying systems are often continuous in time. In this paper,
we introduce a model-based reinforcement learning algorithm that represents
continuous-time dynamics using nonlinear ordinary differential equations
(ODEs). We capture epistemic uncertainty using well-calibrated probabilistic
models, and use the optimistic principle for exploration. Our regret bounds
surface the importance of the measurement selection strategy(MSS), since in
continuous time we not only must decide how to explore, but also when to
observe the underlying system. Our analysis demonstrates that the regret is
sublinear when modeling ODEs with Gaussian Processes (GP) for common choices of
MSS, such as equidistant sampling. Additionally, we propose an adaptive,
data-dependent, practical MSS that, when combined with GP dynamics, also
achieves sublinear regret with significantly fewer samples. We showcase the
benefits of continuous-time modeling over its discrete-time counterpart, as
well as our proposed adaptive MSS over standard baselines, on several
applications
DJ-MC: A Reinforcement-Learning Agent for Music Playlist Recommendation
In recent years, there has been growing focus on the study of automated
recommender systems. Music recommendation systems serve as a prominent domain
for such works, both from an academic and a commercial perspective. A
fundamental aspect of music perception is that music is experienced in temporal
context and in sequence. In this work we present DJ-MC, a novel
reinforcement-learning framework for music recommendation that does not
recommend songs individually but rather song sequences, or playlists, based on
a model of preferences for both songs and song transitions. The model is
learned online and is uniquely adapted for each listener. To reduce exploration
time, DJ-MC exploits user feedback to initialize a model, which it subsequently
updates by reinforcement. We evaluate our framework with human participants
using both real song and playlist data. Our results indicate that DJ-MC's
ability to recommend sequences of songs provides a significant improvement over
more straightforward approaches, which do not take transitions into account.Comment: -Updated to the most recent and completed version (to be presented at
AAMAS 2015) -Updated author list. in Autonomous Agents and Multiagent Systems
(AAMAS) 2015, Istanbul, Turkey, May 201
Closed-Loop Learning of Visual Control Policies
In this paper we present a general, flexible framework for learning mappings
from images to actions by interacting with the environment. The basic idea is
to introduce a feature-based image classifier in front of a reinforcement
learning algorithm. The classifier partitions the visual space according to the
presence or absence of few highly informative local descriptors that are
incrementally selected in a sequence of attempts to remove perceptual aliasing.
We also address the problem of fighting overfitting in such a greedy algorithm.
Finally, we show how high-level visual features can be generated when the power
of local descriptors is insufficient for completely disambiguating the aliased
states. This is done by building a hierarchy of composite features that consist
of recursive spatial combinations of visual features. We demonstrate the
efficacy of our algorithms by solving three visual navigation tasks and a
visual version of the classical Car on the Hill control problem
A Reinforcement Learning Control in Hot Stamping for Cycle Time Optimization
Hot stamping is a hot metal forming technology increasingly in demand that produces
ultra-high strength parts with complex shapes. A major concern in these systems is how to shorten
production times to improve production Key Performance Indicators. In this work, we present a
Reinforcement Learning approach that can obtain an optimal behavior strategy for dynamically
managing the cycle time in hot stamping to optimize manufacturing production while maintaining
the quality of the final product. Results are compared with the business-as-usual cycle time control
approach and the optimal solution obtained by the execution of a dynamic programming algorithm.
Reinforcement Learning control outperforms the business-as-usual behavior by reducing the cycle
time and the total batch time in non-stable temperature phases
- …