4,198 research outputs found
Batch Reinforcement Learning on the Industrial Benchmark: First Experiences
The Particle Swarm Optimization Policy (PSO-P) has been recently introduced
and proven to produce remarkable results on interacting with academic
reinforcement learning benchmarks in an off-policy, batch-based setting. To
further investigate the properties and feasibility on real-world applications,
this paper investigates PSO-P on the so-called Industrial Benchmark (IB), a
novel reinforcement learning (RL) benchmark that aims at being realistic by
including a variety of aspects found in industrial applications, like
continuous state and action spaces, a high dimensional, partially observable
state space, delayed effects, and complex stochasticity. The experimental
results of PSO-P on IB are compared to results of closed-form control policies
derived from the model-based Recurrent Control Neural Network (RCNN) and the
model-free Neural Fitted Q-Iteration (NFQ). Experiments show that PSO-P is not
only of interest for academic benchmarks, but also for real-world industrial
applications, since it also yielded the best performing policy in our IB
setting. Compared to other well established RL techniques, PSO-P produced
outstanding results in performance and robustness, requiring only a relatively
low amount of effort in finding adequate parameters or making complex design
decisions
Deep Variational Reinforcement Learning for POMDPs
Many real-world sequential decision making problems are partially observable
by nature, and the environment model is typically unknown. Consequently, there
is great need for reinforcement learning methods that can tackle such problems
given only a stream of incomplete and noisy observations. In this paper, we
propose deep variational reinforcement learning (DVRL), which introduces an
inductive bias that allows an agent to learn a generative model of the
environment and perform inference in that model to effectively aggregate the
available information. We develop an n-step approximation to the evidence lower
bound (ELBO), allowing the model to be trained jointly with the policy. This
ensures that the latent state representation is suitable for the control task.
In experiments on Mountain Hike and flickering Atari we show that our method
outperforms previous approaches relying on recurrent neural networks to encode
the past
- …