1,005 research outputs found
Online Meta-learning by Parallel Algorithm Competition
The efficiency of reinforcement learning algorithms depends critically on a
few meta-parameters that modulates the learning updates and the trade-off
between exploration and exploitation. The adaptation of the meta-parameters is
an open question in reinforcement learning, which arguably has become more of
an issue recently with the success of deep reinforcement learning in
high-dimensional state spaces. The long learning times in domains such as Atari
2600 video games makes it not feasible to perform comprehensive searches of
appropriate meta-parameter values. We propose the Online Meta-learning by
Parallel Algorithm Competition (OMPAC) method. In the OMPAC method, several
instances of a reinforcement learning algorithm are run in parallel with small
differences in the initial values of the meta-parameters. After a fixed number
of episodes, the instances are selected based on their performance in the task
at hand. Before continuing the learning, Gaussian noise is added to the
meta-parameters with a predefined probability. We validate the OMPAC method by
improving the state-of-the-art results in stochastic SZ-Tetris and in standard
Tetris with a smaller, 1010, board, by 31% and 84%, respectively, and
by improving the results for deep Sarsa() agents in three Atari 2600
games by 62% or more. The experiments also show the ability of the OMPAC method
to adapt the meta-parameters according to the learning progress in different
tasks.Comment: 15 pages, 10 figures. arXiv admin note: text overlap with
arXiv:1702.0311
Dealing with uncertainty: balancing exploration and exploitation in deep recurrent reinforcement learning
Incomplete knowledge of the environment leads an agent to make decisions
under uncertainty. One of the major dilemmas in Reinforcement Learning (RL)
where an autonomous agent has to balance two contrasting needs in making its
decisions is: exploiting the current knowledge of the environment to maximize
the cumulative reward as well as exploring actions that allow improving the
knowledge of the environment, hopefully leading to higher reward values
(exploration-exploitation trade-off). Concurrently, another relevant issue
regards the full observability of the states, which may not be assumed in all
applications. For instance, when 2D images are considered as input in an RL
approach used for finding the best actions within a 3D simulation environment.
In this work, we address these issues by deploying and testing several
techniques to balance exploration and exploitation trade-off on partially
observable systems for predicting steering wheels in autonomous driving
scenarios. More precisely, the final aim is to investigate the effects of using
both adaptive and deterministic exploration strategies coupled with a Deep
Recurrent Q-Network. Additionally, we adapted and evaluated the impact of a
modified quadratic loss function to improve the learning phase of the
underlying Convolutional Recurrent Neural Network. We show that adaptive
methods better approximate the trade-off between exploration and exploitation
and, in general, Softmax and Max-Boltzmann strategies outperform epsilon-greedy
techniques.Comment: 28 pages, added renferences, corrected typos, revised argument in
section 3, results unchanged, redifinition of the article structur
Effective Multi-Agent Deep Reinforcement Learning Control with Relative Entropy Regularization
In this paper, a novel Multi-agent Reinforcement Learning (MARL) approach,
Multi-Agent Continuous Dynamic Policy Gradient (MACDPP) was proposed to tackle
the issues of limited capability and sample efficiency in various scenarios
controlled by multiple agents. It alleviates the inconsistency of multiple
agents' policy updates by introducing the relative entropy regularization to
the Centralized Training with Decentralized Execution (CTDE) framework with the
Actor-Critic (AC) structure. Evaluated by multi-agent cooperation and
competition tasks and traditional control tasks including OpenAI benchmarks and
robot arm manipulation, MACDPP demonstrates significant superiority in learning
capability and sample efficiency compared with both related multi-agent and
widely implemented signal-agent baselines and therefore expands the potential
of MARL in effectively learning challenging control scenarios
- …