75,949 research outputs found
Functions that Emerge through End-to-End Reinforcement Learning - The Direction for Artificial General Intelligence -
Recently, triggered by the impressive results in TV-games or game of Go by
Google DeepMind, end-to-end reinforcement learning (RL) is collecting
attentions. Although little is known, the author's group has propounded this
framework for around 20 years and already has shown various functions that
emerge in a neural network (NN) through RL. In this paper, they are introduced
again at this timing.
"Function Modularization" approach is deeply penetrated subconsciously. The
inputs and outputs for a learning system can be raw sensor signals and motor
commands. "State space" or "action space" generally used in RL show the
existence of functional modules. That has limited reinforcement learning to
learning only for the action-planning module. In order to extend reinforcement
learning to learning of the entire function on a huge degree of freedom of a
massively parallel learning system and to explain or develop human-like
intelligence, the author has believed that end-to-end RL from sensors to motors
using a recurrent NN (RNN) becomes an essential key. Especially in the higher
functions, this approach is very effective by being free from the need to
decide their inputs and outputs.
The functions that emerge, we have confirmed, through RL using a NN cover a
broad range from real robot learning with raw camera pixel inputs to
acquisition of dynamic functions in a RNN. Those are (1)image recognition,
(2)color constancy (optical illusion), (3)sensor motion (active recognition),
(4)hand-eye coordination and hand reaching movement, (5)explanation of brain
activities, (6)communication, (7)knowledge transfer, (8)memory, (9)selective
attention, (10)prediction, (11)exploration. The end-to-end RL enables the
emergence of very flexible comprehensive functions that consider many things in
parallel although it is difficult to give the boundary of each function
clearly.Comment: The Multi-disciplinary Conference on Reinforcement Learning and
Decision Making (RLDM) 2017, 5 pages, 4 figure
New Reinforcement Learning Using a Chaotic Neural Network for Emergence of "Thinking" - "Exploration" Grows into "Thinking" through Learning -
Expectation for the emergence of higher functions is getting larger in the
framework of end-to-end reinforcement learning using a recurrent neural
network. However, the emergence of "thinking" that is a typical higher function
is difficult to realize because "thinking" needs non fixed-point, flow-type
attractors with both convergence and transition dynamics. Furthermore, in order
to introduce "inspiration" or "discovery" in "thinking", not completely random
but unexpected transition should be also required.
By analogy to "chaotic itinerancy", we have hypothesized that "exploration"
grows into "thinking" through learning by forming flow-type attractors on
chaotic random-like dynamics. It is expected that if rational dynamics are
learned in a chaotic neural network (ChNN), coexistence of rational state
transition, inspiration-like state transition and also random-like exploration
for unknown situation can be realized.
Based on the above idea, we have proposed new reinforcement learning using a
ChNN as an actor. The positioning of exploration is completely different from
the conventional one. The chaotic dynamics inside the ChNN produces exploration
factors by itself. Since external random numbers for stochastic action
selection are not used, exploration factors cannot be isolated from the output.
Therefore, the learning method is also completely different from the
conventional one.
At each non-feedback connection, one variable named causality trace takes in
and maintains the input through the connection according to the change in its
output. Using the trace and TD error, the weight is updated.
In this paper, as the result of a recent simple task to see whether the new
learning works or not, it is shown that a robot with two wheels and two visual
sensors reaches a target while avoiding an obstacle after learning though there
are still many rooms for improvement.Comment: The Multi-disciplinary Conference on Reinforcement Learning and
Decision Making (RLDM) 2017, 5 pages, 6 figure
GalaxyNet: Connecting galaxies and dark matter haloes with deep neural networks and reinforcement learning in large volumes
We present the novel wide & deep neural network GalaxyNet, which connects the
properties of galaxies and dark matter haloes, and is directly trained on
observed galaxy statistics using reinforcement learning. The most important
halo properties to predict stellar mass and star formation rate (SFR) are halo
mass, growth rate, and scale factor at the time the mass peaks, which results
from a feature importance analysis with random forests. We train different
models with supervised learning to find the optimal network architecture.
GalaxyNet is then trained with a reinforcement learning approach: for a fixed
set of weights and biases, we compute the galaxy properties for all haloes and
then derive mock statistics (stellar mass functions, cosmic and specific SFRs,
quenched fractions, and clustering). Comparing these statistics to observations
we get the model loss, which is minimised with particle swarm optimisation.
GalaxyNet reproduces the observed data very accurately
(), and predicts a stellar-to-halo mass relation with a
lower normalisation and shallower low-mass slope at high redshift than
empirical models. We find that at low mass, the galaxies with the highest SFRs
are satellites, although most satellites are quenched. The normalisation of the
instantaneous conversion efficiency increases with redshift, but stays constant
above . Finally, we use GalaxyNet to populate a cosmic volume of
with galaxies and predict the BAO signal, the bias, and
the clustering of active and passive galaxies up to , which can be tested
with next-generation surveys, such as LSST and Euclid.Comment: 21 pages, 21 figures, 6 tables, submitte
Hierarchical Reinforcement Learning for Quadruped Locomotion
Legged locomotion is a challenging task for learning algorithms, especially
when the task requires a diverse set of primitive behaviors. To solve these
problems, we introduce a hierarchical framework to automatically decompose
complex locomotion tasks. A high-level policy issues commands in a latent space
and also selects for how long the low-level policy will execute the latent
command. Concurrently, the low-level policy uses the latent command and only
the robot's on-board sensors to control the robot's actuators. Our approach
allows the high-level policy to run at a lower frequency than the low-level
one. We test our framework on a path-following task for a dynamic quadruped
robot and we show that steering behaviors automatically emerge in the latent
command space as low-level skills are needed for this task. We then show
efficient adaptation of the trained policy to a different task by transfer of
the trained low-level policy. Finally, we validate the policies on a real
quadruped robot. To the best of our knowledge, this is the first application of
end-to-end hierarchical learning to a real robotic locomotion task
Emergence of Locomotion Behaviours in Rich Environments
The reinforcement learning paradigm allows, in principle, for complex
behaviours to be learned directly from simple reward signals. In practice,
however, it is common to carefully hand-design the reward function to encourage
a particular solution, or to derive it from demonstration data. In this paper
explore how a rich environment can help to promote the learning of complex
behavior. Specifically, we train agents in diverse environmental contexts, and
find that this encourages the emergence of robust behaviours that perform well
across a suite of tasks. We demonstrate this principle for locomotion --
behaviours that are known for their sensitivity to the choice of reward. We
train several simulated bodies on a diverse set of challenging terrains and
obstacles, using a simple reward function based on forward progress. Using a
novel scalable variant of policy gradient reinforcement learning, our agents
learn to run, jump, crouch and turn as required by the environment without
explicit reward-based guidance. A visual depiction of highlights of the learned
behavior can be viewed following https://youtu.be/hx_bgoTF7bs
Improving Coordination in Small-Scale Multi-Agent Deep Reinforcement Learning through Memory-driven Communication
Deep reinforcement learning algorithms have recently been used to train
multiple interacting agents in a centralised manner whilst keeping their
execution decentralised. When the agents can only acquire partial observations
and are faced with tasks requiring coordination and synchronisation skills,
inter-agent communication plays an essential role. In this work, we propose a
framework for multi-agent training using deep deterministic policy gradients
that enables concurrent, end-to-end learning of an explicit communication
protocol through a memory device. During training, the agents learn to perform
read and write operations enabling them to infer a shared representation of the
world. We empirically demonstrate that concurrent learning of the communication
device and individual policies can improve inter-agent coordination and
performance in small-scale systems. Our experimental results show that the
proposed method achieves superior performance in scenarios with up to six
agents. We illustrate how different communication patterns can emerge on six
different tasks of increasing complexity. Furthermore, we study the effects of
corrupting the communication channel, provide a visualisation of the
time-varying memory content as the underlying task is being solved and validate
the building blocks of the proposed memory device through ablation studies
Latent Space Policies for Hierarchical Reinforcement Learning
We address the problem of learning hierarchical deep neural network policies
for reinforcement learning. In contrast to methods that explicitly restrict or
cripple lower layers of a hierarchy to force them to use higher-level
modulating signals, each layer in our framework is trained to directly solve
the task, but acquires a range of diverse strategies via a maximum entropy
reinforcement learning objective. Each layer is also augmented with latent
random variables, which are sampled from a prior distribution during the
training of that layer. The maximum entropy objective causes these latent
variables to be incorporated into the layer's policy, and the higher level
layer can directly control the behavior of the lower layer through this latent
space. Furthermore, by constraining the mapping from latent variables to
actions to be invertible, higher layers retain full expressivity: neither the
higher layers nor the lower layers are constrained in their behavior. Our
experimental evaluation demonstrates that we can improve on the performance of
single-layer policies on standard benchmark tasks simply by adding additional
layers, and that our method can solve more complex sparse-reward tasks by
learning higher-level policies on top of high-entropy skills optimized for
simple low-level objectives.Comment: ICML 2018; Videos: https://sites.google.com/view/latent-space-deep-rl
Code: https://github.com/haarnoja/sa
Social Learning Methods in Board Games
This paper discusses the effects of social learning in training of game
playing agents. The training of agents in a social context instead of a
self-play environment is investigated. Agents that use the reinforcement
learning algorithms are trained in social settings. This mimics the way in
which players of board games such as scrabble and chess mentor each other in
their clubs. A Round Robin tournament and a modified Swiss tournament setting
are used for the training. The agents trained using social settings are
compared to self play agents and results indicate that more robust agents
emerge from the social training setting. Higher state space games can benefit
from such settings as diverse set of agents will have multiple strategies that
increase the chances of obtaining more experienced players at the end of
training. The Social Learning trained agents exhibit better playing experience
than self play agents. The modified Swiss playing style spawns a larger number
of better playing agents as the population size increases.Comment: 6 page
Learning through Probing: a decentralized reinforcement learning architecture for social dilemmas
Multi-agent reinforcement learning has received significant interest in
recent years notably due to the advancements made in deep reinforcement
learning which have allowed for the developments of new architectures and
learning algorithms. Using social dilemmas as the training ground, we present a
novel learning architecture, Learning through Probing (LTP), where agents
utilize a probing mechanism to incorporate how their opponent's behavior
changes when an agent takes an action. We use distinct training phases and
adjust rewards according to the overall outcome of the experiences accounting
for changes to the opponents behavior. We introduce a parameter eta to
determine the significance of these future changes to opponent behavior. When
applied to the Iterated Prisoner's Dilemma (IPD), LTP agents demonstrate that
they can learn to cooperate with each other, achieving higher average
cumulative rewards than other reinforcement learning methods while also
maintaining good performance in playing against static agents that are present
in Axelrod tournaments. We compare this method with traditional reinforcement
learning algorithms and agent-tracking techniques to highlight key differences
and potential applications. We also draw attention to the differences between
solving games and societal-like interactions and analyze the training of
Q-learning agents in makeshift societies. This is to emphasize how cooperation
may emerge in societies and demonstrate this using environments where
interactions with opponents are determined through a random encounter format of
the IPD.Comment: 9 pages, 4 figure
Universal Planning Networks
A key challenge in complex visuomotor control is learning abstract
representations that are effective for specifying goals, planning, and
generalization. To this end, we introduce universal planning networks (UPN).
UPNs embed differentiable planning within a goal-directed policy. This planning
computation unrolls a forward model in a latent space and infers an optimal
action plan through gradient descent trajectory optimization. The
plan-by-gradient-descent process and its underlying representations are learned
end-to-end to directly optimize a supervised imitation learning objective. We
find that the representations learned are not only effective for goal-directed
visual imitation via gradient-based trajectory optimization, but can also
provide a metric for specifying goals using images. The learned representations
can be leveraged to specify distance-based rewards to reach new target states
for model-free reinforcement learning, resulting in substantially more
effective learning when solving new tasks described via image-based goals. We
were able to achieve successful transfer of visuomotor planning strategies
across robots with significantly different morphologies and actuation
capabilities.Comment: Videos available at https://sites.google.com/view/upn-public/hom
- …