25,442 research outputs found
M-Walk: Learning to Walk over Graphs using Monte Carlo Tree Search
Learning to walk over a graph towards a target node for a given query and a
source node is an important problem in applications such as knowledge base
completion (KBC). It can be formulated as a reinforcement learning (RL) problem
with a known state transition model. To overcome the challenge of sparse
rewards, we develop a graph-walking agent called M-Walk, which consists of a
deep recurrent neural network (RNN) and Monte Carlo Tree Search (MCTS). The RNN
encodes the state (i.e., history of the walked path) and maps it separately to
a policy and Q-values. In order to effectively train the agent from sparse
rewards, we combine MCTS with the neural policy to generate trajectories
yielding more positive rewards. From these trajectories, the network is
improved in an off-policy manner using Q-learning, which modifies the RNN
policy via parameter sharing. Our proposed RL algorithm repeatedly applies this
policy-improvement step to learn the model. At test time, MCTS is combined with
the neural policy to predict the target node. Experimental results on several
graph-walking benchmarks show that M-Walk is able to learn better policies than
other RL-based methods, which are mainly based on policy gradients. M-Walk also
outperforms traditional KBC baselines.Comment: Yelong Shen, Jianshu Chen and Po-Sen Huang contributed equally to the
paper. Published at 32nd Conference on Neural Information Processing Systems
(NeurIPS 2018), Montr\'eal, Canad
Feedback Control For Cassie With Deep Reinforcement Learning
Bipedal locomotion skills are challenging to develop. Control strategies
often use local linearization of the dynamics in conjunction with reduced-order
abstractions to yield tractable solutions. In these model-based control
strategies, the controller is often not fully aware of many details, including
torque limits, joint limits, and other non-linearities that are necessarily
excluded from the control computations for simplicity. Deep reinforcement
learning (DRL) offers a promising model-free approach for controlling bipedal
locomotion which can more fully exploit the dynamics. However, current results
in the machine learning literature are often based on ad-hoc simulation models
that are not based on corresponding hardware. Thus it remains unclear how well
DRL will succeed on realizable bipedal robots. In this paper, we demonstrate
the effectiveness of DRL using a realistic model of Cassie, a bipedal robot. By
formulating a feedback control problem as finding the optimal policy for a
Markov Decision Process, we are able to learn robust walking controllers that
imitate a reference motion with DRL. Controllers for different walking speeds
are learned by imitating simple time-scaled versions of the original reference
motion. Controller robustness is demonstrated through several challenging
tests, including sensory delay, walking blindly on irregular terrain and
unexpected pushes at the pelvis. We also show we can interpolate between
individual policies and that robustness can be improved with an interpolated
policy.Comment: 6 pages, 4 figures, accepted for IROS201
Graph-based State Representation for Deep Reinforcement Learning
Deep RL approaches build much of their success on the ability of the deep
neural network to generate useful internal representations. Nevertheless, they
suffer from a high sample-complexity and starting with a good input
representation can have a significant impact on the performance. In this paper,
we exploit the fact that the underlying Markov decision process (MDP)
represents a graph, which enables us to incorporate the topological information
for effective state representation learning.
Motivated by the recent success of node representations for several graph
analytical tasks we specifically investigate the capability of node
representation learning methods to effectively encode the topology of the
underlying MDP in Deep RL. To this end we perform a comparative analysis of
several models chosen from 4 different classes of representation learning
algorithms for policy learning in grid-world navigation tasks, which are
representative of a large class of RL problems. We find that all embedding
methods outperform the commonly used matrix representation of grid-world
environments in all of the studied cases. Moreoever, graph convolution based
methods are outperformed by simpler random walk based methods and graph linear
autoencoders
Coordinating Disaster Emergency Response with Heuristic Reinforcement Learning
A crucial and time-sensitive task when any disaster occurs is to rescue
victims and distribute resources to the right groups and locations. This task
is challenging in populated urban areas, due to the huge burst of help requests
generated in a very short period. To improve the efficiency of the emergency
response in the immediate aftermath of a disaster, we propose a heuristic
multi-agent reinforcement learning scheduling algorithm, named as ResQ, which
can effectively schedule the rapid deployment of volunteers to rescue victims
in dynamic settings. The core concept is to quickly identify victims and
volunteers from social network data and then schedule rescue parties with an
adaptive learning algorithm. This framework performs two key functions: 1)
identify trapped victims and rescue volunteers, and 2) optimize the volunteers'
rescue strategy in a complex time-sensitive environment. The proposed ResQ
algorithm can speed up the training processes through a heuristic function
which reduces the state-action space by identifying the set of particular
actions over others. Experimental results showed that the proposed heuristic
multi-agent reinforcement learning based scheduling outperforms several
state-of-art methods, in terms of both reward rate and response times
Visual Imitation Learning with Recurrent Siamese Networks
It would be desirable for a reinforcement learning (RL) based agent to learn
behaviour by merely watching a demonstration. However, defining rewards that
facilitate this goal within the RL paradigm remains a challenge. Here we
address this problem with Siamese networks, trained to compute distances
between observed behaviours and the agent's behaviours. Given a desired motion
such Siamese networks can be used to provide a reward signal to an RL agent via
the distance between the desired motion and the agent's motion. We experiment
with an RNN-based comparator model that can compute distances in space and time
between motion clips while training an RL policy to minimize this distance.
Through experimentation, we have had also found that the inclusion of
multi-task data and an additional image encoding loss helps enforce the
temporal consistency. These two components appear to balance reward for
matching a specific instance of behaviour versus that behaviour in general.
Furthermore, we focus here on a particularly challenging form of this problem
where only a single demonstration is provided for a given task -- the one-shot
learning setting. We demonstrate our approach on humanoid agents in both 2D
with degrees of freedom (DoF) and 3D with DoF.Comment: PrePrin
Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout
A grand goal in AI is to build a robot that can accurately navigate based on
natural language instructions, which requires the agent to perceive the scene,
understand and ground language, and act in the real-world environment. One key
challenge here is to learn to navigate in new environments that are unseen
during training. Most of the existing approaches perform dramatically worse in
unseen environments as compared to seen ones. In this paper, we present a
generalizable navigational agent. Our agent is trained in two stages. The first
stage is training via mixed imitation and reinforcement learning, combining the
benefits from both off-policy and on-policy optimization. The second stage is
fine-tuning via newly-introduced 'unseen' triplets (environment, path,
instruction). To generate these unseen triplets, we propose a simple but
effective 'environmental dropout' method to mimic unseen environments, which
overcomes the problem of limited seen environment variability. Next, we apply
semi-supervised learning (via back-translation) on these dropped-out
environments to generate new paths and instructions. Empirically, we show that
our agent is substantially better at generalizability when fine-tuned with
these triplets, outperforming the state-of-art approaches by a large margin on
the private unseen test set of the Room-to-Room task, and achieving the top
rank on the leaderboard.Comment: NAACL 2019 (12 pages
Realizing Learned Quadruped Locomotion Behaviors through Kinematic Motion Primitives
Humans and animals are believed to use a very minimal set of trajectories to
perform a wide variety of tasks including walking. Our main objective in this
paper is two fold 1) Obtain an effective tool to realize these basic motion
patterns for quadrupedal walking, called the kinematic motion primitives
(kMPs), via trajectories learned from deep reinforcement learning (D-RL) and 2)
Realize a set of behaviors, namely trot, walk, gallop and bound from these
kinematic motion primitives in our custom four legged robot, called the
`Stoch'. D-RL is a data driven approach, which has been shown to be very
effective for realizing all kinds of robust locomotion behaviors, both in
simulation and in experiment. On the other hand, kMPs are known to capture the
underlying structure of walking and yield a set of derived behaviors. We first
generate walking gaits from D-RL, which uses policy gradient based approaches.
We then analyze the resulting walking by using principal component analysis. We
observe that the kMPs extracted from PCA followed a similar pattern
irrespective of the type of gaits generated. Leveraging on this underlying
structure, we then realize walking in Stoch by a straightforward reconstruction
of joint trajectories from kMPs. This type of methodology improves the
transferability of these gaits to real hardware, lowers the computational
overhead on-board, and also avoids multiple training iterations by generating a
set of derived behaviors from a single learned gait.Comment: Accepted by ICRA 2019. Supplementary Video:
https://youtu.be/kiLKSqI4Kh
Emergent Complexity via Multi-Agent Competition
Reinforcement learning algorithms can train agents that solve problems in
complex, interesting environments. Normally, the complexity of the trained
agent is closely related to the complexity of the environment. This suggests
that a highly capable agent requires a complex environment for training. In
this paper, we point out that a competitive multi-agent environment trained
with self-play can produce behaviors that are far more complex than the
environment itself. We also point out that such environments come with a
natural curriculum, because for any skill level, an environment full of agents
of this level will have the right level of difficulty. This work introduces
several competitive multi-agent environments where agents compete in a 3D world
with simulated physics. The trained agents learn a wide variety of complex and
interesting skills, even though the environment themselves are relatively
simple. The skills include behaviors such as running, blocking, ducking,
tackling, fooling opponents, kicking, and defending using both arms and legs. A
highlight of the learned behaviors can be found here: https://goo.gl/eR7fbXComment: Published as a conference paper at ICLR 201
Transfer Learning for Prosthetics Using Imitation Learning
In this paper, We Apply Reinforcement learning (RL) techniques to train a
realistic biomechanical model to work with different people and on different
walking environments. We benchmarking 3 RL algorithms: Deep Deterministic
Policy Gradient (DDPG), Trust Region Policy Optimization (TRPO) and Proximal
Policy Optimization (PPO) in OpenSim environment, Also we apply imitation
learning to a prosthetics domain to reduce the training time needed to design
customized prosthetics. We use DDPG algorithm to train an original expert
agent. We then propose a modification to the Dataset Aggregation (DAgger)
algorithm to reuse the expert knowledge and train a new target agent to
replicate that behaviour in fewer than 5 iterations, compared to the 100
iterations taken by the expert agent which means reducing training time by 95%.
Our modifications to the DAgger algorithm improve the balance between
exploiting the expert policy and exploring the environment. We show empirically
that these improve convergence time of the target agent, particularly when
there is some degree of variation between expert and naive agent.Comment: Workshop paper, Black in AI, NeurIPS 201
Planning to Explore via Self-Supervised World Models
Reinforcement learning allows solving complex tasks, however, the learning
tends to be task-specific and the sample efficiency remains a challenge. We
present Plan2Explore, a self-supervised reinforcement learning agent that
tackles both these challenges through a new approach to self-supervised
exploration and fast adaptation to new tasks, which need not be known during
exploration. During exploration, unlike prior methods which retrospectively
compute the novelty of observations after the agent has already reached them,
our agent acts efficiently by leveraging planning to seek out expected future
novelty. After exploration, the agent quickly adapts to multiple downstream
tasks in a zero or a few-shot manner. We evaluate on challenging control tasks
from high-dimensional image inputs. Without any training supervision or
task-specific interaction, Plan2Explore outperforms prior self-supervised
exploration methods, and in fact, almost matches the performances oracle which
has access to rewards. Videos and code at
https://ramanans1.github.io/plan2explore/Comment: Accepted at ICML 2020. Videos and code at
https://ramanans1.github.io/plan2explore
- …