1,797 research outputs found
Unsupervised adaptation of brain machine interface decoders
The performance of neural decoders can degrade over time due to
nonstationarities in the relationship between neuronal activity and behavior.
In this case, brain-machine interfaces (BMI) require adaptation of their
decoders to maintain high performance across time. One way to achieve this is
by use of periodical calibration phases, during which the BMI system (or an
external human demonstrator) instructs the user to perform certain movements or
behaviors. This approach has two disadvantages: (i) calibration phases
interrupt the autonomous operation of the BMI and (ii) between two calibration
phases the BMI performance might not be stable but continuously decrease. A
better alternative would be that the BMI decoder is able to continuously adapt
in an unsupervised manner during autonomous BMI operation, i.e. without knowing
the movement intentions of the user.
In the present article, we present an efficient method for such unsupervised
training of BMI systems for continuous movement control. The proposed method
utilizes a cost function derived from neuronal recordings, which guides a
learning algorithm to evaluate the decoding parameters. We verify the
performance of our adaptive method by simulating a BMI user with an optimal
feedback control model and its interaction with our adaptive BMI decoder. The
simulation results show that the cost function and the algorithm yield fast and
precise trajectories towards targets at random orientations on a 2-dimensional
computer screen. For initially unknown and non-stationary tuning parameters,
our unsupervised method is still able to generate precise trajectories and to
keep its performance stable in the long term. The algorithm can optionally work
also with neuronal error signals instead or in conjunction with the proposed
unsupervised adaptation.Comment: 28 pages, 13 figures, submitted to Frontiers in Neuroprosthetic
Never Stop Learning: The Effectiveness of Fine-Tuning in Robotic Reinforcement Learning
One of the great promises of robot learning systems is that they will be able
to learn from their mistakes and continuously adapt to ever-changing
environments. Despite this potential, most of the robot learning systems today
are deployed as a fixed policy and they are not being adapted after their
deployment. Can we efficiently adapt previously learned behaviors to new
environments, objects and percepts in the real world? In this paper, we present
a method and empirical evidence towards a robot learning framework that
facilitates continuous adaption. In particular, we demonstrate how to adapt
vision-based robotic manipulation policies to new variations by fine-tuning via
off-policy reinforcement learning, including changes in background, object
shape and appearance, lighting conditions, and robot morphology. Further, this
adaptation uses less than 0.2% of the data necessary to learn the task from
scratch. We find that our approach of adapting pre-trained policies leads to
substantial performance gains over the course of fine-tuning, and that
pre-training via RL is essential: training from scratch or adapting from
supervised ImageNet features are both unsuccessful with such small amounts of
data. We also find that these positive results hold in a limited continual
learning setting, in which we repeatedly fine-tune a single lineage of policies
using data from a succession of new tasks. Our empirical conclusions are
consistently supported by experiments on simulated manipulation tasks, and by
52 unique fine-tuning experiments on a real robotic grasping system pre-trained
on 580,000 grasps.Comment: 8.5 pages, 9 figures. See video overview and experiments at
https://youtu.be/pPDVewcSpdc and project website at
https://ryanjulian.me/continual-fine-tunin
Collaborative Evolutionary Reinforcement Learning
Deep reinforcement learning algorithms have been successfully applied to a
range of challenging control tasks. However, these methods typically struggle
with achieving effective exploration and are extremely sensitive to the choice
of hyperparameters. One reason is that most approaches use a noisy version of
their operating policy to explore - thereby limiting the range of exploration.
In this paper, we introduce Collaborative Evolutionary Reinforcement Learning
(CERL), a scalable framework that comprises a portfolio of policies that
simultaneously explore and exploit diverse regions of the solution space. A
collection of learners - typically proven algorithms like TD3 - optimize over
varying time-horizons leading to this diverse portfolio. All learners
contribute to and use a shared replay buffer to achieve greater sample
efficiency. Computational resources are dynamically distributed to favor the
best learners as a form of online algorithm selection. Neuroevolution binds
this entire process to generate a single emergent learner that exceeds the
capabilities of any individual learner. Experiments in a range of continuous
control benchmarks demonstrate that the emergent learner significantly
outperforms its composite learners while remaining overall more
sample-efficient - notably solving the Mujoco Humanoid benchmark where all of
its composite learners (TD3) fail entirely in isolation.Comment: Added link to public Github repo. Minor editorial changes. Order of
authors modified to reflect ICML submissio
Multi-task Deep Reinforcement Learning with PopArt
The reinforcement learning community has made great strides in designing
algorithms capable of exceeding human performance on specific tasks. These
algorithms are mostly trained one task at the time, each new task requiring to
train a brand new agent instance. This means the learning algorithm is general,
but each solution is not; each agent can only solve the one task it was trained
on. In this work, we study the problem of learning to master not one but
multiple sequential-decision tasks at once. A general issue in multi-task
learning is that a balance must be found between the needs of multiple tasks
competing for the limited resources of a single learning system. Many learning
algorithms can get distracted by certain tasks in the set of tasks to solve.
Such tasks appear more salient to the learning process, for instance because of
the density or magnitude of the in-task rewards. This causes the algorithm to
focus on those salient tasks at the expense of generality. We propose to
automatically adapt the contribution of each task to the agent's updates, so
that all tasks have a similar impact on the learning dynamics. This resulted in
state of the art performance on learning to play all games in a set of 57
diverse Atari games. Excitingly, our method learned a single trained policy -
with a single set of weights - that exceeds median human performance. To our
knowledge, this was the first time a single agent surpassed human-level
performance on this multi-task domain. The same approach also demonstrated
state of the art performance on a set of 30 tasks in the 3D reinforcement
learning platform DeepMind Lab
Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control
We propose a plan online and learn offline (POLO) framework for the setting
where an agent, with an internal model, needs to continually act and learn in
the world. Our work builds on the synergistic relationship between local
model-based control, global value function learning, and exploration. We study
how local trajectory optimization can cope with approximation errors in the
value function, and can stabilize and accelerate value function learning.
Conversely, we also study how approximate value functions can help reduce the
planning horizon and allow for better policies beyond local solutions. Finally,
we also demonstrate how trajectory optimization can be used to perform
temporally coordinated exploration in conjunction with estimating uncertainty
in value function approximation. This exploration is critical for fast and
stable learning of the value function. Combining these components enable
solutions to complex simulated control tasks, like humanoid locomotion and
dexterous in-hand manipulation, in the equivalent of a few minutes of
experience in the real world.Comment: The first two authors contributed equally. Accepted at ICLR 2019.
Supplementary videos available at: https://sites.google.com/view/polo-mp
Adaptive Online Planning for Continual Lifelong Learning
We study learning control in an online reset-free lifelong learning scenario,
where mistakes can compound catastrophically into the future and the underlying
dynamics of the environment may change. Traditional model-free policy learning
methods have achieved successes in difficult tasks due to their broad
flexibility, but struggle in this setting, as they can activate failure modes
early in their lifetimes which are difficult to recover from and face
performance degradation as dynamics change. On the other hand, model-based
planning methods learn and adapt quickly, but require prohibitive levels of
computational resources. We present a new algorithm, Adaptive Online Planning
(AOP), that achieves strong performance in this setting by combining
model-based planning with model-free learning. By approximating the uncertainty
of the model-free components and the planner performance, AOP is able to call
upon more extensive planning only when necessary, leading to reduced
computation times, while still gracefully adapting behaviors in the face of
unpredictable changes in the world -- even when traditional RL fails.Comment: Originally published in NeurIPS Deep RL 201
A Lifelong Learning Approach to Mobile Robot Navigation
This paper presents a self-improving lifelong learning framework for a mobile
robot navigating in different environments. Classical static navigation methods
require environment-specific in-situ system adjustment, e.g. from human
experts, or may repeat their mistakes regardless of how many times they have
navigated in the same environment. Having the potential to improve with
experience, learning-based navigation is highly dependent on access to training
resources, e.g. sufficient memory and fast computation, and is prone to
forgetting previously learned capability, especially when facing different
environments. In this work, we propose Lifelong Learning for Navigation (LLfN)
which (1) improves a mobile robot's navigation behavior purely based on its own
experience, and (2) retains the robot's capability to navigate in previous
environments after learning in new ones. LLfN is implemented and tested
entirely onboard a physical robot with a limited memory and computation budget.Comment: Accepted by IEEE Robotics and Automation Letters (RA-L
IOB: Integrating Optimization Transfer and Behavior Transfer for Multi-Policy Reuse
Humans have the ability to reuse previously learned policies to solve new
tasks quickly, and reinforcement learning (RL) agents can do the same by
transferring knowledge from source policies to a related target task. Transfer
RL methods can reshape the policy optimization objective (optimization
transfer) or influence the behavior policy (behavior transfer) using source
policies. However, selecting the appropriate source policy with limited samples
to guide target policy learning has been a challenge. Previous methods
introduce additional components, such as hierarchical policies or estimations
of source policies' value functions, which can lead to non-stationary policy
optimization or heavy sampling costs, diminishing transfer effectiveness. To
address this challenge, we propose a novel transfer RL method that selects the
source policy without training extra components. Our method utilizes the Q
function in the actor-critic framework to guide policy selection, choosing the
source policy with the largest one-step improvement over the current target
policy. We integrate optimization transfer and behavior transfer (IOB) by
regularizing the learned policy to mimic the guidance policy and combining them
as the behavior policy. This integration significantly enhances transfer
effectiveness, surpasses state-of-the-art transfer RL baselines in benchmark
tasks, and improves final performance and knowledge transferability in
continual learning scenarios. Additionally, we show that our optimization
transfer technique is guaranteed to improve target policy learning.Comment: 26 pages, 9 figure
Progressive Neural Networks
Learning to solve complex sequences of tasks--while both leveraging transfer
and avoiding catastrophic forgetting--remains a key obstacle to achieving
human-level intelligence. The progressive networks approach represents a step
forward in this direction: they are immune to forgetting and can leverage prior
knowledge via lateral connections to previously learned features. We evaluate
this architecture extensively on a wide variety of reinforcement learning tasks
(Atari and 3D maze games), and show that it outperforms common baselines based
on pretraining and finetuning. Using a novel sensitivity measure, we
demonstrate that transfer occurs at both low-level sensory and high-level
control layers of the learned policy
Neural-encoding Human Experts' Domain Knowledge to Warm Start Reinforcement Learning
Deep reinforcement learning has been successful in a variety of tasks, such
as game playing and robotic manipulation. However, attempting to learn
\textit{tabula rasa} disregards the logical structure of many domains as well
as the wealth of readily available knowledge from domain experts that could
help "warm start" the learning process. We present a novel reinforcement
learning technique that allows for intelligent initialization of a neural
network weights and architecture. Our approach permits the encoding domain
knowledge directly into a neural decision tree, and improves upon that
knowledge with policy gradient updates. We empirically validate our approach on
two OpenAI Gym tasks and two modified StarCraft 2 tasks, showing that our novel
architecture outperforms multilayer-perceptron and recurrent architectures. Our
knowledge-based framework finds superior policies compared to imitation
learning-based and prior knowledge-based approaches. Importantly, we
demonstrate that our approach can be used by untrained humans to initially
provide >80% increase in expected reward relative to baselines prior to
training (p 60% increase in expected reward after
policy optimization (p = 0.011)
- …