153 research outputs found
System Design for an Integrated Lifelong Reinforcement Learning Agent for Real-Time Strategy Games
As Artificial and Robotic Systems are increasingly deployed and relied upon
for real-world applications, it is important that they exhibit the ability to
continually learn and adapt in dynamically-changing environments, becoming
Lifelong Learning Machines. Continual/lifelong learning (LL) involves
minimizing catastrophic forgetting of old tasks while maximizing a model's
capability to learn new tasks. This paper addresses the challenging lifelong
reinforcement learning (L2RL) setting. Pushing the state-of-the-art forward in
L2RL and making L2RL useful for practical applications requires more than
developing individual L2RL algorithms; it requires making progress at the
systems-level, especially research into the non-trivial problem of how to
integrate multiple L2RL algorithms into a common framework. In this paper, we
introduce the Lifelong Reinforcement Learning Components Framework (L2RLCF),
which standardizes L2RL systems and assimilates different continual learning
components (each addressing different aspects of the lifelong learning problem)
into a unified system. As an instantiation of L2RLCF, we develop a standard API
allowing easy integration of novel lifelong learning components. We describe a
case study that demonstrates how multiple independently-developed LL components
can be integrated into a single realized system. We also introduce an
evaluation environment in order to measure the effect of combining various
system components. Our evaluation environment employs different LL scenarios
(sequences of tasks) consisting of Starcraft-2 minigames and allows for the
fair, comprehensive, and quantitative comparison of different combinations of
components within a challenging common evaluation environment.Comment: The Second International Conference on AIML Systems, October 12--15,
2022, Bangalore, Indi
Lifelong Reinforcement Learning On Mobile Robots
Machine learning has shown tremendous growth in the past decades, unlocking new capabilities in a variety of fields including computer vision, natural language processing, and robotic control. While the sophistication of individual problems a learning system can handle has greatly advanced, the ability of a system to extend beyond an individual problem to adapt and solve new problems has progressed more slowly. This thesis explores the problem of progressive learning. The goal is to develop methodologies that accumulate, transfer, and adapt knowledge in applied settings where the system is faced with the ambiguity and resource limitations of operating in the physical world.
There are undoubtedly many challenges to designing such a system, my thesis looks at the component of this problem related to how knowledge from previous tasks can be a benefit in the domain of reinforcement learning where the agent receives rewards for positive actions. Reinforcement learning is particularly difficult when training on physical systems, like mobile robots, where repeated trials can
damage the system and unrestricted exploration is often associated with safety risks. I investigate how knowledge can be efficiently accumulated and applied to future reinforcement learning problems on mobile robots in order to reduce sample complexity and enable systems to adapt to novel settings. Doing this involves mathematical models which can combine knowledge from multiple tasks, methods for restructuring optimizations and data collection to handle sequential updates, and data selection strategies that can be used to address resource limitations
Task-Agnostic Continual Reinforcement Learning: In Praise of a Simple Baseline
We study methods for task-agnostic continual reinforcement learning (TACRL).
TACRL is a setting that combines the difficulties of partially-observable RL (a
consequence of task agnosticism) and the difficulties of continual learning
(CL), i.e., learning on a non-stationary sequence of tasks. We compare TACRL
methods with their soft upper bounds prescribed by previous literature:
multi-task learning (MTL) methods which do not have to deal with non-stationary
data distributions, as well as task-aware methods, which are allowed to operate
under full observability. We consider a previously unexplored and
straightforward baseline for TACRL, replay-based recurrent RL (3RL), in which
we augment an RL algorithm with recurrent mechanisms to mitigate partial
observability and experience replay mechanisms for catastrophic forgetting in
CL.
By studying empirical performance in a sequence of RL tasks, we find
surprising occurrences of 3RL matching and overcoming the MTL and task-aware
soft upper bounds. We lay out hypotheses that could explain this inflection
point of continual and task-agnostic learning research. Our hypotheses are
empirically tested in continuous control tasks via a large-scale study of the
popular multi-task and continual learning benchmark Meta-World. By analyzing
different training statistics including gradient conflict, we find evidence
that 3RL's outperformance stems from its ability to quickly infer how new tasks
relate with the previous ones, enabling forward transfer
How to Reuse and Compose Knowledge for a Lifetime of Tasks: A Survey on Continual Learning and Functional Composition
A major goal of artificial intelligence (AI) is to create an agent capable of
acquiring a general understanding of the world. Such an agent would require the
ability to continually accumulate and build upon its knowledge as it encounters
new experiences. Lifelong or continual learning addresses this setting, whereby
an agent faces a continual stream of problems and must strive to capture the
knowledge necessary for solving each new task it encounters. If the agent is
capable of accumulating knowledge in some form of compositional representation,
it could then selectively reuse and combine relevant pieces of knowledge to
construct novel solutions. Despite the intuitive appeal of this simple idea,
the literatures on lifelong learning and compositional learning have proceeded
largely separately. In an effort to promote developments that bridge between
the two fields, this article surveys their respective research landscapes and
discusses existing and future connections between them
Continual Task Allocation in Meta-Policy Network via Sparse Prompting
How to train a generalizable meta-policy by continually learning a sequence
of tasks? It is a natural human skill yet challenging to achieve by current
reinforcement learning: the agent is expected to quickly adapt to new tasks
(plasticity) meanwhile retaining the common knowledge from previous tasks
(stability). We address it by "Continual Task Allocation via Sparse Prompting
(CoTASP)", which learns over-complete dictionaries to produce sparse masks as
prompts extracting a sub-network for each task from a meta-policy network. By
optimizing the sub-network and prompts alternatively, CoTASP updates the
meta-policy via training a task-specific policy. The dictionary is then updated
to align the optimized prompts with tasks' embedding, thereby capturing their
semantic correlations. Hence, relevant tasks share more neurons in the
meta-policy network via similar prompts while cross-task interference causing
forgetting is effectively restrained. Given a trained meta-policy with updated
dictionaries, new task adaptation reduces to highly efficient sparse prompting
and sub-network finetuning. In experiments, CoTASP achieves a promising
plasticity-stability trade-off without storing or replaying any past tasks'
experiences and outperforms existing continual and multi-task RL methods on all
seen tasks, forgetting reduction, and generalization to unseen tasks.Comment: Accepted by ICML 202
The Effectiveness of World Models for Continual Reinforcement Learning
World models power some of the most efficient reinforcement learning
algorithms. In this work, we showcase that they can be harnessed for continual
learning - a situation when the agent faces changing environments. World models
typically employ a replay buffer for training, which can be naturally extended
to continual learning. We systematically study how different selective
experience replay methods affect performance, forgetting, and transfer. We also
provide recommendations regarding various modeling options for using world
models. The best set of choices is called Continual-Dreamer, it is
task-agnostic and utilizes the world model for continual exploration.
Continual-Dreamer is sample efficient and outperforms state-of-the-art
task-agnostic continual reinforcement learning methods on Minigrid and Minihack
benchmarks.Comment: Accepted at CoLLAs 2023, 21 pages, 15 figure
Offline Experience Replay for Continual Offline Reinforcement Learning
The capability of continuously learning new skills via a sequence of
pre-collected offline datasets is desired for an agent. However, consecutively
learning a sequence of offline tasks likely leads to the catastrophic
forgetting issue under resource-limited scenarios. In this paper, we formulate
a new setting, continual offline reinforcement learning (CORL), where an agent
learns a sequence of offline reinforcement learning tasks and pursues good
performance on all learned tasks with a small replay buffer without exploring
any of the environments of all the sequential tasks. For consistently learning
on all sequential tasks, an agent requires acquiring new knowledge and
meanwhile preserving old knowledge in an offline manner. To this end, we
introduced continual learning algorithms and experimentally found experience
replay (ER) to be the most suitable algorithm for the CORL problem. However, we
observe that introducing ER into CORL encounters a new distribution shift
problem: the mismatch between the experiences in the replay buffer and
trajectories from the learned policy. To address such an issue, we propose a
new model-based experience selection (MBES) scheme to build the replay buffer,
where a transition model is learned to approximate the state distribution. This
model is used to bridge the distribution bias between the replay buffer and the
learned model by filtering the data from offline data that most closely
resembles the learned model for storage. Moreover, in order to enhance the
ability on learning new tasks, we retrofit the experience replay method with a
new dual behavior cloning (DBC) architecture to avoid the disturbance of
behavior-cloning loss on the Q-learning process. In general, we call our
algorithm offline experience replay (OER). Extensive experiments demonstrate
that our OER method outperforms SOTA baselines in widely-used Mujoco
environments.Comment: 9 pages, 4 figure
Lifelong Machine Learning Of Functionally Compositional Structures
A hallmark of human intelligence is the ability to construct self-contained chunks of knowledge and reuse them in novel combinations for solving different yet structurally related problems. Learning such compositional structures has been a significant challenge for artificial systems, due to the underlying combinatorial search. To date, research into compositional learning has largely proceeded separately from work on lifelong or continual learning. This dissertation integrated these two lines of work to present a general-purpose framework for lifelong learning of functionally compositional structures. The framework separates the learning into two stages: learning how to best combine existing components to assimilate a novel problem, and learning how to adapt the set of existing components to accommodate the new problem. This separation explicitly handles the trade-off between the stability required to remember how to solve earlier tasks and the flexibility required to solve new tasks. This dissertation instantiated the framework into various supervised and reinforcement learning (RL) algorithms. Empirical evaluations on a range of supervised learning benchmarks compared the proposed algorithms against well-established techniques, and found that 1)~compositional models enable improved lifelong learning when the tasks are highly diverse by balancing the incorporation of new knowledge and the retention of past knowledge, 2)~the separation of the learning into stages permits lifelong learning of compositional knowledge, and 3)~the components learned by the proposed methods represent self-contained and reusable functions. Similar evaluations on existing and new RL benchmarks demonstrated that 1)~algorithms under the framework accelerate the discovery of high-performing policies in a variety of domains, including robotic manipulation, and 2)~these algorithms retain, and often improve, knowledge that enables them to solve tasks learned in the past. The dissertation extended one lifelong compositional RL algorithm to the nonstationary setting, where the distribution over tasks varies over time, and found that modularity permits individually tracking changes to different elements in the environment. The final contribution of this dissertation was a new benchmark for evaluating approaches to compositional RL, which exposed that existing methods struggle to discover the compositional properties of the environment
- …