Search CORE

10 research outputs found

Bridging adaptive management and reinforcement learning for more robust decisions

Author: Boettiger Carl
Chapman Melissa
Lapeyrolerie Marcus
Xu Lily
Publication venue
Publication date: 15/03/2023
Field of study

From out-competing grandmasters in chess to informing high-stakes healthcare decisions, emerging methods from artificial intelligence are increasingly capable of making complex and strategic decisions in diverse, high-dimensional, and uncertain situations. But can these methods help us devise robust strategies for managing environmental systems under great uncertainty? Here we explore how reinforcement learning, a subfield of artificial intelligence, approaches decision problems through a lens similar to adaptive environmental management: learning through experience to gradually improve decisions with updated knowledge. We review where reinforcement learning (RL) holds promise for improving evidence-informed adaptive management decisions even when classical optimization methods are intractable. For example, model-free deep RL might help identify quantitative decision strategies even when models are nonidentifiable. Finally, we discuss technical and social issues that arise when applying reinforcement learning to adaptive management problems in the environmental domain. Our synthesis suggests that environmental management and computer science can learn from one another about the practices, promises, and perils of experience-based decision-making.Comment: In press at Philosophical Transactions of the Royal Society

arXiv.org e-Print Archive

Tackling Non-Stationarity in Reinforcement Learning via Causal-Origin Representation

Author: Li Yilin
Lu Zongqing
Yang Boyu
Zhang Wanpeng
Publication venue
Publication date: 29/09/2023
Field of study

In real-world scenarios, the application of reinforcement learning is significantly challenged by complex non-stationarity. Most existing methods attempt to model changes in the environment explicitly, often requiring impractical prior knowledge. In this paper, we propose a new perspective, positing that non-stationarity can propagate and accumulate through complex causal relationships during state transitions, thereby compounding its sophistication and affecting policy learning. We believe that this challenge can be more effectively addressed by tracing the causal origin of non-stationarity. To this end, we introduce the Causal-Origin REPresentation (COREP) algorithm. COREP primarily employs a guided updating mechanism to learn a stable graph representation for states termed as causal-origin representation. By leveraging this representation, the learned policy exhibits impressive resilience to non-stationarity. We supplement our approach with a theoretical analysis grounded in the causal interpretation for non-stationary reinforcement learning, advocating for the validity of the causal-origin representation. Experimental results further demonstrate the superior performance of COREP over existing methods in tackling non-stationarity

arXiv.org e-Print Archive

An Adaptive Deep RL Method for Non-Stationary Environments with Piecewise Stable Context

Author: Chen Jianyu
Chen Xiaoyu
Cheng Peng
Cheng Wenxue
Liu Tie-Yan
Qin Tao
Xiong Yongqiang
Zhang Pushi
Zhao Li
Zheng Yufeng
Zhu Xiangming
Publication venue
Publication date: 24/12/2022
Field of study

One of the key challenges in deploying RL to real-world applications is to adapt to variations of unknown environment contexts, such as changing terrains in robotic tasks and fluctuated bandwidth in congestion control. Existing works on adaptation to unknown environment contexts either assume the contexts are the same for the whole episode or assume the context variables are Markovian. However, in many real-world applications, the environment context usually stays stable for a stochastic period and then changes in an abrupt and unpredictable manner within an episode, resulting in a segment structure, which existing works fail to address. To leverage the segment structure of piecewise stable context in real-world applications, in this paper, we propose a \textit{\textbf{Se}gmented \textbf{C}ontext \textbf{B}elief \textbf{A}ugmented \textbf{D}eep~(SeCBAD)} RL method. Our method can jointly infer the belief distribution over latent context with the posterior over segment length and perform more accurate belief context inference with observed data within the current context segment. The inferred belief context can be leveraged to augment the state, leading to a policy that can adapt to abrupt variations in context. We demonstrate empirically that SeCBAD can infer context segment length accurately and outperform existing methods on a toy grid world environment and Mujuco tasks with piecewise-stable context.Comment: NeurIPS 202

arXiv.org e-Print Archive

Reinforcement Learning in Presence of Discrete Markovian Context Evolution

Author: Bou-Ammar H
Jafferjee T
Ren H
Shen J
Sootla A
Wang J
Publication venue: ICLR
Publication date: 28/01/2022
Field of study

We consider a context-dependent Reinforcement Learning (RL) setting, which is characterized by: a) an unknown finite number of not directly observable contexts; b) abrupt (discontinuous) context changes occurring during an episode; and c) Markovian context evolution. We argue that this challenging case is often met in applications and we tackle it using a Bayesian model-based approach and variational inference. We adapt a sticky Hierarchical Dirichlet Process (HDP) prior for model learning, which is arguably best-suited for infinite Markov chain modeling. We then derive a context distillation procedure, which identifies and removes spurious contexts in an unsupervised fashion. We argue that the combination of these two components allows inferring the number of contexts from data thus dealing with the context cardinality assumption. We then find the representation of the optimal policy enabling efficient policy learning using off-the-shelf RL algorithms. Finally, we demonstrate empirically (using gym environments cart-pole swing-up, drone, intersection) that our approach succeeds where state-of-the-art methods of other frameworks fail and elaborate on the reasons for such failures

arXiv.org e-Print Archive

UCL Discovery

Tempo Adaption in Non-stationary Reinforcement Learning

Author: Ding Yuhao
Jin Ming
Lavaei Javad
Lee Hyunin
Lee Jongmin
Sojoudi Somayeh
Publication venue
Publication date: 26/09/2023
Field of study

We first raise and tackle ``time synchronization'' issue between the agent and the environment in non-stationary reinforcement learning (RL), a crucial factor hindering its real-world applications. In reality, environmental changes occur over wall-clock time (

\mathfrak{t}

) rather than episode progress (

k

), where wall-clock time signifies the actual elapsed time within the fixed duration

\mathfrak{t} \in [0, T]

. In existing works, at episode

k

, the agent rollouts a trajectory and trains a policy before transitioning to episode

k+1

. In the context of the time-desynchronized environment, however, the agent at time

\mathfrak{t}_k

allocates

\Delta \mathfrak{t}

for trajectory generation and training, subsequently moves to the next episode at

\mathfrak{t}_{k+1}=\mathfrak{t}_{k}+\Delta \mathfrak{t}

. Despite a fixed total episode (

K

), the agent accumulates different trajectories influenced by the choice of \textit{interaction times} (

\mathfrak{t}_1,\mathfrak{t}_2,...,\mathfrak{t}_K

), significantly impacting the sub-optimality gap of policy. We propose a Proactively Synchronizing Tempo (ProST) framework that computes optimal

\{ \mathfrak{t}_1,\mathfrak{t}_2,...,\mathfrak{t}_K \} (= \{ \mathfrak{t} \}_{1:K})

. Our main contribution is that we show optimal

\{ \mathfrak{t} \}_{1:K}

trades-off between the policy training time (agent tempo) and how fast the environment changes (environment tempo). Theoretically, this work establishes an optimal

\{ \mathfrak{t} \}_{1:K}

as a function of the degree of the environment's non-stationarity while also achieving a sublinear dynamic regret. Our experimental evaluation on various high dimensional non-stationary environments shows that the ProST framework achieves a higher online return at optimal

\{ \mathfrak{t} \}_{1:K}

than the existing methods.Comment: 53 pages. To be published in Neural Information Processing Systems (NeurIPS), 202

arXiv.org e-Print Archive

Towards Continual Reinforcement Learning: A Review and Perspectives

Author: Khetarpal Khimya
Precup Doina
Riemer Matthew
Rish Irina
Publication venue
Publication date: 24/12/2020
Field of study

In this article, we aim to provide a literature review of different formulations and approaches to continual reinforcement learning (RL), also known as lifelong or non-stationary RL. We begin by discussing our perspective on why RL is a natural fit for studying continual learning. We then provide a taxonomy of different continual RL formulations and mathematically characterize the non-stationary dynamics of each setting. We go on to discuss evaluation of continual RL agents, providing an overview of benchmarks used in the literature and important metrics for understanding agent performance. Finally, we highlight open problems and challenges in bridging the gap between the current state of continual RL and findings in neuroscience. While still in its early days, the study of continual RL has the promise to develop better incremental reinforcement learners that can function in increasingly realistic applications where non-stationarity plays a vital role. These include applications such as those in the fields of healthcare, education, logistics, and robotics.Comment: Preprint, 52 pages, 8 figure

arXiv.org e-Print Archive

Delays in Reinforcement Learning

Author: Liotet Pierre
Publication venue
Publication date: 20/09/2023
Field of study

Delays are inherent to most dynamical systems. Besides shifting the process in time, they can significantly affect their performance. For this reason, it is usually valuable to study the delay and account for it. Because they are dynamical systems, it is of no surprise that sequential decision-making problems such as Markov decision processes (MDP) can also be affected by delays. These processes are the foundational framework of reinforcement learning (RL), a paradigm whose goal is to create artificial agents capable of learning to maximise their utility by interacting with their environment. RL has achieved strong, sometimes astonishing, empirical results, but delays are seldom explicitly accounted for. The understanding of the impact of delay on the MDP is limited. In this dissertation, we propose to study the delay in the agent's observation of the state of the environment or in the execution of the agent's actions. We will repeatedly change our point of view on the problem to reveal some of its structure and peculiarities. A wide spectrum of delays will be considered, and potential solutions will be presented. This dissertation also aims to draw links between celebrated frameworks of the RL literature and the one of delays

arXiv.org e-Print Archive