Search CORE

24 research outputs found

Online Meta-Critic Learning for Off-Policy Actor-Critic Methods

Author: Hospedales Timothy M.
Li Yiying
Wang Huaimin
Yang Yongxin
Zhou Wei
Publication venue
Publication date: 01/11/2020
Field of study

Off-Policy Actor-Critic (Off-PAC) methods have proven successful in a variety of continuous control tasks. Normally, the critic's action-value function is updated using temporal-difference, and the critic in turn provides a loss for the actor that trains it to take actions with higher expected return. In this paper, we introduce a novel and flexible meta-critic that observes the learning process and meta-learns an additional loss for the actor that accelerates and improves actor-critic learning. Compared to the vanilla critic, the meta-critic network is explicitly trained to accelerate the learning process; and compared to existing meta-learning algorithms, meta-critic is rapidly learned online for a single task, rather than slowly over a family of tasks. Crucially, our meta-critic framework is designed for off-policy based learners, which currently provide state-of-the-art reinforcement learning sample efficiency. We demonstrate that online meta-critic learning leads to improvements in avariety of continuous control environments when combined with contemporary Off-PAC methods DDPG, TD3 and the state-of-the-art SAC.Comment: NeurIPS 202

arXiv.org e-Print Archive

Edinburgh Research Explorer

Discovering Object-Centric Generalized Value Functions From Pixels

Author: Kahou Samira Ebrahimi
Khetarpal Khimya
Nath Somjit
Subbaraj Gopeshh Raaj
Publication venue
Publication date: 27/06/2023
Field of study

Deep Reinforcement Learning has shown significant progress in extracting useful representations from high-dimensional inputs albeit using hand-crafted auxiliary tasks and pseudo rewards. Automatically learning such representations in an object-centric manner geared towards control and fast adaptation remains an open research problem. In this paper, we introduce a method that tries to discover meaningful features from objects, translating them to temporally coherent "question" functions and leveraging the subsequent learned general value functions for control. We compare our approach with state-of-the-art techniques alongside other ablations and show competitive performance in both stationary and non-stationary settings. Finally, we also investigate the discovered general value functions and through qualitative analysis show that the learned representations are not only interpretable but also, centered around objects that are invariant to changes across tasks facilitating fast adaptation.Comment: Accepted at ICML 202

arXiv.org e-Print Archive