Search CORE

1,592 research outputs found

What is Essential for Unseen Goal Generalization of Offline Goal-conditioned RL?

Author: Hu Hao
Lin Yong
Ma Xiaoteng
Yang Rui
Zhang Chongjie
Zhang Tong
Publication venue
Publication date: 30/05/2023
Field of study

Offline goal-conditioned RL (GCRL) offers a way to train general-purpose agents from fully offline datasets. In addition to being conservative within the dataset, the generalization ability to achieve unseen goals is another fundamental challenge for offline GCRL. However, to the best of our knowledge, this problem has not been well studied yet. In this paper, we study out-of-distribution (OOD) generalization of offline GCRL both theoretically and empirically to identify factors that are important. In a number of experiments, we observe that weighted imitation learning enjoys better generalization than pessimism-based offline RL method. Based on this insight, we derive a theory for OOD generalization, which characterizes several important design choices. We then propose a new offline GCRL method, Generalizable Offline goAl-condiTioned RL (GOAT), by combining the findings from our theoretical and empirical studies. On a new benchmark containing 9 independent identically distributed (IID) tasks and 17 OOD tasks, GOAT outperforms current state-of-the-art methods by a large margin.Comment: Accepted by Proceedings of the 40th International Conference on Machine Learning, 202

arXiv.org e-Print Archive

Recommended from our members

State-based Policy Representation for Deep Policy Learning

Author: Liu Fangchen
Publication venue: eScholarship, University of California
Publication date: 01/01/2020
Field of study

Reinforcement Learning has achieved noticeable success in many fields, such as video game playing, continuous control, and the game of Go. One the other hand, current approaches usually require large sample complexity, and also lack the transferability to similar tasks. Imitation learning, also known as ``learning from demonstrations'', is possible to mitigate the former problem by providing successful experiences. However, current methods usually assume the expert and imitator are the same, which lack flexibility and robustness when the dynamics change.Generalizability is the core of artificial intelligence. An agent should be able to apply its knowledge for novel tasks after training in similar environments, or providing related demonstrations. Given current observation, it should have the ability to predict what can happen (modeling), and what needs to happen (planning). This brings out challenges on how to represent the knowledge and how to utilize the knowledge by learning from interactions or demonstrations.In this thesis, we will systematically study two important problems, the universal goal-reaching problem and the cross-morphology imitation learning problem, which are representative challenges in the field of reinforcement learning and imitation learning. Laying out our research work that attends to these challenging tasks unfolds our roadmap towards the holy-grail goal: make the agent generalizable by learning from observations and model the world

eScholarship - University of California

Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions

Author: Chebotar Yevgen
Finn Chelsea
Gopalakrishnan Keerthana
Hausman Karol
Herzog Alexander
Ibarz Julian
Irpan Alex
Jackson Tomas
Kumar Aviral
Levine Sergey
Lu Yao
Manjunath Deeksha
Nachum Ofir
Peralta Jodilyn
Pertsch Karl
Rao Kanishka
Salazar Grecia
Singht Jaspiar
Sontakke Sumedh
Tan Clayton
Tran Huong T
Vuong Quan
Xia Fei
Yu Tianhe
Zitkovich Brianna
Publication venue
Publication date: 18/09/2023
Field of study

In this work, we present a scalable reinforcement learning method for training multi-task policies from large offline datasets that can leverage both human demonstrations and autonomously collected data. Our method uses a Transformer to provide a scalable representation for Q-functions trained via offline temporal difference backups. We therefore refer to the method as Q-Transformer. By discretizing each action dimension and representing the Q-value of each action dimension as separate tokens, we can apply effective high-capacity sequence modeling techniques for Q-learning. We present several design decisions that enable good performance with offline RL training, and show that Q-Transformer outperforms prior offline RL algorithms and imitation learning techniques on a large diverse real-world robotic manipulation task suite. The project's website and videos can be found at https://q-transformer.github.ioComment: See website at https://q-transformer.github.i

arXiv.org e-Print Archive

Imitation learning based on entropy-regularized forward and inverse reinforcement learning

Author: Doya Kenji
Uchibe Eiji
Publication venue
Publication date: 17/08/2020
Field of study

This paper proposes Entropy-Regularized Imitation Learning (ERIL), which is a combination of forward and inverse reinforcement learning under the framework of the entropy-regularized Markov decision process. ERIL minimizes the reverse Kullback-Leibler (KL) divergence between two probability distributions induced by a learner and an expert. Inverse reinforcement learning (RL) in ERIL evaluates the log-ratio between two distributions using the density ratio trick, which is widely used in generative adversarial networks. More specifically, the log-ratio is estimated by building two binary discriminators. The first discriminator is a state-only function, and it tries to distinguish the state generated by the forward RL step from the expert's state. The second discriminator is a function of current state, action, and transitioned state, and it distinguishes the generated experiences from the ones provided by the expert. Since the second discriminator has the same hyperparameters of the forward RL step, it can be used to control the discriminator's ability. The forward RL minimizes the reverse KL estimated by the inverse RL. We show that minimizing the reverse KL divergence is equivalent to finding an optimal policy under entropy regularization. Consequently, a new policy is derived from an algorithm that resembles Dynamic Policy Programming and Soft Actor-Critic. Our experimental results on MuJoCo-simulated environments show that ERIL is more sample-efficient than such previous methods. We further apply the method to human behaviors in performing a pole-balancing task and show that the estimated reward functions show how every subject achieves the goal.Comment: 33 pages, 10 figure

arXiv.org e-Print Archive

OIST Institutional Repository