1,592 research outputs found
What is Essential for Unseen Goal Generalization of Offline Goal-conditioned RL?
Offline goal-conditioned RL (GCRL) offers a way to train general-purpose
agents from fully offline datasets. In addition to being conservative within
the dataset, the generalization ability to achieve unseen goals is another
fundamental challenge for offline GCRL. However, to the best of our knowledge,
this problem has not been well studied yet. In this paper, we study
out-of-distribution (OOD) generalization of offline GCRL both theoretically and
empirically to identify factors that are important. In a number of experiments,
we observe that weighted imitation learning enjoys better generalization than
pessimism-based offline RL method. Based on this insight, we derive a theory
for OOD generalization, which characterizes several important design choices.
We then propose a new offline GCRL method, Generalizable Offline
goAl-condiTioned RL (GOAT), by combining the findings from our theoretical and
empirical studies. On a new benchmark containing 9 independent identically
distributed (IID) tasks and 17 OOD tasks, GOAT outperforms current
state-of-the-art methods by a large margin.Comment: Accepted by Proceedings of the 40th International Conference on
Machine Learning, 202
Recommended from our members
State-based Policy Representation for Deep Policy Learning
Reinforcement Learning has achieved noticeable success in many fields, such as video game playing, continuous control, and the game of Go. One the other hand, current approaches usually require large sample complexity, and also lack the transferability to similar tasks. Imitation learning, also known as ``learning from demonstrations'', is possible to mitigate the former problem by providing successful experiences. However, current methods usually assume the expert and imitator are the same, which lack flexibility and robustness when the dynamics change.Generalizability is the core of artificial intelligence. An agent should be able to apply its knowledge for novel tasks after training in similar environments, or providing related demonstrations. Given current observation, it should have the ability to predict what can happen (modeling), and what needs to happen (planning). This brings out challenges on how to represent the knowledge and how to utilize the knowledge by learning from interactions or demonstrations.In this thesis, we will systematically study two important problems, the universal goal-reaching problem and the cross-morphology imitation learning problem, which are representative challenges in the field of reinforcement learning and imitation learning. Laying out our research work that attends to these challenging tasks unfolds our roadmap towards the holy-grail goal: make the agent generalizable by learning from observations and model the world
Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions
In this work, we present a scalable reinforcement learning method for
training multi-task policies from large offline datasets that can leverage both
human demonstrations and autonomously collected data. Our method uses a
Transformer to provide a scalable representation for Q-functions trained via
offline temporal difference backups. We therefore refer to the method as
Q-Transformer. By discretizing each action dimension and representing the
Q-value of each action dimension as separate tokens, we can apply effective
high-capacity sequence modeling techniques for Q-learning. We present several
design decisions that enable good performance with offline RL training, and
show that Q-Transformer outperforms prior offline RL algorithms and imitation
learning techniques on a large diverse real-world robotic manipulation task
suite. The project's website and videos can be found at
https://q-transformer.github.ioComment: See website at https://q-transformer.github.i
Imitation learning based on entropy-regularized forward and inverse reinforcement learning
This paper proposes Entropy-Regularized Imitation Learning (ERIL), which is a
combination of forward and inverse reinforcement learning under the framework
of the entropy-regularized Markov decision process. ERIL minimizes the reverse
Kullback-Leibler (KL) divergence between two probability distributions induced
by a learner and an expert. Inverse reinforcement learning (RL) in ERIL
evaluates the log-ratio between two distributions using the density ratio
trick, which is widely used in generative adversarial networks. More
specifically, the log-ratio is estimated by building two binary discriminators.
The first discriminator is a state-only function, and it tries to distinguish
the state generated by the forward RL step from the expert's state. The second
discriminator is a function of current state, action, and transitioned state,
and it distinguishes the generated experiences from the ones provided by the
expert. Since the second discriminator has the same hyperparameters of the
forward RL step, it can be used to control the discriminator's ability. The
forward RL minimizes the reverse KL estimated by the inverse RL. We show that
minimizing the reverse KL divergence is equivalent to finding an optimal policy
under entropy regularization. Consequently, a new policy is derived from an
algorithm that resembles Dynamic Policy Programming and Soft Actor-Critic. Our
experimental results on MuJoCo-simulated environments show that ERIL is more
sample-efficient than such previous methods. We further apply the method to
human behaviors in performing a pole-balancing task and show that the estimated
reward functions show how every subject achieves the goal.Comment: 33 pages, 10 figure
- …