Interacting with the actual environment to acquire data is often costly and
time-consuming in robotic tasks. Model-based offline reinforcement learning
(RL) provides a feasible solution. On the one hand, it eliminates the
requirements of interaction with the actual environment. On the other hand, it
learns the transition dynamics and reward function from the offline datasets
and generates simulated rollouts to accelerate training. Previous model-based
offline RL methods adopt probabilistic ensemble neural networks (NN) to model
aleatoric uncertainty and epistemic uncertainty. However, this results in an
exponential increase in training time and computing resource requirements.
Furthermore, these methods are easily disturbed by the accumulative errors of
the environment dynamics models when simulating long-term rollouts. To solve
the above problems, we propose an uncertainty-aware sequence modeling
architecture called Environment Transformer. It models the probability
distribution of the environment dynamics and reward function to capture
aleatoric uncertainty and treats epistemic uncertainty as a learnable noise
parameter. Benefiting from the accurate modeling of the transition dynamics and
reward function, Environment Transformer can be combined with arbitrary
planning, dynamics programming, or policy optimization algorithms for offline
RL. In this case, we perform Conservative Q-Learning (CQL) to learn a
conservative Q-function. Through simulation experiments, we demonstrate that
our method achieves or exceeds state-of-the-art performance in widely studied
offline RL benchmarks. Moreover, we show that Environment Transformer's
simulated rollout quality, sample efficiency, and long-term rollout simulation
capability are superior to those of previous model-based offline RL methods.Comment: ICRA202