MuZero Unplugged presents a promising approach for offline policy learning
from logged data. It conducts Monte-Carlo Tree Search (MCTS) with a learned
model and leverages Reanalyze algorithm to learn purely from offline data. For
good performance, MCTS requires accurate learned models and a large number of
simulations, thus costing huge computing time. This paper investigates a few
hypotheses where MuZero Unplugged may not work well under the offline RL
settings, including 1) learning with limited data coverage; 2) learning from
offline data of stochastic environments; 3) improperly parameterized models
given the offline data; 4) with a low compute budget. We propose to use a
regularized one-step look-ahead approach to tackle the above issues. Instead of
planning with the expensive MCTS, we use the learned model to construct an
advantage estimation based on a one-step rollout. Policy improvements are
towards the direction that maximizes the estimated advantage with
regularization of the dataset. We conduct extensive empirical studies with
BSuite environments to verify the hypotheses and then run our algorithm on the
RL Unplugged Atari benchmark. Experimental results show that our proposed
approach achieves stable performance even with an inaccurate learned model. On
the large-scale Atari benchmark, the proposed method outperforms MuZero
Unplugged by 43%. Most significantly, it uses only 5.6% wall-clock time (i.e.,
1 hour) compared to MuZero Unplugged (i.e., 17.8 hours) to achieve a 150% IQM
normalized score with the same hardware and software stacks. Our implementation
is open-sourced at https://github.com/sail-sg/rosmo.Comment: ICLR202