Despite some successful applications of goal-driven navigation, existing deep
reinforcement learning (DRL)-based approaches notoriously suffers from poor
data efficiency issue. One of the reasons is that the goal information is
decoupled from the perception module and directly introduced as a condition of
decision-making, resulting in the goal-irrelevant features of the scene
representation playing an adversary role during the learning process. In light
of this, we present a novel Goal-guided Transformer-enabled reinforcement
learning (GTRL) approach by considering the physical goal states as an input of
the scene encoder for guiding the scene representation to couple with the goal
information and realizing efficient autonomous navigation. More specifically,
we propose a novel variant of the Vision Transformer as the backbone of the
perception system, namely Goal-guided Transformer (GoT), and pre-train it with
expert priors to boost the data efficiency. Subsequently, a reinforcement
learning algorithm is instantiated for the decision-making system, taking the
goal-oriented scene representation from the GoT as the input and generating
decision commands. As a result, our approach motivates the scene representation
to concentrate mainly on goal-relevant features, which substantially enhances
the data efficiency of the DRL learning process, leading to superior navigation
performance. Both simulation and real-world experimental results manifest the
superiority of our approach in terms of data efficiency, performance,
robustness, and sim-to-real generalization, compared with other
state-of-the-art (SOTA) baselines. The demonstration video
(https://www.youtube.com/watch?v=aqJCHcsj4w0) and the source code
(https://github.com/OscarHuangWind/DRL-Transformer-SimtoReal-Navigation) are
also provided