Sim2Real transfer has gained popularity because it helps transfer from
inexpensive simulators to real world. This paper presents a novel system that
fuses components in a traditional \textit{World Model} into a robust system,
trained entirely within a simulator, that \textit{Zero-Shot} transfers to the
real world. To facilitate transfer, we use an intermediary representation that
are based on \textit{Bird's Eye View (BEV)} images. Thus, our robot learns to
navigate in a simulator by first learning to translate from complex
\textit{First-Person View (FPV)} based RGB images to BEV representations, then
learning to navigate using those representations. Later, when tested in the
real world, the robot uses the perception model that translates FPV-based RGB
images to embeddings that are used by the downstream policy. The incorporation
of state-checking modules using \textit{Anchor images} and \textit{Mixture
Density LSTM} not only interpolates uncertain and missing observations but also
enhances the robustness of the model when exposed to the real-world
environment. We trained the model using data collected using a
\textit{Differential drive} robot in the CARLA simulator. Our methodology's
effectiveness is shown through the deployment of trained models onto a
\textit{Real world Differential drive} robot. Lastly we release a comprehensive
codebase, dataset and models for training and deployment that are available to
the public.Comment: Under Review at the International Conference on Robotics and
Automation 2024; Accepted at NeurIPS 2023, Robot Learning Worksho