A novel framework is proposed to incrementally collect landmark-based graph
memory and use the collected memory for image goal navigation. Given a target
image to search, an embodied robot utilizes semantic memory to find the target
in an unknown environment. % The semantic graph memory is collected from a
panoramic observation of an RGB-D camera without knowing the robot's pose. In
this paper, we present a topological semantic graph memory (TSGM), which
consists of (1) a graph builder that takes the observed RGB-D image to
construct a topological semantic graph, (2) a cross graph mixer module that
takes the collected nodes to get contextual information, and (3) a memory
decoder that takes the contextual memory as an input to find an action to the
target. On the task of image goal navigation, TSGM significantly outperforms
competitive baselines by +5.0-9.0% on the success rate and +7.0-23.5% on SPL,
which means that the TSGM finds efficient paths. Additionally, we demonstrate
our method on a mobile robot in real-world image goal scenarios