As Large Language Models (LLMs) become popular, there emerged an important
trend of using multimodality to augment the LLMs' generation ability, which
enables LLMs to better interact with the world. However, there lacks a unified
perception of at which stage and how to incorporate different modalities. In
this survey, we review methods that assist and augment generative models by
retrieving multimodal knowledge, whose formats range from images, codes,
tables, graphs, to audio. Such methods offer a promising solution to important
concerns such as factuality, reasoning, interpretability, and robustness. By
providing an in-depth review, this survey is expected to provide scholars with
a deeper understanding of the methods' applications and encourage them to adapt
existing techniques to the fast-growing field of LLMs