This paper proposes Multi-modAl Retrieval model via Visual modulE pLugin
(MARVEL) to learn an embedding space for queries and multi-modal documents to
conduct retrieval. MARVEL encodes queries and multi-modal documents with a
unified encoder model, which helps to alleviate the modality gap between images
and texts. Specifically, we enable the image understanding ability of a
well-trained dense retriever, T5-ANCE, by incorporating the image features
encoded by the visual module as its inputs. To facilitate the multi-modal
retrieval tasks, we build the ClueWeb22-MM dataset based on the ClueWeb22
dataset, which regards anchor texts as queries, and exact the related texts and
image documents from anchor linked web pages. Our experiments show that MARVEL
significantly outperforms the state-of-the-art methods on the multi-modal
retrieval dataset WebQA and ClueWeb22-MM. Our further analyses show that the
visual module plugin method is tailored to enable the image understanding
ability for an existing dense retrieval model. Besides, we also show that the
language model has the ability to extract image semantics from image encoders
and adapt the image features in the input space of language models. All codes
are available at https://github.com/OpenMatch/MARVEL