Visual Word Sense Disambiguation (VWSD) is a novel challenging task with the
goal of retrieving an image among a set of candidates, which better represents
the meaning of an ambiguous word within a given context. In this paper, we make
a substantial step towards unveiling this interesting task by applying a
varying set of approaches. Since VWSD is primarily a text-image retrieval task,
we explore the latest transformer-based methods for multimodal retrieval.
Additionally, we utilize Large Language Models (LLMs) as knowledge bases to
enhance the given phrases and resolve ambiguity related to the target word. We
also study VWSD as a unimodal problem by converting to text-to-text and
image-to-image retrieval, as well as question-answering (QA), to fully explore
the capabilities of relevant models. To tap into the implicit knowledge of
LLMs, we experiment with Chain-of-Thought (CoT) prompting to guide explainable
answer generation. On top of all, we train a learn to rank (LTR) model in order
to combine our different modules, achieving competitive ranking results.
Extensive experiments on VWSD demonstrate valuable insights to effectively
drive future directions.Comment: Conference on Empirical Methods in Natural Language Processing
(EMNLP) 202