Chats emerge as an effective user-friendly approach for information
retrieval, and are successfully employed in many domains, such as customer
service, healthcare, and finance. However, existing image retrieval approaches
typically address the case of a single query-to-image round, and the use of
chats for image retrieval has been mostly overlooked. In this work, we
introduce ChatIR: a chat-based image retrieval system that engages in a
conversation with the user to elicit information, in addition to an initial
query, in order to clarify the user's search intent. Motivated by the
capabilities of today's foundation models, we leverage Large Language Models to
generate follow-up questions to an initial image description. These questions
form a dialog with the user in order to retrieve the desired image from a large
corpus. In this study, we explore the capabilities of such a system tested on a
large dataset and reveal that engaging in a dialog yields significant gains in
image retrieval. We start by building an evaluation pipeline from an existing
manually generated dataset and explore different modules and training
strategies for ChatIR. Our comparison includes strong baselines derived from
related applications trained with Reinforcement Learning. Our system is capable
of retrieving the target image from a pool of 50K images with over 78% success
rate after 5 dialogue rounds, compared to 75% when questions are asked by
humans, and 64% for a single shot text-to-image retrieval. Extensive
evaluations reveal the strong capabilities and examine the limitations of
CharIR under different settings