Referential ambiguities arise in dialogue when a referring expression does
not uniquely identify the intended referent for the addressee. Addressees
usually detect such ambiguities immediately and work with the speaker to repair
it using meta-communicative, Clarificational Exchanges (CE): a Clarification
Request (CR) and a response. Here, we argue that the ability to generate and
respond to CRs imposes specific constraints on the architecture and objective
functions of multi-modal, visually grounded dialogue models. We use the SIMMC
2.0 dataset to evaluate the ability of different state-of-the-art model
architectures to process CEs, with a metric that probes the contextual updates
that arise from them in the model. We find that language-based models are able
to encode simple multi-modal semantic information and process some CEs,
excelling with those related to the dialogue history, whilst multi-modal models
can use additional learning objectives to obtain disentangled object
representations, which become crucial to handle complex referential ambiguities
across modalities overall.Comment: Accepted at SIGDIAL'23 (upcoming). Repository with code and
experiments available at https://github.com/JChiyah/what-are-you-referring-t