72 research outputs found
DisCLIP: Open-Vocabulary Referring Expression Generation
Referring Expressions Generation (REG) aims to produce textual descriptions
that unambiguously identifies specific objects within a visual scene.
Traditionally, this has been achieved through supervised learning methods,
which perform well on specific data distributions but often struggle to
generalize to new images and concepts. To address this issue, we present a
novel approach for REG, named DisCLIP, short for discriminative CLIP. We build
on CLIP, a large-scale visual-semantic model, to guide an LLM to generate a
contextual description of a target concept in an image while avoiding other
distracting concepts. Notably, this optimization happens at inference time and
does not require additional training or tuning of learned parameters. We
measure the quality of the generated text by evaluating the capability of a
receiver model to accurately identify the described object within the scene. To
achieve this, we use a frozen zero-shot comprehension module as a critique of
our generated referring expressions. We evaluate DisCLIP on multiple referring
expression benchmarks through human evaluation and show that it significantly
outperforms previous methods on out-of-domain datasets. Our results highlight
the potential of using pre-trained visual-semantic models for generating
high-quality contextual descriptions
Dealing with Semantic Underspecification in Multimodal NLP
Intelligent systems that aim at mastering language as humans do must deal
with its semantic underspecification, namely, the possibility for a linguistic
signal to convey only part of the information needed for communication to
succeed. Consider the usages of the pronoun they, which can leave the gender
and number of its referent(s) underspecified. Semantic underspecification is
not a bug but a crucial language feature that boosts its storage and processing
efficiency. Indeed, human speakers can quickly and effortlessly integrate
semantically-underspecified linguistic signals with a wide range of
non-linguistic information, e.g., the multimodal context, social or cultural
conventions, and shared knowledge. Standard NLP models have, in principle, no
or limited access to such extra information, while multimodal systems grounding
language into other modalities, such as vision, are naturally equipped to
account for this phenomenon. However, we show that they struggle with it, which
could negatively affect their performance and lead to harmful consequences when
used for applications. In this position paper, we argue that our community
should be aware of semantic underspecification if it aims to develop language
technology that can successfully interact with human users. We discuss some
applications where mastering it is crucial and outline a few directions toward
achieving this goal.Comment: To appear in the Proceedings of ACL 2023 (main conference). 13 pages,
3 figure
Communication-based Evaluation for Natural Language Generation
Natural language generation (NLG) systems are commonly evaluated using n-gram overlap measures (e.g. BLEU, ROUGE). These measures do not directly capture semantics or speaker intentions, and so they often turn out to be misaligned with our true goals for NLG. In this work, we argue instead for communication-based evaluations: assuming the purpose of an NLG system is to convey information to a reader/listener, we can directly evaluate its effectiveness at this task using the Rational Speech Acts model of pragmatic language use. We illustrate with a color reference dataset that contains descriptions in pre-defined quality categories, showing that our method better aligns with these quality categories than do any of the prominent n-gram overlap methods
- …