37 research outputs found
Dealing with Semantic Underspecification in Multimodal NLP
Intelligent systems that aim at mastering language as humans do must deal
with its semantic underspecification, namely, the possibility for a linguistic
signal to convey only part of the information needed for communication to
succeed. Consider the usages of the pronoun they, which can leave the gender
and number of its referent(s) underspecified. Semantic underspecification is
not a bug but a crucial language feature that boosts its storage and processing
efficiency. Indeed, human speakers can quickly and effortlessly integrate
semantically-underspecified linguistic signals with a wide range of
non-linguistic information, e.g., the multimodal context, social or cultural
conventions, and shared knowledge. Standard NLP models have, in principle, no
or limited access to such extra information, while multimodal systems grounding
language into other modalities, such as vision, are naturally equipped to
account for this phenomenon. However, we show that they struggle with it, which
could negatively affect their performance and lead to harmful consequences when
used for applications. In this position paper, we argue that our community
should be aware of semantic underspecification if it aims to develop language
technology that can successfully interact with human users. We discuss some
applications where mastering it is crucial and outline a few directions toward
achieving this goal.Comment: To appear in the Proceedings of ACL 2023 (main conference). 13 pages,
3 figure
A Psycholinguistic Analysis of BERT's Representations of Compounds
This work studies the semantic representations learned by BERT for compounds,
that is, expressions such as sunlight or bodyguard. We build on recent studies
that explore semantic information in Transformers at the word level and test
whether BERT aligns with human semantic intuitions when dealing with
expressions (e.g., sunlight) whose overall meaning depends -- to a various
extent -- on the semantics of the constituent words (sun, light). We leverage a
dataset that includes human judgments on two psycholinguistic measures of
compound semantic analysis: lexeme meaning dominance (LMD; quantifying the
weight of each constituent toward the compound meaning) and semantic
transparency (ST; evaluating the extent to which the compound meaning is
recoverable from the constituents' semantics). We show that BERT-based measures
moderately align with human intuitions, especially when using contextualized
representations, and that LMD is overall more predictable than ST. Contrary to
the results reported for 'standard' words, higher, more contextualized layers
are the best at representing compound meaning. These findings shed new light on
the abilities of BERT in dealing with fine-grained semantic phenomena.
Moreover, they can provide insights into how speakers represent compounds.Comment: To appear in the Proceedings of EACL 2023 (main conference
Is the Red Square Big? MALeViC: Modeling Adjectives Leveraging Visual Contexts
This work aims at modeling how the meaning of gradable adjectives of size
(`big', `small') can be learned from visually-grounded contexts. Inspired by
cognitive and linguistic evidence showing that the use of these expressions
relies on setting a threshold that is dependent on a specific context, we
investigate the ability of multi-modal models in assessing whether an object is
`big' or `small' in a given visual scene. In contrast with the standard
computational approach that simplistically treats gradable adjectives as
`fixed' attributes, we pose the problem as relational: to be successful, a
model has to consider the full visual context. By means of four main tasks, we
show that state-of-the-art models (but not a relatively strong baseline) can
learn the function subtending the meaning of size adjectives, though their
performance is found to decrease while moving from simple to more complex
tasks. Crucially, models fail in developing abstract representations of
gradable adjectives that can be used compositionally.Comment: Accepted at EMNLP-IJCNLP 201
When Language Models Fall in Love: Animacy Processing in Transformer Language Models
Animacy - whether an entity is alive and sentient - is fundamental to
cognitive processing, impacting areas such as memory, vision, and language.
However, animacy is not always expressed directly in language: in English it
often manifests indirectly, in the form of selectional constraints on verbs and
adjectives. This poses a potential issue for transformer language models (LMs):
they often train only on text, and thus lack access to extralinguistic
information from which humans learn about animacy. We ask: how does this impact
LMs' animacy processing - do they still behave as humans do? We answer this
question using open-source LMs. Like previous studies, we find that LMs behave
much like humans when presented with entities whose animacy is typical.
However, we also show that even when presented with stories about atypically
animate entities, such as a peanut in love, LMs adapt: they treat these
entities as animate, though they do not adapt as well as humans. Even when the
context indicating atypical animacy is very short, LMs pick up on subtle clues
and change their behavior. We conclude that despite the limited signal through
which LMs can learn about animacy, they are indeed sensitive to the relevant
lexical semantic nuances available in English.Comment: To appear at EMNLP 202
GROOViST: A Metric for Grounding Objects in Visual Storytelling
A proper evaluation of stories generated for a sequence of images -- the task
commonly referred to as visual storytelling -- must consider multiple aspects,
such as coherence, grammatical correctness, and visual grounding. In this work,
we focus on evaluating the degree of grounding, that is, the extent to which a
story is about the entities shown in the images. We analyze current metrics,
both designed for this purpose and for general vision-text alignment. Given
their observed shortcomings, we propose a novel evaluation tool, GROOViST, that
accounts for cross-modal dependencies, temporal misalignments (the fact that
the order in which entities appear in the story and the image sequence may not
match), and human intuitions on visual grounding. An additional advantage of
GROOViST is its modular design, where the contribution of each component can be
assessed and interpreted individually.Comment: In EMNLP 2023 main conference proceedings (to appear
The BLA Benchmark: Investigating Basic Language Abilities of Pre-Trained Multimodal Models
Despite the impressive performance achieved by pre-trained
language-and-vision models in downstream tasks, it remains an open question
whether this reflects a proper understanding of image-text interaction. In this
work, we explore to what extent they handle basic linguistic constructions --
active-passive voice, coordination, and relative clauses -- that even preschool
children can typically master. We present BLA, a novel, automatically
constructed benchmark to evaluate multimodal models on these Basic Language
Abilities. We show that different types of Transformer-based systems, such as
CLIP, ViLBERT, and BLIP2, generally struggle with BLA in a zero-shot setting,
in line with previous findings. Our experiments, in particular, show that most
of the tested models only marginally benefit when fine-tuned or prompted with
construction-specific samples. Yet, the generative BLIP2 shows promising
trends, especially in an in-context learning setting. This opens the door to
using BLA not only as an evaluation benchmark but also to improve models' basic
language abilities.Comment: This is the camera-ready version of the paper that will be published
in the Proceedings of EMNLP 2023 (Singapore, 6-10 December 2023
Generating Image Descriptions via Sequential Cross-Modal Alignment Guided by Human Gaze
When speakers describe an image, they tend to look at objects before
mentioning them. In this paper, we investigate such sequential cross-modal
alignment by modelling the image description generation process
computationally. We take as our starting point a state-of-the-art image
captioning system and develop several model variants that exploit information
from human gaze patterns recorded during language production. In particular, we
propose the first approach to image description generation where visual
processing is modelled . Our experiments and analyses
confirm that better descriptions can be obtained by exploiting gaze-driven
attention and shed light on human cognitive processes by comparing different
ways of aligning the gaze modality with language production. We find that
processing gaze data sequentially leads to descriptions that are better aligned
to those produced by speakers, more diverse, and more naturalparticularly
when gaze is encoded with a dedicated recurrent component.Comment: In Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP 2020
Refer, Reuse, Reduce: Generating Subsequent References in Visual and Conversational Contexts
Dialogue participants often refer to entities or situations repeatedly within
a conversation, which contributes to its cohesiveness. Subsequent references
exploit the common ground accumulated by the interlocutors and hence have
several interesting properties, namely, they tend to be shorter and reuse
expressions that were effective in previous mentions. In this paper, we tackle
the generation of first and subsequent references in visually grounded
dialogue. We propose a generation model that produces referring utterances
grounded in both the visual and the conversational context. To assess the
referring effectiveness of its output, we also implement a reference resolution
system. Our experiments and analyses show that the model produces better, more
effective referring utterances than a model not grounded in the dialogue
context, and generates subsequent references that exhibit linguistic patterns
akin to humans.Comment: In Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP 2020