23 research outputs found
Visually Grounded Meaning Representations
In this paper we address the problem of grounding distributional representations of lexical meaning. We introduce a new
model which uses stacked autoencoders to learn higher-level representations from textual and visual input. The visual modality is
encoded via vectors of attributes obtained automatically from images. We create a new large-scale taxonomy of 600 visual attributes
representing more than 500 concepts and 700K images. We use this dataset to train attribute classifiers and integrate their predictions
with text-based distributional models of word meaning. We evaluate our model on its ability to simulate word similarity judgments and
concept categorization. On both tasks, our model yields a better fit to behavioral data compared to baselines and related models which
either rely on a single modality or do not make use of attribute-based input
Learning visually grounded meaning representations
Humans possess a rich semantic knowledge of words and concepts which captures the
perceivable physical properties of their real-world referents and their relations. Encoding
this knowledge or some of its aspects is the goal of computational models of
semantic representation and has been the subject of considerable research in cognitive
science, natural language processing, and related areas. Existing models have
placed emphasis on different aspects of meaning, depending ultimately on the task at
hand. Typically, such models have been used in tasks addressing the simulation of behavioural
phenomena, e.g., lexical priming or categorisation, as well as in natural language
applications, such as information retrieval, document classification, or semantic
role labelling. A major strand of research popular across disciplines focuses on models
which induce semantic representations from text corpora. These models are based on
the hypothesis that the meaning of words is established by their distributional relation
to other words (Harris, 1954). Despite their widespread use, distributional models of
word meaning have been criticised as ‘disembodied’ in that they are not grounded in
perception and action (Perfetti, 1998; Barsalou, 1999; Glenberg and Kaschak, 2002).
This lack of grounding contrasts with many experimental studies suggesting that meaning
is acquired not only from exposure to the linguistic environment but also from our
interaction with the physical world (Landau et al., 1998; Bornstein et al., 2004). This
criticism has led to the emergence of new models aiming at inducing perceptually
grounded semantic representations. Essentially, existing approaches learn meaning
representations from multiple views corresponding to different modalities, i.e. linguistic
and perceptual input. To approximate the perceptual modality, previous work has
relied largely on semantic attributes collected from humans (e.g., is round, is sour), or
on automatically extracted image features. Semantic attributes have a long-standing
tradition in cognitive science and are thought to represent salient psychological aspects
of word meaning including multisensory information. However, their elicitation
from human subjects limits the scope of computational models to a small number of
concepts for which attributes are available.
In this thesis, we present an approach which draws inspiration from the successful
application of attribute classifiers in image classification, and represent images and
the concepts depicted by them by automatically predicted visual attributes. To this
end, we create a dataset comprising nearly 700K images and a taxonomy of 636 visual
attributes and use it to train attribute classifiers. We show that their predictions
can act as a substitute for human-produced attributes without any critical information
loss. In line with the attribute-based approximation of the visual modality, we represent
the linguistic modality by textual attributes which we obtain with an off-the-shelf
distributional model. Having first established this core contribution of a novel modelling
framework for grounded meaning representations based on semantic attributes,
we show that these can be integrated into existing approaches to perceptually grounded
representations. We then introduce a model which is formulated as a stacked autoencoder
(a variant of multilayer neural networks), which learns higher-level meaning representations
by mapping words and images, represented by attributes, into a common
embedding space. In contrast to most previous approaches to multimodal learning using
different variants of deep networks and data sources, our model is defined at a finer
level of granularity—it computes representations for individual words and is unique in
its use of attributes as a means of representing the textual and visual modalities.
We evaluate the effectiveness of the representations learnt by our model by assessing
its ability to account for human behaviour on three semantic tasks, namely word
similarity, concept categorisation, and typicality of category members. With respect to
the word similarity task, we focus on the model’s ability to capture similarity in both
the meaning and appearance of the words’ referents. Since existing benchmark datasets
on word similarity do not distinguish between these two dimensions and often contain
abstract words, we create a new dataset in a large-scale experiment where participants
are asked to give two ratings per word pair expressing their semantic and visual
similarity, respectively. Experimental results show that our model learns meaningful
representations which are more accurate than models based on individual modalities or
different modality integration mechanisms. The presented model is furthermore able to
predict textual attributes for new concepts given their visual attribute predictions only,
which we demonstrate by comparing model output with human generated attributes.
Finally, we show the model’s effectiveness in an image-based task on visual category
learning, in which images are used as a stand-in for real-world objects
Learning grounded word meaning representations on similarity graphs
This paper introduces a novel approach to learn visually grounded meaning representations of words as low-dimensional node embeddings on an underlying graph hierarchy. The lower level of the hierarchy models modality-specific word representations through dedicated but communicating graphs, while the higher level puts these representations together on a single graph to learn a representation jointly from both modalities. The topology of each graph models similarity relations among words, and is estimated jointly with the graph embedding.The assumption underlying this model is that words sharing similar meaning correspond to communities in an underlying similarity graph in a low-dimensional space. We named this model Hierarchical Multi-Modal Similarity Graph Embedding (HM-SGE). Experimental results validate the ability of HM-SGE to simulate human similarity judgements and concept categorization, outperforming the state of the art.Peer ReviewedPostprint (published version
Limitations of Cross-Lingual Learning from Image Search
Cross-lingual representation learning is an important step in making NLP
scale to all the world's languages. Recent work on bilingual lexicon induction
suggests that it is possible to learn cross-lingual representations of words
based on similarities between images associated with these words. However, that
work focused on the translation of selected nouns only. In our work, we
investigate whether the meaning of other parts-of-speech, in particular
adjectives and verbs, can be learned in the same way. We also experiment with
combining the representations learned from visual data with embeddings learned
from textual data. Our experiments across five language pairs indicate that
previous work does not scale to the problem of learning cross-lingual
representations beyond simple nouns
A Neurobiologically Motivated Analysis of Distributional Semantic Models
The pervasive use of distributional semantic models or word embeddings in a
variety of research fields is due to their remarkable ability to represent the
meanings of words for both practical application and cognitive modeling.
However, little has been known about what kind of information is encoded in
text-based word vectors. This lack of understanding is particularly problematic
when word vectors are regarded as a model of semantic representation for
abstract concepts. This paper attempts to reveal the internal information of
distributional word vectors by the analysis using Binder et al.'s (2016)
brain-based vectors, explicitly structured conceptual representations based on
neurobiologically motivated attributes. In the analysis, the mapping from
text-based vectors to brain-based vectors is trained and prediction performance
is evaluated by comparing the estimated and original brain-based vectors. The
analysis demonstrates that social and cognitive information is better encoded
in text-based word vectors, but emotional information is not. This result is
discussed in terms of embodied theories for abstract concepts.Comment: submitted to CogSci 201
Learning Multimodal Word Representation via Dynamic Fusion Methods
Multimodal models have been proven to outperform text-based models on
learning semantic word representations. Almost all previous multimodal models
typically treat the representations from different modalities equally. However,
it is obvious that information from different modalities contributes
differently to the meaning of words. This motivates us to build a multimodal
model that can dynamically fuse the semantic representations from different
modalities according to different types of words. To that end, we propose three
novel dynamic fusion methods to assign importance weights to each modality, in
which weights are learned under the weak supervision of word association pairs.
The extensive experiments have demonstrated that the proposed methods
outperform strong unimodal baselines and state-of-the-art multimodal models.Comment: To be appear in AAAI-1
Using Sparse Semantic Embeddings Learned from Multimodal Text and Image Data to Model Human Conceptual Knowledge
Distributional models provide a convenient way to model semantics using dense
embedding spaces derived from unsupervised learning algorithms. However, the
dimensions of dense embedding spaces are not designed to resemble human
semantic knowledge. Moreover, embeddings are often built from a single source
of information (typically text data), even though neurocognitive research
suggests that semantics is deeply linked to both language and perception. In
this paper, we combine multimodal information from both text and image-based
representations derived from state-of-the-art distributional models to produce
sparse, interpretable vectors using Joint Non-Negative Sparse Embedding.
Through in-depth analyses comparing these sparse models to human-derived
behavioural and neuroimaging data, we demonstrate their ability to predict
interpretable linguistic descriptions of human ground-truth semantic knowledge.Comment: Proceedings of the 22nd Conference on Computational Natural Language
Learning (CoNLL 2018), pages 260-270. Brussels, Belgium, October 31 -
November 1, 2018. Association for Computational Linguistic
Don't Blame Distributional Semantics if it can't do Entailment
Distributional semantics has had enormous empirical success in Computational
Linguistics and Cognitive Science in modeling various semantic phenomena, such
as semantic similarity, and distributional models are widely used in
state-of-the-art Natural Language Processing systems. However, the theoretical
status of distributional semantics within a broader theory of language and
cognition is still unclear: What does distributional semantics model? Can it
be, on its own, a fully adequate model of the meanings of linguistic
expressions? The standard answer is that distributional semantics is not fully
adequate in this regard, because it falls short on some of the central aspects
of formal semantic approaches: truth conditions, entailment, reference, and
certain aspects of compositionality. We argue that this standard answer rests
on a misconception: These aspects do not belong in a theory of expression
meaning, they are instead aspects of speaker meaning, i.e., communicative
intentions in a particular context. In a slogan: words do not refer, speakers
do. Clearing this up enables us to argue that distributional semantics on its
own is an adequate model of expression meaning. Our proposal sheds light on the
role of distributional semantics in a broader theory of language and cognition,
its relationship to formal semantics, and its place in computational models.Comment: To appear in Proceedings of the 13th International Conference on
Computational Semantics (IWCS 2019), Gothenburg, Swede