21 research outputs found
Learning Multi-Modal Word Representation Grounded in Visual Context
Representing the semantics of words is a long-standing problem for the
natural language processing community. Most methods compute word semantics
given their textual context in large corpora. More recently, researchers
attempted to integrate perceptual and visual features. Most of these works
consider the visual appearance of objects to enhance word representations but
they ignore the visual environment and context in which objects appear. We
propose to unify text-based techniques with vision-based techniques by
simultaneously leveraging textual and visual context to learn multimodal word
embeddings. We explore various choices for what can serve as a visual context
and present an end-to-end method to integrate visual context elements in a
multimodal skip-gram model. We provide experiments and extensive analysis of
the obtained results
Recommended from our members
From Words to Behaviour via Semantic Networks
The contents and structure of semantic networks have
been the focus of much recent research, with major
advances in the development of distributional models. In
parallel, connectionist modeling has extended our
knowledge of the processes engaged in semantic
activation. However, these two lines of investigation have
rarely brought together. Here, starting from a standard
textual model of semantics, we allow activation to spread
throughout its associated semantic network, as dictated by
the patterns of semantic similarity between words. We
find that the activation profile of the network, measured
at various time points, can successfully account for
response times in the lexical decision task, as well as for
subjective concreteness and imageability ratings
Multimodal Grounding for Language Processing
This survey discusses how recent developments in multimodal processing
facilitate conceptual grounding of language. We categorize the information flow
in multimodal processing with respect to cognitive models of human information
processing and analyze different methods for combining multimodal
representations. Based on this methodological inventory, we discuss the benefit
of multimodal grounding for a variety of language processing tasks and the
challenges that arise. We particularly focus on multimodal grounding of verbs
which play a crucial role for the compositional power of language.Comment: The paper has been published in the Proceedings of the 27 Conference
of Computational Linguistics. Please refer to this version for citations:
https://www.aclweb.org/anthology/papers/C/C18/C18-1197
Multimodal Variational Autoencoders for Semi-Supervised Learning: In Defense of Product-of-Experts
Multimodal generative models should be able to learn a meaningful latent
representation that enables a coherent joint generation of all modalities
(e.g., images and text). Many applications also require the ability to
accurately sample modalities conditioned on observations of a subset of the
modalities. Often not all modalities may be observed for all training data
points, so semi-supervised learning should be possible. In this study, we
evaluate a family of product-of-experts (PoE) based variational autoencoders
that have these desired properties. We include a novel PoE based architecture
and training procedure. An empirical evaluation shows that the PoE based models
can outperform an additive mixture-of-experts (MoE) approach. Our experiments
support the intuition that PoE models are more suited for a conjunctive
combination of modalities while MoEs are more suited for a disjunctive fusion