3,260 research outputs found
Multimodal Convolutional Neural Networks for Matching Image and Sentence
In this paper, we propose multimodal convolutional neural networks (m-CNNs)
for matching image and sentence. Our m-CNN provides an end-to-end framework
with convolutional architectures to exploit image representation, word
composition, and the matching relations between the two modalities. More
specifically, it consists of one image CNN encoding the image content, and one
matching CNN learning the joint representation of image and sentence. The
matching CNN composes words to different semantic fragments and learns the
inter-modal relations between image and the composed fragments at different
levels, thus fully exploit the matching relations between image and sentence.
Experimental results on benchmark databases of bidirectional image and sentence
retrieval demonstrate that the proposed m-CNNs can effectively capture the
information necessary for image and sentence matching. Specifically, our
proposed m-CNNs for bidirectional image and sentence retrieval on Flickr30K and
Microsoft COCO databases achieve the state-of-the-art performances.Comment: Accepted by ICCV 201
Weakly-supervised Visual Grounding of Phrases with Linguistic Structures
We propose a weakly-supervised approach that takes image-sentence pairs as
input and learns to visually ground (i.e., localize) arbitrary linguistic
phrases, in the form of spatial attention masks. Specifically, the model is
trained with images and their associated image-level captions, without any
explicit region-to-phrase correspondence annotations. To this end, we introduce
an end-to-end model which learns visual groundings of phrases with two types of
carefully designed loss functions. In addition to the standard discriminative
loss, which enforces that attended image regions and phrases are consistently
encoded, we propose a novel structural loss which makes use of the parse tree
structures induced by the sentences. In particular, we ensure complementarity
among the attention masks that correspond to sibling noun phrases, and
compositionality of attention masks among the children and parent phrases, as
defined by the sentence parse tree. We validate the effectiveness of our
approach on the Microsoft COCO and Visual Genome datasets.Comment: CVPR 201
Multimodal Grounding for Language Processing
This survey discusses how recent developments in multimodal processing
facilitate conceptual grounding of language. We categorize the information flow
in multimodal processing with respect to cognitive models of human information
processing and analyze different methods for combining multimodal
representations. Based on this methodological inventory, we discuss the benefit
of multimodal grounding for a variety of language processing tasks and the
challenges that arise. We particularly focus on multimodal grounding of verbs
which play a crucial role for the compositional power of language.Comment: The paper has been published in the Proceedings of the 27 Conference
of Computational Linguistics. Please refer to this version for citations:
https://www.aclweb.org/anthology/papers/C/C18/C18-1197
- …