9,118 research outputs found
Learning to Compose and Reason with Language Tree Structures for Visual Grounding
Grounding natural language in images, such as localizing "the black dog on
the left of the tree", is one of the core problems in artificial intelligence,
as it needs to comprehend the fine-grained and compositional language space.
However, existing solutions rely on the association between the holistic
language features and visual features, while neglect the nature of
compositional reasoning implied in the language. In this paper, we propose a
natural language grounding model that can automatically compose a binary tree
structure for parsing the language and then perform visual reasoning along the
tree in a bottom-up fashion. We call our model RVG-TREE: Recursive Grounding
Tree, which is inspired by the intuition that any language expression can be
recursively decomposed into two constituent parts, and the grounding confidence
score can be recursively accumulated by calculating their grounding scores
returned by sub-trees. RVG-TREE can be trained end-to-end by using the
Straight-Through Gumbel-Softmax estimator that allows the gradients from the
continuous score functions passing through the discrete tree construction.
Experiments on several benchmarks show that our model achieves the
state-of-the-art performance with more explainable reasoning.Comment: Accepted to IEEE Transactions on Pattern Analysis and Machine
Intelligence (T-PAMI
Multimodal Convolutional Neural Networks for Matching Image and Sentence
In this paper, we propose multimodal convolutional neural networks (m-CNNs)
for matching image and sentence. Our m-CNN provides an end-to-end framework
with convolutional architectures to exploit image representation, word
composition, and the matching relations between the two modalities. More
specifically, it consists of one image CNN encoding the image content, and one
matching CNN learning the joint representation of image and sentence. The
matching CNN composes words to different semantic fragments and learns the
inter-modal relations between image and the composed fragments at different
levels, thus fully exploit the matching relations between image and sentence.
Experimental results on benchmark databases of bidirectional image and sentence
retrieval demonstrate that the proposed m-CNNs can effectively capture the
information necessary for image and sentence matching. Specifically, our
proposed m-CNNs for bidirectional image and sentence retrieval on Flickr30K and
Microsoft COCO databases achieve the state-of-the-art performances.Comment: Accepted by ICCV 201
Learning to Reason: End-to-End Module Networks for Visual Question Answering
Natural language questions are inherently compositional, and many are most
easily answered by reasoning about their decomposition into modular
sub-problems. For example, to answer "is there an equal number of balls and
boxes?" we can look for balls, look for boxes, count them, and compare the
results. The recently proposed Neural Module Network (NMN) architecture
implements this approach to question answering by parsing questions into
linguistic substructures and assembling question-specific deep networks from
smaller modules that each solve one subtask. However, existing NMN
implementations rely on brittle off-the-shelf parsers, and are restricted to
the module configurations proposed by these parsers rather than learning them
from data. In this paper, we propose End-to-End Module Networks (N2NMNs), which
learn to reason by directly predicting instance-specific network layouts
without the aid of a parser. Our model learns to generate network structures
(by imitating expert demonstrations) while simultaneously learning network
parameters (using the downstream task loss). Experimental results on the new
CLEVR dataset targeted at compositional question answering show that N2NMNs
achieve an error reduction of nearly 50% relative to state-of-the-art
attentional approaches, while discovering interpretable network architectures
specialized for each question
Embodying Artifact Production Knowledge
On a modified view of embodied cognition, I argue that the conceptual structure of some present-day’s abstract artifact concepts such as PIECE OF MUSIC or PIECE OF ART can be effectively explained if it is taken into account that “visual recordings” of first observed result objects played a major role in developing abstract artifact concepts
- …