2,705 research outputs found
Visual Question Answering: A Survey of Methods and Datasets
Visual Question Answering (VQA) is a challenging task that has received
increasing attention from both the computer vision and the natural language
processing communities. Given an image and a question in natural language, it
requires reasoning over visual elements of the image and general knowledge to
infer the correct answer. In the first part of this survey, we examine the
state of the art by comparing modern approaches to the problem. We classify
methods by their mechanism to connect the visual and textual modalities. In
particular, we examine the common approach of combining convolutional and
recurrent neural networks to map images and questions to a common feature
space. We also discuss memory-augmented and modular architectures that
interface with structured knowledge bases. In the second part of this survey,
we review the datasets available for training and evaluating VQA systems. The
various datatsets contain questions at different levels of complexity, which
require different capabilities and types of reasoning. We examine in depth the
question/answer pairs from the Visual Genome project, and evaluate the
relevance of the structured annotations of images with scene graphs for VQA.
Finally, we discuss promising future directions for the field, in particular
the connection to structured knowledge bases and the use of natural language
processing models.Comment: 25 page
Textbook Question Answering with Multi-modal Context Graph Understanding and Self-supervised Open-set Comprehension
In this work, we introduce a novel algorithm for solving the textbook
question answering (TQA) task which describes more realistic QA problems
compared to other recent tasks. We mainly focus on two related issues with
analysis of the TQA dataset. First, solving the TQA problems requires to
comprehend multi-modal contexts in complicated input data. To tackle this issue
of extracting knowledge features from long text lessons and merging them with
visual features, we establish a context graph from texts and images, and
propose a new module f-GCN based on graph convolutional networks (GCN). Second,
scientific terms are not spread over the chapters and subjects are split in the
TQA dataset. To overcome this so called "out-of-domain" issue, before learning
QA problems, we introduce a novel self-supervised open-set learning process
without any annotations. The experimental results show that our model
significantly outperforms prior state-of-the-art methods. Moreover, ablation
studies validate that both methods of incorporating f-GCN for extracting
knowledge from multi-modal contexts and our newly proposed self-supervised
learning process are effective for TQA problems.Comment: ACL2019 Camera-read
Linguistic Structure Guided Context Modeling for Referring Image Segmentation
Referring image segmentation aims to predict the foreground mask of the
object referred by a natural language sentence. Multimodal context of the
sentence is crucial to distinguish the referent from the background. Existing
methods either insufficiently or redundantly model the multimodal context. To
tackle this problem, we propose a "gather-propagate-distribute" scheme to model
multimodal context by cross-modal interaction and implement this scheme as a
novel Linguistic Structure guided Context Modeling (LSCM) module. Our LSCM
module builds a Dependency Parsing Tree suppressed Word Graph (DPT-WG) which
guides all the words to include valid multimodal context of the sentence while
excluding disturbing ones through three steps over the multimodal feature,
i.e., gathering, constrained propagation and distributing. Extensive
experiments on four benchmarks demonstrate that our method outperforms all the
previous state-of-the-arts.Comment: Accepted by ECCV 2020. Code is available at
https://github.com/spyflying/LSCM-Refse
Differentiable Parsing and Visual Grounding of Natural Language Instructions for Object Placement
We present a new method, PARsing And visual GrOuNding (ParaGon), for
grounding natural language in object placement tasks. Natural language
generally describes objects and spatial relations with compositionality and
ambiguity, two major obstacles to effective language grounding. For
compositionality, ParaGon parses a language instruction into an object-centric
graph representation to ground objects individually. For ambiguity, ParaGon
uses a novel particle-based graph neural network to reason about object
placements with uncertainty. Essentially, ParaGon integrates a parsing
algorithm into a probabilistic, data-driven learning framework. It is fully
differentiable and trained end-to-end from data for robustness against complex,
ambiguous language input.Comment: To appear in ICRA 202
Deriving Verb Predicates By Clustering Verbs with Arguments
Hand-built verb clusters such as the widely used Levin classes (Levin, 1993)
have proved useful, but have limited coverage. Verb classes automatically
induced from corpus data such as those from VerbKB (Wijaya, 2016), on the other
hand, can give clusters with much larger coverage, and can be adapted to
specific corpora such as Twitter. We present a method for clustering the
outputs of VerbKB: verbs with their multiple argument types, e.g.
"marry(person, person)", "feel(person, emotion)." We make use of a novel
low-dimensional embedding of verbs and their arguments to produce high quality
clusters in which the same verb can be in different clusters depending on its
argument type. The resulting verb clusters do a better job than hand-built
clusters of predicting sarcasm, sentiment, and locus of control in tweets
The Grail theorem prover: Type theory for syntax and semantics
As the name suggests, type-logical grammars are a grammar formalism based on
logic and type theory. From the prespective of grammar design, type-logical
grammars develop the syntactic and semantic aspects of linguistic phenomena
hand-in-hand, letting the desired semantics of an expression inform the
syntactic type and vice versa. Prototypical examples of the successful
application of type-logical grammars to the syntax-semantics interface include
coordination, quantifier scope and extraction.This chapter describes the Grail
theorem prover, a series of tools for designing and testing grammars in various
modern type-logical grammars which functions as a tool . All tools described in
this chapter are freely available
- …