2,147 research outputs found
Scene Graph Generation from Objects, Phrases and Region Captions
Object detection, scene graph generation and region captioning, which are
three scene understanding tasks at different semantic levels, are tied
together: scene graphs are generated on top of objects detected in an image
with their pairwise relationship predicted, while region captioning gives a
language description of the objects, their attributes, relations, and other
context information. In this work, to leverage the mutual connections across
semantic levels, we propose a novel neural network model, termed as Multi-level
Scene Description Network (denoted as MSDN), to solve the three vision tasks
jointly in an end-to-end manner. Objects, phrases, and caption regions are
first aligned with a dynamic graph based on their spatial and semantic
connections. Then a feature refining structure is used to pass messages across
the three levels of semantic tasks through the graph. We benchmark the learned
model on three tasks, and show the joint learning across three tasks with our
proposed method can bring mutual improvements over previous models.
Particularly, on the scene graph generation task, our proposed method
outperforms the state-of-art method with more than 3% margin.Comment: accepted by ICCV 201
Scene Graph Generation via Conditional Random Fields
Despite the great success object detection and segmentation models have
achieved in recognizing individual objects in images, performance on cognitive
tasks such as image caption, semantic image retrieval, and visual QA is far
from satisfactory. To achieve better performance on these cognitive tasks,
merely recognizing individual object instances is insufficient. Instead, the
interactions between object instances need to be captured in order to
facilitate reasoning and understanding of the visual scenes in an image. Scene
graph, a graph representation of images that captures object instances and
their relationships, offers a comprehensive understanding of an image. However,
existing techniques on scene graph generation fail to distinguish subjects and
objects in the visual scenes of images and thus do not perform well with
real-world datasets where exist ambiguous object instances. In this work, we
propose a novel scene graph generation model for predicting object instances
and its corresponding relationships in an image. Our model, SG-CRF, learns the
sequential order of subject and object in a relationship triplet, and the
semantic compatibility of object instance nodes and relationship nodes in a
scene graph efficiently. Experiments empirically show that SG-CRF outperforms
the state-of-the-art methods, on three different datasets, i.e., CLEVR, VRD,
and Visual Genome, raising the Recall@100 from 24.99% to 49.95%, from 41.92% to
50.47%, and from 54.69% to 54.77%, respectively
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
The Flickr30k dataset has become a standard benchmark for sentence-based
image description. This paper presents Flickr30k Entities, which augments the
158k captions from Flickr30k with 244k coreference chains, linking mentions of
the same entities across different captions for the same image, and associating
them with 276k manually annotated bounding boxes. Such annotations are
essential for continued progress in automatic image description and grounded
language understanding. They enable us to define a new benchmark for
localization of textual entity mentions in an image. We present a strong
baseline for this task that combines an image-text embedding, detectors for
common objects, a color classifier, and a bias towards selecting larger
objects. While our baseline rivals in accuracy more complex state-of-the-art
models, we show that its gains cannot be easily parlayed into improvements on
such tasks as image-sentence retrieval, thus underlining the limitations of
current methods and the need for further research
A Comprehensive Survey of Deep Learning for Image Captioning
Generating a description of an image is called image captioning. Image
captioning requires to recognize the important objects, their attributes and
their relationships in an image. It also needs to generate syntactically and
semantically correct sentences. Deep learning-based techniques are capable of
handling the complexities and challenges of image captioning. In this survey
paper, we aim to present a comprehensive review of existing deep learning-based
image captioning techniques. We discuss the foundation of the techniques to
analyze their performances, strengths and limitations. We also discuss the
datasets and the evaluation metrics popularly used in deep learning based
automatic image captioning.Comment: 36 Pages, Accepted as a Journal Paper in ACM Computing Surveys
(October 2018
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
Despite progress in perceptual tasks such as image classification, computers
still perform poorly on cognitive tasks such as image description and question
answering. Cognition is core to tasks that involve not just recognizing, but
reasoning about our visual world. However, models used to tackle the rich
content in images for cognitive tasks are still being trained using the same
datasets designed for perceptual tasks. To achieve success at cognitive tasks,
models need to understand the interactions and relationships between objects in
an image. When asked "What vehicle is the person riding?", computers will need
to identify the objects in an image as well as the relationships riding(man,
carriage) and pulling(horse, carriage) in order to answer correctly that "the
person is riding a horse-drawn carriage".
In this paper, we present the Visual Genome dataset to enable the modeling of
such relationships. We collect dense annotations of objects, attributes, and
relationships within each image to learn these models. Specifically, our
dataset contains over 100K images where each image has an average of 21
objects, 18 attributes, and 18 pairwise relationships between objects. We
canonicalize the objects, attributes, relationships, and noun phrases in region
descriptions and questions answer pairs to WordNet synsets. Together, these
annotations represent the densest and largest dataset of image descriptions,
objects, attributes, relationships, and question answers.Comment: 44 pages, 37 figure
Visual Relationship Detection using Scene Graphs: A Survey
Understanding a scene by decoding the visual relationships depicted in an
image has been a long studied problem. While the recent advances in deep
learning and the usage of deep neural networks have achieved near human
accuracy on many tasks, there still exists a pretty big gap between human and
machine level performance when it comes to various visual relationship
detection tasks. Developing on earlier tasks like object recognition,
segmentation and captioning which focused on a relatively coarser image
understanding, newer tasks have been introduced recently to deal with a finer
level of image understanding. A Scene Graph is one such technique to better
represent a scene and the various relationships present in it. With its wide
number of applications in various tasks like Visual Question Answering,
Semantic Image Retrieval, Image Generation, among many others, it has proved to
be a useful tool for deeper and better visual relationship understanding. In
this paper, we present a detailed survey on the various techniques for scene
graph generation, their efficacy to represent visual relationships and how it
has been used to solve various downstream tasks. We also attempt to analyze the
various future directions in which the field might advance in the future. Being
one of the first papers to give a detailed survey on this topic, we also hope
to give a succinct introduction to scene graphs, and guide practitioners while
developing approaches for their applications
Generating Multi-Sentence Lingual Descriptions of Indoor Scenes
This paper proposes a novel framework for generating lingual descriptions of
indoor scenes. Whereas substantial efforts have been made to tackle this
problem, previous approaches focusing primarily on generating a single sentence
for each image, which is not sufficient for describing complex scenes. We
attempt to go beyond this, by generating coherent descriptions with multiple
sentences. Our approach is distinguished from conventional ones in several
aspects: (1) a 3D visual parsing system that jointly infers objects,
attributes, and relations; (2) a generative grammar learned automatically from
training text; and (3) a text generation algorithm that takes into account the
coherence among sentences. Experiments on the augmented NYU-v2 dataset show
that our framework can generate natural descriptions with substantially higher
ROGUE scores compared to those produced by the baseline
From Images to Sentences through Scene Description Graphs using Commonsense Reasoning and Knowledge
In this paper we propose the construction of linguistic descriptions of
images. This is achieved through the extraction of scene description graphs
(SDGs) from visual scenes using an automatically constructed knowledge base.
SDGs are constructed using both vision and reasoning. Specifically, commonsense
reasoning is applied on (a) detections obtained from existing perception
methods on given images, (b) a "commonsense" knowledge base constructed using
natural language processing of image annotations and (c) lexical ontological
knowledge from resources such as WordNet. Amazon Mechanical Turk(AMT)-based
evaluations on Flickr8k, Flickr30k and MS-COCO datasets show that in most
cases, sentences auto-constructed from SDGs obtained by our method give a more
relevant and thorough description of an image than a recent state-of-the-art
image caption based approach. Our Image-Sentence Alignment Evaluation results
are also comparable to that of the recent state-of-the art approaches
Exploring Visual Relationship for Image Captioning
It is always well believed that modeling relationships between objects would
be helpful for representing and eventually describing an image. Nevertheless,
there has not been evidence in support of the idea on image description
generation. In this paper, we introduce a new design to explore the connections
between objects for image captioning under the umbrella of attention-based
encoder-decoder framework. Specifically, we present Graph Convolutional
Networks plus Long Short-Term Memory (dubbed as GCN-LSTM) architecture that
novelly integrates both semantic and spatial object relationships into image
encoder. Technically, we build graphs over the detected objects in an image
based on their spatial and semantic connections. The representations of each
region proposed on objects are then refined by leveraging graph structure
through GCN. With the learnt region-level features, our GCN-LSTM capitalizes on
LSTM-based captioning framework with attention mechanism for sentence
generation. Extensive experiments are conducted on COCO image captioning
dataset, and superior results are reported when comparing to state-of-the-art
approaches. More remarkably, GCN-LSTM increases CIDEr-D performance from 120.1%
to 128.7% on COCO testing set.Comment: ECCV 201
Multi-task Learning of Hierarchical Vision-Language Representation
It is still challenging to build an AI system that can perform tasks that
involve vision and language at human level. So far, researchers have singled
out individual tasks separately, for each of which they have designed networks
and trained them on its dedicated datasets. Although this approach has seen a
certain degree of success, it comes with difficulties of understanding
relations among different tasks and transferring the knowledge learned for a
task to others. We propose a multi-task learning approach that enables to learn
vision-language representation that is shared by many tasks from their diverse
datasets. The representation is hierarchical, and prediction for each task is
computed from the representation at its corresponding level of the hierarchy.
We show through experiments that our method consistently outperforms previous
single-task-learning methods on image caption retrieval, visual question
answering, and visual grounding. We also analyze the learned hierarchical
representation by visualizing attention maps generated in our network
- …