37 research outputs found
Image Description using Visual Dependency Representations
Describing the main event of an image in-volves identifying the objects depicted and predicting the relationships between them. Previous approaches have represented images as unstructured bags of regions, which makes it difficult to accurately predict meaningful relationships between regions. In this pa-per, we introduce visual dependency represen-tations to capture the relationships between the objects in an image, and hypothesize that this representation can improve image de-scription. We test this hypothesis using a new data set of region-annotated images, as-sociated with visual dependency representa-tions and gold-standard descriptions. We de-scribe two template-based description gener-ation models that operate over visual depen-dency representations. In an image descrip-tion task, we find that these models outper-form approaches that rely on object proxim-ity or corpus information to generate descrip-tions on both automatic measures and on hu-man judgements.
Scene Graph Generation with External Knowledge and Image Reconstruction
Scene graph generation has received growing attention with the advancements
in image understanding tasks such as object detection, attributes and
relationship prediction,~\etc. However, existing datasets are biased in terms
of object and relationship labels, or often come with noisy and missing
annotations, which makes the development of a reliable scene graph prediction
model very challenging. In this paper, we propose a novel scene graph
generation algorithm with external knowledge and image reconstruction loss to
overcome these dataset issues. In particular, we extract commonsense knowledge
from the external knowledge base to refine object and phrase features for
improving generalizability in scene graph generation. To address the bias of
noisy object annotations, we introduce an auxiliary image reconstruction path
to regularize the scene graph generation network. Extensive experiments show
that our framework can generate better scene graphs, achieving the
state-of-the-art performance on two benchmark datasets: Visual Relationship
Detection and Visual Genome datasets.Comment: 10 pages, 5 figures, Accepted in CVPR 201
A Novel Framework for Robustness Analysis of Visual QA Models
Deep neural networks have been playing an essential role in many computer
vision tasks including Visual Question Answering (VQA). Until recently, the
study of their accuracy was the main focus of research but now there is a trend
toward assessing the robustness of these models against adversarial attacks by
evaluating their tolerance to varying noise levels. In VQA, adversarial attacks
can target the image and/or the proposed main question and yet there is a lack
of proper analysis of the later. In this work, we propose a flexible framework
that focuses on the language part of VQA that uses semantically relevant
questions, dubbed basic questions, acting as controllable noise to evaluate the
robustness of VQA models. We hypothesize that the level of noise is positively
correlated to the similarity of a basic question to the main question. Hence,
to apply noise on any given main question, we rank a pool of basic questions
based on their similarity by casting this ranking task as a LASSO optimization
problem. Then, we propose a novel robustness measure, R_score, and two
large-scale basic question datasets (BQDs) in order to standardize robustness
analysis for VQA models.Comment: Accepted by the Thirty-Third AAAI Conference on Artificial
Intelligence, (AAAI-19), as an oral pape
The Long-Short Story of Movie Description
Generating descriptions for videos has many applications including assisting
blind people and human-robot interaction. The recent advances in image
captioning as well as the release of large-scale movie description datasets
such as MPII Movie Description allow to study this task in more depth. Many of
the proposed methods for image captioning rely on pre-trained object classifier
CNNs and Long-Short Term Memory recurrent networks (LSTMs) for generating
descriptions. While image description focuses on objects, we argue that it is
important to distinguish verbs, objects, and places in the challenging setting
of movie description. In this work we show how to learn robust visual
classifiers from the weak annotations of the sentence descriptions. Based on
these visual classifiers we learn how to generate a description using an LSTM.
We explore different design choices to build and train the LSTM and achieve the
best performance to date on the challenging MPII-MD dataset. We compare and
analyze our approach and prior work along various dimensions to better
understand the key challenges of the movie description task
Detecting Visual Relationships with Deep Relational Networks
Relationships among objects play a crucial role in image understanding.
Despite the great success of deep learning techniques in recognizing individual
objects, reasoning about the relationships among objects remains a challenging
task. Previous methods often treat this as a classification problem,
considering each type of relationship (e.g. "ride") or each distinct visual
phrase (e.g. "person-ride-horse") as a category. Such approaches are faced with
significant difficulties caused by the high diversity of visual appearance for
each kind of relationships or the large number of distinct visual phrases. We
propose an integrated framework to tackle this problem. At the heart of this
framework is the Deep Relational Network, a novel formulation designed
specifically for exploiting the statistical dependencies between objects and
their relationships. On two large datasets, the proposed method achieves
substantial improvement over state-of-the-art.Comment: To be appeared in CVPR 2017 as an oral pape
Deep Reinforcement Learning-based Image Captioning with Embedding Reward
Image captioning is a challenging problem owing to the complexity in
understanding the image content and diverse ways of describing it in natural
language. Recent advances in deep neural networks have substantially improved
the performance of this task. Most state-of-the-art approaches follow an
encoder-decoder framework, which generates captions using a sequential
recurrent prediction model. However, in this paper, we introduce a novel
decision-making framework for image captioning. We utilize a "policy network"
and a "value network" to collaboratively generate captions. The policy network
serves as a local guidance by providing the confidence of predicting the next
word according to the current state. Additionally, the value network serves as
a global and lookahead guidance by evaluating all possible extensions of the
current state. In essence, it adjusts the goal of predicting the correct words
towards the goal of generating captions similar to the ground truth captions.
We train both networks using an actor-critic reinforcement learning model, with
a novel reward defined by visual-semantic embedding. Extensive experiments and
analyses on the Microsoft COCO dataset show that the proposed framework
outperforms state-of-the-art approaches across different evaluation metrics