13,479 research outputs found
Deep Aesthetic Quality Assessment with Semantic Information
Human beings often assess the aesthetic quality of an image coupled with the
identification of the image's semantic content. This paper addresses the
correlation issue between automatic aesthetic quality assessment and semantic
recognition. We cast the assessment problem as the main task among a multi-task
deep model, and argue that semantic recognition task offers the key to address
this problem. Based on convolutional neural networks, we employ a single and
simple multi-task framework to efficiently utilize the supervision of aesthetic
and semantic labels. A correlation item between these two tasks is further
introduced to the framework by incorporating the inter-task relationship
learning. This item not only provides some useful insight about the correlation
but also improves assessment accuracy of the aesthetic task. Particularly, an
effective strategy is developed to keep a balance between the two tasks, which
facilitates to optimize the parameters of the framework. Extensive experiments
on the challenging AVA dataset and Photo.net dataset validate the importance of
semantic recognition in aesthetic quality assessment, and demonstrate that
multi-task deep models can discover an effective aesthetic representation to
achieve state-of-the-art results.Comment: 13 pages, 10 figure
Automatic Attribute Discovery with Neural Activations
How can a machine learn to recognize visual attributes emerging out of online
community without a definitive supervised dataset? This paper proposes an
automatic approach to discover and analyze visual attributes from a noisy
collection of image-text data on the Web. Our approach is based on the
relationship between attributes and neural activations in the deep network. We
characterize the visual property of the attribute word as a divergence within
weakly-annotated set of images. We show that the neural activations are useful
for discovering and learning a classifier that well agrees with human
perception from the noisy real-world Web data. The empirical study suggests the
layered structure of the deep neural networks also gives us insights into the
perceptual depth of the given word. Finally, we demonstrate that we can utilize
highly-activating neurons for finding semantically relevant regions.Comment: ECCV 201
Visual Relationship Detection using Scene Graphs: A Survey
Understanding a scene by decoding the visual relationships depicted in an
image has been a long studied problem. While the recent advances in deep
learning and the usage of deep neural networks have achieved near human
accuracy on many tasks, there still exists a pretty big gap between human and
machine level performance when it comes to various visual relationship
detection tasks. Developing on earlier tasks like object recognition,
segmentation and captioning which focused on a relatively coarser image
understanding, newer tasks have been introduced recently to deal with a finer
level of image understanding. A Scene Graph is one such technique to better
represent a scene and the various relationships present in it. With its wide
number of applications in various tasks like Visual Question Answering,
Semantic Image Retrieval, Image Generation, among many others, it has proved to
be a useful tool for deeper and better visual relationship understanding. In
this paper, we present a detailed survey on the various techniques for scene
graph generation, their efficacy to represent visual relationships and how it
has been used to solve various downstream tasks. We also attempt to analyze the
various future directions in which the field might advance in the future. Being
one of the first papers to give a detailed survey on this topic, we also hope
to give a succinct introduction to scene graphs, and guide practitioners while
developing approaches for their applications
Exploring Visual Relationship for Image Captioning
It is always well believed that modeling relationships between objects would
be helpful for representing and eventually describing an image. Nevertheless,
there has not been evidence in support of the idea on image description
generation. In this paper, we introduce a new design to explore the connections
between objects for image captioning under the umbrella of attention-based
encoder-decoder framework. Specifically, we present Graph Convolutional
Networks plus Long Short-Term Memory (dubbed as GCN-LSTM) architecture that
novelly integrates both semantic and spatial object relationships into image
encoder. Technically, we build graphs over the detected objects in an image
based on their spatial and semantic connections. The representations of each
region proposed on objects are then refined by leveraging graph structure
through GCN. With the learnt region-level features, our GCN-LSTM capitalizes on
LSTM-based captioning framework with attention mechanism for sentence
generation. Extensive experiments are conducted on COCO image captioning
dataset, and superior results are reported when comparing to state-of-the-art
approaches. More remarkably, GCN-LSTM increases CIDEr-D performance from 120.1%
to 128.7% on COCO testing set.Comment: ECCV 201
A Picture Tells a Thousand Words -- About You! User Interest Profiling from User Generated Visual Content
Inference of online social network users' attributes and interests has been
an active research topic. Accurate identification of users' attributes and
interests is crucial for improving the performance of personalization and
recommender systems. Most of the existing works have focused on textual content
generated by the users and have successfully used it for predicting users'
interests and other identifying attributes. However, little attention has been
paid to user generated visual content (images) that is becoming increasingly
popular and pervasive in recent times. We posit that images posted by users on
online social networks are a reflection of topics they are interested in and
propose an approach to infer user attributes from images posted by them. We
analyze the content of individual images and then aggregate the image-level
knowledge to infer user-level interest distribution. We employ image-level
similarity to propagate the label information between images, as well as
utilize the image category information derived from the user created
organization structure to further propagate the category-level knowledge for
all images. A real life social network dataset created from Pinterest is used
for evaluation and the experimental results demonstrate the effectiveness of
our proposed approach.Comment: 7 pages, 6 Figures, 4 Table
Vision-to-Language Tasks Based on Attributes and Attention Mechanism
Vision-to-language tasks aim to integrate computer vision and natural
language processing together, which has attracted the attention of many
researchers. For typical approaches, they encode image into feature
representations and decode it into natural language sentences. While they
neglect high-level semantic concepts and subtle relationships between image
regions and natural language elements. To make full use of these information,
this paper attempt to exploit the text guided attention and semantic-guided
attention (SA) to find the more correlated spatial information and reduce the
semantic gap between vision and language. Our method includes two level
attention networks. One is the text-guided attention network which is used to
select the text-related regions. The other is SA network which is used to
highlight the concept-related regions and the region-related concepts. At last,
all these information are incorporated to generate captions or answers.
Practically, image captioning and visual question answering experiments have
been carried out, and the experimental results have shown the excellent
performance of the proposed approach.Comment: 15 pages, 6 figures, 50 reference
Siamese Attentional Keypoint Network for High Performance Visual Tracking
In this paper, we investigate the impacts of three main aspects of visual
tracking, i.e., the backbone network, the attentional mechanism, and the
detection component, and propose a Siamese Attentional Keypoint Network, dubbed
SATIN, for efficient tracking and accurate localization. Firstly, a new Siamese
lightweight hourglass network is specially designed for visual tracking. It
takes advantage of the benefits of the repeated bottom-up and top-down
inference to capture more global and local contextual information at multiple
scales. Secondly, a novel cross-attentional module is utilized to leverage both
channel-wise and spatial intermediate attentional information, which can
enhance both discriminative and localization capabilities of feature maps.
Thirdly, a keypoints detection approach is invented to trace any target object
by detecting the top-left corner point, the centroid point, and the
bottom-right corner point of its bounding box. Therefore, our SATIN tracker not
only has a strong capability to learn more effective object representations,
but also is computational and memory storage efficiency, either during the
training or testing stages. To the best of our knowledge, we are the first to
propose this approach. Without bells and whistles, experimental results
demonstrate that our approach achieves state-of-the-art performance on several
recent benchmark datasets, at a speed far exceeding 27 frames per second.Comment: Accepted by Knowledge-Based SYSTEM
Attributes for Improved Attributes: A Multi-Task Network for Attribute Classification
Attributes, or semantic features, have gained popularity in the past few
years in domains ranging from activity recognition in video to face
verification. Improving the accuracy of attribute classifiers is an important
first step in any application which uses these attributes. In most works to
date, attributes have been considered to be independent. However, we know this
not to be the case. Many attributes are very strongly related, such as heavy
makeup and wearing lipstick. We propose to take advantage of attribute
relationships in three ways: by using a multi-task deep convolutional neural
network (MCNN) sharing the lowest layers amongst all attributes, sharing the
higher layers for related attributes, and by building an auxiliary network on
top of the MCNN which utilizes the scores from all attributes to improve the
final classification of each attribute. We demonstrate the effectiveness of our
method by producing results on two challenging publicly available datasets
Tell-and-Answer: Towards Explainable Visual Question Answering using Attributes and Captions
Visual Question Answering (VQA) has attracted attention from both computer
vision and natural language processing communities. Most existing approaches
adopt the pipeline of representing an image via pre-trained CNNs, and then
using the uninterpretable CNN features in conjunction with the question to
predict the answer. Although such end-to-end models might report promising
performance, they rarely provide any insight, apart from the answer, into the
VQA process. In this work, we propose to break up the end-to-end VQA into two
steps: explaining and reasoning, in an attempt towards a more explainable VQA
by shedding light on the intermediate results between these two steps. To that
end, we first extract attributes and generate descriptions as explanations for
an image using pre-trained attribute detectors and image captioning models,
respectively. Next, a reasoning module utilizes these explanations in place of
the image to infer an answer to the question. The advantages of such a
breakdown include: (1) the attributes and captions can reflect what the system
extracts from the image, thus can provide some explanations for the predicted
answer; (2) these intermediate results can help us identify the inabilities of
both the image understanding part and the answer inference part when the
predicted answer is wrong. We conduct extensive experiments on a popular VQA
dataset and dissect all results according to several measurements of the
explanation quality. Our system achieves comparable performance with the
state-of-the-art, yet with added benefits of explainability and the inherent
ability to further improve with higher quality explanations
Hybrid Knowledge Routed Modules for Large-scale Object Detection
The dominant object detection approaches treat the recognition of each region
separately and overlook crucial semantic correlations between objects in one
scene. This paradigm leads to substantial performance drop when facing heavy
long-tail problems, where very few samples are available for rare classes and
plenty of confusing categories exists. We exploit diverse human commonsense
knowledge for reasoning over large-scale object categories and reaching
semantic coherency within one image. Particularly, we present Hybrid Knowledge
Routed Modules (HKRM) that incorporates the reasoning routed by two kinds of
knowledge forms: an explicit knowledge module for structured constraints that
are summarized with linguistic knowledge (e.g. shared attributes,
relationships) about concepts; and an implicit knowledge module that depicts
some implicit constraints (e.g. common spatial layouts). By functioning over a
region-to-region graph, both modules can be individualized and adapted to
coordinate with visual patterns in each image, guided by specific knowledge
forms. HKRM are light-weight, general-purpose and extensible by easily
incorporating multiple knowledge to endow any detection networks the ability of
global semantic reasoning. Experiments on large-scale object detection
benchmarks show HKRM obtains around 34.5% improvement on VisualGenome (1000
categories) and 30.4% on ADE in terms of mAP. Codes and trained model can be
found in https://github.com/chanyn/HKRM.Comment: 9 pages, 5 figure
- …