6,369 research outputs found
Understanding Convolutional Neural Networks in Terms of Category-Level Attributes
Abstract. It has been recently reported that convolutional neural net-works (CNNs) show good performances in many image recognition tasks. They significantly outperform the previous approaches that are not based on neural networks particularly for object category recognition. These performances are arguably owing to their ability of discovering better image features for recognition tasks through learning, resulting in the acquisition of better internal representations of the inputs. However, in spite of the good performances, it remains an open question why CNNs work so well and/or how they can learn such good representations. In this study, we conjecture that the learned representation can be interpreted as category-level attributes that have good properties. We conducted sev-eral experiments by using the dataset AwA (Animals with Attributes) and a CNN trained for ILSVRC-2012 in a fully supervised setting to ex-amine this conjecture. We report that there exist units in the CNN that can predict some of the 85 semantic attributes fairly accurately, along with a detailed observation that this is true only for visual attributes and not for non-visual ones. It is more natural to think that the CNN may discover not only semantic attributes but non-semantic ones (or ones that are difficult to represent as a word). To explore this possibility, we perform zero-shot learning by regarding the activation pattern of upper layers as attributes describing the categories. The result shows that it outperforms the state-of-the-art with a significant margin.
Density estimation using Real NVP
Unsupervised learning of probabilistic models is a central yet challenging
problem in machine learning. Specifically, designing models with tractable
learning, sampling, inference and evaluation is crucial in solving this task.
We extend the space of such models using real-valued non-volume preserving
(real NVP) transformations, a set of powerful invertible and learnable
transformations, resulting in an unsupervised learning algorithm with exact
log-likelihood computation, exact sampling, exact inference of latent
variables, and an interpretable latent space. We demonstrate its ability to
model natural images on four datasets through sampling, log-likelihood
evaluation and latent variable manipulations.Comment: 10 pages of main content, 3 pages of bibliography, 18 pages of
appendix. Accepted at ICLR 201
Large Scale Holistic Video Understanding
Video recognition has been advanced in recent years by benchmarks with rich
annotations. However, research is still mainly limited to human action or
sports recognition - focusing on a highly specific video understanding task and
thus leaving a significant gap towards describing the overall content of a
video. We fill this gap by presenting a large-scale "Holistic Video
Understanding Dataset"~(HVU). HVU is organized hierarchically in a semantic
taxonomy that focuses on multi-label and multi-task video understanding as a
comprehensive problem that encompasses the recognition of multiple semantic
aspects in the dynamic scene. HVU contains approx.~572k videos in total with 9
million annotations for training, validation, and test set spanning over 3142
labels. HVU encompasses semantic aspects defined on categories of scenes,
objects, actions, events, attributes, and concepts which naturally captures the
real-world scenarios.
We demonstrate the generalization capability of HVU on three challenging
tasks: 1.) Video classification, 2.) Video captioning and 3.) Video clustering
tasks. In particular for video classification, we introduce a new
spatio-temporal deep neural network architecture called "Holistic Appearance
and Temporal Network"~(HATNet) that builds on fusing 2D and 3D architectures
into one by combining intermediate representations of appearance and temporal
cues. HATNet focuses on the multi-label and multi-task learning problem and is
trained in an end-to-end manner. Via our experiments, we validate the idea that
holistic representation learning is complementary, and can play a key role in
enabling many real-world applications.Comment: ECCV 202
From BoW to CNN: Two Decades of Texture Representation for Texture Classification
Texture is a fundamental characteristic of many types of images, and texture
representation is one of the essential and challenging problems in computer
vision and pattern recognition which has attracted extensive research
attention. Since 2000, texture representations based on Bag of Words (BoW) and
on Convolutional Neural Networks (CNNs) have been extensively studied with
impressive performance. Given this period of remarkable evolution, this paper
aims to present a comprehensive survey of advances in texture representation
over the last two decades. More than 200 major publications are cited in this
survey covering different aspects of the research, which includes (i) problem
description; (ii) recent advances in the broad categories of BoW-based,
CNN-based and attribute-based methods; and (iii) evaluation issues,
specifically benchmark datasets and state of the art results. In retrospect of
what has been achieved so far, the survey discusses open challenges and
directions for future research.Comment: Accepted by IJC
A Picture Tells a Thousand Words -- About You! User Interest Profiling from User Generated Visual Content
Inference of online social network users' attributes and interests has been
an active research topic. Accurate identification of users' attributes and
interests is crucial for improving the performance of personalization and
recommender systems. Most of the existing works have focused on textual content
generated by the users and have successfully used it for predicting users'
interests and other identifying attributes. However, little attention has been
paid to user generated visual content (images) that is becoming increasingly
popular and pervasive in recent times. We posit that images posted by users on
online social networks are a reflection of topics they are interested in and
propose an approach to infer user attributes from images posted by them. We
analyze the content of individual images and then aggregate the image-level
knowledge to infer user-level interest distribution. We employ image-level
similarity to propagate the label information between images, as well as
utilize the image category information derived from the user created
organization structure to further propagate the category-level knowledge for
all images. A real life social network dataset created from Pinterest is used
for evaluation and the experimental results demonstrate the effectiveness of
our proposed approach.Comment: 7 pages, 6 Figures, 4 Table
Visual Affordance and Function Understanding: A Survey
Nowadays, robots are dominating the manufacturing, entertainment and
healthcare industries. Robot vision aims to equip robots with the ability to
discover information, understand it and interact with the environment. These
capabilities require an agent to effectively understand object affordances and
functionalities in complex visual domains. In this literature survey, we first
focus on Visual affordances and summarize the state of the art as well as open
problems and research gaps. Specifically, we discuss sub-problems such as
affordance detection, categorization, segmentation and high-level reasoning.
Furthermore, we cover functional scene understanding and the prevalent
functional descriptors used in the literature. The survey also provides
necessary background to the problem, sheds light on its significance and
highlights the existing challenges for affordance and functionality learning.Comment: 26 pages, 22 image
Deep Structured Scene Parsing by Learning with Image Descriptions
This paper addresses a fundamental problem of scene understanding: How to
parse the scene image into a structured configuration (i.e., a semantic object
hierarchy with object interaction relations) that finely accords with human
perception. We propose a deep architecture consisting of two networks: i) a
convolutional neural network (CNN) extracting the image representation for
pixelwise object labeling and ii) a recursive neural network (RNN) discovering
the hierarchical object structure and the inter-object relations. Rather than
relying on elaborative user annotations (e.g., manually labeling semantic maps
and relations), we train our deep model in a weakly-supervised manner by
leveraging the descriptive sentences of the training images. Specifically, we
decompose each sentence into a semantic tree consisting of nouns and verb
phrases, and facilitate these trees discovering the configurations of the
training images. Once these scene configurations are determined, then the
parameters of both the CNN and RNN are updated accordingly by back propagation.
The entire model training is accomplished through an Expectation-Maximization
method. Extensive experiments suggest that our model is capable of producing
meaningful and structured scene configurations and achieving more favorable
scene labeling performance on PASCAL VOC 2012 over other state-of-the-art
weakly-supervised methods.Comment: Discovering a semantic object hierarchy with object interaction
relations (Publhised in Proceedings of IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2016. (oral)
Visual Relationship Detection using Scene Graphs: A Survey
Understanding a scene by decoding the visual relationships depicted in an
image has been a long studied problem. While the recent advances in deep
learning and the usage of deep neural networks have achieved near human
accuracy on many tasks, there still exists a pretty big gap between human and
machine level performance when it comes to various visual relationship
detection tasks. Developing on earlier tasks like object recognition,
segmentation and captioning which focused on a relatively coarser image
understanding, newer tasks have been introduced recently to deal with a finer
level of image understanding. A Scene Graph is one such technique to better
represent a scene and the various relationships present in it. With its wide
number of applications in various tasks like Visual Question Answering,
Semantic Image Retrieval, Image Generation, among many others, it has proved to
be a useful tool for deeper and better visual relationship understanding. In
this paper, we present a detailed survey on the various techniques for scene
graph generation, their efficacy to represent visual relationships and how it
has been used to solve various downstream tasks. We also attempt to analyze the
various future directions in which the field might advance in the future. Being
one of the first papers to give a detailed survey on this topic, we also hope
to give a succinct introduction to scene graphs, and guide practitioners while
developing approaches for their applications
Visual Question Answering: A Survey of Methods and Datasets
Visual Question Answering (VQA) is a challenging task that has received
increasing attention from both the computer vision and the natural language
processing communities. Given an image and a question in natural language, it
requires reasoning over visual elements of the image and general knowledge to
infer the correct answer. In the first part of this survey, we examine the
state of the art by comparing modern approaches to the problem. We classify
methods by their mechanism to connect the visual and textual modalities. In
particular, we examine the common approach of combining convolutional and
recurrent neural networks to map images and questions to a common feature
space. We also discuss memory-augmented and modular architectures that
interface with structured knowledge bases. In the second part of this survey,
we review the datasets available for training and evaluating VQA systems. The
various datatsets contain questions at different levels of complexity, which
require different capabilities and types of reasoning. We examine in depth the
question/answer pairs from the Visual Genome project, and evaluate the
relevance of the structured annotations of images with scene graphs for VQA.
Finally, we discuss promising future directions for the field, in particular
the connection to structured knowledge bases and the use of natural language
processing models.Comment: 25 page
A Comprehensive Survey on Graph Neural Networks
Deep learning has revolutionized many machine learning tasks in recent years,
ranging from image classification and video processing to speech recognition
and natural language understanding. The data in these tasks are typically
represented in the Euclidean space. However, there is an increasing number of
applications where data are generated from non-Euclidean domains and are
represented as graphs with complex relationships and interdependency between
objects. The complexity of graph data has imposed significant challenges on
existing machine learning algorithms. Recently, many studies on extending deep
learning approaches for graph data have emerged. In this survey, we provide a
comprehensive overview of graph neural networks (GNNs) in data mining and
machine learning fields. We propose a new taxonomy to divide the
state-of-the-art graph neural networks into four categories, namely recurrent
graph neural networks, convolutional graph neural networks, graph autoencoders,
and spatial-temporal graph neural networks. We further discuss the applications
of graph neural networks across various domains and summarize the open source
codes, benchmark data sets, and model evaluation of graph neural networks.
Finally, we propose potential research directions in this rapidly growing
field.Comment: Minor revision (updated tables and references
- …