3,576 research outputs found
Weakly Supervised Object Discovery by Generative Adversarial & Ranking Networks
The deep generative adversarial networks (GAN) recently have been shown to be
promising for different computer vision applications, like image edit- ing,
synthesizing high resolution images, generating videos, etc. These networks and
the corresponding learning scheme can handle various visual space map- pings.
We approach GANs with a novel training method and learning objective, to
discover multiple object instances for three cases: 1) synthesizing a picture
of a specific object within a cluttered scene; 2) localizing different
categories in images for weakly supervised object detection; and 3) improving
object discov- ery in object detection pipelines. A crucial advantage of our
method is that it learns a new deep similarity metric, to distinguish multiple
objects in one im- age. We demonstrate that the network can act as an
encoder-decoder generating parts of an image which contain an object, or as a
modified deep CNN to rep- resent images for object detection in supervised and
weakly supervised scheme. Our ranking GAN offers a novel way to search through
images for object specific patterns. We have conducted experiments for
different scenarios and demonstrate the method performance for object
synthesizing and weakly supervised object detection and classification using
the MS-COCO and PASCAL VOC datasets
SCH-GAN: Semi-supervised Cross-modal Hashing by Generative Adversarial Network
Cross-modal hashing aims to map heterogeneous multimedia data into a common
Hamming space, which can realize fast and flexible retrieval across different
modalities. Supervised cross-modal hashing methods have achieved considerable
progress by incorporating semantic side information. However, they mainly have
two limitations: (1) Heavily rely on large-scale labeled cross-modal training
data which are labor intensive and hard to obtain. (2) Ignore the rich
information contained in the large amount of unlabeled data across different
modalities, especially the margin examples that are easily to be incorrectly
retrieved, which can help to model the correlations. To address these problems,
in this paper we propose a novel Semi-supervised Cross-Modal Hashing approach
by Generative Adversarial Network (SCH-GAN). We aim to take advantage of GAN's
ability for modeling data distributions to promote cross-modal hashing learning
in an adversarial way. The main contributions can be summarized as follows: (1)
We propose a novel generative adversarial network for cross-modal hashing. In
our proposed SCH-GAN, the generative model tries to select margin examples of
one modality from unlabeled data when giving a query of another modality. While
the discriminative model tries to distinguish the selected examples and true
positive examples of the query. These two models play a minimax game so that
the generative model can promote the hashing performance of discriminative
model. (2) We propose a reinforcement learning based algorithm to drive the
training of proposed SCH-GAN. The generative model takes the correlation score
predicted by discriminative model as a reward, and tries to select the examples
close to the margin to promote discriminative model by maximizing the margin
between positive and negative data. Experiments on 3 widely-used datasets
verify the effectiveness of our proposed approach.Comment: 12 pages, submitted to IEEE Transactions on Cybernetic
Unsupervised Object Matching for Relational Data
We propose an unsupervised object matching method for relational data, which
finds matchings between objects in different relational datasets without
correspondence information. For example, the proposed method matches documents
in different languages in multi-lingual document-word networks without
dictionaries nor alignment information. The proposed method assumes that each
object has latent vectors, and the probability of neighbor objects is modeled
by the inner-product of the latent vectors, where the neighbors are generated
by short random walks over the relations. The latent vectors are estimated by
maximizing the likelihood of the neighbors for each dataset. The estimated
latent vectors contain hidden structural information of each object in the
given relational dataset. Then, the proposed method linearly projects the
latent vectors for all the datasets onto a common latent space shared across
all datasets by matching the distributions while preserving the structural
information. The projection matrix is estimated by minimizing the distance
between the latent vector distributions with an orthogonality regularizer. To
represent the distributions effectively, we use the kernel embedding of
distributions that hold high-order moment information about a distribution as
an element in a reproducing kernel Hilbert space, which enables us to calculate
the distance between the distributions without density estimation. The
structural information encoded in the latent vectors are preserved by using the
orthogonality regularizer. We demonstrate the effectiveness of the proposed
method with experiments using real-world multi-lingual document-word relational
datasets and multiple user-item relational datasets
Thinking Outside the Pool: Active Training Image Creation for Relative Attributes
Current wisdom suggests more labeled image data is always better, and
obtaining labels is the bottleneck. Yet curating a pool of sufficiently diverse
and informative images is itself a challenge. In particular, training image
curation is problematic for fine-grained attributes, where the subtle visual
differences of interest may be rare within traditional image sources. We
propose an active image generation approach to address this issue. The main
idea is to jointly learn the attribute ranking task while also learning to
generate novel realistic image samples that will benefit that task. We
introduce an end-to-end framework that dynamically "imagines" image pairs that
would confuse the current model, presents them to human annotators for
labeling, then improves the predictive model with the new examples. With
results on two datasets, we show that by thinking outside the pool of real
images, our approach gains generalization accuracy for challenging fine-grained
attribute comparisons
Representation Learning by Reconstructing Neighborhoods
Since its introduction, unsupervised representation learning has attracted a
lot of attention from the research community, as it is demonstrated to be
highly effective and easy-to-apply in tasks such as dimension reduction,
clustering, visualization, information retrieval, and semi-supervised learning.
In this work, we propose a novel unsupervised representation learning framework
called neighbor-encoder, in which domain knowledge can be easily incorporated
into the learning process without modifying the general encoder-decoder
architecture of the classic autoencoder.In contrast to autoencoder, which
reconstructs the input data itself, neighbor-encoder reconstructs the input
data's neighbors. As the proposed representation learning problem is
essentially a neighbor reconstruction problem, domain knowledge can be easily
incorporated in the form of an appropriate definition of similarity between
objects. Based on that observation, our framework can leverage any
off-the-shelf similarity search algorithms or side information to find the
neighbor of an input object. Applications of other algorithms (e.g.,
association rule mining) in our framework are also possible, given that the
appropriate definition of neighbor can vary in different contexts. We have
demonstrated the effectiveness of our framework in many diverse domains,
including images, text, and time series, and for various data mining tasks
including classification, clustering, and visualization. Experimental results
show that neighbor-encoder not only outperforms autoencoder in most of the
scenarios we consider, but also achieves the state-of-the-art performance on
text document clustering
Transitive Invariance for Self-supervised Visual Representation Learning
Learning visual representations with self-supervised learning has become
popular in computer vision. The idea is to design auxiliary tasks where labels
are free to obtain. Most of these tasks end up providing data to learn specific
kinds of invariance useful for recognition. In this paper, we propose to
exploit different self-supervised approaches to learn representations invariant
to (i) inter-instance variations (two objects in the same class should have
similar features) and (ii) intra-instance variations (viewpoint, pose,
deformations, illumination, etc). Instead of combining two approaches with
multi-task learning, we argue to organize and reason the data with multiple
variations. Specifically, we propose to generate a graph with millions of
objects mined from hundreds of thousands of videos. The objects are connected
by two types of edges which correspond to two types of invariance: "different
instances but a similar viewpoint and category" and "different viewpoints of
the same instance". By applying simple transitivity on the graph with these
edges, we can obtain pairs of images exhibiting richer visual invariance. We
use this data to train a Triplet-Siamese network with VGG16 as the base
architecture and apply the learned representations to different recognition
tasks. For object detection, we achieve 63.2% mAP on PASCAL VOC 2007 using Fast
R-CNN (compare to 67.3% with ImageNet pre-training). For the challenging COCO
dataset, our method is surprisingly close (23.5%) to the ImageNet-supervised
counterpart (24.4%) using the Faster R-CNN framework. We also show that our
network can perform significantly better than the ImageNet network in the
surface normal estimation task.Comment: ICCV 201
Sparse Label Smoothing Regularization for Person Re-Identification
Person re-identification (re-id) is a cross-camera retrieval task which
establishes a correspondence between images of a person from multiple cameras.
Deep Learning methods have been successfully applied to this problem and have
achieved impressive results. However, these methods require a large amount of
labeled training data. Currently labeled datasets in person re-id are limited
in their scale and manual acquisition of such large-scale datasets from
surveillance cameras is a tedious and labor-intensive task. In this paper, we
propose a framework that performs intelligent data augmentation and assigns
partial smoothing label to generated data. Our approach first exploits the
clustering property of existing person re-id datasets to create groups of
similar objects that model cross-view variations. Each group is then used to
generate realistic images through adversarial training. Our aim is to emphasize
feature similarity between generated samples and the original samples. Finally,
we assign a non-uniform label distribution to the generated samples and define
a regularized loss function for training. The proposed approach tackles two
problems (1) how to efficiently use the generated data and (2) how to address
the over-smoothness problem found in current regularization methods. Extensive
experiments on four larges cale datasets show that our regularization method
significantly improves the Re-ID accuracy compared to existing methods.Comment: 13 pages, 6 figure
Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog
This paper presents a new model for visual dialog, Recurrent Dual Attention
Network (ReDAN), using multi-step reasoning to answer a series of questions
about an image. In each question-answering turn of a dialog, ReDAN infers the
answer progressively through multiple reasoning steps. In each step of the
reasoning process, the semantic representation of the question is updated based
on the image and the previous dialog history, and the recurrently-refined
representation is used for further reasoning in the subsequent step. On the
VisDial v1.0 dataset, the proposed ReDAN model achieves a new state-of-the-art
of 64.47% NDCG score. Visualization on the reasoning process further
demonstrates that ReDAN can locate context-relevant visual and textual clues
via iterative refinement, which can lead to the correct answer step-by-step.Comment: Accepted to ACL 201
Deep Learning on Graphs: A Survey
Deep learning has been shown to be successful in a number of domains, ranging
from acoustics, images, to natural language processing. However, applying deep
learning to the ubiquitous graph data is non-trivial because of the unique
characteristics of graphs. Recently, substantial research efforts have been
devoted to applying deep learning methods to graphs, resulting in beneficial
advances in graph analysis techniques. In this survey, we comprehensively
review the different types of deep learning methods on graphs. We divide the
existing methods into five categories based on their model architectures and
training strategies: graph recurrent neural networks, graph convolutional
networks, graph autoencoders, graph reinforcement learning, and graph
adversarial methods. We then provide a comprehensive overview of these methods
in a systematic manner mainly by following their development history. We also
analyze the differences and compositions of different methods. Finally, we
briefly outline the applications in which they have been used and discuss
potential future research directions.Comment: Accepted by Transactions on Knowledge and Data Engineering. 24 pages,
11 figure
3D-A-Nets: 3D Deep Dense Descriptor for Volumetric Shapes with Adversarial Networks
Recently researchers have been shifting their focus towards learned 3D shape
descriptors from hand-craft ones to better address challenging issues of the
deformation and structural variation inherently present in 3D objects. 3D
geometric data are often transformed to 3D Voxel grids with regular format in
order to be better fed to a deep neural net architecture. However, the
computational intractability of direct application of 3D convolutional nets to
3D volumetric data severely limits the efficiency (i.e. slow processing) and
effectiveness (i.e. unsatisfied accuracy) in processing 3D geometric data. In
this paper, powered with a novel design of adversarial networks (3D-A-Nets), we
have developed a novel 3D deep dense shape descriptor (3D-DDSD) to address the
challenging issues of efficient and effective 3D volumetric data processing. We
developed new definition of 2D multilayer dense representation (MDR) of 3D
volumetric data to extract concise but geometrically informative shape
description and a novel design of adversarial networks that jointly train a set
of convolution neural network (CNN), recurrent neural network (RNN) and an
adversarial discriminator. More specifically, the generator network produces 3D
shape features that encourages the clustering of samples from the same category
with correct class label, whereas the discriminator network discourages the
clustering by assigning them misleading adversarial class labels. By addressing
the challenges posed by the computational inefficiency of direct application of
CNN to 3D volumetric data, 3D-A-Nets can learn high-quality 3D-DSDD which
demonstrates superior performance on 3D shape classification and retrieval over
other state-of-the-art techniques by a great margin.Comment: 8 pages, 8 figure
- …