7,111 research outputs found
Affinity Graph Supervision for Visual Recognition
Affinity graphs are widely used in deep architectures, including graph
convolutional neural networks and attention networks. Thus far, the literature
has focused on abstracting features from such graphs, while the learning of the
affinities themselves has been overlooked. Here we propose a principled method
to directly supervise the learning of weights in affinity graphs, to exploit
meaningful connections between entities in the data source. Applied to a visual
attention network, our affinity supervision improves relationship recovery
between objects, even without the use of manually annotated relationship
labels. We further show that affinity learning between objects boosts scene
categorization performance and that the supervision of affinity can also be
applied to graphs built from mini-batches, for neural network training. In an
image classification task we demonstrate consistent improvement over the
baseline, with diverse network architectures and datasets
Adaptive Affinity Fields for Semantic Segmentation
Semantic segmentation has made much progress with increasingly powerful
pixel-wise classifiers and incorporating structural priors via Conditional
Random Fields (CRF) or Generative Adversarial Networks (GAN). We propose a
simpler alternative that learns to verify the spatial structure of segmentation
during training only. Unlike existing approaches that enforce semantic labels
on individual pixels and match labels between neighbouring pixels, we propose
the concept of Adaptive Affinity Fields (AAF) to capture and match the semantic
relations between neighbouring pixels in the label space. We use adversarial
learning to select the optimal affinity field size for each semantic category.
It is formulated as a minimax problem, optimizing our segmentation neural
network in a best worst-case learning scenario. AAF is versatile for
representing structures as a collection of pixel-centric relations, easier to
train than GAN and more efficient than CRF without run-time inference. Our
extensive evaluations on PASCAL VOC 2012, Cityscapes, and GTA5 datasets
demonstrate its above-par segmentation performance and robust generalization
across domains.Comment: To appear in European Conference on Computer Vision (ECCV) 201
Exploring Object Relation in Mean Teacher for Cross-Domain Detection
Rendering synthetic data (e.g., 3D CAD-rendered images) to generate
annotations for learning deep models in vision tasks has attracted increasing
attention in recent years. However, simply applying the models learnt on
synthetic images may lead to high generalization error on real images due to
domain shift. To address this issue, recent progress in cross-domain
recognition has featured the Mean Teacher, which directly simulates
unsupervised domain adaptation as semi-supervised learning. The domain gap is
thus naturally bridged with consistency regularization in a teacher-student
scheme. In this work, we advance this Mean Teacher paradigm to be applicable
for cross-domain detection. Specifically, we present Mean Teacher with Object
Relations (MTOR) that novelly remolds Mean Teacher under the backbone of Faster
R-CNN by integrating the object relations into the measure of consistency cost
between teacher and student modules. Technically, MTOR firstly learns
relational graphs that capture similarities between pairs of regions for
teacher and student respectively. The whole architecture is then optimized with
three consistency regularizations: 1) region-level consistency to align the
region-level predictions between teacher and student, 2) inter-graph
consistency for matching the graph structures between teacher and student, and
3) intra-graph consistency to enhance the similarity between regions of same
class within the graph of student. Extensive experiments are conducted on the
transfers across Cityscapes, Foggy Cityscapes, and SIM10k, and superior results
are reported when comparing to state-of-the-art approaches. More remarkably, we
obtain a new record of single model: 22.8% of mAP on Syn2Real detection
dataset.Comment: CVPR 2019; The codes and model of our MTOR are publicly available at:
https://github.com/caiqi/mean-teacher-cross-domain-detectio
Multiple-Human Parsing in the Wild
Human parsing is attracting increasing research attention. In this work, we
aim to push the frontier of human parsing by introducing the problem of
multi-human parsing in the wild. Existing works on human parsing mainly tackle
single-person scenarios, which deviates from real-world applications where
multiple persons are present simultaneously with interaction and occlusion. To
address the multi-human parsing problem, we introduce a new multi-human parsing
(MHP) dataset and a novel multi-human parsing model named MH-Parser. The MHP
dataset contains multiple persons captured in real-world scenes with
pixel-level fine-grained semantic annotations in an instance-aware setting. The
MH-Parser generates global parsing maps and person instance masks
simultaneously in a bottom-up fashion with the help of a new Graph-GAN model.
We envision that the MHP dataset will serve as a valuable data resource to
develop new multi-human parsing models, and the MH-Parser offers a strong
baseline to drive future research for multi-human parsing in the wild.Comment: The first two authors are with equal contributio
Learning Correspondence from the Cycle-Consistency of Time
We introduce a self-supervised method for learning visual correspondence from
unlabeled video. The main idea is to use cycle-consistency in time as free
supervisory signal for learning visual representations from scratch. At
training time, our model learns a feature map representation to be useful for
performing cycle-consistent tracking. At test time, we use the acquired
representation to find nearest neighbors across space and time. We demonstrate
the generalizability of the representation -- without finetuning -- across a
range of visual correspondence tasks, including video object segmentation,
keypoint tracking, and optical flow. Our approach outperforms previous
self-supervised methods and performs competitively with strongly supervised
methods.Comment: CVPR 2019 Oral. Project page: http://ajabri.github.io/timecycl
Weakly Supervised Few-shot Object Segmentation using Co-Attention with Visual and Semantic Embeddings
Significant progress has been made recently in developing few-shot object
segmentation methods. Learning is shown to be successful in few-shot
segmentation settings, using pixel-level, scribbles and bounding box
supervision. This paper takes another approach, i.e., only requiring
image-level label for few-shot object segmentation. We propose a novel
multi-modal interaction module for few-shot object segmentation that utilizes a
co-attention mechanism using both visual and word embedding. Our model using
image-level labels achieves 4.8% improvement over previously proposed
image-level few-shot object segmentation. It also outperforms state-of-the-art
methods that use weak bounding box supervision on PASCAL-5i. Our results show
that few-shot segmentation benefits from utilizing word embeddings, and that we
are able to perform few-shot segmentation using stacked joint visual semantic
processing with weak image-level labels. We further propose a novel setup,
Temporal Object Segmentation for Few-shot Learning (TOSFL) for videos. TOSFL
can be used on a variety of public video data such as Youtube-VOS, as
demonstrated in both instance-level and category-level TOSFL experiments.Comment: Accepted to IJCAI'20. The first three authors listed contributed
equall
Unsupervised High-level Feature Learning by Ensemble Projection for Semi-supervised Image Classification and Image Clustering
This paper investigates the problem of image classification with limited or
no annotations, but abundant unlabeled data. The setting exists in many tasks
such as semi-supervised image classification, image clustering, and image
retrieval. Unlike previous methods, which develop or learn sophisticated
regularizers for classifiers, our method learns a new image representation by
exploiting the distribution patterns of all available data for the task at
hand. Particularly, a rich set of visual prototypes are sampled from all
available data, and are taken as surrogate classes to train discriminative
classifiers; images are projected via the classifiers; the projected values,
similarities to the prototypes, are stacked to build the new feature vector.
The training set is noisy. Hence, in the spirit of ensemble learning we create
a set of such training sets which are all diverse, leading to diverse
classifiers. The method is dubbed Ensemble Projection (EP). EP captures not
only the characteristics of individual images, but also the relationships among
images. It is conceptually simple and computationally efficient, yet effective
and flexible. Experiments on eight standard datasets show that: (1) EP
outperforms previous methods for semi-supervised image classification; (2) EP
produces promising results for self-taught image classification, where
unlabeled samples are a random collection of images rather than being from the
same distribution as the labeled ones; and (3) EP improves over the original
features for image clustering. The code of the method is available on the
project page.Comment: 22 pages, 8 figure
Learning Latent Representations in Neural Networks for Clustering through Pseudo Supervision and Graph-based Activity Regularization
In this paper, we propose a novel unsupervised clustering approach exploiting
the hidden information that is indirectly introduced through a pseudo
classification objective. Specifically, we randomly assign a pseudo
parent-class label to each observation which is then modified by applying the
domain specific transformation associated with the assigned label. Generated
pseudo observation-label pairs are subsequently used to train a neural network
with Auto-clustering Output Layer (ACOL) that introduces multiple softmax nodes
for each pseudo parent-class. Due to the unsupervised objective based on
Graph-based Activity Regularization (GAR) terms, softmax duplicates of each
parent-class are specialized as the hidden information captured through the
help of domain specific transformations is propagated during training.
Ultimately we obtain a k-means friendly latent representation. Furthermore, we
demonstrate how the chosen transformation type impacts performance and helps
propagate the latent information that is useful in revealing unknown clusters.
Our results show state-of-the-art performance for unsupervised clustering tasks
on MNIST, SVHN and USPS datasets, with the highest accuracies reported to date
in the literature.Comment: To appear in proceedings of the International Conference on Learning
Representations 201
Coarse-to-Fine Annotation Enrichment for Semantic Segmentation Learning
Rich high-quality annotated data is critical for semantic segmentation
learning, yet acquiring dense and pixel-wise ground-truth is both labor- and
time-consuming. Coarse annotations (e.g., scribbles, coarse polygons) offer an
economical alternative, with which training phase could hardly generate
satisfactory performance unfortunately. In order to generate high-quality
annotated data with a low time cost for accurate segmentation, in this paper,
we propose a novel annotation enrichment strategy, which expands existing
coarse annotations of training data to a finer scale. Extensive experiments on
the Cityscapes and PASCAL VOC 2012 benchmarks have shown that the neural
networks trained with the enriched annotations from our framework yield a
significant improvement over that trained with the original coarse labels. It
is highly competitive to the performance obtained by using human annotated
dense annotations. The proposed method also outperforms among other
state-of-the-art weakly-supervised segmentation methods.Comment: CIKM 2018 International Conference on Information and Knowledge
Managemen
Taskonomy: Disentangling Task Transfer Learning
Do visual tasks have a relationship, or are they unrelated? For instance,
could having surface normals simplify estimating the depth of an image?
Intuition answers these questions positively, implying existence of a structure
among visual tasks. Knowing this structure has notable values; it is the
concept underlying transfer learning and provides a principled way for
identifying redundancies across tasks, e.g., to seamlessly reuse supervision
among related tasks or solve many tasks in one system without piling up the
complexity.
We proposes a fully computational approach for modeling the structure of
space of visual tasks. This is done via finding (first and higher-order)
transfer learning dependencies across a dictionary of twenty six 2D, 2.5D, 3D,
and semantic tasks in a latent space. The product is a computational taxonomic
map for task transfer learning. We study the consequences of this structure,
e.g. nontrivial emerged relationships, and exploit them to reduce the demand
for labeled data. For example, we show that the total number of labeled
datapoints needed for solving a set of 10 tasks can be reduced by roughly 2/3
(compared to training independently) while keeping the performance nearly the
same. We provide a set of tools for computing and probing this taxonomical
structure including a solver that users can employ to devise efficient
supervision policies for their use cases.Comment: CVPR 2018 (Oral). See project website and live demos at
http://taskonomy.vision
- …