7,875 research outputs found
Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination
Neural net classifiers trained on data with annotated class labels can also
capture apparent visual similarity among categories without being directed to
do so. We study whether this observation can be extended beyond the
conventional domain of supervised learning: Can we learn a good feature
representation that captures apparent similarity among instances, instead of
classes, by merely asking the feature to be discriminative of individual
instances? We formulate this intuition as a non-parametric classification
problem at the instance-level, and use noise-contrastive estimation to tackle
the computational challenges imposed by the large number of instance classes.
Our experimental results demonstrate that, under unsupervised learning
settings, our method surpasses the state-of-the-art on ImageNet classification
by a large margin. Our method is also remarkable for consistently improving
test performance with more training data and better network architectures. By
fine-tuning the learned feature, we further obtain competitive results for
semi-supervised learning and object detection tasks. Our non-parametric model
is highly compact: With 128 features per image, our method requires only 600MB
storage for a million images, enabling fast nearest neighbour retrieval at the
run time.Comment: CVPR 2018 spotlight paper. Code:
https://github.com/zhirongw/lemniscate.pytorc
Prototypical Contrastive Learning of Unsupervised Representations
This paper presents Prototypical Contrastive Learning (PCL), an unsupervised
representation learning method that addresses the fundamental limitations of
instance-wise contrastive learning. PCL not only learns low-level features for
the task of instance discrimination, but more importantly, it implicitly
encodes semantic structures of the data into the learned embedding space.
Specifically, we introduce prototypes as latent variables to help find the
maximum-likelihood estimation of the network parameters in an
Expectation-Maximization framework. We iteratively perform E-step as finding
the distribution of prototypes via clustering and M-step as optimizing the
network via contrastive learning. We propose ProtoNCE loss, a generalized
version of the InfoNCE loss for contrastive learning, which encourages
representations to be closer to their assigned prototypes. PCL outperforms
state-of-the-art instance-wise contrastive learning methods on multiple
benchmarks with substantial improvement in low-resource transfer learning. Code
and pretrained models are available at https://github.com/salesforce/PCL
Learning Spatiotemporal Features via Video and Text Pair Discrimination
Current video representations heavily rely on learning from manually
annotated video datasets which are time-consuming and expensive to acquire. We
observe videos are naturally accompanied by abundant text information such as
YouTube titles and Instagram captions. In this paper, we leverage this
visual-textual connection to learn spatiotemporal features in an efficient
weakly-supervised manner. We present a general cross-modal pair discrimination
(CPD) framework to capture this correlation between a video and its associated
text. Specifically, we adopt noise-contrastive estimation to tackle the
computational issue imposed by the huge amount of pair instance classes and
design a practical curriculum learning strategy. We train our CPD models on
both standard video dataset (Kinetics-210k) and uncurated web video dataset
(Instagram-300k) to demonstrate its effectiveness. Without further fine-tuning,
the learnt models obtain competitive results for action classification on
Kinetics under the linear classification protocol. Moreover, our visual model
provides an effective initialization to fine-tune on downstream tasks, which
yields a remarkable performance gain for action recognition on UCF101 and
HMDB51, compared with the existing state-of-the-art self-supervised training
methods. In addition, our CPD model yields a new state of the art for zero-shot
action recognition on UCF101 by directly utilizing the learnt visual-textual
embeddings. The code will be made available at
https://github.com/MCG-NJU/CPD-Video.Comment: Technical Repor
Self-Supervised Similarity Learning for Digital Pathology
Using features extracted from networks pretrained on ImageNet is a common
practice in applications of deep learning for digital pathology. However it
presents the downside of missing domain specific image information. In digital
pathology, supervised training data is expensive and difficult to collect. We
propose a self-supervised method for feature extraction by similarity learning
on whole slide images (WSI) that is simple to implement and allows creation of
robust and compact image descriptors. We train a siamese network, exploiting
image spatial continuity and assuming spatially adjacent tiles in the image are
more similar to each other than distant tiles. Our network outputs feature
vectors of length 128, which allows dramatically lower memory storage and
faster processing than networks pretrained on ImageNet. We apply the method on
digital pathology WSIs from the Camelyon16 train set and assess and compare our
method by measuring image retrieval of tumor tiles and descriptor pair distance
ratio for distant/near tiles in the Camelyon16 test set. We show that our
method yields better retrieval task results than existing ImageNet based and
generic self-supervised feature extraction methods. To the best of our
knowledge, this is also the first published method for self-supervised learning
tailored for digital pathology
Unsupervised Semantic-based Aggregation of Deep Convolutional Features
In this paper, we propose a simple but effective semantic-based aggregation
(SBA) method. The proposed SBA utilizes the discriminative filters of deep
convolutional layers as semantic detectors. Moreover, we propose the effective
unsupervised strategy to select some semantic detectors to generate the
"probabilistic proposals", which highlight certain discriminative pattern of
objects and suppress the noise of background. The final global SBA
representation could then be acquired by aggregating the regional
representations weighted by the selected "probabilistic proposals"
corresponding to various semantic content. Our unsupervised SBA is easy to
generalize and achieves excellent performance on various tasks. We conduct
comprehensive experiments and show that our unsupervised SBA outperforms the
state-of-the-art unsupervised and supervised aggregation methods on image
retrieval, place recognition and cloud classification.Comment: 10 pages. arXiv admin note: text overlap with arXiv:1705.0124
Supervised Dictionary Learning and Sparse Representation-A Review
Dictionary learning and sparse representation (DLSR) is a recent and
successful mathematical model for data representation that achieves
state-of-the-art performance in various fields such as pattern recognition,
machine learning, computer vision, and medical imaging. The original
formulation for DLSR is based on the minimization of the reconstruction error
between the original signal and its sparse representation in the space of the
learned dictionary. Although this formulation is optimal for solving problems
such as denoising, inpainting, and coding, it may not lead to optimal solution
in classification tasks, where the ultimate goal is to make the learned
dictionary and corresponding sparse representation as discriminative as
possible. This motivated the emergence of a new category of techniques, which
is appropriately called supervised dictionary learning and sparse
representation (S-DLSR), leading to more optimal dictionary and sparse
representation in classification tasks. Despite many research efforts for
S-DLSR, the literature lacks a comprehensive view of these techniques, their
connections, advantages and shortcomings. In this paper, we address this gap
and provide a review of the recently proposed algorithms for S-DLSR. We first
present a taxonomy of these algorithms into six categories based on the
approach taken to include label information into the learning of the dictionary
and/or sparse representation. For each category, we draw connections between
the algorithms in this category and present a unified framework for them. We
then provide guidelines for applied researchers on how to represent and learn
the building blocks of an S-DLSR solution based on the problem at hand. This
review provides a broad, yet deep, view of the state-of-the-art methods for
S-DLSR and allows for the advancement of research and development in this
emerging area of research
Local Label Propagation for Large-Scale Semi-Supervised Learning
A significant issue in training deep neural networks to solve supervised
learning tasks is the need for large numbers of labelled datapoints. The goal
of semi-supervised learning is to leverage ubiquitous unlabelled data, together
with small quantities of labelled data, to achieve high task performance.
Though substantial recent progress has been made in developing semi-supervised
algorithms that are effective for comparatively small datasets, many of these
techniques do not scale readily to the large (unlaballed) datasets
characteristic of real-world applications. In this paper we introduce a novel
approach to scalable semi-supervised learning, called Local Label Propagation
(LLP). Extending ideas from recent work on unsupervised embedding learning, LLP
first embeds datapoints, labelled and otherwise, in a common latent space using
a deep neural network. It then propagates pseudolabels from known to unknown
datapoints in a manner that depends on the local geometry of the embedding,
taking into account both inter-point distance and local data density as a
weighting on propagation likelihood. The parameters of the deep embedding are
then trained to simultaneously maximize pseudolabel categorization performance
as well as a metric of the clustering of datapoints within each psuedo-label
group, iteratively alternating stages of network training and label
propagation. We illustrate the utility of the LLP method on the ImageNet
dataset, achieving results that outperform previous state-of-the-art scalable
semi-supervised learning algorithms by large margins, consistently across a
wide variety of training regimes. We also show that the feature representation
learned with LLP transfers well to scene recognition in the Places 205 dataset
Improving Generalization via Scalable Neighborhood Component Analysis
Current major approaches to visual recognition follow an end-to-end
formulation that classifies an input image into one of the pre-determined set
of semantic categories. Parametric softmax classifiers are a common choice for
such a closed world with fixed categories, especially when big labeled data is
available during training. However, this becomes problematic for open-set
scenarios where new categories are encountered with very few examples for
learning a generalizable parametric classifier. We adopt a non-parametric
approach for visual recognition by optimizing feature embeddings instead of
parametric classifiers. We use a deep neural network to learn the visual
feature that preserves the neighborhood structure in the semantic space, based
on the Neighborhood Component Analysis (NCA) criterion. Limited by its
computational bottlenecks, we devise a mechanism to use augmented memory to
scale NCA for large datasets and very deep networks. Our experiments deliver
not only remarkable performance on ImageNet classification for such a simple
non-parametric method, but most importantly a more generalizable feature
representation for sub-category discovery and few-shot recognition.Comment: To appear in ECCV 201
Local Aggregation for Unsupervised Learning of Visual Embeddings
Unsupervised approaches to learning in neural networks are of substantial
interest for furthering artificial intelligence, both because they would enable
the training of networks without the need for large numbers of expensive
annotations, and because they would be better models of the kind of
general-purpose learning deployed by humans. However, unsupervised networks
have long lagged behind the performance of their supervised counterparts,
especially in the domain of large-scale visual recognition. Recent developments
in training deep convolutional embeddings to maximize non-parametric instance
separation and clustering objectives have shown promise in closing this gap.
Here, we describe a method that trains an embedding function to maximize a
metric of local aggregation, causing similar data instances to move together in
the embedding space, while allowing dissimilar instances to separate. This
aggregation metric is dynamic, allowing soft clusters of different scales to
emerge. We evaluate our procedure on several large-scale visual recognition
datasets, achieving state-of-the-art unsupervised transfer learning performance
on object recognition in ImageNet, scene recognition in Places 205, and object
detection in PASCAL VOC
Unsupervised Feature Learning by Cross-Level Instance-Group Discrimination
Unsupervised feature learning has made great strides with contrastive
learning based on instance discrimination and invariant mapping, as benchmarked
on curated class-balanced datasets. However, natural data could be highly
correlated and long-tail distributed. Natural between-instance similarity
conflicts with the presumed instance distinction, causing unstable training and
poor performance.
Our idea is to discover and integrate between-instance similarity into
contrastive learning, not directly by instance grouping, but by cross-level
discrimination (CLD) between instances and local instance groups. While
invariant mapping of each instance is imposed by attraction within its
augmented views, between-instance similarity emerges from common repulsion
against instance groups.
Our batch-wise and cross-view comparisons also greatly improve the
positive/negative sample ratio of contrastive learning and achieve better
invariant mapping. To effect both grouping and discrimination objectives, we
impose them on features separately derived from a shared representation. In
addition, we propose normalized projection heads and unsupervised
hyper-parameter tuning for the first time.
Our extensive experimentation demonstrates that CLD is a lean and powerful
add-on to existing methods (e.g., NPID, MoCo, InfoMin, BYOL) on highly
correlated, long-tail, or balanced datasets. It not only achieves new
state-of-the-art on self-supervision, semi-supervision, and transfer learning
benchmarks, but also beats MoCo v2 and SimCLR on every reported performance
attained with a much larger compute. CLD effectively extends unsupervised
learning to natural data and brings it closer to real-world applications.Comment: 10 page
- …