7,659 research outputs found
Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination
Neural net classifiers trained on data with annotated class labels can also
capture apparent visual similarity among categories without being directed to
do so. We study whether this observation can be extended beyond the
conventional domain of supervised learning: Can we learn a good feature
representation that captures apparent similarity among instances, instead of
classes, by merely asking the feature to be discriminative of individual
instances? We formulate this intuition as a non-parametric classification
problem at the instance-level, and use noise-contrastive estimation to tackle
the computational challenges imposed by the large number of instance classes.
Our experimental results demonstrate that, under unsupervised learning
settings, our method surpasses the state-of-the-art on ImageNet classification
by a large margin. Our method is also remarkable for consistently improving
test performance with more training data and better network architectures. By
fine-tuning the learned feature, we further obtain competitive results for
semi-supervised learning and object detection tasks. Our non-parametric model
is highly compact: With 128 features per image, our method requires only 600MB
storage for a million images, enabling fast nearest neighbour retrieval at the
run time.Comment: CVPR 2018 spotlight paper. Code:
https://github.com/zhirongw/lemniscate.pytorc
Prototypical Contrastive Learning of Unsupervised Representations
This paper presents Prototypical Contrastive Learning (PCL), an unsupervised
representation learning method that addresses the fundamental limitations of
instance-wise contrastive learning. PCL not only learns low-level features for
the task of instance discrimination, but more importantly, it implicitly
encodes semantic structures of the data into the learned embedding space.
Specifically, we introduce prototypes as latent variables to help find the
maximum-likelihood estimation of the network parameters in an
Expectation-Maximization framework. We iteratively perform E-step as finding
the distribution of prototypes via clustering and M-step as optimizing the
network via contrastive learning. We propose ProtoNCE loss, a generalized
version of the InfoNCE loss for contrastive learning, which encourages
representations to be closer to their assigned prototypes. PCL outperforms
state-of-the-art instance-wise contrastive learning methods on multiple
benchmarks with substantial improvement in low-resource transfer learning. Code
and pretrained models are available at https://github.com/salesforce/PCL
Learning Spatiotemporal Features via Video and Text Pair Discrimination
Current video representations heavily rely on learning from manually
annotated video datasets which are time-consuming and expensive to acquire. We
observe videos are naturally accompanied by abundant text information such as
YouTube titles and Instagram captions. In this paper, we leverage this
visual-textual connection to learn spatiotemporal features in an efficient
weakly-supervised manner. We present a general cross-modal pair discrimination
(CPD) framework to capture this correlation between a video and its associated
text. Specifically, we adopt noise-contrastive estimation to tackle the
computational issue imposed by the huge amount of pair instance classes and
design a practical curriculum learning strategy. We train our CPD models on
both standard video dataset (Kinetics-210k) and uncurated web video dataset
(Instagram-300k) to demonstrate its effectiveness. Without further fine-tuning,
the learnt models obtain competitive results for action classification on
Kinetics under the linear classification protocol. Moreover, our visual model
provides an effective initialization to fine-tune on downstream tasks, which
yields a remarkable performance gain for action recognition on UCF101 and
HMDB51, compared with the existing state-of-the-art self-supervised training
methods. In addition, our CPD model yields a new state of the art for zero-shot
action recognition on UCF101 by directly utilizing the learnt visual-textual
embeddings. The code will be made available at
https://github.com/MCG-NJU/CPD-Video.Comment: Technical Repor
Unsupervised Semantic-based Aggregation of Deep Convolutional Features
In this paper, we propose a simple but effective semantic-based aggregation
(SBA) method. The proposed SBA utilizes the discriminative filters of deep
convolutional layers as semantic detectors. Moreover, we propose the effective
unsupervised strategy to select some semantic detectors to generate the
"probabilistic proposals", which highlight certain discriminative pattern of
objects and suppress the noise of background. The final global SBA
representation could then be acquired by aggregating the regional
representations weighted by the selected "probabilistic proposals"
corresponding to various semantic content. Our unsupervised SBA is easy to
generalize and achieves excellent performance on various tasks. We conduct
comprehensive experiments and show that our unsupervised SBA outperforms the
state-of-the-art unsupervised and supervised aggregation methods on image
retrieval, place recognition and cloud classification.Comment: 10 pages. arXiv admin note: text overlap with arXiv:1705.0124
Supervised Dictionary Learning and Sparse Representation-A Review
Dictionary learning and sparse representation (DLSR) is a recent and
successful mathematical model for data representation that achieves
state-of-the-art performance in various fields such as pattern recognition,
machine learning, computer vision, and medical imaging. The original
formulation for DLSR is based on the minimization of the reconstruction error
between the original signal and its sparse representation in the space of the
learned dictionary. Although this formulation is optimal for solving problems
such as denoising, inpainting, and coding, it may not lead to optimal solution
in classification tasks, where the ultimate goal is to make the learned
dictionary and corresponding sparse representation as discriminative as
possible. This motivated the emergence of a new category of techniques, which
is appropriately called supervised dictionary learning and sparse
representation (S-DLSR), leading to more optimal dictionary and sparse
representation in classification tasks. Despite many research efforts for
S-DLSR, the literature lacks a comprehensive view of these techniques, their
connections, advantages and shortcomings. In this paper, we address this gap
and provide a review of the recently proposed algorithms for S-DLSR. We first
present a taxonomy of these algorithms into six categories based on the
approach taken to include label information into the learning of the dictionary
and/or sparse representation. For each category, we draw connections between
the algorithms in this category and present a unified framework for them. We
then provide guidelines for applied researchers on how to represent and learn
the building blocks of an S-DLSR solution based on the problem at hand. This
review provides a broad, yet deep, view of the state-of-the-art methods for
S-DLSR and allows for the advancement of research and development in this
emerging area of research
Unsupervised Feature Learning by Cross-Level Instance-Group Discrimination
Unsupervised feature learning has made great strides with contrastive
learning based on instance discrimination and invariant mapping, as benchmarked
on curated class-balanced datasets. However, natural data could be highly
correlated and long-tail distributed. Natural between-instance similarity
conflicts with the presumed instance distinction, causing unstable training and
poor performance.
Our idea is to discover and integrate between-instance similarity into
contrastive learning, not directly by instance grouping, but by cross-level
discrimination (CLD) between instances and local instance groups. While
invariant mapping of each instance is imposed by attraction within its
augmented views, between-instance similarity emerges from common repulsion
against instance groups.
Our batch-wise and cross-view comparisons also greatly improve the
positive/negative sample ratio of contrastive learning and achieve better
invariant mapping. To effect both grouping and discrimination objectives, we
impose them on features separately derived from a shared representation. In
addition, we propose normalized projection heads and unsupervised
hyper-parameter tuning for the first time.
Our extensive experimentation demonstrates that CLD is a lean and powerful
add-on to existing methods (e.g., NPID, MoCo, InfoMin, BYOL) on highly
correlated, long-tail, or balanced datasets. It not only achieves new
state-of-the-art on self-supervision, semi-supervision, and transfer learning
benchmarks, but also beats MoCo v2 and SimCLR on every reported performance
attained with a much larger compute. CLD effectively extends unsupervised
learning to natural data and brings it closer to real-world applications.Comment: 10 page
Data-Efficient Image Recognition with Contrastive Predictive Coding
Human observers can learn to recognize new categories of images from a
handful of examples, yet doing so with artificial ones remains an open
challenge. We hypothesize that data-efficient recognition is enabled by
representations which make the variability in natural signals more predictable.
We therefore revisit and improve Contrastive Predictive Coding, an unsupervised
objective for learning such representations. This new implementation produces
features which support state-of-the-art linear classification accuracy on the
ImageNet dataset. When used as input for non-linear classification with deep
neural networks, this representation allows us to use 2-5x less labels than
classifiers trained directly on image pixels. Finally, this unsupervised
representation substantially improves transfer learning to object detection on
the PASCAL VOC dataset, surpassing fully supervised pre-trained ImageNet
classifiers
Improving Generalization via Scalable Neighborhood Component Analysis
Current major approaches to visual recognition follow an end-to-end
formulation that classifies an input image into one of the pre-determined set
of semantic categories. Parametric softmax classifiers are a common choice for
such a closed world with fixed categories, especially when big labeled data is
available during training. However, this becomes problematic for open-set
scenarios where new categories are encountered with very few examples for
learning a generalizable parametric classifier. We adopt a non-parametric
approach for visual recognition by optimizing feature embeddings instead of
parametric classifiers. We use a deep neural network to learn the visual
feature that preserves the neighborhood structure in the semantic space, based
on the Neighborhood Component Analysis (NCA) criterion. Limited by its
computational bottlenecks, we devise a mechanism to use augmented memory to
scale NCA for large datasets and very deep networks. Our experiments deliver
not only remarkable performance on ImageNet classification for such a simple
non-parametric method, but most importantly a more generalizable feature
representation for sub-category discovery and few-shot recognition.Comment: To appear in ECCV 201
Transfer Adaptation Learning: A Decade Survey
The world we see is ever-changing and it always changes with people, things,
and the environment. Domain is referred to as the state of the world at a
certain moment. A research problem is characterized as transfer adaptation
learning (TAL) when it needs knowledge correspondence between different
moments/domains. Conventional machine learning aims to find a model with the
minimum expected risk on test data by minimizing the regularized empirical risk
on the training data, which, however, supposes that the training and test data
share similar joint probability distribution. TAL aims to build models that can
perform tasks of target domain by learning knowledge from a semantic related
but distribution different source domain. It is an energetic research filed of
increasing influence and importance, which is presenting a blowout publication
trend. This paper surveys the advances of TAL methodologies in the past decade,
and the technical challenges and essential problems of TAL have been observed
and discussed with deep insights and new perspectives. Broader solutions of
transfer adaptation learning being created by researchers are identified, i.e.,
instance re-weighting adaptation, feature adaptation, classifier adaptation,
deep network adaptation and adversarial adaptation, which are beyond the early
semi-supervised and unsupervised split. The survey helps researchers rapidly
but comprehensively understand and identify the research foundation, research
status, theoretical limitations, future challenges and under-studied issues
(universality, interpretability, and credibility) to be broken in the field
toward universal representation and safe applications in open-world scenarios.Comment: 26 pages, 4 figure
Local Aggregation for Unsupervised Learning of Visual Embeddings
Unsupervised approaches to learning in neural networks are of substantial
interest for furthering artificial intelligence, both because they would enable
the training of networks without the need for large numbers of expensive
annotations, and because they would be better models of the kind of
general-purpose learning deployed by humans. However, unsupervised networks
have long lagged behind the performance of their supervised counterparts,
especially in the domain of large-scale visual recognition. Recent developments
in training deep convolutional embeddings to maximize non-parametric instance
separation and clustering objectives have shown promise in closing this gap.
Here, we describe a method that trains an embedding function to maximize a
metric of local aggregation, causing similar data instances to move together in
the embedding space, while allowing dissimilar instances to separate. This
aggregation metric is dynamic, allowing soft clusters of different scales to
emerge. We evaluate our procedure on several large-scale visual recognition
datasets, achieving state-of-the-art unsupervised transfer learning performance
on object recognition in ImageNet, scene recognition in Places 205, and object
detection in PASCAL VOC
- …