13,929 research outputs found
Human-like Clustering with Deep Convolutional Neural Networks
Classification and clustering have been studied separately in machine
learning and computer vision. Inspired by the recent success of deep learning
models in solving various vision problems (e.g., object recognition, semantic
segmentation) and the fact that humans serve as the gold standard in assessing
clustering algorithms, here, we advocate for a unified treatment of the two
problems and suggest that hierarchical frameworks that progressively build
complex patterns on top of the simpler ones (e.g., convolutional neural
networks) offer a promising solution. We do not dwell much on the learning
mechanisms in these frameworks as they are still a matter of debate, with
respect to biological constraints. Instead, we emphasize on the
compositionality of the real world structures and objects. In particular, we
show that CNNs, trained end to end using back propagation with noisy labels,
are able to cluster data points belonging to several overlapping shapes, and do
so much better than the state of the art algorithms. The main takeaway lesson
from our study is that mechanisms of human vision, particularly the hierarchal
organization of the visual ventral stream should be taken into account in
clustering algorithms (e.g., for learning representations in an unsupervised
manner or with minimum supervision) to reach human level clustering
performance. This, by no means, suggests that other methods do not hold merits.
For example, methods relying on pairwise affinities (e.g., spectral clustering)
have been very successful in many scenarios but still fail in some cases (e.g.,
overlapping clusters)
Self-Supervised Learning of Face Representations for Video Face Clustering
Analyzing the story behind TV series and movies often requires understanding
who the characters are and what they are doing. With improving deep face
models, this may seem like a solved problem. However, as face detectors get
better, clustering/identification needs to be revisited to address increasing
diversity in facial appearance. In this paper, we address video face clustering
using unsupervised methods. Our emphasis is on distilling the essential
information, identity, from the representations obtained using deep pre-trained
face networks. We propose a self-supervised Siamese network that can be trained
without the need for video/track based supervision, and thus can also be
applied to image collections. We evaluate our proposed method on three video
face clustering datasets. The experiments show that our methods outperform
current state-of-the-art methods on all datasets. Video face clustering is
lacking a common benchmark as current works are often evaluated with different
metrics and/or different sets of face tracks.Comment: To appear at International Conference on Automatic Face and Gesture
Recognition (2019) as an Oral. The datasets and code are available at
https://github.com/vivoutlaw/SSIA
What is the right way to represent document images?
In this article we study the problem of document image representation based
on visual features. We propose a comprehensive experimental study that compares
three types of visual document image representations: (1) traditional so-called
shallow features, such as the RunLength and the Fisher-Vector descriptors, (2)
deep features based on Convolutional Neural Networks, and (3) features
extracted from hybrid architectures that take inspiration from the two previous
ones.
We evaluate these features in several tasks (i.e. classification, clustering,
and retrieval) and in different setups (e.g. domain transfer) using several
public and in-house datasets. Our results show that deep features generally
outperform other types of features when there is no domain shift and the new
task is closely related to the one used to train the model. However, when a
large domain or task shift is present, the Fisher-Vector shallow features
generalize better and often obtain the best results
A survey on trajectory clustering analysis
This paper comprehensively surveys the development of trajectory clustering.
Considering the critical role of trajectory data mining in modern intelligent
systems for surveillance security, abnormal behavior detection, crowd behavior
analysis, and traffic control, trajectory clustering has attracted growing
attention. Existing trajectory clustering methods can be grouped into three
categories: unsupervised, supervised and semi-supervised algorithms. In spite
of achieving a certain level of development, trajectory clustering is limited
in its success by complex conditions such as application scenarios and data
dimensions. This paper provides a holistic understanding and deep insight into
trajectory clustering, and presents a comprehensive analysis of representative
methods and promising future directions
Unsupervised Category Discovery via Looped Deep Pseudo-Task Optimization Using a Large Scale Radiology Image Database
Obtaining semantic labels on a large scale radiology image database (215,786
key images from 61,845 unique patients) is a prerequisite yet bottleneck to
train highly effective deep convolutional neural network (CNN) models for image
recognition. Nevertheless, conventional methods for collecting image labels
(e.g., Google search followed by crowd-sourcing) are not applicable due to the
formidable difficulties of medical annotation tasks for those who are not
clinically trained. This type of image labeling task remains non-trivial even
for radiologists due to uncertainty and possible drastic inter-observer
variation or inconsistency.
In this paper, we present a looped deep pseudo-task optimization procedure
for automatic category discovery of visually coherent and clinically semantic
(concept) clusters. Our system can be initialized by domain-specific (CNN
trained on radiology images and text report derived labels) or generic
(ImageNet based) CNN models. Afterwards, a sequence of pseudo-tasks are
exploited by the looped deep image feature clustering (to refine image labels)
and deep CNN training/classification using new labels (to obtain more task
representative deep features). Our method is conceptually simple and based on
the hypothesized "convergence" of better labels leading to better trained CNN
models which in turn feed more effective deep image features to facilitate more
meaningful clustering/labels. We have empirically validated the convergence and
demonstrated promising quantitative and qualitative results. Category labels of
significantly higher quality than those in previous work are discovered. This
allows for further investigation of the hierarchical semantic nature of the
given large-scale radiology image database
Unsupervised Person Re-identification: Clustering and Fine-tuning
The superiority of deeply learned pedestrian representations has been
reported in very recent literature of person re-identification (re-ID). In this
paper, we consider the more pragmatic issue of learning a deep feature with no
or only a few labels. We propose a progressive unsupervised learning (PUL)
method to transfer pretrained deep representations to unseen domains. Our
method is easy to implement and can be viewed as an effective baseline for
unsupervised re-ID feature learning. Specifically, PUL iterates between 1)
pedestrian clustering and 2) fine-tuning of the convolutional neural network
(CNN) to improve the original model trained on the irrelevant labeled dataset.
Since the clustering results can be very noisy, we add a selection operation
between the clustering and fine-tuning. At the beginning when the model is
weak, CNN is fine-tuned on a small amount of reliable examples which locate
near to cluster centroids in the feature space. As the model becomes stronger
in subsequent iterations, more images are being adaptively selected as CNN
training samples. Progressively, pedestrian clustering and the CNN model are
improved simultaneously until algorithm convergence. This process is naturally
formulated as self-paced learning. We then point out promising directions that
may lead to further improvement. Extensive experiments on three large-scale
re-ID datasets demonstrate that PUL outputs discriminative features that
improve the re-ID accuracy.Comment: Add more results, parameter analysis and comparison
Unsupervised learning from videos using temporal coherency deep networks
In this work we address the challenging problem of unsupervised learning from
videos. Existing methods utilize the spatio-temporal continuity in contiguous
video frames as regularization for the learning process. Typically, this
temporal coherence of close frames is used as a free form of annotation,
encouraging the learned representations to exhibit small differences between
these frames. But this type of approach fails to capture the dissimilarity
between videos with different content, hence learning less discriminative
features. We here propose two Siamese architectures for Convolutional Neural
Networks, and their corresponding novel loss functions, to learn from unlabeled
videos, which jointly exploit the local temporal coherence between contiguous
frames, and a global discriminative margin used to separate representations of
different videos. An extensive experimental evaluation is presented, where we
validate the proposed models on various tasks. First, we show how the learned
features can be used to discover actions and scenes in video collections.
Second, we show the benefits of such an unsupervised learning from just
unlabeled videos, which can be directly used as a prior for the supervised
recognition tasks of actions and objects in images, where our results further
show that our features can even surpass a traditional and heavily supervised
pre-training plus fine-tunning strategy
Unsupervised Learning using Pretrained CNN and Associative Memory Bank
Deep Convolutional features extracted from a comprehensive labeled dataset,
contain substantial representations which could be effectively used in a new
domain. Despite the fact that generic features achieved good results in many
visual tasks, fine-tuning is required for pretrained deep CNN models to be more
effective and provide state-of-the-art performance. Fine tuning using the
backpropagation algorithm in a supervised setting, is a time and resource
consuming process. In this paper, we present a new architecture and an approach
for unsupervised object recognition that addresses the above mentioned problem
with fine tuning associated with pretrained CNN-based supervised deep learning
approaches while allowing automated feature extraction. Unlike existing works,
our approach is applicable to general object recognition tasks. It uses a
pretrained (on a related domain) CNN model for automated feature extraction
pipelined with a Hopfield network based associative memory bank for storing
patterns for classification purposes. The use of associative memory bank in our
framework allows eliminating backpropagation while providing competitive
performance on an unseen dataset.Comment: Paper was accepted at the 2018 International Joint Conference on
Neural Networks (IJCNN 2018
Unsupervised learning of object semantic parts from internal states of CNNs by population encoding
We address the key question of how object part representations can be found
from the internal states of CNNs that are trained for high-level tasks, such as
object classification. This work provides a new unsupervised method to learn
semantic parts and gives new understanding of the internal representations of
CNNs. Our technique is based on the hypothesis that semantic parts are
represented by populations of neurons rather than by single filters. We propose
a clustering technique to extract part representations, which we call Visual
Concepts. We show that visual concepts are semantically coherent in that they
represent semantic parts, and visually coherent in that corresponding image
patches appear very similar. Also, visual concepts provide full spatial
coverage of the parts of an object, rather than a few sparse parts as is
typically found in keypoint annotations. Furthermore, We treat single visual
concept as part detector and evaluate it for keypoint detection using the
PASCAL3D+ dataset and for part detection using our newly annotated ImageNetPart
dataset. The experiments demonstrate that visual concepts can be used to detect
parts. We also show that some visual concepts respond to several semantic
parts, provided these parts are visually similar. Thus visual concepts have the
essential properties: semantic meaning and detection capability. Note that our
ImageNetPart dataset gives rich part annotations which cover the whole object,
making it useful for other part-related applications
Semi-supervised Clustering for Short Text via Deep Representation Learning
In this work, we propose a semi-supervised method for short text clustering,
where we represent texts as distributed vectors with neural networks, and use a
small amount of labeled data to specify our intention for clustering. We design
a novel objective to combine the representation learning process and the
k-means clustering process together, and optimize the objective with both
labeled data and unlabeled data iteratively until convergence through three
steps: (1) assign each short text to its nearest centroid based on its
representation from the current neural networks; (2) re-estimate the cluster
centroids based on cluster assignments from step (1); (3) update neural
networks according to the objective by keeping centroids and cluster
assignments fixed. Experimental results on four datasets show that our method
works significantly better than several other text clustering methods.Comment: In Proceedings of CoNLL 201
- …