1,171 research outputs found
Fast and Accurate Person Re-Identification with RMNet
In this paper we introduce a new neural network architecture designed to use
in embedded vision applications. It merges the best working practices of
network architectures like MobileNets and ResNets to our named RMNet
architecture. We also focus on key moments of building mobile architectures to
carry out in the limited computation budget. Additionally, to demonstrate the
effectiveness of our architecture we evaluate the RMNet backbone on Person
Re-identification task. The proposed approach is in top 3 of state of the art
solutions on Market-1501 challenge, however our method significantly
outperforms them by the inference speed
Improved Hard Example Mining by Discovering Attribute-based Hard Person Identity
In this paper, we propose Hard Person Identity Mining (HPIM) that attempts to
refine the hard example mining to improve the exploration efficacy in person
re-identification. It is motivated by following observation: the more
attributes some people share, the more difficult to separate their identities.
Based on this observation, we develop HPIM via a transferred attribute
describer, a deep multi-attribute classifier trained from the source noisy
person attribute datasets. We encode each image into the attribute
probabilistic description in the target person re-ID dataset. Afterwards in the
attribute code space, we consider each person as a distribution to generate his
view-specific attribute codes in different practical scenarios. Hence we
estimate the person-specific statistical moments from zeroth to higher order,
which are further used to calculate the central moment discrepancies between
persons. Such discrepancy is a ground to choose hard identity to organize
proper mini-batches, without concerning the person representation changing in
metric learning. It presents as a complementary tool of hard example mining,
which helps to explore the global instead of the local hard example constraint
in the mini-batch built by randomly sampled identities. Extensive experiments
on two person re-identification benchmarks validated the effectiveness of our
proposed algorithm
In Defense of the Triplet Loss for Person Re-Identification
In the past few years, the field of computer vision has gone through a
revolution fueled mainly by the advent of large datasets and the adoption of
deep convolutional neural networks for end-to-end learning. The person
re-identification subfield is no exception to this. Unfortunately, a prevailing
belief in the community seems to be that the triplet loss is inferior to using
surrogate losses (classification, verification) followed by a separate metric
learning step. We show that, for models trained from scratch as well as
pretrained ones, using a variant of the triplet loss to perform end-to-end deep
metric learning outperforms most other published methods by a large margin.Comment: Lucas Beyer and Alexander Hermans contributed equally. Updates: Minor
fixes, new SOTA comparisons, add CUHK03 result
Neural Signatures for Licence Plate Re-identification
The problem of vehicle licence plate re-identification is generally
considered as a one-shot image retrieval problem. The objective of this task is
to learn a feature representation (called a "signature") for licence plates.
Incoming licence plate images are converted to signatures and matched to a
previously collected template database through a distance measure. Then, the
input image is recognized as the template whose signature is "nearest" to the
input signature. The template database is restricted to contain only a single
signature per unique licence plate for our problem.
We measure the performance of deep convolutional net-based features adapted
from face recognition on this task. In addition, we also test a hybrid approach
combining the Fisher vector with a neural network-based embedding called "f2nn"
trained with the Triplet loss function. We find that the hybrid approach
performs comparably while providing computational benefits. The signature
generated by the hybrid approach also shows higher generalizability to datasets
more dissimilar to the training corpus
Joint Discriminative and Generative Learning for Person Re-identification
Person re-identification (re-id) remains challenging due to significant
intra-class variations across different cameras. Recently, there has been a
growing interest in using generative models to augment training data and
enhance the invariance to input changes. The generative pipelines in existing
methods, however, stay relatively separate from the discriminative re-id
learning stages. Accordingly, re-id models are often trained in a
straightforward manner on the generated data. In this paper, we seek to improve
learned re-id embeddings by better leveraging the generated data. To this end,
we propose a joint learning framework that couples re-id learning and data
generation end-to-end. Our model involves a generative module that separately
encodes each person into an appearance code and a structure code, and a
discriminative module that shares the appearance encoder with the generative
module. By switching the appearance or structure codes, the generative module
is able to generate high-quality cross-id composed images, which are online fed
back to the appearance encoder and used to improve the discriminative module.
The proposed joint learning framework renders significant improvement over the
baseline without using generated data, leading to the state-of-the-art
performance on several benchmark datasets.Comment: CVPR 2019 (Oral
Generalization in Metric Learning: Should the Embedding Layer be the Embedding Layer?
This work studies deep metric learning under small to medium scale data as we
believe that better generalization could be a contributing factor to the
improvement of previous fine-grained image retrieval methods; it should be
considered when designing future techniques. In particular, we investigate
using other layers in a deep metric learning system (besides the embedding
layer) for feature extraction and analyze how well they perform on training
data and generalize to testing data. From this study, we suggest a new
regularization practice where one can add or choose a more optimal layer for
feature extraction. State-of-the-art performance is demonstrated on 3
fine-grained image retrieval benchmarks: Cars-196, CUB-200-2011, and Stanford
Online Product.Comment: new version for WAC
Directional Statistics-based Deep Metric Learning for Image Classification and Retrieval
Deep distance metric learning (DDML), which is proposed to learn image
similarity metrics in an end-to-end manner based on the convolution neural
network, has achieved encouraging results in many computer vision
tasks.-normalization in the embedding space has been used to improve the
performance of several DDML methods. However, the commonly used Euclidean
distance is no longer an accurate metric for -normalized embedding space,
i.e., a hyper-sphere. Another challenge of current DDML methods is that their
loss functions are usually based on rigid data formats, such as the triplet
tuple. Thus, an extra process is needed to prepare data in specific formats. In
addition, their losses are obtained from a limited number of samples, which
leads to a lack of the global view of the embedding space. In this paper, we
replace the Euclidean distance with the cosine similarity to better utilize the
-normalization, which is able to attenuate the curse of dimensionality.
More specifically, a novel loss function based on the von Mises-Fisher
distribution is proposed to learn a compact hyper-spherical embedding space.
Moreover, a new efficient learning algorithm is developed to better capture the
global structure of the embedding space. Experiments for both classification
and retrieval tasks on several standard datasets show that our method achieves
state-of-the-art performance with a simpler training procedure. Furthermore, we
demonstrate that, even with a small number of convolutional layers, our model
can still obtain significantly better classification performance than the
widely used softmax loss.Comment: codes will come soo
Multiscale CNN based Deep Metric Learning for Bioacoustic Classification: Overcoming Training Data Scarcity Using Dynamic Triplet Loss
This paper proposes multiscale convolutional neural network (CNN)-based deep
metric learning for bioacoustic classification, under low training data
conditions. The proposed CNN is characterized by the utilization of four
different filter sizes at each level to analyze input feature maps. This
multiscale nature helps in describing different bioacoustic events effectively:
smaller filters help in learning the finer details of bioacoustic events,
whereas, larger filters help in analyzing a larger context leading to global
details. A dynamic triplet loss is employed in the proposed CNN architecture to
learn a transformation from the input space to the embedding space, where
classification is performed. The triplet loss helps in learning this
transformation by analyzing three examples, referred to as triplets, at a time
where intra-class distance is minimized while maximizing the inter-class
separation by a dynamically increasing margin. The number of possible triplets
increases cubically with the dataset size, making triplet loss more suitable
than the softmax cross-entropy loss in low training data conditions.
Experiments on three different publicly available datasets show that the
proposed framework performs better than existing bioacoustic classification
frameworks. Experimental results also confirm the superiority of the triplet
loss over the cross-entropy loss in low training data conditionsComment: Under Review at JASA. Primitive version of paper. We are still
working on getting better performances out of the comparative method
Towards Learning a Universal Non-Semantic Representation of Speech
The ultimate goal of transfer learning is to reduce labeled data requirements
by exploiting a pre-existing embedding model trained for different datasets or
tasks. The visual and language communities have established benchmarks to
compare embeddings, but the speech community has yet to do so. This paper
proposes a benchmark for comparing speech representations on non-semantic
tasks, and proposes a representation based on an unsupervised triplet-loss
objective. The proposed representation outperforms other representations on the
benchmark, and even exceeds state-of-the-art performance on a number of
transfer learning tasks. The embedding is trained on a publicly available
dataset, and it is tested on a variety of low-resource downstream tasks,
including personalization tasks and medical domain. The benchmark, models, and
evaluation code are publicly released
Self-Supervised Learning of Face Representations for Video Face Clustering
Analyzing the story behind TV series and movies often requires understanding
who the characters are and what they are doing. With improving deep face
models, this may seem like a solved problem. However, as face detectors get
better, clustering/identification needs to be revisited to address increasing
diversity in facial appearance. In this paper, we address video face clustering
using unsupervised methods. Our emphasis is on distilling the essential
information, identity, from the representations obtained using deep pre-trained
face networks. We propose a self-supervised Siamese network that can be trained
without the need for video/track based supervision, and thus can also be
applied to image collections. We evaluate our proposed method on three video
face clustering datasets. The experiments show that our methods outperform
current state-of-the-art methods on all datasets. Video face clustering is
lacking a common benchmark as current works are often evaluated with different
metrics and/or different sets of face tracks.Comment: To appear at International Conference on Automatic Face and Gesture
Recognition (2019) as an Oral. The datasets and code are available at
https://github.com/vivoutlaw/SSIA
- …