7,882 research outputs found
Learnable PINs: Cross-Modal Embeddings for Person Identity
We propose and investigate an identity sensitive joint embedding of face and
voice. Such an embedding enables cross-modal retrieval from voice to face and
from face to voice. We make the following four contributions: first, we show
that the embedding can be learnt from videos of talking faces, without
requiring any identity labels, using a form of cross-modal self-supervision;
second, we develop a curriculum learning schedule for hard negative mining
targeted to this task, that is essential for learning to proceed successfully;
third, we demonstrate and evaluate cross-modal retrieval for identities unseen
and unheard during training over a number of scenarios and establish a
benchmark for this novel task; finally, we show an application of using the
joint embedding for automatically retrieving and labelling characters in TV
dramas.Comment: To appear in ECCV 201
Deep Adaptive Feature Embedding with Local Sample Distributions for Person Re-identification
Person re-identification (re-id) aims to match pedestrians observed by
disjoint camera views. It attracts increasing attention in computer vision due
to its importance to surveillance system. To combat the major challenge of
cross-view visual variations, deep embedding approaches are proposed by
learning a compact feature space from images such that the Euclidean distances
correspond to their cross-view similarity metric. However, the global Euclidean
distance cannot faithfully characterize the ideal similarity in a complex
visual feature space because features of pedestrian images exhibit unknown
distributions due to large variations in poses, illumination and occlusion.
Moreover, intra-personal training samples within a local range are robust to
guide deep embedding against uncontrolled variations, which however, cannot be
captured by a global Euclidean distance. In this paper, we study the problem of
person re-id by proposing a novel sampling to mine suitable \textit{positives}
(i.e. intra-class) within a local range to improve the deep embedding in the
context of large intra-class variations. Our method is capable of learning a
deep similarity metric adaptive to local sample structure by minimizing each
sample's local distances while propagating through the relationship between
samples to attain the whole intra-class minimization. To this end, a novel
objective function is proposed to jointly optimize similarity metric learning,
local positive mining and robust deep embedding. This yields local
discriminations by selecting local-ranged positive samples, and the learned
features are robust to dramatic intra-class variations. Experiments on
benchmarks show state-of-the-art results achieved by our method.Comment: Published on Pattern Recognitio
Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision
The goal of this work is to train discriminative cross-modal embeddings
without access to manually annotated data. Recent advances in self-supervised
learning have shown that effective representations can be learnt from natural
cross-modal synchrony. We build on earlier work to train embeddings that are
more discriminative for uni-modal downstream tasks. To this end, we propose a
novel training strategy that not only optimises metrics across modalities, but
also enforces intra-class feature separation within each of the modalities. The
effectiveness of the method is demonstrated on two downstream tasks: lip
reading using the features trained on audio-visual synchronisation, and speaker
recognition using the features trained for cross-modal biometric matching. The
proposed method outperforms state-of-the-art self-supervised baselines by a
signficant margin.Comment: Under submission as a conference pape
Ranked List Loss for Deep Metric Learning
The objective of deep metric learning (DML) is to learn embeddings that can
capture semantic similarity and dissimilarity information among data points.
Existing pairwise or tripletwise loss functions used in DML are known to suffer
from slow convergence due to a large proportion of trivial pairs or triplets as
the model improves. To improve this, ranking-motivated structured losses are
proposed recently to incorporate multiple examples and exploit the structured
information among them. They converge faster and achieve state-of-the-art
performance. In this work, we unveil two limitations of existing
ranking-motivated structured losses and propose a novel ranked list loss to
solve both of them. First, given a query, only a fraction of data points is
incorporated to build the similarity structure. Consequently, some useful
examples are ignored and the structure is less informative. To address this, we
propose to build a set-based similarity structure by exploiting all instances
in the gallery. The learning setting can be interpreted as few-shot retrieval:
given a mini-batch, every example is iteratively used as a query, and the rest
ones compose the gallery to search, i.e., the support set in few-shot setting.
The rest examples are split into a positive set and a negative set. For every
mini-batch, the learning objective of ranked list loss is to make the query
closer to the positive set than to the negative set by a margin. Second,
previous methods aim to pull positive pairs as close as possible in the
embedding space. As a result, the intraclass data distribution tends to be
extremely compressed. In contrast, we propose to learn a hypersphere for each
class in order to preserve useful similarity structure inside it, which
functions as regularisation. Extensive experiments demonstrate the superiority
of our proposal by comparing with the state-of-the-art methods.Comment: Accepted to T-PAMI. Therefore, to read the offical version, please go
to IEEE Xplore. Fine-grained image retrieval task. Our source code is
available online: https://github.com/XinshaoAmosWang/Ranked-List-Loss-for-DM
- …