990 research outputs found
3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition
Audio-visual recognition (AVR) has been considered as a solution for speech
recognition tasks when the audio is corrupted, as well as a visual recognition
method used for speaker verification in multi-speaker scenarios. The approach
of AVR systems is to leverage the extracted information from one modality to
improve the recognition ability of the other modality by complementing the
missing information. The essential problem is to find the correspondence
between the audio and visual streams, which is the goal of this work. We
propose the use of a coupled 3D Convolutional Neural Network (3D-CNN)
architecture that can map both modalities into a representation space to
evaluate the correspondence of audio-visual streams using the learned
multimodal features. The proposed architecture will incorporate both spatial
and temporal information jointly to effectively find the correlation between
temporal information for different modalities. By using a relatively small
network architecture and much smaller dataset for training, our proposed method
surpasses the performance of the existing similar methods for audio-visual
matching which use 3D CNNs for feature representation. We also demonstrate that
an effective pair selection method can significantly increase the performance.
The proposed method achieves relative improvements over 20% on the Equal Error
Rate (EER) and over 7% on the Average Precision (AP) in comparison to the
state-of-the-art method
Nested Invariance Pooling and RBM Hashing for Image Instance Retrieval
The goal of this work is the computation of very compact binary hashes for
image instance retrieval. Our approach has two novel contributions. The first
one is Nested Invariance Pooling (NIP), a method inspired from i-theory, a
mathematical theory for computing group invariant transformations with
feed-forward neural networks. NIP is able to produce compact and
well-performing descriptors with visual representations extracted from
convolutional neural networks. We specifically incorporate scale, translation
and rotation invariances but the scheme can be extended to any arbitrary sets
of transformations. We also show that using moments of increasing order
throughout nesting is important. The NIP descriptors are then hashed to the
target code size (32-256 bits) with a Restricted Boltzmann Machine with a novel
batch-level regularization scheme specifically designed for the purpose of
hashing (RBMH). A thorough empirical evaluation with state-of-the-art shows
that the results obtained both with the NIP descriptors and the NIP+RBMH hashes
are consistently outstanding across a wide range of datasets.Comment: Image Instance Retrieval, CNN, Invariant Representation, Hashing,
Unsupervised Learning, Regularization. arXiv admin note: text overlap with
arXiv:1601.0209
Learning Locality-Constrained Collaborative Representation for Face Recognition
The model of low-dimensional manifold and sparse representation are two
well-known concise models that suggest each data can be described by a few
characteristics. Manifold learning is usually investigated for dimension
reduction by preserving some expected local geometric structures from the
original space to a low-dimensional one. The structures are generally
determined by using pairwise distance, e.g., Euclidean distance. Alternatively,
sparse representation denotes a data point as a linear combination of the
points from the same subspace. In practical applications, however, the nearby
points in terms of pairwise distance may not belong to the same subspace, and
vice versa. Consequently, it is interesting and important to explore how to get
a better representation by integrating these two models together. To this end,
this paper proposes a novel coding algorithm, called Locality-Constrained
Collaborative Representation (LCCR), which improves the robustness and
discrimination of data representation by introducing a kind of local
consistency. The locality term derives from a biologic observation that the
similar inputs have similar code. The objective function of LCCR has an
analytical solution, and it does not involve local minima. The empirical
studies based on four public facial databases, ORL, AR, Extended Yale B, and
Multiple PIE, show that LCCR is promising in recognizing human faces from
frontal views with varying expression and illumination, as well as various
corruptions and occlusions.Comment: 16 pages, v
Dual-reference Face Retrieval
Face retrieval has received much attention over the past few decades, and
many efforts have been made in retrieving face images against pose,
illumination, and expression variations. However, the conventional works fail
to meet the requirements of a potential and novel task --- retrieving a
person's face image at a specific age, especially when the specific 'age' is
not given as a numeral, i.e. 'retrieving someone's image at the similar age
period shown by another person's image'. To tackle this problem, we propose a
dual reference face retrieval framework in this paper, where the system takes
two inputs: an identity reference image which indicates the target identity and
an age reference image which reflects the target age. In our framework, the raw
images are first projected on a joint manifold, which preserves both the age
and identity locality. Then two similarity metrics of age and identity are
exploited and optimized by utilizing our proposed quartet-based model. The
experiments show promising results, outperforming hierarchical methods.Comment: Accepted at AAAI 201
Collaborative Discriminant Locality Preserving Projections With its Application to Face Recognition
We present a novel Discriminant Locality Preserving Projections (DLPP)
algorithm named Collaborative Discriminant Locality Preserving Projection
(CDLPP). In our algorithm, the discriminating power of DLPP are further
exploited from two aspects. On the one hand, the global optimum of class
scattering is guaranteed via using the between-class scatter matrix to replace
the original denominator of DLPP. On the other hand, motivated by collaborative
representation, an -norm constraint is imposed to the projections to
discover the collaborations of dimensions in the sample space. We apply our
algorithm to face recognition. Three popular face databases, namely AR, ORL and
LFW-A, are employed for evaluating the performance of CDLPP. Extensive
experimental results demonstrate that CDLPP significantly improves the
discriminating power of DLPP and outperforms the state-of-the-arts.Comment: second versio
Learning Spread-out Local Feature Descriptors
We propose a simple, yet powerful regularization technique that can be used
to significantly improve both the pairwise and triplet losses in learning local
feature descriptors. The idea is that in order to fully utilize the expressive
power of the descriptor space, good local feature descriptors should be
sufficiently "spread-out" over the space. In this work, we propose a
regularization term to maximize the spread in feature descriptor inspired by
the property of uniform distribution. We show that the proposed regularization
with triplet loss outperforms existing Euclidean distance based descriptor
learning techniques by a large margin. As an extension, the proposed
regularization technique can also be used to improve image-level deep feature
embedding.Comment: ICCV 2017. 9 pages, 7 figure
Nonlinear Local Metric Learning for Person Re-identification
Person re-identification aims at matching pedestrians observed from
non-overlapping camera views. Feature descriptor and metric learning are two
significant problems in person re-identification. A discriminative metric
learning method should be capable of exploiting complex nonlinear
transformations due to the large variations in feature space. In this paper, we
propose a nonlinear local metric learning (NLML) method to improve the
state-of-the-art performance of person re-identification on public datasets.
Motivated by the fact that local metric learning has been introduced to handle
the data which varies locally and deep neural network has presented outstanding
capability in exploiting the nonlinearity of samples, we utilize the merits of
both local metric learning and deep neural network to learn multiple sets of
nonlinear transformations. By enforcing a margin between the distances of
positive pedestrian image pairs and distances of negative pairs in the
transformed feature subspace, discriminative information can be effectively
exploited in the developed neural networks. Our experiments show that the
proposed NLML method achieves the state-of-the-art results on the widely used
VIPeR, GRID, and CUHK 01 datasets.Comment: Submitted to CVPR 201
A Deep Hashing Learning Network
Hashing-based methods seek compact and efficient binary codes that preserve
the neighborhood structure in the original data space. For most existing
hashing methods, an image is first encoded as a vector of hand-crafted visual
feature, followed by a hash projection and quantization step to get the compact
binary vector. Most of the hand-crafted features just encode the low-level
information of the input, the feature may not preserve the semantic
similarities of images pairs. Meanwhile, the hashing function learning process
is independent with the feature representation, so the feature may not be
optimal for the hashing projection. In this paper, we propose a supervised
hashing method based on a well designed deep convolutional neural network,
which tries to learn hashing code and compact representations of data
simultaneously. The proposed model learn the binary codes by adding a compact
sigmoid layer before the loss layer. Experiments on several image data sets
show that the proposed model outperforms other state-of-the-art methods.Comment: 7 pages, 5 figure
Transfer Metric Learning: Algorithms, Applications and Outlooks
Distance metric learning (DML) aims to find an appropriate way to reveal the
underlying data relationship. It is critical in many machine learning, pattern
recognition and data mining algorithms, and usually require large amount of
label information (such as class labels or pair/triplet constraints) to achieve
satisfactory performance. However, the label information may be insufficient in
real-world applications due to the high-labeling cost, and DML may fail in this
case. Transfer metric learning (TML) is able to mitigate this issue for DML in
the domain of interest (target domain) by leveraging knowledge/information from
other related domains (source domains). Although achieved a certain level of
development, TML has limited success in various aspects such as selective
transfer, theoretical understanding, handling complex data, big data and
extreme cases. In this survey, we present a systematic review of the TML
literature. In particular, we group TML into different categories according to
different settings and metric transfer strategies, such as direct metric
approximation, subspace approximation, distance approximation, and distribution
approximation. A summarization and insightful discussion of the various TML
approaches and their applications will be presented. Finally, we indicate some
challenges and provide possible future directions.Comment: 14 pages, 5 figure
Unsupervised learning from videos using temporal coherency deep networks
In this work we address the challenging problem of unsupervised learning from
videos. Existing methods utilize the spatio-temporal continuity in contiguous
video frames as regularization for the learning process. Typically, this
temporal coherence of close frames is used as a free form of annotation,
encouraging the learned representations to exhibit small differences between
these frames. But this type of approach fails to capture the dissimilarity
between videos with different content, hence learning less discriminative
features. We here propose two Siamese architectures for Convolutional Neural
Networks, and their corresponding novel loss functions, to learn from unlabeled
videos, which jointly exploit the local temporal coherence between contiguous
frames, and a global discriminative margin used to separate representations of
different videos. An extensive experimental evaluation is presented, where we
validate the proposed models on various tasks. First, we show how the learned
features can be used to discover actions and scenes in video collections.
Second, we show the benefits of such an unsupervised learning from just
unlabeled videos, which can be directly used as a prior for the supervised
recognition tasks of actions and objects in images, where our results further
show that our features can even surpass a traditional and heavily supervised
pre-training plus fine-tunning strategy
- …