990 research outputs found

    3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition

    Full text link
    Audio-visual recognition (AVR) has been considered as a solution for speech recognition tasks when the audio is corrupted, as well as a visual recognition method used for speaker verification in multi-speaker scenarios. The approach of AVR systems is to leverage the extracted information from one modality to improve the recognition ability of the other modality by complementing the missing information. The essential problem is to find the correspondence between the audio and visual streams, which is the goal of this work. We propose the use of a coupled 3D Convolutional Neural Network (3D-CNN) architecture that can map both modalities into a representation space to evaluate the correspondence of audio-visual streams using the learned multimodal features. The proposed architecture will incorporate both spatial and temporal information jointly to effectively find the correlation between temporal information for different modalities. By using a relatively small network architecture and much smaller dataset for training, our proposed method surpasses the performance of the existing similar methods for audio-visual matching which use 3D CNNs for feature representation. We also demonstrate that an effective pair selection method can significantly increase the performance. The proposed method achieves relative improvements over 20% on the Equal Error Rate (EER) and over 7% on the Average Precision (AP) in comparison to the state-of-the-art method

    Nested Invariance Pooling and RBM Hashing for Image Instance Retrieval

    Full text link
    The goal of this work is the computation of very compact binary hashes for image instance retrieval. Our approach has two novel contributions. The first one is Nested Invariance Pooling (NIP), a method inspired from i-theory, a mathematical theory for computing group invariant transformations with feed-forward neural networks. NIP is able to produce compact and well-performing descriptors with visual representations extracted from convolutional neural networks. We specifically incorporate scale, translation and rotation invariances but the scheme can be extended to any arbitrary sets of transformations. We also show that using moments of increasing order throughout nesting is important. The NIP descriptors are then hashed to the target code size (32-256 bits) with a Restricted Boltzmann Machine with a novel batch-level regularization scheme specifically designed for the purpose of hashing (RBMH). A thorough empirical evaluation with state-of-the-art shows that the results obtained both with the NIP descriptors and the NIP+RBMH hashes are consistently outstanding across a wide range of datasets.Comment: Image Instance Retrieval, CNN, Invariant Representation, Hashing, Unsupervised Learning, Regularization. arXiv admin note: text overlap with arXiv:1601.0209

    Learning Locality-Constrained Collaborative Representation for Face Recognition

    Full text link
    The model of low-dimensional manifold and sparse representation are two well-known concise models that suggest each data can be described by a few characteristics. Manifold learning is usually investigated for dimension reduction by preserving some expected local geometric structures from the original space to a low-dimensional one. The structures are generally determined by using pairwise distance, e.g., Euclidean distance. Alternatively, sparse representation denotes a data point as a linear combination of the points from the same subspace. In practical applications, however, the nearby points in terms of pairwise distance may not belong to the same subspace, and vice versa. Consequently, it is interesting and important to explore how to get a better representation by integrating these two models together. To this end, this paper proposes a novel coding algorithm, called Locality-Constrained Collaborative Representation (LCCR), which improves the robustness and discrimination of data representation by introducing a kind of local consistency. The locality term derives from a biologic observation that the similar inputs have similar code. The objective function of LCCR has an analytical solution, and it does not involve local minima. The empirical studies based on four public facial databases, ORL, AR, Extended Yale B, and Multiple PIE, show that LCCR is promising in recognizing human faces from frontal views with varying expression and illumination, as well as various corruptions and occlusions.Comment: 16 pages, v

    Dual-reference Face Retrieval

    Full text link
    Face retrieval has received much attention over the past few decades, and many efforts have been made in retrieving face images against pose, illumination, and expression variations. However, the conventional works fail to meet the requirements of a potential and novel task --- retrieving a person's face image at a specific age, especially when the specific 'age' is not given as a numeral, i.e. 'retrieving someone's image at the similar age period shown by another person's image'. To tackle this problem, we propose a dual reference face retrieval framework in this paper, where the system takes two inputs: an identity reference image which indicates the target identity and an age reference image which reflects the target age. In our framework, the raw images are first projected on a joint manifold, which preserves both the age and identity locality. Then two similarity metrics of age and identity are exploited and optimized by utilizing our proposed quartet-based model. The experiments show promising results, outperforming hierarchical methods.Comment: Accepted at AAAI 201

    Collaborative Discriminant Locality Preserving Projections With its Application to Face Recognition

    Full text link
    We present a novel Discriminant Locality Preserving Projections (DLPP) algorithm named Collaborative Discriminant Locality Preserving Projection (CDLPP). In our algorithm, the discriminating power of DLPP are further exploited from two aspects. On the one hand, the global optimum of class scattering is guaranteed via using the between-class scatter matrix to replace the original denominator of DLPP. On the other hand, motivated by collaborative representation, an L2L_2-norm constraint is imposed to the projections to discover the collaborations of dimensions in the sample space. We apply our algorithm to face recognition. Three popular face databases, namely AR, ORL and LFW-A, are employed for evaluating the performance of CDLPP. Extensive experimental results demonstrate that CDLPP significantly improves the discriminating power of DLPP and outperforms the state-of-the-arts.Comment: second versio

    Learning Spread-out Local Feature Descriptors

    Full text link
    We propose a simple, yet powerful regularization technique that can be used to significantly improve both the pairwise and triplet losses in learning local feature descriptors. The idea is that in order to fully utilize the expressive power of the descriptor space, good local feature descriptors should be sufficiently "spread-out" over the space. In this work, we propose a regularization term to maximize the spread in feature descriptor inspired by the property of uniform distribution. We show that the proposed regularization with triplet loss outperforms existing Euclidean distance based descriptor learning techniques by a large margin. As an extension, the proposed regularization technique can also be used to improve image-level deep feature embedding.Comment: ICCV 2017. 9 pages, 7 figure

    Nonlinear Local Metric Learning for Person Re-identification

    Full text link
    Person re-identification aims at matching pedestrians observed from non-overlapping camera views. Feature descriptor and metric learning are two significant problems in person re-identification. A discriminative metric learning method should be capable of exploiting complex nonlinear transformations due to the large variations in feature space. In this paper, we propose a nonlinear local metric learning (NLML) method to improve the state-of-the-art performance of person re-identification on public datasets. Motivated by the fact that local metric learning has been introduced to handle the data which varies locally and deep neural network has presented outstanding capability in exploiting the nonlinearity of samples, we utilize the merits of both local metric learning and deep neural network to learn multiple sets of nonlinear transformations. By enforcing a margin between the distances of positive pedestrian image pairs and distances of negative pairs in the transformed feature subspace, discriminative information can be effectively exploited in the developed neural networks. Our experiments show that the proposed NLML method achieves the state-of-the-art results on the widely used VIPeR, GRID, and CUHK 01 datasets.Comment: Submitted to CVPR 201

    A Deep Hashing Learning Network

    Full text link
    Hashing-based methods seek compact and efficient binary codes that preserve the neighborhood structure in the original data space. For most existing hashing methods, an image is first encoded as a vector of hand-crafted visual feature, followed by a hash projection and quantization step to get the compact binary vector. Most of the hand-crafted features just encode the low-level information of the input, the feature may not preserve the semantic similarities of images pairs. Meanwhile, the hashing function learning process is independent with the feature representation, so the feature may not be optimal for the hashing projection. In this paper, we propose a supervised hashing method based on a well designed deep convolutional neural network, which tries to learn hashing code and compact representations of data simultaneously. The proposed model learn the binary codes by adding a compact sigmoid layer before the loss layer. Experiments on several image data sets show that the proposed model outperforms other state-of-the-art methods.Comment: 7 pages, 5 figure

    Transfer Metric Learning: Algorithms, Applications and Outlooks

    Full text link
    Distance metric learning (DML) aims to find an appropriate way to reveal the underlying data relationship. It is critical in many machine learning, pattern recognition and data mining algorithms, and usually require large amount of label information (such as class labels or pair/triplet constraints) to achieve satisfactory performance. However, the label information may be insufficient in real-world applications due to the high-labeling cost, and DML may fail in this case. Transfer metric learning (TML) is able to mitigate this issue for DML in the domain of interest (target domain) by leveraging knowledge/information from other related domains (source domains). Although achieved a certain level of development, TML has limited success in various aspects such as selective transfer, theoretical understanding, handling complex data, big data and extreme cases. In this survey, we present a systematic review of the TML literature. In particular, we group TML into different categories according to different settings and metric transfer strategies, such as direct metric approximation, subspace approximation, distance approximation, and distribution approximation. A summarization and insightful discussion of the various TML approaches and their applications will be presented. Finally, we indicate some challenges and provide possible future directions.Comment: 14 pages, 5 figure

    Unsupervised learning from videos using temporal coherency deep networks

    Full text link
    In this work we address the challenging problem of unsupervised learning from videos. Existing methods utilize the spatio-temporal continuity in contiguous video frames as regularization for the learning process. Typically, this temporal coherence of close frames is used as a free form of annotation, encouraging the learned representations to exhibit small differences between these frames. But this type of approach fails to capture the dissimilarity between videos with different content, hence learning less discriminative features. We here propose two Siamese architectures for Convolutional Neural Networks, and their corresponding novel loss functions, to learn from unlabeled videos, which jointly exploit the local temporal coherence between contiguous frames, and a global discriminative margin used to separate representations of different videos. An extensive experimental evaluation is presented, where we validate the proposed models on various tasks. First, we show how the learned features can be used to discover actions and scenes in video collections. Second, we show the benefits of such an unsupervised learning from just unlabeled videos, which can be directly used as a prior for the supervised recognition tasks of actions and objects in images, where our results further show that our features can even surpass a traditional and heavily supervised pre-training plus fine-tunning strategy
    • …
    corecore