182 research outputs found

    The Role of Riemannian Manifolds in Computer Vision: From Coding to Deep Metric Learning

    Get PDF
    A diverse number of tasks in computer vision and machine learning enjoy from representations of data that are compact yet discriminative, informative and robust to critical measurements. Two notable representations are offered by Region Covariance Descriptors (RCovD) and linear subspaces which are naturally analyzed through the manifold of Symmetric Positive Definite (SPD) matrices and the Grassmann manifold, respectively, two widely used types of Riemannian manifolds in computer vision. As our first objective, we examine image and video-based recognition applications where the local descriptors have the aforementioned Riemannian structures, namely the SPD or linear subspace structure. Initially, we provide a solution to compute Riemannian version of the conventional Vector of Locally aggregated Descriptors (VLAD), using geodesic distance of the underlying manifold as the nearness measure. Next, by having a closer look at the resulting codes, we formulate a new concept which we name Local Difference Vectors (LDV). LDVs enable us to elegantly expand our Riemannian coding techniques to any arbitrary metric as well as provide intrinsic solutions to Riemannian sparse coding and its variants when local structured descriptors are considered. We then turn our attention to two special types of covariance descriptors namely infinite-dimensional RCovDs and rank-deficient covariance matrices for which the underlying Riemannian structure, i.e. the manifold of SPD matrices is out of reach to great extent. %Generally speaking, infinite-dimensional RCovDs offer better discriminatory power over their low-dimensional counterparts. To overcome this difficulty, we propose to approximate the infinite-dimensional RCovDs by making use of two feature mappings, namely random Fourier features and the Nystrom method. As for the rank-deficient covariance matrices, unlike most existing approaches that employ inference tools by predefined regularizers, we derive positive definite kernels that can be decomposed into the kernels on the cone of SPD matrices and kernels on the Grassmann manifolds and show their effectiveness for image set classification task. Furthermore, inspired by attractive properties of Riemannian optimization techniques, we extend the recently introduced Keep It Simple and Straightforward MEtric learning (KISSME) method to the scenarios where input data is non-linearly distributed. To this end, we make use of the infinite dimensional covariance matrices and propose techniques towards projecting on the positive cone in a Reproducing Kernel Hilbert Space (RKHS). We also address the sensitivity issue of the KISSME to the input dimensionality. The KISSME algorithm is greatly dependent on Principal Component Analysis (PCA) as a preprocessing step which can lead to difficulties, especially when the dimensionality is not meticulously set. To address this issue, based on the KISSME algorithm, we develop a Riemannian framework to jointly learn a mapping performing dimensionality reduction and a metric in the induced space. Lastly, in line with the recent trend in metric learning, we devise end-to-end learning of a generic deep network for metric learning using our derivation

    A survey of face recognition techniques under occlusion

    Get PDF
    The limited capacity to recognize faces under occlusions is a long-standing problem that presents a unique challenge for face recognition systems and even for humans. The problem regarding occlusion is less covered by research when compared to other challenges such as pose variation, different expressions, etc. Nevertheless, occluded face recognition is imperative to exploit the full potential of face recognition for real-world applications. In this paper, we restrict the scope to occluded face recognition. First, we explore what the occlusion problem is and what inherent difficulties can arise. As a part of this review, we introduce face detection under occlusion, a preliminary step in face recognition. Second, we present how existing face recognition methods cope with the occlusion problem and classify them into three categories, which are 1) occlusion robust feature extraction approaches, 2) occlusion aware face recognition approaches, and 3) occlusion recovery based face recognition approaches. Furthermore, we analyze the motivations, innovations, pros and cons, and the performance of representative approaches for comparison. Finally, future challenges and method trends of occluded face recognition are thoroughly discussed

    Dictionary Representation of Deep Features for Occlusion-Robust Face Recognition

    Get PDF
    Deep learning has achieved exciting results in face recognition; however, the accuracy is still unsatisfying for occluded faces. To improve the robustness for occluded faces, this paper proposes a novel deep dictionary representation-based classification scheme, where a convolutional neural network is employed as the feature extractor and followed by a dictionary to linearly code the extracted deep features. The dictionary is composed by a gallery part consisting of the deep features of the training samples and an auxiliary part consisting of the mapping vectors acquired from the subjects either inside or outside the training set and associated with the occlusion patterns of the testing face samples. A squared Euclidean norm is used to regularize the coding coefficients. The proposed scheme is computationally efficient and is robust to large contiguous occlusion. In addition, the proposed scheme is generic for both the occluded and non-occluded face images and works with a single training sample per subject. The extensive experimental evaluations demonstrate the superior performance of the proposed approach over other state-of-the-art algorithms

    Advanced Biometrics with Deep Learning

    Get PDF
    Biometrics, such as fingerprint, iris, face, hand print, hand vein, speech and gait recognition, etc., as a means of identity management have become commonplace nowadays for various applications. Biometric systems follow a typical pipeline, that is composed of separate preprocessing, feature extraction and classification. Deep learning as a data-driven representation learning approach has been shown to be a promising alternative to conventional data-agnostic and handcrafted pre-processing and feature extraction for biometric systems. Furthermore, deep learning offers an end-to-end learning paradigm to unify preprocessing, feature extraction, and recognition, based solely on biometric data. This Special Issue has collected 12 high-quality, state-of-the-art research papers that deal with challenging issues in advanced biometric systems based on deep learning. The 12 papers can be divided into 4 categories according to biometric modality; namely, face biometrics, medical electronic signals (EEG and ECG), voice print, and others

    A Survey of Face Recognition

    Full text link
    Recent years witnessed the breakthrough of face recognition with deep convolutional neural networks. Dozens of papers in the field of FR are published every year. Some of them were applied in the industrial community and played an important role in human life such as device unlock, mobile payment, and so on. This paper provides an introduction to face recognition, including its history, pipeline, algorithms based on conventional manually designed features or deep learning, mainstream training, evaluation datasets, and related applications. We have analyzed and compared state-of-the-art works as many as possible, and also carefully designed a set of experiments to find the effect of backbone size and data distribution. This survey is a material of the tutorial named The Practical Face Recognition Technology in the Industrial World in the FG2023

    Sparse Representations and Feature Learning for Image Set Classification and Correspondence Estimation

    Get PDF
    The use of effective features is a key component in solving many computer vision tasks including, but not limited to, image (set) classification and correspondence estimation. Many research directions have focused on finding good features for the task under consideration, traditionally by hand crafting and recently by machine learning. In our work, we present algorithms for feature extraction and sparse representation for the classification of image sets. In addition, we present an approach for deep metric learning for correspondence estimation. We start by benchmarking various image set classification methods on a mobile video dataset that we have collected and made public. The videos were acquired under three different ambient conditions to capture the type of variations caused by the 'mobility' of the devices. An inspection of these videos reveals a combination of favorable and challenging properties unique to smartphone face videos. Besides mobility, the dataset has other challenges including partial faces, occasional pose changes, blur and fiducial point localization errors. Based on the evaluation, the recognition rates drop dramatically when enrollment and test videos come from different sessions. We then present Bayesian Representation-based Classification (BRC), an approach based on sparse Bayesian regression and subspace clustering for image set classification. A Bayesian statistical framework is used to compare BRC with similar existing approaches such as Collaborative Representation-based Classification (CRC) and Sparse Representation-based Classification (SRC), where it is shown that BRC employs precision hyperpriors that are more non-informative than those of CRC/SRC. Furthermore, we present a robust probe image set handling strategy that balances the trade-off between efficiency and accuracy. Experiments on three datasets illustrate the effectiveness of our algorithm compared to state-of-the-art set-based methods. We then propose to represent image sets as a dictionaries of hand-crafted descriptors based on Symmetric Positive Definite (SPD) matrices that are more robust to local deformations and fiducial point location errors. We then learn a tangent map for transforming the SPD matrix logarithms into a lower-dimensional Log-Euclidean space such that the transformed gallery atoms adhere to a more discriminative subspace structure. A query image set is then classified by first mapping its SPD descriptors into the computed Log-Euclidean tangent space and then using the sparse representation over the tangent space to decide a label for the image set. Experiments on four public datasets show that representation-based classification based on the proposed features outperforms many state-of-the-art methods. We then present Nonlinear Subspace Feature Enhancement (NSFE), an approach for nonlinearly embedding image sets into a space where they adhere to a more discriminative subspace structure. We describe how the structured loss function of NSFE can be optimized in a batch-by-batch fashion by a two-step alternating algorithm. The algorithm makes very few assumptions about the form of the embedding to be learned and is compatible with stochastic gradient descent and back-propagation. We evaluate NSFE with different types of input features and nonlinear embeddings and show that NSFE compares favorably to state-of-the-art image set classification methods. Finally, we propose a hierarchical approach for deep metric learning and descriptor matching for the task of point correspondence estimation. Our idea is motivated by the observation that existing metric learning approaches based on supervising and matching with only the deepest layer result in features that are suboptimal in some aspects to shallower features. Instead, the best matching performance, as we empirically show, is obtained by combining the high invariance of deeper features with the geometric sensitivity and higher precision of shallower features. We compare our method to state-of-the-art networks as well as fusion baselines inspired from existing semantic segmentation networks and empirically show that our method is more accurate and better suited to correspondence estimation

    Face Recognition from Weakly Labeled Data

    Get PDF
    Recognizing the identity of a face or a person in the media usually requires lots of training data to design robust classifiers, which demands a great amount of human effort for annotation. Alternatively, the weakly labeled data is publicly available, but the labels can be ambiguous or noisy. For instance, names in the caption of a news photo provide possible candidates for faces appearing in the image. Names in the screenplays are only weakly associated with faces in the videos. Since weakly labeled data is not explicitly labeled by humans, robust learning methods that use weakly labeled data should suppress the impact of noisy instances or automatically resolve the ambiguities in noisy labels. We propose a method for character identification in a TV-series. The proposed method uses automatically extracted labels by associating the faces with names in the transcripts. Such weakly labeled data often has erroneous labels resulting from errors in detecting a face and synchronization. Our approach achieves robustness to noisy labeling by utilizing several features. We construct track nodes from face and person tracks and utilize information from facial and clothing appearances. We discover the video structure for effective inference by constructing a minimum-distance spanning tree (MST) from the track nodes. Hence, track nodes of similar appearance become adjacent to each other and are likely to have the same identity. The non-local cost aggregation step thus serves as a noise suppression step to reliably recognize the identity of the characters in the video. Another type of weakly labeled data results from labeling ambiguities. In other words, a training sample can have more than one label, and typically one of the labels is the true label. For instance, a news photo is usually accompanied by the captions, and the names provided in the captions can be used as the candidate labels for the faces appearing in the photo. Learning an effective subject classifier from the ambiguously labeled data is called ambiguously labeled learning. We propose a matrix completion framework for predicting the actual labels from the ambiguously labeled instances, and a standard supervised classifier that subsequently learns from the disambiguated labels to classify new data. We generalize this matrix completion framework to handle the issue of labeling imbalance that avoids domination by dominant labels. Besides, an iterative candidate elimination step is integrated with the proposed approach to improve the ambiguity resolution. Recently, video-based face recognition techniques have received significant attention since faces in a video provide diverse exemplars for constructing a robust representation of the target (i.e., subject of interest). Nevertheless, the target face in the video is usually annotated with minimum human effort (i.e., a single bounding box in a video frame). Although face tracking techniques can be utilized to associate faces in a single video shot, it is ineffective for associating faces across multiple video shots. To fully utilize faces of a target in multiples-shot videos, we propose a target face association (TFA) method to obtain a set of images of the target face, and these associated images are then utilized to construct a robust representation of the target for improving the performance of video-based face recognition task. One of the most important applications of video-based face recognition is outdoor video surveillance using a camera network. Face recognition in outdoor environment is a challenging task due to illumination changes, pose variations, and occlusions. We present the taxonomy of camera networks and discuss several techniques for continuous tracking of faces acquired by an outdoor camera network as well as a face matching algorithm. Finally, we demonstrate the real-time video surveillance system using pan-tilt-zoom (PTZ) cameras to perform pedestrian tracking, localization, face detection, and face recognition

    Exploiting Spatio-Temporal Coherence for Video Object Detection in Robotics

    Get PDF
    This paper proposes a method to enhance video object detection for indoor environments in robotics. Concretely, it exploits knowledge about the camera motion between frames to propagate previously detected objects to successive frames. The proposal is rooted in the concepts of planar homography to propose regions of interest where to find objects, and recursive Bayesian filtering to integrate observations over time. The proposal is evaluated on six virtual, indoor environments, accounting for the detection of nine object classes over a total of ∼ 7k frames. Results show that our proposal improves the recall and the F1-score by a factor of 1.41 and 1.27, respectively, as well as it achieves a significant reduction of the object categorization entropy (58.8%) when compared to a two-stage video object detection method used as baseline, at the cost of small time overheads (120 ms) and precision loss (0.92).</p
    • …
    corecore