69,937 research outputs found

    Robust Face Recognition via Multimodal Deep Face Representation

    Full text link
    © 2015 IEEE. Face images appearing in multimedia applications, e.g., social networks and digital entertainment, usually exhibit dramatic pose, illumination, and expression variations, resulting in considerable performance degradation for traditional face recognition algorithms. This paper proposes a comprehensive deep learning framework to jointly learn face representation using multimodal information. The proposed deep learning structure is composed of a set of elaborately designed convolutional neural networks (CNNs) and a three-layer stacked auto-encoder (SAE). The set of CNNs extracts complementary facial features from multimodal data. Then, the extracted features are concatenated to form a high-dimensional feature vector, whose dimension is compressed by SAE. All of the CNNs are trained using a subset of 9,000 subjects from the publicly available CASIA-WebFace database, which ensures the reproducibility of this work. Using the proposed single CNN architecture and limited training data, 98.43% verification rate is achieved on the LFW database. Benefitting from the complementary information contained in multimodal data, our small ensemble system achieves higher than 99.0% recognition rate on LFW using publicly available training set

    Information embedding and retrieval in 3D printed objects

    Get PDF
    Deep learning and convolutional neural networks have become the main tools of computer vision. These techniques are good at using supervised learning to learn complex representations from data. In particular, under limited settings, the image recognition model now performs better than the human baseline. However, computer vision science aims to build machines that can see. It requires the model to be able to extract more valuable information from images and videos than recognition. Generally, it is much more challenging to apply these deep learning models from recognition to other problems in computer vision. This thesis presents end-to-end deep learning architectures for a new computer vision field: watermark retrieval from 3D printed objects. As it is a new area, there is no state-of-the-art on many challenging benchmarks. Hence, we first define the problems and introduce the traditional approach, Local Binary Pattern method, to set our baseline for further study. Our neural networks seem useful but straightfor- ward, which outperform traditional approaches. What is more, these networks have good generalization. However, because our research field is new, the problems we face are not only various unpredictable parameters but also limited and low-quality training data. To address this, we make two observations: (i) we do not need to learn everything from scratch, we know a lot about the image segmentation area, and (ii) we cannot know everything from data, our models should be aware what key features they should learn. This thesis explores these ideas and even explore more. We show how to use end-to-end deep learning models to learn to retrieve watermark bumps and tackle covariates from a few training images data. Secondly, we introduce ideas from synthetic image data and domain randomization to augment training data and understand various covariates that may affect retrieve real-world 3D watermark bumps. We also show how the illumination in synthetic images data to effect and even improve retrieval accuracy for real-world recognization applications

    Multimodal Approaches to Computer Vision Problems

    Get PDF
    The goal of computer vision research is to automatically extract high-level information from images and videos. The vast majority of this research focuses specifically on visible light imagery. In this dissertation, we present approaches to computer vision problems that incorporate data obtained from alternative modalities including thermal infrared imagery, near-infrared imagery, and text. We consider approaches where other modalities are used in place of visible imagery as well as approaches that use other modalities to improve the performance of traditional computer vision algorithms. The bulk of this dissertation focuses on Heterogeneous Face Recognition (HFR). HFR is a variant of face recognition where the probe and gallery face images are obtained with different sensing modalities. We also present a method to incorporate text information into human activity recognition algorithms. We first present a kernel task-driven coupled dictionary model to represent the data across multiple domains for thermal infrared HFR. We extend a linear coupled dictionary model to use the kernel method to process the signals in a high dimensional space; this effectively enables the dictionaries to represent the data non-linearly in the original feature space. We further improve the model by making the dictionaries task-driven. This allows us to tune the dictionaries to perform well on the classification task at hand rather than the standard reconstruction task. We show that our algorithms outperform algorithms based on standard coupled dictionaries on three datasets for thermal infrared to visible face recognition. Next, we present a deep learning-based approach to near-infrared (NIR) HFR. Most approaches to HFR involve modeling the relationship between corresponding images from the visible and sensing domains. Due to data constraints, this is typically done at the patch level and/or with shallow models to prevent overfitting. In this approach, rather than modeling local patches or using a simple model, we use a complex, deep model to learn the relationship between the entirety of cross-modal face images. We describe a deep convolutional neural network-based method that leverages a large visible image face dataset to prevent overfitting. We present experimental results on two benchmark data sets showing its effectiveness. Third, we present a model order selection algorithm for deep neural networks. In recent years, deep learning has emerged as a dominant methodology in machine learning. While it has been shown to produce state-of-the-art results for a variety of applications, one aspect of deep networks that has not been extensively researched is how to determine the optimal network structure. This problem is generally solved by ad hoc methods. In this work we address a sub-problem of this task: determining the breadth (number of nodes) of each layer. We show how to use group-sparsity-inducing regularization to automatically select these hyper-parameters. We demonstrate the proposed method by using it to reduce the size of networks while maintaining performance for our NIR HFR deep-learning algorithm. Additionally, we demonstrate the generality of our algorithm by applying it to image classification tasks. Finally, we present a method to improve activity recognition algorithms through the use of multitask learning and information extracted from a large text corpora. Current state-of-the-art deep learning approaches are limited by the size and scope of the data set they use to train the networks. We present a multitask learning approach to expand the training data set. Specifically, we train the neural networks to recognize objects in addition to activities. This allows us to expand our training set with large, publicly available object recognition data sets and thus use deeper, state-of-the-art network architectures. Additionally, when learning about the target activities, the algorithms are limited to the information contained in the training set. It is virtually impossible to capture all variations of the target activities in a training set. In this work, we extract information about the target activities from a large text corpora. We incorporate this information into the training algorithm by using it to select relevant object recognition classes for the multitask learning approach. We present experimental results on a benchmark activity recognition data set showing the effectiveness of our approach

    Matching software-generated sketches to face photographs with a very deep CNN, morphed faces, and transfer learning

    Get PDF
    Sketches obtained from eyewitness descriptions of criminals have proven to be useful in apprehending criminals, particularly when there is a lack of evidence. Automated methods to identify subjects depicted in sketches have been proposed in the literature, but their performance is still unsatisfactory when using software-generated sketches and when tested using extensive galleries with a large amount of subjects. Despite the success of deep learning in several applications including face recognition, little work has been done in applying it for face photograph-sketch recognition. This is mainly a consequence of the need to ensure robust training of deep networks by using a large number of images, yet limited quantities are publicly available. Moreover, most algorithms have not been designed to operate on software-generated face composite sketches which are used by numerous law enforcement agencies worldwide. This paper aims to tackle these issues with the following contributions: 1) a very deep convolutional neural network is utilised to determine the identity of a subject in a composite sketch by comparing it to face photographs and is trained by applying transfer learning to a state-of-the-art model pretrained for face photograph recognition; 2) a 3-D morphable model is used to synthesise both photographs and sketches to augment the available training data, an approach that is shown to significantly aid performance; and 3) the UoM-SGFS database is extended to contain twice the number of subjects, now having 1200 sketches of 600 subjects. An extensive evaluation of popular and stateof-the-art algorithms is also performed due to the lack of such information in the literature, where it is demonstrated that the proposed approach comprehensively outperforms state-of-the-art methods on all publicly available composite sketch datasets.peer-reviewe

    Facial Landmark Feature Fusion in Transfer Learning of Child Facial Expressions

    Get PDF
    Automatic classification of child facial expressions is challenging due to the scarcity of image samples with annotations. Transfer learning of deep convolutional neural networks (CNNs), pretrained on adult facial expressions, can be effectively finetuned for child facial expression classification using limited facial images of children. Recent work inspired by facial age estimation and age-invariant face recognition proposes a fusion of facial landmark features with deep representation learning to augment facial expression classification performance. We hypothesize that deep transfer learning of child facial expressions may also benefit from fusing facial landmark features. Our proposed model architecture integrates two input branches: a CNN branch for image feature extraction and a fully connected branch for processing landmark-based features. The model-derived features of these two branches are concatenated into a latent feature vector for downstream expression classification. The architecture is trained on an adult facial expression classification task. Then, the trained model is finetuned to perform child facial expression classification. The combined feature fusion and transfer learning approach is compared against multiple models: training on adult expressions only (adult baseline), child expression only (child baseline), and transfer learning from adult to child data. We also evaluate the classification performance of feature fusion without transfer learning on model performance. Training on child data, we find that feature fusion improves the 10-fold cross validation mean accuracy from 80.32% to 83.72% with similar variance. Proposed fine-tuning with landmark feature fusion of child expressions yields the best mean accuracy of 85.14%, a more than 30% improvement over the adult baseline and nearly 5% improvement over the child baseline

    Facial Expression Recognition from World Wild Web

    Full text link
    Recognizing facial expression in a wild setting has remained a challenging task in computer vision. The World Wide Web is a good source of facial images which most of them are captured in uncontrolled conditions. In fact, the Internet is a Word Wild Web of facial images with expressions. This paper presents the results of a new study on collecting, annotating, and analyzing wild facial expressions from the web. Three search engines were queried using 1250 emotion related keywords in six different languages and the retrieved images were mapped by two annotators to six basic expressions and neutral. Deep neural networks and noise modeling were used in three different training scenarios to find how accurately facial expressions can be recognized when trained on noisy images collected from the web using query terms (e.g. happy face, laughing man, etc)? The results of our experiments show that deep neural networks can recognize wild facial expressions with an accuracy of 82.12%
    • …
    corecore