711 research outputs found
Expanded Parts Model for Semantic Description of Humans in Still Images
We introduce an Expanded Parts Model (EPM) for recognizing human attributes
(e.g. young, short hair, wearing suit) and actions (e.g. running, jumping) in
still images. An EPM is a collection of part templates which are learnt
discriminatively to explain specific scale-space regions in the images (in
human centric coordinates). This is in contrast to current models which consist
of a relatively few (i.e. a mixture of) 'average' templates. EPM uses only a
subset of the parts to score an image and scores the image sparsely in space,
i.e. it ignores redundant and random background in an image. To learn our
model, we propose an algorithm which automatically mines parts and learns
corresponding discriminative templates together with their respective locations
from a large number of candidate parts. We validate our method on three recent
challenging datasets of human attributes and actions. We obtain convincing
qualitative and state-of-the-art quantitative results on the three datasets.Comment: Accepted for publication in IEEE Transactions on Pattern Analysis and
Machine Intelligence (TPAMI
Sparse and Deep Representations for Face Recognition and Object Detection
Face recognition and object detection are two very fundamental visual recognition applications in computer vision. How to learn “good” feature representations using machine learning has become the cornerstone of perception-based systems. A good feature representation is often the one that is robust and discriminative to multiple instances of the same category. Starting from features such as intensity, histogram etc. in the image, followed by hand-crafted features, to the most recent sophisticated deep feature representations, we have witnessed the remarkable improvement in the ability of a feature learning algorithm to perform pattern recognition tasks such as face recognition and object detection. One of the conventional feature learning methods, dictionary learning has been proposed to learn discriminative and sparse representations for visual recognition. These dictionary learning methods can learn both representative and discriminative dictionaries, and the associated sparse representations are effective for vision tasks such as face recognition. More recently, deep features have been widely adopted by the computer vision community owing to the powerful deep neural network, which is capable of distilling information from high dimensional input spaces to a low dimensional semantic space. The research problems which comprise this dissertation lie at the cross section of conventional feature and deep feature learning approaches. Thus, in this dissertation, we study both sparse and deep representations for face recognition and object detection.
First, we begin by studying the topic of spare representations. We present a simple thresholded feature learning algorithm under sparse support recovery. We show that under certain conditions, the thresholded feature exactly recovers the nonzero support of the sparse code. Secondly, based on the theoretical guarantees, we derive the model and algorithm named Dictionary Learning for Thresholded Features (DLTF), to learn the dictionary that is optimized for the thresholded feature. The DLTF dictionaries are specifically designed for using the thresholded feature at inference, which prioritize simplicity, efficiency, general usability and theoretical guarantees. Both synthetic simulations and real-data experiments (i.e. image clustering and unsupervised hashing) verify the competitive quantitative results and remarkable efficiency of applying thresholded features with DLTF dictionaries.
Continuing our focus on investigating the sparse representation and its application to computer vision tasks, we address the sparse representations for unconstrained face verification/recognition problem. In the first part, we address the video-based face recognition problem since it brings more challenges due to the fact that the videos are often acquired under significant variations in poses, expressions, lighting conditions and backgrounds. In order to extract representations that are robust to these variations, we propose a structured dictionary learning framework. Specifically, we employ dictionary learning and low-rank approximation methods to preserve the invariant structure of face images in videos. The learned structured dictionary is both discriminative and reconstructive. We demonstrate the effectiveness of our approach through extensive experiments on three video-based face recognition datasets.
Recently, template-based face verification has gained more popularity. Unlike traditional verification tasks, which evaluate on image-to-image or video-to-video pairs, template-based face verification/recognition methods can exploit training and/or gallery data containing a mixture of both images or videos from the person of interest. In the second part, we propose a regularized sparse coding approach for template-based face verification. First, we construct a reference dictionary, which represents the training set. Then we learn the discriminative sparse codes of the templates for verification through the proposed template regularized sparse coding approach. Finally, we measure the similarity between templates.
However, in real world scenarios, training and test data are sampled from different distributions. Therefore, we also extend the dictionary learning techniques to tackle the domain adaptation problem, where the data from the training set (source domain) and test set (target domain) have different underlying distributions (domain shift). We propose a domain-adaptive dictionary learning framework to model the domain shift by generating a set of intermediate domains. These intermediate domains bridge the gap between the source and target domains. Specifically, we not only learn a common dictionary to encode the domain-shared features but also learn a set of domain specific dictionaries to model the domain shift. This separation enables us to learn more compact and reconstructive dictionaries for domain adaptation. The domain-adaptive features for recognition are finally derived by aligning all the recovered feature representations of both source and target along the domain path. We evaluate our approach on both cross-domain face recognition and object classification tasks.
Finally, we study another fundamental problem in computer vision: generic object detection. Object detection has become one of the most valuable pattern recognition tasks, with great benefits in scene understanding, face recognition, action recognition, robotics and self-driving vehicles, etc. We propose a novel object detector named "Deep Regionlets" by blending deep learning and the traditional regionlet method. The proposed framework "Deep Regionlets" is able to address the limitations of traditional regionlet methods, leading to significant precision improvement by exploiting the power of deep convolutional neural networks. Furthermore, we conduct a detailed analysis of our approach to understand its merits and properties. Extensive experiments on two detection benchmark datasets show that the proposed deep regionlet approach outperforms several state-of-the-art competitors
Regularized Robust Coding for Face Recognition
Recently the sparse representation based classification (SRC) has been
proposed for robust face recognition (FR). In SRC, the testing image is coded
as a sparse linear combination of the training samples, and the representation
fidelity is measured by the l2-norm or l1-norm of the coding residual. Such a
sparse coding model assumes that the coding residual follows Gaussian or
Laplacian distribution, which may not be effective enough to describe the
coding residual in practical FR systems. Meanwhile, the sparsity constraint on
the coding coefficients makes SRC's computational cost very high. In this
paper, we propose a new face coding model, namely regularized robust coding
(RRC), which could robustly regress a given signal with regularized regression
coefficients. By assuming that the coding residual and the coding coefficient
are respectively independent and identically distributed, the RRC seeks for a
maximum a posterior solution of the coding problem. An iteratively reweighted
regularized robust coding (IR3C) algorithm is proposed to solve the RRC model
efficiently. Extensive experiments on representative face databases demonstrate
that the RRC is much more effective and efficient than state-of-the-art sparse
representation based methods in dealing with face occlusion, corruption,
lighting and expression changes, etc
Template Adaptation for Face Verification and Identification
Face recognition performance evaluation has traditionally focused on
one-to-one verification, popularized by the Labeled Faces in the Wild dataset
for imagery and the YouTubeFaces dataset for videos. In contrast, the newly
released IJB-A face recognition dataset unifies evaluation of one-to-many face
identification with one-to-one face verification over templates, or sets of
imagery and videos for a subject. In this paper, we study the problem of
template adaptation, a form of transfer learning to the set of media in a
template. Extensive performance evaluations on IJB-A show a surprising result,
that perhaps the simplest method of template adaptation, combining deep
convolutional network features with template specific linear SVMs, outperforms
the state-of-the-art by a wide margin. We study the effects of template size,
negative set construction and classifier fusion on performance, then compare
template adaptation to convolutional networks with metric learning, 2D and 3D
alignment. Our unexpected conclusion is that these other methods, when combined
with template adaptation, all achieve nearly the same top performance on IJB-A
for template-based face verification and identification
Face Identification and Clustering
In this thesis, we study two problems based on clustering algorithms. In the
first problem, we study the role of visual attributes using an agglomerative
clustering algorithm to whittle down the search area where the number of
classes is high to improve the performance of clustering. We observe that as we
add more attributes, the clustering performance increases overall. In the
second problem, we study the role of clustering in aggregating templates in a
1:N open set protocol using multi-shot video as a probe. We observe that by
increasing the number of clusters, the performance increases with respect to
the baseline and reaches a peak, after which increasing the number of clusters
causes the performance to degrade. Experiments are conducted using recently
introduced unconstrained IARPA Janus IJB-A, CS2, and CS3 face recognition
datasets
- …