44 research outputs found
Robust face recognition
University of Technology Sydney. Faculty of Engineering and Information Technology.Face recognition is one of the most important and promising biometric techniques. In face recognition, a similarity score is automatically calculated between face images to further decide their identity. Due to its non-invasive characteristics and ease of use, it has shown great potential in many real-world applications, e.g., video surveillance, access control systems, forensics and security, and social networks. This thesis addresses key challenges inherent in real-world face recognition systems including pose and illumination variations, occlusion, and image blur. To tackle these challenges, a series of robust face recognition algorithms are proposed. These can be summarized as follows:
In Chapter 2, we present a novel, manually designed face image descriptor named “Dual-Cross Patterns” (DCP). DCP efficiently encodes the seconder-order statistics of facial textures in the most informative directions within a face image. It proves to be more descriptive and discriminative than previous descriptors. We further extend DCP into a comprehensive face representation scheme named “Multi-Directional Multi-Level Dual-Cross Patterns” (MDML-DCPs). MDML-DCPs efficiently encodes the invariant characteristics of a face image from multiple levels into patterns that are highly discriminative of inter-personal differences but robust to intra-personal variations. MDML-DCPs achieves the best performance on the challenging FERET, FRGC 2.0, CAS-PEAL-R1, and LFW databases.
In Chapter 3, we develop a deep learning-based face image descriptor named “Multimodal Deep Face Representation” (MM-DFR) to automatically learn face representations from multimodal image data. In brief, convolutional neural networks (CNNs) are designed to extract complementary information from the original holistic face image, the frontal pose image rendered by 3D modeling, and uniformly sampled image patches. The recognition ability of each CNN is optimized by carefully integrating a number of published or newly developed tricks. A feature level fusion approach using stacked auto-encoders is designed to fuse the features extracted from the set of CNNs, which is advantageous for non-linear dimension reduction. MM-DFR achieves over 99% recognition rate on LFW using publicly available training data.
In Chapter 4, based on our research on handcrafted face image descriptors, we propose a powerful pose-invariant face recognition (PIFR) framework capable of handling the full range of pose variations within ±90° of yaw. The framework has two parts: the first is Patch-based Partial Representation (PBPR), and the second is Multi-task Feature Transformation Learning (MtFTL). PBPR transforms the original PIFR problem into a partial frontal face recognition problem. A robust patch-based face representation scheme is developed to represent the synthesized partial frontal faces. For each patch, a transformation dictionary is learnt under the MtFTL scheme. The transformation dictionary transforms the features of different poses into a discriminative subspace in which face matching is performed. The PBPR-MtFTL framework outperforms previous state-of-the-art PIFR methods on the FERET, CMU-PIE, and Multi-PIE databases.
In Chapter 5, based on our research on deep learning-based face image descriptors, we design a novel framework named Trunk-Branch Ensemble CNN (TBE-CNN) to handle challenges in video-based face recognition (VFR) under surveillance circumstances. Three major challenges are considered: image blur, occlusion, and pose variation. First, to learn blur-robust face representations, we artificially blur training data composed of clear still images to account for a shortfall in real-world video training data. Second, to enhance the robustness of CNN features to pose variations and occlusion, we propose the TBE-CNN architecture, which efficiently extracts complementary information from holistic face images and patches cropped around facial components. Third, to further promote the discriminative power of the representations learnt by TBE-CNN, we propose an improved triplet loss function. With the proposed techniques, TBE-CNN achieves state-of-the-art performance on three popular video face databases: PaSC, COX Face, and YouTube Faces
CPP-Net: Context-aware Polygon Proposal Network for Nucleus Segmentation
Nucleus segmentation is a challenging task due to the crowded distribution
and blurry boundaries of nuclei. Recent approaches represent nuclei by means of
polygons to differentiate between touching and overlapping nuclei and have
accordingly achieved promising performance. Each polygon is represented by a
set of centroid-to-boundary distances, which are in turn predicted by features
of the centroid pixel for a single nucleus. However, using the centroid pixel
alone does not provide sufficient contextual information for robust prediction.
To handle this problem, we propose a Context-aware Polygon Proposal Network
(CPP-Net) for nucleus segmentation. First, we sample a point set rather than
one single pixel within each cell for distance prediction. This strategy
substantially enhances contextual information and thereby improves the
robustness of the prediction. Second, we propose a Confidence-based Weighting
Module, which adaptively fuses the predictions from the sampled point set.
Third, we introduce a novel Shape-Aware Perceptual (SAP) loss that constrains
the shape of the predicted polygons. Here, the SAP loss is based on an
additional network that is pre-trained by means of mapping the centroid
probability map and the pixel-to-boundary distance maps to a different nucleus
representation. Extensive experiments justify the effectiveness of each
component in the proposed CPP-Net. Finally, CPP-Net is found to achieve
state-of-the-art performance on three publicly available databases, namely
DSB2018, BBBC06, and PanNuke. Code of this paper will be released
DARC: Distribution-Aware Re-Coloring Model for Generalizable Nucleus Segmentation
Nucleus segmentation is usually the first step in pathological image analysis
tasks. Generalizable nucleus segmentation refers to the problem of training a
segmentation model that is robust to domain gaps between the source and target
domains. The domain gaps are usually believed to be caused by the varied image
acquisition conditions, e.g., different scanners, tissues, or staining
protocols. In this paper, we argue that domain gaps can also be caused by
different foreground (nucleus)-background ratios, as this ratio significantly
affects feature statistics that are critical to normalization layers. We
propose a Distribution-Aware Re-Coloring (DARC) model that handles the above
challenges from two perspectives. First, we introduce a re-coloring method that
relieves dramatic image color variations between different domains. Second, we
propose a new instance normalization method that is robust to the variation in
foreground-background ratios. We evaluate the proposed methods on two HE
stained image datasets, named CoNSeP and CPM17, and two IHC stained image
datasets, called DeepLIIF and BC-DeepLIIF. Extensive experimental results
justify the effectiveness of our proposed DARC model. Codes are available at
\url{https://github.com/csccsccsccsc/DARCComment: Accepted by MICCAI 202
Category-Level 6D Object Pose and Size Estimation using Self-Supervised Deep Prior Deformation Networks
It is difficult to precisely annotate object instances and their semantics in
3D space, and as such, synthetic data are extensively used for these tasks,
e.g., category-level 6D object pose and size estimation. However, the easy
annotations in synthetic domains bring the downside effect of synthetic-to-real
(Sim2Real) domain gap. In this work, we aim to address this issue in the task
setting of Sim2Real, unsupervised domain adaptation for category-level 6D
object pose and size estimation. We propose a method that is built upon a novel
Deep Prior Deformation Network, shortened as DPDN. DPDN learns to deform
features of categorical shape priors to match those of object observations, and
is thus able to establish deep correspondence in the feature space for direct
regression of object poses and sizes. To reduce the Sim2Real domain gap, we
formulate a novel self-supervised objective upon DPDN via consistency learning;
more specifically, we apply two rigid transformations to each object
observation in parallel, and feed them into DPDN respectively to yield dual
sets of predictions; on top of the parallel learning, an inter-consistency term
is employed to keep cross consistency between dual predictions for improving
the sensitivity of DPDN to pose changes, while individual intra-consistency
ones are used to enforce self-adaptation within each learning itself. We train
DPDN on both training sets of the synthetic CAMERA25 and real-world REAL275
datasets; our results outperform the existing methods on REAL275 test set under
both the unsupervised and supervised settings. Ablation studies also verify the
efficacy of our designs. Our code is released publicly at
https://github.com/JiehongLin/Self-DPDN.Comment: Accepted by ECCV202
Towards Hard-Positive Query Mining for DETR-based Human-Object Interaction Detection
Human-Object Interaction (HOI) detection is a core task for high-level image
understanding. Recently, Detection Transformer (DETR)-based HOI detectors have
become popular due to their superior performance and efficient structure.
However, these approaches typically adopt fixed HOI queries for all testing
images, which is vulnerable to the location change of objects in one specific
image. Accordingly, in this paper, we propose to enhance DETR's robustness by
mining hard-positive queries, which are forced to make correct predictions
using partial visual cues. First, we explicitly compose hard-positive queries
according to the ground-truth (GT) position of labeled human-object pairs for
each training image. Specifically, we shift the GT bounding boxes of each
labeled human-object pair so that the shifted boxes cover only a certain
portion of the GT ones. We encode the coordinates of the shifted boxes for each
labeled human-object pair into an HOI query. Second, we implicitly construct
another set of hard-positive queries by masking the top scores in
cross-attention maps of the decoder layers. The masked attention maps then only
cover partial important cues for HOI predictions. Finally, an alternate
strategy is proposed that efficiently combines both types of hard queries. In
each iteration, both DETR's learnable queries and one selected type of
hard-positive queries are adopted for loss computation. Experimental results
show that our proposed approach can be widely applied to existing DETR-based
HOI detectors. Moreover, we consistently achieve state-of-the-art performance
on three benchmarks: HICO-DET, V-COCO, and HOI-A. Code is available at
https://github.com/MuchHair/HQM.Comment: Accepted by ECCV202