4 research outputs found

    Deep face tracking and parsing in the wild

    Get PDF
    Face analysis has been a long-standing research direction in the field of computer vision and pattern recognition. A complete face analysis system involves solving several tasks including face detection, face tracking, face parsing, and face recognition. Recently, the performance of methods in all tasks has significantly improved thanks to the employment of Deep Convolutional Neural Networks (DCNNs). However, existing face analysis algorithms mainly focus on solving facial images captured in the constrained laboratory environment, and their performance on real-world images has remained less explored. Compared with the lab environment, the in-the-wild settings involve greater diversity in face sizes, poses, facial expressions, background clutters, lighting conditions and imaging quality. This thesis investigates two fundamental tasks in face analysis under in-the-wild settings: face tracking and face parsing. Both tasks serve as important prerequisites for downstream face analysis applications. However, in-the-wild datasets remain scarce in both fields and models have not been rigorously evaluated in such settings. In this thesis, we aim to bridge that gap of lacking in-the-wild data, evaluate existing methods in these settings, and develop accurate, robust and efficient deep learning-based methods for the two tasks. For face tracking in the wild, we introduce the first in-the-wild face tracking dataset, MobiFace, that consists of 80 videos captured by mobile phones during mobile live-streaming. The environment of the live-streaming performance is fully unconstrained and the interactions between users and mobile phones are natural and spontaneous. Next, we evaluate existing tracking methods, including generic object trackers and dedicated face trackers. The results show that MobiFace represent unique challenges in face tracking in the wild and cannot be readily solved by existing methods. Finally, we present a DCNN-based framework, FT-RCNN, that significantly outperforms other methods in face tracking in the wild. For face parsing in the wild, we introduce the first large-scale in-the-wild face dataset, iBugMask, that contains 21, 866 training images and 1, 000 testing images. Unlike existing datasets, the images in iBugMask are captured in the fully unconstrained environment and are not cropped or preprocessed of any kind. Manually annotated per-pixel labels for eleven facial regions are provided for each target face. Next, we benchmark existing parsing methods and the results show that iBugMask is extremely challenging for all methods. By rigorous benchmarking, we observe that the pre-processing of facial images with bounding boxes in face parsing in the wild introduces bias. When cropping the face with a bounding box, a cropping margin has to be hand-picked. If face alignment is used, fiducial landmarks are required and a predefined alignment template has to be selected. These additional hyper-parameters have to be carefully considered and can have a significant impact on the face parsing performance. To solve this, we propose Region-of-Interest (RoI) Tanh-polar transform that warps the whole image to a fixed-sized representation. Moreover, the RoI Tanh-polar transform is differentiable and allows for rotation equivariance in 1 DCNNs. We show that when coupled with a simple Fully Convolutional Network, our RoI Tanh-polar transformer Network has achieved state-of-the-art results on face parsing in the wild. This thesis contributes towards in-the-wild face tracking and face parsing by providing novel datasets and proposing effective frameworks. Both tasks can benefit real-world downstream applications such as facial age estimation, facial expression recognition and lip-reading. The proposed RoI Tanh-polar transform also provides a new perspective in how to preprocess the face images and make the DCNNs truly end-to-end for real-world face analysis applications.Open Acces

    Dynamic face parsing in the wild

    No full text
    Landmark-based facial descriptors are widely utilised in face video analysis to obtain useful facial dynamics. Such a sparse facial representation is not capable of constructing the full dynamics of each facial component like eyes and mouths, while such dynamics can be essential to the recognition of higher-level features like facial expressions, emotions, identity, and so on. A dense facial descriptor such as the segmentation masks generated by face parsing, however, can effectively overcome those limitations by providing pixel-wise semantic information that is generally more discriminative and more desirable to facial analysis tasks. Recently, Deep Convolutional Neural Networks (DCNNs) have made impressive progress in semantic image segmentation, a task that performs per-pixel classifications. Those deep segmentation models can naturally generate pixel-level predictions for facial images, however, face parsing in the wild is still a challenging task. The model's ability to accurately segment different facial regions is crucial to generate high-quality face masks. Besides, how to adapt the segmentation models designed for static images to the continuous environment of face videos also requires consideration. To satisfy the real-time requirement under realistic scenarios, the acceleration problem needs to be resolved. This thesis investigates different aspects of in-the-wild face parsing and proposes several novel approaches of constructing robust face segmentation masks. To increase the robustness of eye segmentation against low-quality video scenarios, we encode the shape priors of eyes into the training procedure of deep segmentation model. Additionally, the segmentation model's sensitivity to semantic facial contours is enhanced by introducing the Dilated Convolutions with Lateral Inhibitions, which is a convolutional operator biologically inspired by human visual systems. To exploit information from both temporal and spatial domains in face videos, we propose a ConvLSTM-FCN model to generate temporal-smoothed face segmentation masks which are more tolerant to video variations. Eventually, we consider to accelerate the process of dynamic face parsing via Reinforcement Learning to learn a globally-optimised key scheduler. This thesis contributes towards in-the-wild face parsing from different aspects such as improving the fundamental network architectures and optimising the performance under realistic scenarios. It can benefit downstream tasks that require detailed facial dynamics such as facial expression recognition and lip-reading. It can also be inspiring to future works on semantic image/video segmentation, and to other works pursuing face parsing with higher visual qualities or with better working efficiency.Open Acces
    corecore