38 research outputs found
Augmented Deep Representations for Unconstrained Still/Video-based Face Recognition
Face recognition is one of the active areas of research in computer vision and biometrics. Many approaches have been proposed in the literature that demonstrate impressive performance, especially those based on deep learning. However, unconstrained face recognition with large pose, illumination, occlusion and other variations is still an unsolved problem. Unconstrained video-based face recognition is even more challenging due to the large volume of data to be processed, lack of labeled training data and significant intra/inter-video variations on scene, blur, video quality, etc. Although Deep Convolutional Neural Networks (DCNNs) have provided discriminant representations for faces and achieved performance surpassing humans in controlled scenarios, modifications are necessary for face recognition in unconstrained conditions. In this dissertation, we propose several methods that improve unconstrained face recognition performance by augmenting the representation provided by the deep networks using correlation or contextual information in the data.
For unconstrained still face recognition, we present an encoding approach to combine the Fisher vector (FV) encoding and DCNN representations, which is called FV-DCNN. The feature maps from the last convolutional layer in the deep network are encoded by FV into a robust representation, which utilizes the correlation between facial parts within each face. A VLAD-based encoding method called VLAD-DCNN is also proposed as an extension. Extensive evaluations on three challenging face recognition datasets show that the proposed FV-DCNN and VLAD-DCNN perform comparable to or better than many state-of-the-art face verification methods.
For the more challenging video-based face recognition task, we first propose an automatic system and model the video-to-video similarity as subspace-to-subspace similarity, where the subspaces characterize the correlation between deep representations of faces in videos. In the system, a quality-aware subspace-to-subspace similarity is introduced, where subspaces are learned using quality-aware principal component analysis. Subspaces along with quality-aware exemplars of templates are used to produce the similarity scores between video pairs by a quality-aware principal angle-based subspace-to-subspace similarity metric. The method is evaluated on four video datasets. The experimental results demonstrate the superior performance of the proposed method.
To utilize the temporal information in videos, a hybrid dictionary learning method is also proposed for video-based face recognition. The proposed unsupervised approach effectively models the temporal correlation between deep representations of video faces using dynamical dictionaries. A practical iterative optimization algorithm is introduced to learn the dynamical dictionary. Experiments on three video-based face recognition datasets demonstrate that the proposed method can effectively learn robust and discriminative representation for videos and improve the face recognition performance.
Finally, to leverage contextual information in videos, we present the Uncertainty-Gated Graph (UGG) for unconstrained video-based face recognition. It utilizes contextual information between faces by conducting graph-based identity propagation between sample tracklets, where identity information are initialized by the deep representations of video faces. UGG explicitly models the uncertainty of the contextual connections between tracklets by adaptively updating the weights of the edge gates according to the identity distributions of the nodes during inference. UGG is a generic graphical model that can be applied at only inference time or with end-to-end training. We demonstrate the effectiveness of UGG with state-of-the-art results on the recently released challenging Cast Search in Movies and IARPA Janus Surveillance Video Benchmark datasets
Dynamic Switching State Systems for Visual Tracking
This work addresses the problem of how to capture the dynamics of maneuvering objects for visual tracking. Towards this end, the perspective of recursive Bayesian filters and the perspective of deep learning approaches for state estimation are considered and their functional viewpoints are brought together
Dynamic Switching State Systems for Visual Tracking
This work addresses the problem of how to capture the dynamics of maneuvering objects for visual tracking. Towards this end, the perspective of recursive Bayesian filters and the perspective of deep learning approaches for state estimation are considered and their functional viewpoints are brought together
Cluster and Aggregate: Face Recognition with Large Probe Set
Feature fusion plays a crucial role in unconstrained face recognition where
inputs (probes) comprise of a set of low quality images whose individual
qualities vary. Advances in attention and recurrent modules have led to feature
fusion that can model the relationship among the images in the input set.
However, attention mechanisms cannot scale to large due to their quadratic
complexity and recurrent modules suffer from input order sensitivity. We
propose a two-stage feature fusion paradigm, Cluster and Aggregate, that can
both scale to large and maintain the ability to perform sequential
inference with order invariance. Specifically, Cluster stage is a linear
assignment of inputs to global cluster centers, and Aggregation stage
is a fusion over clustered features. The clustered features play an
integral role when the inputs are sequential as they can serve as a
summarization of past features. By leveraging the order-invariance of
incremental averaging operation, we design an update rule that achieves
batch-order invariance, which guarantees that the contributions of early image
in the sequence do not diminish as time steps increase. Experiments on IJB-B
and IJB-S benchmark datasets show the superiority of the proposed two-stage
paradigm in unconstrained face recognition. Code and pretrained models are
available in https://github.com/mk-minchul/cafaceComment: To appear in NeurIPS 202
Recommended from our members
Improving Visual Recognition With Unlabeled Data
The success of deep neural networks has resulted in computer vision systems that obtain high accuracy on a wide variety of tasks such as image classification, object detection, semantic segmentation, etc. However, most state-of-the-art vision systems are dependent upon large amounts of labeled training data, which is not a scalable solution in the long run. This work focuses on improving existing models for visual object recognition and detection without being dependent on such large-scale human-annotated data. We first show how large numbers of hard examples (cases where an existing model makes a mistake) can be obtained automatically from unlabeled video sequences by exploiting temporal consistency cues in the output of a pre-trained object detector. These examples can strongly influence a model\u27s parameters when the network is re-trained to correct them, resulting in improved performance on several object detection tasks. Further, such hard examples from unlabeled videos can be used to address the problem of unsupervised domain adaptation. We focus on the automatic adaptation of an existing object detector to a new domain with no labeled data, assuming that a large number of unlabeled videos are readily available. Our approach is evaluated on challenging face and pedestrian detection tasks involving large domain shifts, showing improved performance with minimal dependence on hyper-parameters. Finally, we address the problem of face recognition, which has achieved high accuracy by employing deep neural networks trained on massive labeled datasets. Further improvements through supervised learning require significantly larger datasets and hence massive annotation efforts. We improve upon the performance of face recognition models trained on large-scale labeled datasets by using unlabeled faces as additional training data. We present insights and recipes for training deep face recognition models with labeled and unlabeled data at scale, addressing real-world challenges such as overlapping identities between the labeled and unlabeled datasets, as well as label noise introduced by clustering errors
Understanding Complex Human Behaviour in Images and Videos.
Understanding human motions and activities in images and videos is an important problem in many application domains, including surveillance, robotics, video indexing, and sports analysis. Although much progress has been made in classifying single person's activities in simple videos, little efforts have been made toward the interpretation of behaviors of multiple people in natural videos. In this thesis, I will present my research endeavor toward the understanding of behaviors of multiple people in natural images and videos. I identify four major challenges in this problem: i) identifying individual properties of people in videos, ii) modeling and recognizing the behavior of multiple people, iii) understanding human activities in multiple levels of resolutions and iv) learning characteristic patterns of interactions between people or people and surrounding environment. I discuss how we solve these challenging problems using various computer vision and machine learning technologies. I conclude with final remarks, observations, and possible future research directions.PhDElectrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/99956/1/wgchoi_1.pd
Towards accurate multi-person pose estimation in the wild
In this thesis we are concerned with the problem of articulated human pose estimation and pose tracking in images and video sequences. Human pose estimation is a task of localising major joints of a human skeleton in natural images and is one of the most important visual recognition tasks in the scenes containing humans with numerous applications in robotics, virtual and augmented reality, gaming and healthcare among others. Articulated human pose tracking requires tracking multiple persons in the video sequence while simultaneously estimating full body poses. This task is important for analysing surveillance footage, activity recognition, sports analytics, etc. Most of the prior work focused on the pose estimation of single pre-localised humans whereas here we address a case with multiple people in real world images which entails several challenges such as person-person overlaps in highly crowded scenes, unknown number of people or people entering and leaving video sequences. The first contribution is a multi-person pose estimation algorithm based on the bottom-up detection-by-grouping paradigm. Unlike the widespread top-down approaches our method detects body joints and pairwise relations between them in a single forward pass of a convolutional neural network. Multi-person parsing is performed by optimizing a joint objective based on a multicut graph partitioning framework. Secondly, we extend our pose estimation approach to articulated multi-person pose tracking in videos. Our approach performs multi-target tracking and pose estimation in a holistic manner by optimising a single objective. We further simplify and refine the formulation which allows us to reach close to the real-time performance. Thirdly, we propose a large scale dataset and a benchmark for articulated multi-person tracking. It is the first dataset of video sequences comprising complex multi-person scenes and fully annotated tracks with 2D keypoints. Our fourth contribution is a method for estimating 3D body pose using on-body wearable cameras. Our approach uses a pair of downward facing, head-mounted cameras and captures an entire body. This egocentric approach is free of limitations of traditional setups with external cameras and can estimate body poses in very crowded environments. Our final contribution goes beyond human pose estimation and is in the field of deep learning of 3D object shapes. In particular, we address the case of reconstructing 3D objects from weak supervision. Our approach represents objects as 3D point clouds and is able to learn them with 2D supervision only and without requiring camera pose information at training time. We design a differentiable renderer of point clouds as well as a novel loss formulation for dealing with camera pose ambiguity.In dieser Arbeit behandeln wir das Problem der Schätzung und Verfolgung artikulierter menschlicher Posen in Bildern und Video-Sequenzen. Die Schätzung menschlicher Posen besteht darin die Hauptgelenke des menschlichen Skeletts in natürlichen Bildern zu lokalisieren und ist eine der wichtigsten Aufgaben der visuellen Erkennung in Szenen, die Menschen beinhalten. Sie hat zahlreiche Anwendungen in der Robotik, virtueller und erweiterter Realität, in Videospielen, in der Medizin und weiteren Bereichen. Die Verfolgung artikulierter menschlicher Posen erfordert die Verfolgung mehrerer Personen in einer Videosequenz bei gleichzeitiger Schätzung vollständiger Körperhaltungen. Diese Aufgabe ist besonders wichtig für die Analyse von Video-Überwachungsaufnahmen, Aktivitätenerkennung, digitale Sportanalyse etc. Die meisten vorherigen Arbeiten sind auf die Schätzung einzelner Posen vorlokalisierter Menschen fokussiert, wohingegen wir den Fall mehrerer Personen in natürlichen Aufnahmen betrachten. Dies bringt einige Herausforderungen mit sich, wie die Überlappung verschiedener Personen in dicht gedrängten Szenen, eine unbekannte Anzahl an Personen oder Personen die das Sichtfeld der Video-Sequenz verlassen oder betreten. Der erste Beitrag ist ein Algorithmus zur Schätzung der Posen mehrerer Personen, welcher auf dem Paradigma der Erkennung durch Gruppierung aufbaut. Im Gegensatz zu den verbreiteten Verfeinerungs-Ansätzen erkennt unsere Methode Körpergelenke and paarweise Beziehungen zwischen ihnen in einer einzelnen Vorwärtsrechnung eines faltenden neuronalen Netzwerkes. Die Gliederung in mehrere Personen erfolgt durch Optimierung einer gemeinsamen Zielfunktion, die auf dem Mehrfachschnitt-Problem in der Graphenzerlegung basiert. Zweitens erweitern wir unseren Ansatz zur Posen-Bestimmung auf das Verfolgen mehrerer Personen und deren Artikulation in Videos. Unser Ansatz führt eine Verfolgung mehrerer Ziele und die Schätzung der zugehörigen Posen in ganzheitlicher Weise durch, indem eine einzelne Zielfunktion optimiert wird. Desweiteren vereinfachen und verfeinern wir die Formulierung, was unsere Methode nah an Echtzeit-Leistung bringt. Drittens schlagen wir einen großen Datensatz und einen Bewertungsmaßstab für die Verfolgung mehrerer artikulierter Personen vor. Dies ist der erste Datensatz der Video-Sequenzen von komplexen Szenen mit mehreren Personen beinhaltet und deren Spuren komplett mit zwei-dimensionalen Markierungen der Schlüsselpunkte versehen sind. Unser vierter Beitrag ist eine Methode zur Schätzung von drei-dimensionalen Körperhaltungen mittels am Körper tragbarer Kameras. Unser Ansatz verwendet ein Paar nach unten gerichteter, am Kopf befestigter Kameras und erfasst den gesamten Körper. Dieser egozentrische Ansatz ist frei von jeglichen Limitierungen traditioneller Konfigurationen mit externen Kameras und kann Körperhaltungen in sehr dicht gedrängten Umgebungen bestimmen. Unser letzter Beitrag geht über die Schätzung menschlicher Posen hinaus in den Bereich des tiefen Lernens der Gestalt von drei-dimensionalen Objekten. Insbesondere befassen wir uns mit dem Fall drei-dimensionale Objekte unter schwacher Überwachung zu rekonstruieren. Unser Ansatz repräsentiert Objekte als drei-dimensionale Punktwolken and ist im Stande diese nur mittels zwei-dimensionaler Überwachung und ohne Informationen über die Kamera-Ausrichtung zur Trainingszeit zu lernen. Wir entwerfen einen differenzierbaren Renderer für Punktwolken sowie eine neue Formulierung um mit uneindeutigen Kamera-Ausrichtungen umzugehen