Search CORE

70 research outputs found

Video-based Face Recognition using Spatio-Temporal Representations

Author: Chikkannan Eswaran
John See
Mohammad Faizal Ahmad Fauzi
Publication venue: 'IntechOpen'
Publication date: 27/07/2011
Field of study

Infrared face recognition: a comprehensive review of methodologies and databases

Author: Arandjelovic Ognjen
Bendada Hakim
Ghiass Reza Shoja
Maldague Xavier
Publication venue
Publication date: 01/01/2014
Field of study

Automatic face recognition is an area with immense practical potential which includes a wide range of commercial and law enforcement applications. Hence it is unsurprising that it continues to be one of the most active research areas of computer vision. Even after over three decades of intense research, the state-of-the-art in face recognition continues to improve, benefitting from advances in a range of different research fields such as image processing, pattern recognition, computer graphics, and physiology. Systems based on visible spectrum images, the most researched face recognition modality, have reached a significant level of maturity with some practical success. However, they continue to face challenges in the presence of illumination, pose and expression changes, as well as facial disguises, all of which can significantly decrease recognition accuracy. Amongst various approaches which have been proposed in an attempt to overcome these limitations, the use of infrared (IR) imaging has emerged as a particularly promising research direction. This paper presents a comprehensive and timely review of the literature on this subject. Our key contributions are: (i) a summary of the inherent properties of infrared imaging which makes this modality promising in the context of face recognition, (ii) a systematic review of the most influential approaches, with a focus on emerging common trends as well as key differences between alternative methodologies, (iii) a description of the main databases of infrared facial images available to the researcher, and lastly (iv) a discussion of the most promising avenues for future research.Comment: Pattern Recognition, 2014. arXiv admin note: substantial text overlap with arXiv:1306.160

arXiv.org e-Print Archive

Deakin Research Online

Crossref

University of St. Andrews - Pure

Statistical part-based models for object detection in large 3D scans

Author: Sunkel Martin
Publication venue: Fakultät 6 - Naturwissenschaftlich-Technische Fakultät I. Fachrichtung 6.2 - Informatik
Publication date: 01/01/2013
Field of study

3D scanning technology has matured to a point where very large scale acquisition of high resolution geometry has become feasible. However, having large quantities of 3D data poses new technical challenges. Many applications of practical use require an understanding of semantics of the acquired geometry. Consequently scene understanding plays a key role for many applications. This thesis is concerned with two core topics: 3D object detection and semantic alignment. We address the problem of efficiently detecting large quantities of objects in 3D scans according to object categories learned from sparse user annotation. Objects are modeled by a collection of smaller sub-parts and a graph structure representing part dependencies. The thesis introduces two novel approaches: A part-based chain structured Markov model and a general part-based full correlation model. Both models come with efficient detection schemes which allow for interactive run-times.Die Technologie für 3-dimensionale bildgebende Verfahren (3D Scans) ist mittlerweile an einem Punkt angelangt, an dem hochaufglöste Geometrie-Modelle für sehr große Szenen erstellbar sind. Große Mengen dreidimensionaler Daten stellen allerdings neue technische Herausforderungen. Viele Anwendungen von praktischem Nutzen erfordern ein semantisches Verständnis der akquirierten Geometrie. Dementsprechend spielt das sogenannte “Szenenverstehen” eine Schlüsselrolle bei vielen Anwendungen. Diese Dissertation beschäftigt sich mit 2 Kernthemen: 3D Objekt-Detektion und semantische (Objekt-) Anordnung. Das Problem hierbei ist, große Mengen von Objekten effizient in 3D Scans zu detektieren, wobei die Objekte aus bestimmten Objektkategorien entstammen, welche mittels gerinfügiger Annotationen durch den Benutzer gelernt werden. Dabei werden Objekte modelliert durch eine Ansammlung kleinerer Teilstücke und einer Graph-Struktur, welche die Abhängigkeiten der Einzelteile repäsentiert. Diese Arbeit stellt zwei neuartige Ansätze vor: Ein Markov-Modell, das aus einer teilebasierten Kettenstruktur besteht und einen generellen Ansatz, der auf einem Modell mit voll korrelierten Einzelteilen beruht. Zu beiden Modellen werden effiziente Detektionsschemata aufgezeigt, die interaktive Laufzeiten ermöglichen

Universaar

MPG.PuRe

Acronym

On the 3D point cloud for human-pose estimation

Author: Chan Kai-Chi
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2016
Field of study

This thesis aims at investigating methodologies for estimating a human pose from a 3D point cloud that is captured by a static depth sensor. Human-pose estimation (HPE) is important for a range of applications, such as human-robot interaction, healthcare, surveillance, and so forth. Yet, HPE is challenging because of the uncertainty in sensor measurements and the complexity of human poses. In this research, we focus on addressing challenges related to two crucial components in the estimation process, namely, human-pose feature extraction and human-pose modeling. In feature extraction, the main challenge involves reducing feature ambiguity. We propose a 3D-point-cloud feature called viewpoint and shape feature histogram (VISH) to reduce feature ambiguity by capturing geometric properties of the 3D point cloud of a human. The feature extraction consists of three steps: 3D-point-cloud pre-processing, hierarchical structuring, and feature extraction. In the pre-processing step, 3D points corresponding to a human are extracted and outliers from the environment are removed to retain the 3D points of interest. This step is important because it allows us to reduce the number of 3D points by keeping only those points that correspond to the human body for further processing. In the hierarchical structuring, the pre-processed 3D point cloud is partitioned and replicated into a tree structure as nodes. Viewpoint feature histogram (VFH) and shape features are extracted from each node in the tree to provide a descriptor to represent each node. As the features are obtained based on histograms, coarse-level details are highlighted in large regions and fine-level details are highlighted in small regions. Therefore, the features from the point cloud in the tree can capture coarse level to fine level information to reduce feature ambiguity. In human-pose modeling, the main challenges involve reducing the dimensionality of human-pose space and designing appropriate factors that represent the underlying probability distributions for estimating human poses. To reduce the dimensionality, we propose a non-parametric action-mixture model (AMM). It represents high-dimensional human-pose space using low-dimensional manifolds in searching human poses. In each manifold, a probability distribution is estimated based on feature similarity. The distributions in the manifolds are then redistributed according to the stationary distribution of a Markov chain that models the frequency of human actions. After the redistribution, the manifolds are combined according to a probability distribution determined by action classification. Experiments were conducted using VISH features as input to the AMM. The results showed that the overall error and standard deviation of the AMM were reduced by about 7.9% and 7.1%, respectively, compared with a model without action classification. To design appropriate factors, we consider the AMM as a Bayesian network and propose a mapping that converts the Bayesian network to a neural network called NN-AMM. The proposed mapping consists of two steps: structure identification and parameter learning. In structure identification, we have developed a bottom-up approach to build a neural network while preserving the Bayesian-network structure. In parameter learning, we have created a part-based approach to learn synaptic weights by decomposing a neural network into parts. Based on the concept of distributed representation, the NN-AMM is further modified into a scalable neural network called NND-AMM. A neural-network-based system is then built by using VISH features to represent 3D-point-cloud input and the NND-AMM to estimate 3D human poses. The results showed that the proposed mapping can be utilized to design AMM factors automatically. The NND-AMM can provide more accurate human-pose estimates with fewer hidden neurons than both the AMM and NN-AMM can. Both the NN-AMM and NND-AMM can adapt to different types of input, showing the advantage of using neural networks to design factors

Purdue E-Pubs

부분 정보를 이용한 시각 데이터의 구조화 된 이해: 희소성, 무작위성, 연관성, 그리고 딥 네트워크

Author: 이동훈
Publication venue: 서울대학교 대학원
Publication date: 01/02/2019
Field of study

학위논문 (박사)-- 서울대학교 대학원 : 공과대학 전기·컴퓨터공학부, 2019. 2. Oh, Songhwai.For a deeper understanding of visual data, a relationship between local parts and a global scene has to be carefully examined. Examples of such relationships related to vision problems include but not limited to detecting a region of interest in the scene, classifying an image based on limited visual cues, and synthesizing new images conditioned on the local or global inputs. In this thesis, we aim to learn the relationship and demonstrate its importance by showing that it is one of critical keys to address four challenging vision problems mentioned above. For each problem, we construct deep neural networks that suit for each task. The first problem considered in the thesis is object detection. It requires not only finding local patches that look like target objects conditioned on the context of input scene but also comparing local patches themselves to assign a single detection for each object. To this end, we introduce individualness of detection candidates as a complement to objectness for object detection. The individualness assigns a single detection for each object out of raw detection candidates given by either object proposals or sliding windows. We show that conventional approaches, such as non-maximum suppression, are sub-optimal since they suppress nearby detections using only detection scores. We use a determinantal point process combined with the individualness to optimally select final detections. It models each detection using its quality and similarity to other detections based on the individualness. Then, detections with high detection scores and low correlations are selected by measuring their probability using a determinant of a matrix, which is composed of quality terms on the diagonal entries and similarities on the off-diagonal entries. For concreteness, we focus on the pedestrian detection problem as it is one of the most challenging problems due to frequent occlusions and unpredictable human motions. Experimental results demonstrate that the proposed algorithm works favorably against existing methods, including non-maximal suppression and a quadratic unconstrained binary optimization based method. For a second problem, we classify images based on observations of local patches. More specifically, we consider the problem of estimating the head pose and body orientation of a person from a low-resolution image. Under this setting, it is difficult to reliably extract facial features or detect body parts. We propose a convolutional random projection forest (CRPforest) algorithm for these tasks. A convolutional random projection network (CRPnet) is used at each node of the forest. It maps an input image to a high-dimensional feature space using a rich filter bank. The filter bank is designed to generate sparse responses so that they can be efficiently computed by compressive sensing. A sparse random projection matrix can capture most essential information contained in the filter bank without using all the filters in it. Therefore, the CRPnet is fast, e.g., it requires 0.04ms to process an image of 50×50 pixels, due to the small number of convolutions (e.g., 0.01% of a layer of a neural network) at the expense of less than 2% accuracy. The overall forest estimates head and body pose well on benchmark datasets, e.g., over 98% on the HIIT dataset, while requiring at 3.8ms without using a GPU. Extensive experiments on challenging datasets show that the proposed algorithm performs favorably against the state-of-the-art methods in low-resolution images with noise, occlusion, and motion blur. Then, we shift our attention to image synthesis based on the local-global relationship. Learning how to synthesize and place object instances into an image (semantic map) based on the scene context is a challenging and interesting problem in vision and learning. On one hand, solving this problem requires a joint decision of (a) generating an object mask from a certain class at a plausible scale, location, and shape, and (b) inserting the object instance mask into an existing scene so that the synthesized content is semantically realistic. On the other hand, such a model can synthesize realistic outputs to potentially facilitate numerous image editing and scene parsing tasks. In this paper, we propose an end-to-end trainable neural network that can synthesize and insert object instances into an image via a semantic map. The proposed network contains two generative modules that determine where the inserted object should be (i.e., location and scale) and what the object shape (and pose) should look like. The two modules are connected together with a spatial transformation network and jointly trained and optimized in a purely data-driven way. Specifically, we propose a novel network architecture with parallel supervised and unsupervised paths to guarantee diverse results. We show that the proposed network architecture learns the context-aware distribution of the location and shape of object instances to be inserted, and it can generate realistic and statistically meaningful object instances that simultaneously address the where and what sub-problems. As the final topic of the thesis, we introduce a new vision problem: generating an image based on a small number of key local patches without any geometric prior. In this work, key local patches are defined as informative regions of the target object or scene. This is a challenging problem since it requires generating realistic images and predicting locations of parts at the same time. We construct adversarial networks to tackle this problem. A generator network generates a fake image as well as a mask based on the encoder-decoder framework. On the other hand, a discriminator network aims to detect fake images. The network is trained with three losses to consider spatial, appearance, and adversarial information. The spatial loss determines whether the locations of predicted parts are correct. Input patches are restored in the output image without much modification due to the appearance loss. The adversarial loss ensures output images are realistic. The proposed network is trained without supervisory signals since no labels of key parts are required. Experimental results on seven datasets demonstrate that the proposed algorithm performs favorably on challenging objects and scenes.시각 데이터를 심도 깊게 이해하기 위해서는 전체 영역과 부분 영역들 간의 연관성 혹은 상호 작용을 주의 깊게 분석하는 것이 필요하다. 이에 관련된 컴퓨터 비전 문제로는 이미지에서 원하는 부분을 검출한다던지, 제한된 부분적인 정보만으로 전체 이미지를 판별 하거나, 혹은 주어진 정보로부터 원하는 이미지를 생성하는 등이 있다. 이 논문에서는, 그 연관성을 학습하는 것이 앞서 언급된 다양한 문제들을 푸는데 중요한 열쇠가 된다는 것을 보여주고자 한다. 이에 더해서, 각각의 문제에 알맞는 딥 네트워크의 디자인 또한 토의하고자 한다. 첫 주제로, 물체 검출 방식에 대해 분석하고자 한다. 이 문제는 타겟 물체와 비슷하게 생긴 영역을 찾아야 할 뿐 아니라, 찾아진 영역들 사이에 연관성을 분석함으로써 각 물체 마다 단 하나의 검출 결과를 할당시켜야 한다. 이를 위해, 우리는 objectness에 대한 보완으로써 individualness라는 개념을 제안 하였다. 이는 임의의 방식으로 얻어진 후보 물체 영역 중 하나씩을 물체 마다 할당하는데 쓰이는데, 이것은 검출 스코어만을 바탕으로 후처리를 하는 기존의 non-maximum suppression 등의 방식이 sub-optimal 결과를 얻을 수 밖에 없기 때문에 이를 개선하고자 도입하였다. 우리는 후보 물체 영역으로부터 최적의 영역들을 선택하기 위해서, determinantal point process라는 random process의 일종을 사용하였다. 이것은 먼저 각각의 검출 결과를 그것의 quality(검출 스코어)와 다른 검출 결과들 사이에 individualness를 바탕으 로 계산된 similarity(상관 관계)를 이용해 모델링 한다. 그 후, 각각의 검출 결과가 선택될 확률을 quality와 similarity에 기반한 커널의 determinant로 표현한다. 그 커널에 diagonal 부분에는 quality가 들어가고, off-diagonal에는 similarity가 대입 된다. 따라서, 어떤 검출 후보가 최종 검출 결과로 선택될 확률이 높아지기 위해서는, 높은 quality를 가짐과 동시에 다른 검출 결과들과 낮은 similarity를 가져야 한다. 이 논문에서는 보행자 검출에 집중하였는데, 이는 보행자 검출이 중요한 문제이면서도, 다른 물체들에 비해 자주 가려지고 다양한 움직임을 보이는 검출이 어려운 물체이기 때문이다. 실험 결과는 제안한 방법이 non-maximum suppression 혹은 quadratic unconstrained binary optimization 방식들 보다 우수함을 보여주었다. 다음 문제로는, 부분 정보를 이용해서 전체 이미지를 classify하는 것을 고려한다. 다양한 classification 문제 중에, 이 논문에서는 저해상도 이미지로부터 사람의 머리와 몸이 향하는 방향을 알아내는 문제에 집중하였다. 이 경우에는, 눈, 코, 입 등을 찾거나, 몸의 파트를 정확히 알아내는 것이 어렵다. 이를 위해, 우리는 convolutional random projection forest (CRPforest)라는 방식을 제안하였다. 이 forest에 각각의 node 안에는 convolutional random projection network (CRPnet)이 들어있는데, 이는 다양한 필터를 이용해서 인풋 이미지를 높은 차원으로 mapping 한다. 이를 효율적으로 다루기 위해 sparse한 결과를 얻을 수 있는 필터들을 사용함으로써, 압축 센싱 개념을 도입 할 수 있도록 하였다. 즉, 실제로는 적은 수의 필터만을 사용해서 전체 이미지의 중요한 정보를 모두 담고자 하는 것이다. 따라서 CRPnet은 50×50 픽셀 이미지에서 0.04ms 만에 동작 할 수 있을 정도로 매우 빠르며, 동시에 성능 하락은 2% 정도로 미미한 결과를 보여주었다. 이를 바탕으로 한 전체 forest는 GPU 없이 3.8ms 안에 동작하며, 머리와 몸통 방향 측정에 대해 다양한 데이터셋에서 최고의 성능을 보여주었다. 또한, 저해상도, 노이즈, 가려짐, 블러 등의 다양한 경우에도 좋은 성능을 보여주었다. 다음으로 부분-전체의 연관성을 통한 이미지 생성 문제를 탐구한다. 입력 이미지 상에 어떤 물체를 어떻게 놓을 것인지를 유추하는 것은 컴퓨터 비전과 기계 학습의 입장에서 아주 흥미로운 문제이다. 이는 먼저, 물체의 마스크를 적절한 크기, 위치, 모양으로 만들면서 동시에 그 물체가 입력 이미지 상에 놓여졌을 때에도 합리적으로 보일 수 있도록 해야 한다. 그렇게 된다면, image editing 혹은 scene parsing 등의 다양한 문제에 응용 될 수 있다. 이 논문에서는, 입력 semantic map으로 부터 새로운 물체를 알맞은 곳에 놓는 문제를 end-to-end 방식으로 학습 가능한 딥 네트워크를 구성하고자 한다. 이를 위해, where 모듈과 what 모듈을 바탕으로 하는 네트워크를 구성하였으며, 두 모듈을 spatial transformer network을 통해 연결하여 동시에 학습이 가능하도록 하였다. 또한, 각각의 모듈에 지도적 학습 경로와 비지도적 학습 경로를 병렬적으로 배치하여 동일한 입력으로 부터 다양한 결과를 얻을 수 있게 하였다. 실험을 통해, 제안한 방식이 삽입될 물체의 위치와 모양에 대한 분포를 동시에 학습 할 수 있고, 그 분포로부터 실제와 유사한 물체를 알맞은 곳에 놓을 수 있음을 보였다. 마지막으로 고려할 문제는, 컴퓨터 비전 분야에 새로운 문제로써, 위치 정보가 상실 된 적은 수의 부분 패치들을 바탕으로 전체 이미지를 복원하는 것이다. 이것은 이미지 생성과 동시에 각 패치의 위치 정보를 추측해야 하기에 어려운 문제가 된다. 우리는 적대적 네트워크를 바탕으로 이 문제를 해결하고자 하였다. 즉, 생성 네트워크는 encoder-decoder 방식을 이용해서 이미지와 위치 마스크를 찾고자 하는 반면에, 판별 네트워크는 생성된 가짜 이미지를 찾으려고 한다. 그리고 전체 네트워크는 위치, 겉보기, 적대적 경쟁의 세 가지 목적 함수들로 학습이 된다. 위치 목적 함수는 알맞은 위치를 예측하기 위해 사용되었고, 겉보기 목적 함수는 입력 패치 들이 결과 이미지 상에 적은 변화만을 가지고 남아있도록 하기 위해 사용되었으며, 적대적 경쟁 목적 함수는 생성된 이미지가 실제 이미지와 비슷할 수 있도록 하기 위해 적용되었다. 이렇게 구성된 네트워크는 별도의 annotation 없이 기존 데이터셋 들을 바탕으로 학습이 가능한 장점이 있다. 또한 실험을 통해, 제안한 방식이 다양한 데이터셋에서 잘 동작함을 보였다.1 Introduction 1 1.1 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . 5 2 Related Work 9 2.1 Detection methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Orientation estimation methods . . . . . . . . . . . . . . . . . . . . 11 2.3 Instance synthesis methods . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Image generation methods . . . . . . . . . . . . . . . . . . . . . . . 15 3 Pedestrian detection 19 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.1 Determinantal Point Process Formulation . . . . . . . . . . 22 3.2.2 Quality Term . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.3 Individualness and Diversity Feature . . . . . . . . . . . . . 25 3.2.4 Mode Finding . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.5 Relationship to Quadratic Unconstrained Binary Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . 36 3.3.2 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.3 DET curves . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3.4 Sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . 43 3.3.5 Effectiveness of the quality and similarity term design . . . 44 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4 Head and body orientation estimation 51 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2 Algorithmic Overview . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3 Rich Filter Bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3.1 Compressed Filter Bank . . . . . . . . . . . . . . . . . . . . 57 4.3.2 Box Filter Bank . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4 Convolutional Random Projection Net . . . . . . . . . . . . . . . . 58 4.4.1 Input Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4.2 Convolutional and ReLU Layers . . . . . . . . . . . . . . . 60 4.4.3 Random Projection Layer . . . . . . . . . . . . . . . . . . . 61 4.4.4 Fully-Connected and Output Layers . . . . . . . . . . . . . 62 4.5 Convolutional Random Projection Forest . . . . . . . . . . . . . . 62 4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.6.1 Evaluation Datasets . . . . . . . . . . . . . . . . . . . . . . 65 4.6.2 CRPnet Characteristics . . . . . . . . . . . . . . . . . . . . 66 4.6.3 Head and Body Orientation Estimation . . . . . . . . . . . 67 4.6.4 Analysis of the Proposed Algorithm . . . . . . . . . . . . . 87 4.6.5 Classification Examples . . . . . . . . . . . . . . . . . . . . 87 4.6.6 Regression Examples . . . . . . . . . . . . . . . . . . . . . . 100 4.6.7 Experiments on the Original Datasets . . . . . . . . . . . . 100 4.6.8 Dataset Corrections . . . . . . . . . . . . . . . . . . . . . . 100 4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5 Instance synthesis and placement 109 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.2.1 The where module: learning a spatial distribution of object instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.2.2 The what module: learning a shape distribution of object instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.2.3 The complete pipeline . . . . . . . . . . . . . . . . . . . . . 120 5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6 Image generation 129 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.2 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.2.1 Key Part Detection . . . . . . . . . . . . . . . . . . . . . . 135 6.2.2 Part Encoding Network . . . . . . . . . . . . . . . . . . . . 135 6.2.3 Mask Prediction Network . . . . . . . . . . . . . . . . . . . 137 6.2.4 Image Generation Network . . . . . . . . . . . . . . . . . . 138 6.2.5 Real-Fake Discriminator Network . . . . . . . . . . . . . . . 139 6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.3.2 Image Generation Results . . . . . . . . . . . . . . . . . . . 142 6.3.3 Experimental Details . . . . . . . . . . . . . . . . . . . . . . 150 6.3.4 Image Generation from Local Patches . . . . . . . . . . . . 150 6.3.5 Part Combination . . . . . . . . . . . . . . . . . . . . . . . 150 6.3.6 Unsupervised Feature Learning . . . . . . . . . . . . . . . . 151 6.3.7 An Alternative Objective Function . . . . . . . . . . . . . . 151 6.3.8 An Alternative Network Structure . . . . . . . . . . . . . . 151 6.3.9 Different Number of Input Patches . . . . . . . . . . . . . . 152 6.3.10 Smaller Size of Input Patches . . . . . . . . . . . . . . . . . 153 6.3.11 Degraded Input Patches . . . . . . . . . . . . . . . . . . . . 153 6.3.12 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 6.3.13 Failure cases . . . . . . . . . . . . . . . . . . . . . . . . . . 155 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 7 Conclusion and Future Work 179Docto

SNU Open Repository and Archive

Fine-Scaled 3D Geometry Recovery from Single RGB Images

Author: Li Jun
Publication venue: Universitäts- und Landesbibliothek Bonn
Publication date
Field of study

3D geometry recovery from single RGB images is a highly ill-posed and inherently ambiguous problem, which has been a challenging research topic in computer vision for several decades. When fine-scaled 3D geometry is required, the problem become even more difficult. 3D geometry recovery from single images has the objective of recovering geometric information from a single photograph of an object or a scene with multiple objects. The geometric information that is to be retrieved can be of different representations such as surface meshes, voxels, depth maps or 3D primitives, etc. In this thesis, we investigate fine-scaled 3D geometry recovery from single RGB images for three categories: facial wrinkles, indoor scenes and man-made objects. Since each category has its own particular features, styles and also variations in representation, we propose different strategies to handle different 3D geometry estimates respectively. We present a lightweight non-parametric method to generate wrinkles from monocular Kinect RGB images. The key lightweight feature of the method is that it can generate plausible wrinkles using exemplars from one high quality 3D face model with textures. The local geometric patches from the source could be copied to synthesize different wrinkles on the blendshapes of specific users in an offline stage. During online tracking, facial animations with high quality wrinkle details can be recovered in real-time as a linear combination of these personalized wrinkled blendshapes. We propose a fast-to-train two-streamed CNN with multi-scales, which predicts both dense depth map and depth gradient for single indoor scene images.The depth and depth gradient are then fused together into a more accurate and detailed depth map. We introduce a novel set loss over multiple related images. By regularizing the estimation between a common set of images, the network is less prone to overfitting and achieves better accuracy than competing methods. Fine-scaled 3D point cloud could be produced by re-projection to 3D using the known camera parameters. To handle highly structured man-made objects, we introduce a novel neural network architecture for 3D shape recovering from a single image. We develop a convolutional encoder to map a given image to a compact code. Then an associated recursive decoder maps this code back to a full hierarchy, resulting a set of bounding boxes to represent the estimated shape. Finally, we train a second network to predict the fine-scaled geometry in each bounding box at voxel level. The per-box volumes are then embedded into a global one, and from which we reconstruct the final meshed model. Experiments on a variety of datasets show that our approaches can estimate fine-scaled geometry from single RGB images for each category successfully, and surpass state-of-the-art performance in recovering faithful 3D local details with high resolution mesh surface or point cloud

bonndoc – Der Publikationsserver der Universität Bonn

RECOGNITION OF FACES FROM SINGLE AND MULTI-VIEW VIDEOS

Author: Du Ming
Publication venue
Publication date: 01/01/2014
Field of study

Face recognition has been an active research field for decades. In recent years, with videos playing an increasingly important role in our everyday life, video-based face recognition has begun to attract considerable research interest. This leads to a wide range of potential application areas, including TV/movies search and parsing, video surveillance, access control etc. Preliminary research results in this field have suggested that by exploiting the abundant spatial-temporal information contained in videos, we can greatly improve the accuracy and robustness of a visual recognition system. On the other hand, as this research area is still in its infancy, developing an end-to-end face processing pipeline that can robustly detect, track and recognize faces remains a challenging task. The goal of this dissertation is to study some of the related problems under different settings. We address the video-based face association problem, in which one attempts to extract face tracks of multiple subjects while maintaining label consistency. Traditional tracking algorithms have difficulty in handling this task, especially when challenging nuisance factors like motion blur, low resolution or significant camera motions are present. We demonstrate that contextual features, in addition to face appearance itself, play an important role in this case. We propose principled methods to combine multiple features using Conditional Random Fields and Max-Margin Markov networks to infer labels for the detected faces. Different from many existing approaches, our algorithms work in online mode and hence have a wider range of applications. We address issues such as parameter learning, inference and handling false positves/negatives that arise in the proposed approach. Finally, we evaluate our approach on several public databases. We next propose a novel video-based face recognition framework. We address the problem from two different aspects: To handle pose variations, we learn a Structural-SVM based detector which can simultaneously localize the face fiducial points and estimate the face pose. By adopting a different optimization criterion from existing algorithms, we are able to improve localization accuracy. To model other face variations, we use intra-personal/extra-personal dictionaries. The intra-personal/extra-personal modeling of human faces has been shown to work successfully in the Bayesian face recognition framework. It has additional advantages in scalability and generalization, which are of critical importance to real-world applications. Combining intra-personal/extra-personal models with dictionary learning enables us to achieve state-of-arts performance on unconstrained video data, even when the training data come from a different database. Finally, we present an approach for video-based face recognition using camera networks. The focus is on handling pose variations by applying the strength of the multi-view camera network. However, rather than taking the typical approach of modeling these variations, which eventually requires explicit knowledge about pose parameters, we rely on a pose-robust feature that eliminates the needs for pose estimation. The pose-robust feature is developed using the Spherical Harmonic (SH) representation theory. It is extracted using the surface texture map of a spherical model which approximates the subject's head. Feature vectors extracted from a video are modeled as an ensemble of instances of a probability distribution in the Reduced Kernel Hilbert Space (RKHS). The ensemble similarity measure in RKHS improves both robustness and accuracy of the recognition system. The proposed approach outperforms traditional algorithms on a multi-view video database collected using a camera network

Digital Repository at the University of Maryland