Search CORE

99,504 research outputs found

단일 이미지로부터 여러 사람의 표현적 전신 3D 자세 및 형태 추정

Author: 문경식
Publication venue: 서울대학교 대학원
Publication date: 01/02/2021
Field of study

학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·정보공학부, 2021. 2. 이경무.Human is the most centric and interesting object in our life: many human-centric techniques and studies have been proposed from both industry and academia, such as motion capture and human-computer interaction. Recovery of accurate 3D geometry of human (i.e., 3D human pose and shape) is a key component of the human-centric techniques and studies. With the rapid spread of cameras, a single RGB image has become a popular input, and many single RGB-based 3D human pose and shape estimation methods have been proposed. The 3D pose and shape of the whole body, which includes hands and face, provides expressive and rich information, including human intention and feeling. Unfortunately, recovering the whole-body 3D pose and shape is greatly challenging; thus, it has been attempted by few works, called expressive methods. Instead of directly solving the expressive 3D pose and shape estimation, the literature has been developed for recovery of the 3D pose and shape of each part (i.e., body, hands, and face) separately, called part-specific methods. There are several more simplifications. For example, many works estimate only 3D pose without shape because additional 3D shape estimation makes the problem much harder. In addition, most works assume a single person case and do not consider a multi-person case. Therefore, there are several ways to categorize current literature; 1) part-specific methods and expressive methods, 2) 3D human pose estimation methods and 3D human pose and shape estimation methods, and 3) methods for a single person and methods for multiple persons. The difficulty increases while the outputs of methods become richer by changing from part-specific to expressive, from 3D pose estimation to 3D pose and shape estimation, and from a single person case to multi-person case. This dissertation introduces three approaches towards expressive 3D multi-person pose and shape estimation from a single image; thus, the output can finally provide the richest information. The first approach is for 3D multi-person body pose estimation, the second one is 3D multi-person body pose and shape estimation, and the final one is expressive 3D multi-person pose and shape estimation. Each approach tackles critical limitations of previous state-of-the-art methods, thus bringing the literature closer to the real-world environment. First, a 3D multi-person body pose estimation framework is introduced. In contrast to the single person case, the multi-person case additionally requires camera-relative 3D positions of the persons. Estimating the camera-relative 3D position from a single image involves high depth ambiguity. The proposed framework utilizes a deep image feature with the camera pinhole model to recover the camera-relative 3D position. The proposed framework can be combined with any 3D single person pose and shape estimation methods for 3D multi-person pose and shape. Therefore, the following two approaches focus on the single person case and can be easily extended to the multi-person case by using the framework of the first approach. Second, a 3D multi-person body pose and shape estimation method is introduced. It extends the first approach to additionally predict accurate 3D shape while its accuracy significantly outperforms previous state-of-the-art methods by proposing a new target representation, lixel-based 1D heatmap. Finally, an expressive 3D multi-person pose and shape estimation method is introduced. It integrates the part-specific 3D pose and shape of the above approaches; thus, it can provide expressive 3D human pose and shape. In addition, it boosts the accuracy of the estimated 3D pose and shape by proposing a 3D positional pose-guided 3D rotational pose prediction system. The proposed approaches successfully overcome the limitations of the previous state-of-the-art methods. The extensive experimental results demonstrate the superiority of the proposed approaches in both qualitative and quantitative ways.인간은 우리의 일상생활에서 가장 중심이 되고 흥미로운 대상이다. 그에 따라 모션 캡처, 인간-컴퓨터 인터렉션 등 많은 인간중심의 기술과 학문이 산업계와 학계에서 제안되었다. 인간의 정확한 3D 기하 (즉, 인간의 3D 자세와 형태)를 복원하는 것은 인간중심 기술과 학문에서 가장 중요한 부분 중 하나이다. 카메라의 빠른 대중화로 인해 단일 이미지는 많은 알고리즘의 널리 쓰이는 입력이 되었고, 그로 인해 많은 단일 이미지 기반의 3D 인간 자세 및 형태 추정 알고리즘이 제안되었다. 손과 발을 포함한 전신의 3D 자세와 형태는 인간의 의도와 느낌을 포함한 표현적이고 풍부한 정보를 제공한다. 하지만 전신의 3D 자세와 형태를 복원하는 것은 매우 어렵기 때문에 오직 극소수의 방법만이 이를 풀기 위해 제안되었고, 이를 위한 방법들을 표현적인 방법이라고 부른다. 표현적인 3D 자세와 형태를 한 번에 복원하는 것 대신, 사람의 몸, 손, 그리고 얼굴의 3D 자세와 형태를 따로 복원하는 방법들이 제안되었다. 이러한 방법들을 부분 특유 방법이라고 부른다. 이러한 문제의 간단화 이외에도 몇 가지의 간단화가 더 존재한다. 예를 들어, 많은 방법은 3D 형태를 제외한 3D 자세만을 추정한다. 이는 추가적인 3D 형태 추정이 문제를 더 어렵게 만들기 때문이다. 또한, 대부분의 방법은 오직 단일 사람의 경우만 고려하고 여러 사람의 경우는 고려하지 않는다. 그러므로, 현재 제안된 방법들은 몇 가지 기준에 의해 분류될 수 있다; 1) 부분 특유 방법 vs. 표현적 방법, 2) 3D 자세 추정 방법 vs. 3D 자세 및 형태 추정 방법, 그리고 3) 단일 사람을 위한 방법 vs. 여러 사람을 위한 방법. 부분 특유에서 표현적으로, 3D 자세 추정에서 3D 자세 및 형태 추정으로, 단일 사람에서 여러 사람으로 갈수록 추정이 더 어려워지지만, 더 풍부한 정보를 출력할 수 있게 된다. 본 학위논문은 단일 이미지로부터 여러 사람의 표현적인 3D 자세 및 형태 추정을 향하는 세 가지의 접근법을 소개한다. 따라서 최종적으로 제안된 방법은 가장 풍부한 정보를 제공할 수 있다. 첫 번째 접근법은 여러 사람을 위한 3D 자세 추정이고, 두 번째는 여러 사람을 위한 3D 자세 및 형태 추정이고, 그리고 마지막은 여러 사람을 위한 표현적인 3D 자세 및 형태 추정을 위한 방법이다. 각 접근법은 기존 방법들이 가진 중요한 한계점들을 해결하여 제안된 방법들이 실생활에서 쓰일 수 있도록 한다. 첫 번째 접근법은 여러 사람을 위한 3D 자세 추정 프레임워크이다. 단일 사람의 경우와는 다르게 여러 사람의 경우 사람마다 카메라 상대적인 3D 위치가 필요하다. 카메라 상대적인 3D 위치를 단일 이미지로부터 추정하는 것은 매우 높은 깊이 모호성을 동반한다. 제안하는 프레임워크는 심층 이미지 피쳐와 카메라 핀홀 모델을 사용하여 카메라 상대적인 3D 위치를 복원한다. 이 프레임워크는 어떤 단일 사람을 위한 3D 자세 및 형태 추정 방법과 합쳐질 수 있기 때문에, 다음에 소개될 두 접근법은 오직 단일 사람을 위한 3D 자세 및 형태 추정에 초점을 맞춘다. 다음에 소개될 두 접근법에서 제안된 단일 사람을 위한 방법들은 첫 번째 접근법에서 소개되는 여러 사람을 위한 프레임워크를 사용하여 쉽게 여러 사람의 경우로 확장할 수 있다. 두 번째 접근법은 여러 사람을 위한 3D 자세 및 형태 추정 방법이다. 이 방법은 첫 번째 접근법을 확장하여 정확도를 유지하면서 추가로 3D 형태를 추정하게 한다. 높은 정확도를 위해 릭셀 기반의 1D 히트맵을 제안하고, 이로 인해 기존에 발표된 방법들보다 큰 폭으로 높은 성능을 얻는다. 마지막 접근법은 여러 사람을 위한 표현적인 3D 자세 및 형태 추정 방법이다. 이것은 몸, 손, 그리고 얼굴마다 3D 자세 및 형태를 하나로 통합하여 표현적인 3D 자세 및 형태를 얻는다. 게다가, 이것은 3D 위치 포즈 기반의 3D 회전 포즈 추정기법을 제안함으로써 기존에 발표된 방법들보다 훨씬 높은 성능을 얻는다. 제안된 접근법들은 기존에 발표되었던 방법들이 갖는 한계점들을 성공적으로 극복한다. 광범위한 실험적 결과가 정성적, 정량적으로 제안하는 방법들의 효용성을 보여준다.1 Introduction 1 1.1 Background and Research Issues 1 1.2 Outline of the Dissertation 3 2 3D Multi-Person Pose Estimation 7 2.1 Introduction 7 2.2 Related works 10 2.3 Overview of the proposed model 13 2.4 DetectNet 13 2.5 PoseNet 14 2.5.1 Model design 14 2.5.2 Loss function 14 2.6 RootNet 15 2.6.1 Model design 15 2.6.2 Camera normalization 19 2.6.3 Network architecture 19 2.6.4 Loss function 20 2.7 Implementation details 20 2.8 Experiment 21 2.8.1 Dataset and evaluation metric 21 2.8.2 Experimental protocol 22 2.8.3 Ablation study 23 2.8.4 Comparison with state-of-the-art methods 25 2.8.5 Running time of the proposed framework 31 2.8.6 Qualitative results 31 2.9 Conclusion 34 3 3D Multi-Person Pose and Shape Estimation 35 3.1 Introduction 35 3.2 Related works 38 3.3 I2L-MeshNet 41 3.3.1 PoseNet 41 3.3.2 MeshNet 43 3.3.3 Final 3D human pose and mesh 45 3.3.4 Loss functions 45 3.4 Implementation details 47 3.5 Experiment 48 3.5.1 Datasets and evaluation metrics 48 3.5.2 Ablation study 50 3.5.3 Comparison with state-of-the-art methods 57 3.6 Conclusion 60 4 Expressive 3D Multi-Person Pose and Shape Estimation 63 4.1 Introduction 63 4.2 Related works 66 4.3 Pose2Pose 69 4.3.1 PositionNet 69 4.3.2 RotationNet 70 4.4 Expressive 3D human pose and mesh estimation 72 4.4.1 Body part 72 4.4.2 Hand part 73 4.4.3 Face part 73 4.4.4 Training the networks 74 4.4.5 Integration of all parts in the testing stage 74 4.5 Implementation details 77 4.6 Experiment 78 4.6.1 Training sets and evaluation metrics 78 4.6.2 Ablation study 78 4.6.3 Comparison with state-of-the-art methods 82 4.6.4 Running time 87 4.7 Conclusion 87 5 Conclusion and Future Work 89 5.1 Summary and Contributions of the Dissertation 89 5.2 Future Directions 90 5.2.1 Global Context-Aware 3D Multi-Person Pose Estimation 91 5.2.2 Unied Framework for Expressive 3D Human Pose and Shape Estimation 91 5.2.3 Enhancing Appearance Diversity of Images Captured from Multi-View Studio 92 5.2.4 Extension to the video for temporally consistent estimation 94 5.2.5 3D clothed human shape estimation in the wild 94 5.2.6 Robust human action recognition from a video 96 Bibliography 98 국문초록 111Docto

SNU Open Repository and Archive

DEEP NEURAL NETWORKS AND REGRESSION MODELS FOR OBJECT DETECTION AND POSE ESTIMATION

Author: Hara Kota
Publication venue
Publication date: 01/01/2016
Field of study

Estimating the pose, orientation and the location of objects has been a central problem addressed by the computer vision community for decades. In this dissertation, we propose new approaches for these important problems using deep neural networks as well as tree-based regression models. For the first topic, we look at the human body pose estimation problem and propose a novel regression-based approach. The goal of human body pose estimation is to predict the locations of body joints, given an image of a person. Due to significant variations introduced by pose, clothing and body styles, it is extremely difficult to address this task by a standard application of the regression method. Thus, we address this task by dividing the whole body pose estimation problem into a set of local pose estimation problems by introducing a dependency graph which describes the dependency among different body joints. For each local pose estimation problem, we train a boosted regression tree model and estimate the pose by progressively applying the regression along the paths in a dependency graph starting from the root node. Our next work is on improving the traditional regression tree method and demonstrate its effectiveness for pose/orientation estimation tasks. The main issues of the traditional regression training are, 1) the node splitting is limited to binary splitting, 2) the form of the splitting function is limited to thresholding on a single dimension of the input vector and 3) the best splitting function is found by exhaustive search. We propose a novel node splitting algorithm for regression tree training which does not have the issues mentioned above. The algorithm proceeds by first applying k-means clustering in the output space, conducting multi-class classification by support vector machine (SVM) and determining the constant estimate at each leaf node. We apply the regression forest that includes our regression tree models to head pose estimation, car orientation estimation and pedestrian orientation estimation tasks and demonstrate its superiority over various standard regression methods. Next, we turn our attention to the role of pose information for the object detection task. In particular, we focus on the detection of fashion items a person is wearing or carrying. It is clear that the locations of these items are strongly correlated with the pose of the person. To address this task, we first generate a set of candidate bounding boxes by using an object proposal algorithm. For each candidate bounding box, image features are extracted by a deep convolutional neural network pre-trained on a large image dataset and the detection scores are generated by SVMs. We introduce a pose-dependent prior on the geometry of the bounding boxes and combine it with the SVM scores. We demonstrate that the proposed algorithm achieves significant improvement in the detection performance. Lastly, we address the object detection task by exploring a way to incorporate an attention mechanism into the detection algorithm. Humans have the capability of allocating multiple fixation points, each of which attends to different locations and scales of the scene. However, such a mechanism is missing in the current state-of-the-art object detection methods. Inspired by the human vision system, we propose a novel deep network architecture that imitates this attention mechanism. For detecting objects in an image, the network adaptively places a sequence of glimpses at different locations in the image. Evidences of the presence of an object and its location are extracted from these glimpses, which are then fused for estimating the object class and bounding box coordinates. Due to the lack of ground truth annotations for the visual attention mechanism, we train our network using a reinforcement learning algorithm. Experiment results on standard object detection benchmarks show that the proposed network consistently outperforms the baseline networks that do not employ the attention mechanism

Digital Repository at the University of Maryland

Adversarial PoseNet: A Structure-aware Convolutional Network for Human Pose Estimation

Author: Chen Yu
Liu Lingqiao
Shen Chunhua
Wei Xiu-Shen
Yang Jian
Publication venue
Publication date: 01/01/2017
Field of study

For human pose estimation in monocular images, joint occlusions and overlapping upon human bodies often result in deviated pose predictions. Under these circumstances, biologically implausible pose predictions may be produced. In contrast, human vision is able to predict poses by exploiting geometric constraints of joint inter-connectivity. To address the problem by incorporating priors about the structure of human bodies, we propose a novel structure-aware convolutional network to implicitly take such priors into account during training of the deep network. Explicit learning of such constraints is typically challenging. Instead, we design discriminators to distinguish the real poses from the fake ones (such as biologically implausible ones). If the pose generator (G) generates results that the discriminator fails to distinguish from real ones, the network successfully learns the priors.Comment: Fixed typos. 14 pages. Demonstration videos are http://v.qq.com/x/page/c039862eira.html, http://v.qq.com/x/page/f0398zcvkl5.html, http://v.qq.com/x/page/w0398ei9m1r.htm

arXiv.org e-Print Archive

Crossref

Adelaide Research & Scholarship