Search CORE

90,038 research outputs found

Cascaded 3D Full-body Pose Regression from Single Depth Image at 100 FPS

Author: Su Le
Xia Shihong
Zhang Zihao
Publication venue
Publication date: 25/01/2018
Field of study

There are increasing real-time live applications in virtual reality, where it plays an important role in capturing and retargetting 3D human pose. But it is still challenging to estimate accurate 3D pose from consumer imaging devices such as depth camera. This paper presents a novel cascaded 3D full-body pose regression method to estimate accurate pose from a single depth image at 100 fps. The key idea is to train cascaded regressors based on Gradient Boosting algorithm from pre-recorded human motion capture database. By incorporating hierarchical kinematics model of human pose into the learning procedure, we can directly estimate accurate 3D joint angles instead of joint positions. The biggest advantage of this model is that the bone length can be preserved during the whole 3D pose estimation procedure, which leads to more effective features and higher pose estimation accuracy. Our method can be used as an initialization procedure when combining with tracking methods. We demonstrate the power of our method on a wide range of synthesized human motion data from CMU mocap database, Human3.6M dataset and real human movements data captured in real time. In our comparison against previous 3D pose estimation methods and commercial system such as Kinect 2017, we achieve the state-of-the-art accuracy

arXiv.org e-Print Archive

Crossref

Scene-aware Egocentric 3D Human Pose Estimation

Author: Liu Lingjie
Luvizon Diogo
Sarkar Kripasindhu
Theobalt Christian
Wang Jian
Xu Weipeng
Publication venue
Publication date: 25/09/2023
Field of study

Egocentric 3D human pose estimation with a single head-mounted fisheye camera has recently attracted attention due to its numerous applications in virtual and augmented reality. Existing methods still struggle in challenging poses where the human body is highly occluded or is closely interacting with the scene. To address this issue, we propose a scene-aware egocentric pose estimation method that guides the prediction of the egocentric pose with scene constraints. To this end, we propose an egocentric depth estimation network to predict the scene depth map from a wide-view egocentric fisheye camera while mitigating the occlusion of the human body with a depth-inpainting network. Next, we propose a scene-aware pose estimation network that projects the 2D image features and estimated depth map of the scene into a voxel space and regresses the 3D pose with a V2V network. The voxel-based feature representation provides the direct geometric connection between 2D image features and scene geometry, and further facilitates the V2V network to constrain the predicted pose based on the estimated scene geometry. To enable the training of the aforementioned networks, we also generated a synthetic dataset, called EgoGTA, and an in-the-wild dataset based on EgoPW, called EgoPW-Scene. The experimental results of our new evaluation sequences show that the predicted 3D egocentric poses are accurate and physically plausible in terms of human-scene interaction, demonstrating that our method outperforms the state-of-the-art methods both quantitatively and qualitatively

arXiv.org e-Print Archive

Simultaneous Hand Pose and Skeleton Bone-Lengths Estimation from a Single Depth Image

Author: Elhayek Ahmed
Malik Jameel
Stricker Didier
Publication venue
Publication date: 08/12/2017
Field of study

Articulated hand pose estimation is a challenging task for human-computer interaction. The state-of-the-art hand pose estimation algorithms work only with one or a few subjects for which they have been calibrated or trained. Particularly, the hybrid methods based on learning followed by model fitting or model based deep learning do not explicitly consider varying hand shapes and sizes. In this work, we introduce a novel hybrid algorithm for estimating the 3D hand pose as well as bone-lengths of the hand skeleton at the same time, from a single depth image. The proposed CNN architecture learns hand pose parameters and scale parameters associated with the bone-lengths simultaneously. Subsequently, a new hybrid forward kinematics layer employs both parameters to estimate 3D joint positions of the hand. For end-to-end training, we combine three public datasets NYU, ICVL and MSRA-2015 in one unified format to achieve large variation in hand shapes and sizes. Among hybrid methods, our method shows improved accuracy over the state-of-the-art on the combined dataset and the ICVL dataset that contain multiple subjects. Also, our algorithm is demonstrated to work well with unseen images.Comment: This paper has been accepted and presented in 3DV-2017 conference held at Qingdao, China. http://irc.cs.sdu.edu.cn/3dv

arXiv.org e-Print Archive

Crossref

An investigation into image-based indoor localization using deep learning

Author: Li Qing
Publication venue
Publication date
Field of study

Localization is one of the fundamental technologies for many applications such as location-based service ( LBS ), robotics, virtual reality ( VR ), autonomous driving, and pedestrians navigation. Traditional methods based on wireless signals and inertial measurement unit (IMU) have inherent disadvantages which limit their applications. Although image-based localization methods seem to be promising supplements to previous methods, their applications in the indoor scenario have many challenges. Compared to the outdoor environments, indoors are more dynamic which adds difficulty to map construction. Also, indoor scenes tend to be more similar to each other which makes it difficult to distinguish different places with a similar appearance. Besides, how to utilize widely available 3D indoor structures to enhance the localization performance remains to be well explored. Deep learning techniques have achieved significant progress in many computer vision tasks such as image classification, object detection, monocular depth prediction amongst others. However, their application to indoor image-based localization has not yet been well studied. In this thesis, we investigate image-based indoor localization through deep learning techniques. We study the problem from two perspectives: topological localization and metric localization. Topological localization tries to obtain a coarse location whilst metric localization aims to provide accurate pose, which includes both position and orientation. We also study indoor image localization with the assistance of 3D maps by taking advantage of the availability of many 3D maps of indoor scenes. We have made the following contributions: Our first contribution is an indoor topological localization framework inspired by the human self-localization strategy. In this framework, we propose a novel topological map representation that is robust to environmental changes. Unlike previous topological maps, which are constructed by dividing the indoor scenes geometrically, and each region is represented by the aggregation of features derived from the whole region, our topological map is constructed based on the fixed indoor elements and each node is represented with their semantic attributes. Besides, an effective landmark detector is devised to extract semantic information of the objects of interest from the smart-phone video. We also present a new localization algorithm to match the detected semantic landmark sequence against the proposed semantic topological map through their semantic and contextual information. Experiments are conducted on two test sites and results show that our landmark detector is capable of accurately detecting the landmarks and the localization algorithm can perform localization accurately. The second contribution is that we advocate a direct learning-based method using convolutional neural networks (CNNs \nomenclature{CNNs}{Convolutional Neural Networks}) to exploit the relative geometry constraints between images for image-based metric localization. We have developed a new convolutional neural network to predict the global poses and the relative pose of two images simultaneously. This multi-tasking learning strategy allows mutual regularizations for both the global pose regression and the relative pose regression. Furthermore, we designed a new loss function that embeds the relative pose information to distinguish the poses of similar images of different locations. We conduct extensive experiments to validate the effectiveness of the proposed method on two image localization benchmarks and achieve state-of-the-art performance compared to the other learning-based methods. Our third contribution is a single image localization framework in a 3D map. To the best of our knowledge, it is the first approach to localize a single image in a 3D map. The framework includes four main steps: pose initialization, depth inference, local map extraction, and pose correction. The pose initialization step estimates the coarse pose with the learning-based pose regression approach. The depth inference step predicts the dense depth map from the single image. The local map extraction step extracts a local map from the global 3D map to increase the efficiency. Given the local map and generated point cloud, the Iterative Closest Point (ICP \nomenclature{ICP}{Iterative Closest Point}) algorithm is conducted to align the point cloud to the local map and then compute the pose correction of the coarse pose. As the key of the method is to accurately predict the depth from the images, a novel 3D map guided single image depth prediction approach is proposed. The proposed method utilized both the 3D map and the RGB image where we use the RGB image to estimate a dense depth map and employ the 3D map to guide the depth estimation. We show that our new method significantly outperforms current RGB image-based depth estimation methods for both indoor and outdoor datasets. We also show that utilizing the depth map predicted by the new method for single indoor image localization can improve both position and orientation localization accuracy over state-of-the-art methods

Nottingham ePrints

Learning Robust Features and Latent Representations for Single View 3D Pose Estimation of Humans and Objects

Author: Tekin Bugra
Publication venue: Lausanne, EPFL
Publication date: 13/09/2018
Field of study

Estimating the 3D poses of rigid and articulated bodies is one of the fundamental problems of Computer Vision. It has a broad range of applications including augmented reality, surveillance, animation and human-computer interaction. Despite the ever-growing demand driven by the applications, predicting 3D pose from a 2D image is a challenging and ill-posed problem due to the loss of depth information during projection from 3D to 2D. Although there have been years of research on 3D pose estimation problem, it still remains unsolved. In this thesis, we propose a variety of ways to tackle the 3D pose estimation problem both for articulated human bodies and rigid object bodies by learning robust features and latent representations. First, we present a novel video-based approach that exploits spatiotemporal features for 3D human pose estimation in a discriminative regression scheme. While early approaches typically account for motion information by temporally regularizing noisy pose estimates in individual frames, we demonstrate that taking into account motion information very early in the modeling process with spatiotemporal features yields significant performance improvements. We further propose a CNN-based motion compensation approach that stabilizes and centralizes the human body in the bounding boxes of consecutive frames to increase the reliability of spatiotemporal features. This then allows us to effectively overcome ambiguities and improve pose estimation accuracy. Second, we develop a novel Deep Learning framework for structured prediction of 3D human pose. Our approach relies on an auto-encoder to learn a high-dimensional latent pose representation that accounts for joint dependencies. We combine traditional CNNs for supervised learning with auto-encoders for structured learning and demonstrate that our approach outperforms the existing ones both in terms of structure preservation and prediction accuracy. Third, we propose a 3D human pose estimation approach that relies on a two-stream neural network architecture to simultaneously exploit 2D joint location heatmaps and image features. We show that 2D pose of a person, predicted in terms of heatmaps by a fully convolutional network, provides valuable cues to disambiguate challenging poses and results in increased pose estimation accuracy. We further introduce a novel and generic trainable fusion scheme, which automatically learns where and how to fuse the features extracted from two different input modalities that a two-stream neural network operates on. Our trainable fusion framework selects the optimal network architecture on-the-fly and improves upon standard hard-coded network architectures. Fourth, we propose an efficient approach to estimate 3D pose of objects from a single RGB image. Existing methods typically detect 2D bounding boxes and then predict the object pose using a pipelined approach. The redundancy in different parts of the architecture makes such methods computationally expensive. Moreover, the final pose estimation accuracy depends on the accuracy of the intermediate 2D object detection step. In our method, the object is classified and its pose is regressed in a single shot from the full image using a single, compact fully convolutional neural network. Our approach achieves the state-of-the-art accuracy without requiring any costly pose refinement step and runs in real-time at 50 fps on a modern GPU, which is at least 5X faster than the state of the art

Infoscience - École polytechnique fédérale de Lausanne

Marker-free human motion capture in dynamic cluttered environments from a single view-point

Author: Grest Daniel
Publication venue
Publication date: 01/01/2007
Field of study

Human Motion Capture is a widely used technique to obtain motion data for animation of virtual characters. Commercial optical motion capture systems are marker-based. This thesis is about marker-free motion capture. The pose and motion estimation of an observed person is carried out in an optimization framework for articulated objects. The motion function is formulated with kinematic chains consisting of rotations around arbitrary axes in 3D space. This formulation leads to a Nonlinear Least Squares problem, which is solved with gradient-based methods. With the formulation in this thesis the necessary derivatives can be derived analytically. This speeds up processing and increases accuracy. Different gradient based methods are compared to solve the Nonlinear Least Squares problem, which allows the integration of second order motion derivatives as well. The pose estimation requires correspondences between known model of the person and observed data. To obtain this model, a new method is developed, which fits a template model to a specific person from 6 posture images taken by a single camera. Various types of correspondences are integrated in the optimization simultaneously without making approximations to the motion or optimization function, namely 3D-3D correspondences from stereo algorithms and 3D-2D correspondences from image silhouettes and 2D point tracking. Of major importance for the developed methods is the processing time and robustness to cluttered and dynamic background. Experiments show, that complex motion with 24 degrees of freedom is track-able from a single stereo view until body parts get totally occluded. Further methods are developed to estimate pose from a single camera view with cluttered dynamic background. Similar to other work on 2D-3D pose estimation, correspondences between model and image silhouette of the person are established by analyzing the gray value gradient near the predicted model silhouette. To increase the accuracy of silhouette correspondences, color histograms for each body part are combined with image gradient search. The combination of 3D depth data and 2D image data is tested with depth data from a PMD camera (Photonic Mixer Device), which measures the depth to scene points by the time of flight of ligh

MACAU: Open Access Repository of Kiel University

확률적인 3차원 자세 복원과 행동인식

Author: Jungchan Cho
Publication venue: 서울대학교 대학원
Publication date: 01/02/2016
Field of study

학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2016. 2. 오성회.These days, computer vision technology becomes popular and plays an important role in intelligent systems, such as augment reality, video and image analysis, and to name a few. Although cost effective depth cameras, like a Microsoft Kinect, have recently developed, most computer vision algorithms assume that observations are obtained from RGB cameras, which make 2D observations. If, somehow, we can estimate 3D information from 2D observations, it might give better solutions for many computer vision problems. In this dissertation, we focus on estimating 3D information from 2D observations, which is well known as non-rigid structure from motion (NRSfM). More formally, NRSfM finds the three dimensional structure of an object by analyzing image streams with the assumption that an object lies in a low-dimensional space. However, a human body for long periods of time can have complex shape variations and it makes a challenging problem for NRSfM due to its increased degree of freedom. In order to handle complex shape variations, we propose a Procrustean normal distribution mixture model (PNDMM) by extending a recently proposed Procrustean normal distribution (PND), which captures the distribution of non-rigid variations of an object by excluding the effects of rigid motion. Unlike existing methods which use a single model to solve an NRSfM problem, the proposed PNDMM decomposes complex shape variations into a collection of simpler ones, thereby model learning can be more tractable and accurate. We perform experiments showing that the proposed method outperforms existing methods on highly complex and long human motion sequences. In addition, we extend the PNDMM to a single view 3D human pose estimation problem. While recovering a 3D structure of a human body from an image is important, it is a highly ambiguous problem due to the deformation of an articulated human body. Moreover, before estimating a 3D human pose from a 2D human pose, it is important to obtain an accurate 2D human pose. In order to address inaccuracy of 2D pose estimation on a single image and 3D human pose ambiguities, we estimate multiple 2D and 3D human pose candidates and select the best one which can be explained by a 2D human pose detector and a 3D shape model. We also introduce a model transformation which is incorporated into the 3D shape prior model, such that the proposed method can be applied to a novel test image. Experimental results show that the proposed method can provide good 3D reconstruction results when tested on a novel test image, despite inaccuracies of 2D part detections and 3D shape ambiguities. Finally, we handle an action recognition problem from a video clip. Current studies show that high-level features obtained from estimated 2D human poses enable action recognition performance beyond current state-of-the-art methods using low- and mid-level features based on appearance and motion, despite inaccuracy of human pose estimation. Based on these findings, we propose an action recognition method using estimated 3D human pose information since the proposed PNDMM is able to reconstruct 3D shapes from 2D shapes. Experimental results show that 3D pose based descriptors are better than 2D pose based descriptors for action recognition, regardless of classification methods. Considering the fact that we use simple 3D pose descriptors based on a 3D shape model which is learned from 2D shapes, results reported in this dissertation are promising and obtaining accurate 3D information from 2D observations is still an important research issue for reliable computer vision systems.Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Research Issues 4 1.3 Organization of the Dissertation 6 Chapter 2 Preliminary 9 2.1 Generalized Procrustes Analysis (GPA) 11 2.2 EM-GPA Algorithm 12 2.2.1 Objective function 12 2.2.2 E-step 15 2.2.3 M-step 16 2.3 Implementation Considerations for EM-GPA 18 2.3.1 Preprocessing stage 18 2.3.2 Small update rate for the covariance matrix 20 2.4 Experiments 21 2.4.1 Shape alignment with the missing information 23 2.4.2 3D shape modeling 24 2.4.3 2D+3D active appearance models 28 2.5 Chapter Summary and Discussion 32 Chapter 3 Procrustean Normal Distribution Mixture Model 33 3.1 Non-Rigid Structure from Motion 35 3.2 Procrustean Normal Distribution (PND) 38 3.3 PND Mixture Model 41 3.4 Learning a PNDMM 43 3.4.1 E-step 44 3.4.2 M-step 46 3.5 Learning an Adaptive PNDMM 48 3.6 Experiments 50 3.6.1 Experimental setup 50 3.6.2 CMU Mocap database 53 3.6.3 UMPM dataset 69 3.6.4 Simple and short motions 74 3.6.5 Real sequence - qualitative representation 77 3.7 Chapter Summary 78 Chapter 4 Recovering a 3D Human Pose from a Novel Image 83 4.1 Single View 3D Human Pose Estimation 85 4.2 Candidate Generation 87 4.2.1 Initial pose generation 87 4.2.2 Part recombination 88 4.3 3D Shape Prior Model 89 4.3.1 Procrustean mixture model learning 89 4.3.2 Procrustean mixture model fitting 91 4.4 Model Transformation 92 4.4.1 Model normalization 92 4.4.2 Model adaptation 95 4.5 Result Selection 96 4.6 Experiments 98 4.6.1 Implementation details 98 4.6.2 Evaluation of the joint 2D and 3D pose estimation 99 4.6.3 Evaluation of the 2D pose estimation 104 4.6.4 Evaluation of the 3D pose estimation 106 4.7 Chapter Summary 108 Chapter 5 Application to Action Recognition 109 5.1 Appearance and Motion Based Descriptors 112 5.2 2D Pose Based Descriptors 113 5.3 Bag-of-Features with a Multiple Kernel Method 114 5.4 Classification - Kernel Group Sparse Representation 115 5.4.1 Group sparse representation for classification 116 5.4.2 Kernel group sparse (KGS) representation for classification 118 5.5 Experiment on sub-JHMDB Dataset 120 5.5.1 Experimental setup 120 5.5.2 3D pose based descriptor 122 5.5.3 Experimental results 123 5.6 Chapter Summary 129 Chapter 6 Conclusion and Future Work 131 Appendices 135 A Proof of Propositions in Chapter 2 137 A.1 Proof of Proposition 1 137 A.2 Proof of Proposition 3 138 A.3 Proof of Proposition 4 139 B Calculation of p(XijDii) in Chapter 3 141 B.1 Without the Dirac-delta term 141 B.2 With the Dirac-delta term 142 C Procrustean Mixture Model Learning and Fitting in Chapter 4 145 C.1 Procrustean Mixture Model Learning 145 C.2 Procrustean Mixture Model Fitting 147 Bibliography 153 초 록 167Docto

SNU Open Repository and Archive

단일 이미지로부터 여러 사람의 표현적 전신 3D 자세 및 형태 추정

Author: 문경식
Publication venue: 서울대학교 대학원
Publication date: 01/02/2021
Field of study

학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·정보공학부, 2021. 2. 이경무.Human is the most centric and interesting object in our life: many human-centric techniques and studies have been proposed from both industry and academia, such as motion capture and human-computer interaction. Recovery of accurate 3D geometry of human (i.e., 3D human pose and shape) is a key component of the human-centric techniques and studies. With the rapid spread of cameras, a single RGB image has become a popular input, and many single RGB-based 3D human pose and shape estimation methods have been proposed. The 3D pose and shape of the whole body, which includes hands and face, provides expressive and rich information, including human intention and feeling. Unfortunately, recovering the whole-body 3D pose and shape is greatly challenging; thus, it has been attempted by few works, called expressive methods. Instead of directly solving the expressive 3D pose and shape estimation, the literature has been developed for recovery of the 3D pose and shape of each part (i.e., body, hands, and face) separately, called part-specific methods. There are several more simplifications. For example, many works estimate only 3D pose without shape because additional 3D shape estimation makes the problem much harder. In addition, most works assume a single person case and do not consider a multi-person case. Therefore, there are several ways to categorize current literature; 1) part-specific methods and expressive methods, 2) 3D human pose estimation methods and 3D human pose and shape estimation methods, and 3) methods for a single person and methods for multiple persons. The difficulty increases while the outputs of methods become richer by changing from part-specific to expressive, from 3D pose estimation to 3D pose and shape estimation, and from a single person case to multi-person case. This dissertation introduces three approaches towards expressive 3D multi-person pose and shape estimation from a single image; thus, the output can finally provide the richest information. The first approach is for 3D multi-person body pose estimation, the second one is 3D multi-person body pose and shape estimation, and the final one is expressive 3D multi-person pose and shape estimation. Each approach tackles critical limitations of previous state-of-the-art methods, thus bringing the literature closer to the real-world environment. First, a 3D multi-person body pose estimation framework is introduced. In contrast to the single person case, the multi-person case additionally requires camera-relative 3D positions of the persons. Estimating the camera-relative 3D position from a single image involves high depth ambiguity. The proposed framework utilizes a deep image feature with the camera pinhole model to recover the camera-relative 3D position. The proposed framework can be combined with any 3D single person pose and shape estimation methods for 3D multi-person pose and shape. Therefore, the following two approaches focus on the single person case and can be easily extended to the multi-person case by using the framework of the first approach. Second, a 3D multi-person body pose and shape estimation method is introduced. It extends the first approach to additionally predict accurate 3D shape while its accuracy significantly outperforms previous state-of-the-art methods by proposing a new target representation, lixel-based 1D heatmap. Finally, an expressive 3D multi-person pose and shape estimation method is introduced. It integrates the part-specific 3D pose and shape of the above approaches; thus, it can provide expressive 3D human pose and shape. In addition, it boosts the accuracy of the estimated 3D pose and shape by proposing a 3D positional pose-guided 3D rotational pose prediction system. The proposed approaches successfully overcome the limitations of the previous state-of-the-art methods. The extensive experimental results demonstrate the superiority of the proposed approaches in both qualitative and quantitative ways.인간은 우리의 일상생활에서 가장 중심이 되고 흥미로운 대상이다. 그에 따라 모션 캡처, 인간-컴퓨터 인터렉션 등 많은 인간중심의 기술과 학문이 산업계와 학계에서 제안되었다. 인간의 정확한 3D 기하 (즉, 인간의 3D 자세와 형태)를 복원하는 것은 인간중심 기술과 학문에서 가장 중요한 부분 중 하나이다. 카메라의 빠른 대중화로 인해 단일 이미지는 많은 알고리즘의 널리 쓰이는 입력이 되었고, 그로 인해 많은 단일 이미지 기반의 3D 인간 자세 및 형태 추정 알고리즘이 제안되었다. 손과 발을 포함한 전신의 3D 자세와 형태는 인간의 의도와 느낌을 포함한 표현적이고 풍부한 정보를 제공한다. 하지만 전신의 3D 자세와 형태를 복원하는 것은 매우 어렵기 때문에 오직 극소수의 방법만이 이를 풀기 위해 제안되었고, 이를 위한 방법들을 표현적인 방법이라고 부른다. 표현적인 3D 자세와 형태를 한 번에 복원하는 것 대신, 사람의 몸, 손, 그리고 얼굴의 3D 자세와 형태를 따로 복원하는 방법들이 제안되었다. 이러한 방법들을 부분 특유 방법이라고 부른다. 이러한 문제의 간단화 이외에도 몇 가지의 간단화가 더 존재한다. 예를 들어, 많은 방법은 3D 형태를 제외한 3D 자세만을 추정한다. 이는 추가적인 3D 형태 추정이 문제를 더 어렵게 만들기 때문이다. 또한, 대부분의 방법은 오직 단일 사람의 경우만 고려하고 여러 사람의 경우는 고려하지 않는다. 그러므로, 현재 제안된 방법들은 몇 가지 기준에 의해 분류될 수 있다; 1) 부분 특유 방법 vs. 표현적 방법, 2) 3D 자세 추정 방법 vs. 3D 자세 및 형태 추정 방법, 그리고 3) 단일 사람을 위한 방법 vs. 여러 사람을 위한 방법. 부분 특유에서 표현적으로, 3D 자세 추정에서 3D 자세 및 형태 추정으로, 단일 사람에서 여러 사람으로 갈수록 추정이 더 어려워지지만, 더 풍부한 정보를 출력할 수 있게 된다. 본 학위논문은 단일 이미지로부터 여러 사람의 표현적인 3D 자세 및 형태 추정을 향하는 세 가지의 접근법을 소개한다. 따라서 최종적으로 제안된 방법은 가장 풍부한 정보를 제공할 수 있다. 첫 번째 접근법은 여러 사람을 위한 3D 자세 추정이고, 두 번째는 여러 사람을 위한 3D 자세 및 형태 추정이고, 그리고 마지막은 여러 사람을 위한 표현적인 3D 자세 및 형태 추정을 위한 방법이다. 각 접근법은 기존 방법들이 가진 중요한 한계점들을 해결하여 제안된 방법들이 실생활에서 쓰일 수 있도록 한다. 첫 번째 접근법은 여러 사람을 위한 3D 자세 추정 프레임워크이다. 단일 사람의 경우와는 다르게 여러 사람의 경우 사람마다 카메라 상대적인 3D 위치가 필요하다. 카메라 상대적인 3D 위치를 단일 이미지로부터 추정하는 것은 매우 높은 깊이 모호성을 동반한다. 제안하는 프레임워크는 심층 이미지 피쳐와 카메라 핀홀 모델을 사용하여 카메라 상대적인 3D 위치를 복원한다. 이 프레임워크는 어떤 단일 사람을 위한 3D 자세 및 형태 추정 방법과 합쳐질 수 있기 때문에, 다음에 소개될 두 접근법은 오직 단일 사람을 위한 3D 자세 및 형태 추정에 초점을 맞춘다. 다음에 소개될 두 접근법에서 제안된 단일 사람을 위한 방법들은 첫 번째 접근법에서 소개되는 여러 사람을 위한 프레임워크를 사용하여 쉽게 여러 사람의 경우로 확장할 수 있다. 두 번째 접근법은 여러 사람을 위한 3D 자세 및 형태 추정 방법이다. 이 방법은 첫 번째 접근법을 확장하여 정확도를 유지하면서 추가로 3D 형태를 추정하게 한다. 높은 정확도를 위해 릭셀 기반의 1D 히트맵을 제안하고, 이로 인해 기존에 발표된 방법들보다 큰 폭으로 높은 성능을 얻는다. 마지막 접근법은 여러 사람을 위한 표현적인 3D 자세 및 형태 추정 방법이다. 이것은 몸, 손, 그리고 얼굴마다 3D 자세 및 형태를 하나로 통합하여 표현적인 3D 자세 및 형태를 얻는다. 게다가, 이것은 3D 위치 포즈 기반의 3D 회전 포즈 추정기법을 제안함으로써 기존에 발표된 방법들보다 훨씬 높은 성능을 얻는다. 제안된 접근법들은 기존에 발표되었던 방법들이 갖는 한계점들을 성공적으로 극복한다. 광범위한 실험적 결과가 정성적, 정량적으로 제안하는 방법들의 효용성을 보여준다.1 Introduction 1 1.1 Background and Research Issues 1 1.2 Outline of the Dissertation 3 2 3D Multi-Person Pose Estimation 7 2.1 Introduction 7 2.2 Related works 10 2.3 Overview of the proposed model 13 2.4 DetectNet 13 2.5 PoseNet 14 2.5.1 Model design 14 2.5.2 Loss function 14 2.6 RootNet 15 2.6.1 Model design 15 2.6.2 Camera normalization 19 2.6.3 Network architecture 19 2.6.4 Loss function 20 2.7 Implementation details 20 2.8 Experiment 21 2.8.1 Dataset and evaluation metric 21 2.8.2 Experimental protocol 22 2.8.3 Ablation study 23 2.8.4 Comparison with state-of-the-art methods 25 2.8.5 Running time of the proposed framework 31 2.8.6 Qualitative results 31 2.9 Conclusion 34 3 3D Multi-Person Pose and Shape Estimation 35 3.1 Introduction 35 3.2 Related works 38 3.3 I2L-MeshNet 41 3.3.1 PoseNet 41 3.3.2 MeshNet 43 3.3.3 Final 3D human pose and mesh 45 3.3.4 Loss functions 45 3.4 Implementation details 47 3.5 Experiment 48 3.5.1 Datasets and evaluation metrics 48 3.5.2 Ablation study 50 3.5.3 Comparison with state-of-the-art methods 57 3.6 Conclusion 60 4 Expressive 3D Multi-Person Pose and Shape Estimation 63 4.1 Introduction 63 4.2 Related works 66 4.3 Pose2Pose 69 4.3.1 PositionNet 69 4.3.2 RotationNet 70 4.4 Expressive 3D human pose and mesh estimation 72 4.4.1 Body part 72 4.4.2 Hand part 73 4.4.3 Face part 73 4.4.4 Training the networks 74 4.4.5 Integration of all parts in the testing stage 74 4.5 Implementation details 77 4.6 Experiment 78 4.6.1 Training sets and evaluation metrics 78 4.6.2 Ablation study 78 4.6.3 Comparison with state-of-the-art methods 82 4.6.4 Running time 87 4.7 Conclusion 87 5 Conclusion and Future Work 89 5.1 Summary and Contributions of the Dissertation 89 5.2 Future Directions 90 5.2.1 Global Context-Aware 3D Multi-Person Pose Estimation 91 5.2.2 Unied Framework for Expressive 3D Human Pose and Shape Estimation 91 5.2.3 Enhancing Appearance Diversity of Images Captured from Multi-View Studio 92 5.2.4 Extension to the video for temporally consistent estimation 94 5.2.5 3D clothed human shape estimation in the wild 94 5.2.6 Robust human action recognition from a video 96 Bibliography 98 국문초록 111Docto

SNU Open Repository and Archive