    ν™•λ₯ μ μΈ 3차원 μžμ„Έ 볡원과 행동인식

    ν•™μœ„λ…Όλ¬Έ (박사)-- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : 전기·컴퓨터곡학뢀, 2016. 2. μ˜€μ„±νšŒ.These days, computer vision technology becomes popular and plays an important role in intelligent systems, such as augment reality, video and image analysis, and to name a few. Although cost effective depth cameras, like a Microsoft Kinect, have recently developed, most computer vision algorithms assume that observations are obtained from RGB cameras, which make 2D observations. If, somehow, we can estimate 3D information from 2D observations, it might give better solutions for many computer vision problems. In this dissertation, we focus on estimating 3D information from 2D observations, which is well known as non-rigid structure from motion (NRSfM). More formally, NRSfM finds the three dimensional structure of an object by analyzing image streams with the assumption that an object lies in a low-dimensional space. However, a human body for long periods of time can have complex shape variations and it makes a challenging problem for NRSfM due to its increased degree of freedom. In order to handle complex shape variations, we propose a Procrustean normal distribution mixture model (PNDMM) by extending a recently proposed Procrustean normal distribution (PND), which captures the distribution of non-rigid variations of an object by excluding the effects of rigid motion. Unlike existing methods which use a single model to solve an NRSfM problem, the proposed PNDMM decomposes complex shape variations into a collection of simpler ones, thereby model learning can be more tractable and accurate. We perform experiments showing that the proposed method outperforms existing methods on highly complex and long human motion sequences. In addition, we extend the PNDMM to a single view 3D human pose estimation problem. While recovering a 3D structure of a human body from an image is important, it is a highly ambiguous problem due to the deformation of an articulated human body. Moreover, before estimating a 3D human pose from a 2D human pose, it is important to obtain an accurate 2D human pose. In order to address inaccuracy of 2D pose estimation on a single image and 3D human pose ambiguities, we estimate multiple 2D and 3D human pose candidates and select the best one which can be explained by a 2D human pose detector and a 3D shape model. We also introduce a model transformation which is incorporated into the 3D shape prior model, such that the proposed method can be applied to a novel test image. Experimental results show that the proposed method can provide good 3D reconstruction results when tested on a novel test image, despite inaccuracies of 2D part detections and 3D shape ambiguities. Finally, we handle an action recognition problem from a video clip. Current studies show that high-level features obtained from estimated 2D human poses enable action recognition performance beyond current state-of-the-art methods using low- and mid-level features based on appearance and motion, despite inaccuracy of human pose estimation. Based on these findings, we propose an action recognition method using estimated 3D human pose information since the proposed PNDMM is able to reconstruct 3D shapes from 2D shapes. Experimental results show that 3D pose based descriptors are better than 2D pose based descriptors for action recognition, regardless of classification methods. Considering the fact that we use simple 3D pose descriptors based on a 3D shape model which is learned from 2D shapes, results reported in this dissertation are promising and obtaining accurate 3D information from 2D observations is still an important research issue for reliable computer vision systems.Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Research Issues 4 1.3 Organization of the Dissertation 6 Chapter 2 Preliminary 9 2.1 Generalized Procrustes Analysis (GPA) 11 2.2 EM-GPA Algorithm 12 2.2.1 Objective function 12 2.2.2 E-step 15 2.2.3 M-step 16 2.3 Implementation Considerations for EM-GPA 18 2.3.1 Preprocessing stage 18 2.3.2 Small update rate for the covariance matrix 20 2.4 Experiments 21 2.4.1 Shape alignment with the missing information 23 2.4.2 3D shape modeling 24 2.4.3 2D+3D active appearance models 28 2.5 Chapter Summary and Discussion 32 Chapter 3 Procrustean Normal Distribution Mixture Model 33 3.1 Non-Rigid Structure from Motion 35 3.2 Procrustean Normal Distribution (PND) 38 3.3 PND Mixture Model 41 3.4 Learning a PNDMM 43 3.4.1 E-step 44 3.4.2 M-step 46 3.5 Learning an Adaptive PNDMM 48 3.6 Experiments 50 3.6.1 Experimental setup 50 3.6.2 CMU Mocap database 53 3.6.3 UMPM dataset 69 3.6.4 Simple and short motions 74 3.6.5 Real sequence - qualitative representation 77 3.7 Chapter Summary 78 Chapter 4 Recovering a 3D Human Pose from a Novel Image 83 4.1 Single View 3D Human Pose Estimation 85 4.2 Candidate Generation 87 4.2.1 Initial pose generation 87 4.2.2 Part recombination 88 4.3 3D Shape Prior Model 89 4.3.1 Procrustean mixture model learning 89 4.3.2 Procrustean mixture model fitting 91 4.4 Model Transformation 92 4.4.1 Model normalization 92 4.4.2 Model adaptation 95 4.5 Result Selection 96 4.6 Experiments 98 4.6.1 Implementation details 98 4.6.2 Evaluation of the joint 2D and 3D pose estimation 99 4.6.3 Evaluation of the 2D pose estimation 104 4.6.4 Evaluation of the 3D pose estimation 106 4.7 Chapter Summary 108 Chapter 5 Application to Action Recognition 109 5.1 Appearance and Motion Based Descriptors 112 5.2 2D Pose Based Descriptors 113 5.3 Bag-of-Features with a Multiple Kernel Method 114 5.4 Classification - Kernel Group Sparse Representation 115 5.4.1 Group sparse representation for classification 116 5.4.2 Kernel group sparse (KGS) representation for classification 118 5.5 Experiment on sub-JHMDB Dataset 120 5.5.1 Experimental setup 120 5.5.2 3D pose based descriptor 122 5.5.3 Experimental results 123 5.6 Chapter Summary 129 Chapter 6 Conclusion and Future Work 131 Appendices 135 A Proof of Propositions in Chapter 2 137 A.1 Proof of Proposition 1 137 A.2 Proof of Proposition 3 138 A.3 Proof of Proposition 4 139 B Calculation of p(XijDii) in Chapter 3 141 B.1 Without the Dirac-delta term 141 B.2 With the Dirac-delta term 142 C Procrustean Mixture Model Learning and Fitting in Chapter 4 145 C.1 Procrustean Mixture Model Learning 145 C.2 Procrustean Mixture Model Fitting 147 Bibliography 153 초 둝 167Docto

    Multi-body Non-rigid Structure-from-Motion

    Conventional structure-from-motion (SFM) research is primarily concerned with the 3D reconstruction of a single, rigidly moving object seen by a static camera, or a static and rigid scene observed by a moving camera --in both cases there are only one relative rigid motion involved. Recent progress have extended SFM to the areas of {multi-body SFM} (where there are {multiple rigid} relative motions in the scene), as well as {non-rigid SFM} (where there is a single non-rigid, deformable object or scene). Along this line of thinking, there is apparently a missing gap of "multi-body non-rigid SFM", in which the task would be to jointly reconstruct and segment multiple 3D structures of the multiple, non-rigid objects or deformable scenes from images. Such a multi-body non-rigid scenario is common in reality (e.g. two persons shaking hands, multi-person social event), and how to solve it represents a natural {next-step} in SFM research. By leveraging recent results of subspace clustering, this paper proposes, for the first time, an effective framework for multi-body NRSFM, which simultaneously reconstructs and segments each 3D trajectory into their respective low-dimensional subspace. Under our formulation, 3D trajectories for each non-rigid structure can be well approximated with a sparse affine combination of other 3D trajectories from the same structure (self-expressiveness). We solve the resultant optimization with the alternating direction method of multipliers (ADMM). We demonstrate the efficacy of the proposed framework through extensive experiments on both synthetic and real data sequences. Our method clearly outperforms other alternative methods, such as first clustering the 2D feature tracks to groups and then doing non-rigid reconstruction in each group or first conducting 3D reconstruction by using single subspace assumption and then clustering the 3D trajectories into groups.Comment: 21 pages, 16 figure

    A closed-form solution to estimate uncertainty in non-rigid structure from motion

    Semi-Definite Programming (SDP) with low-rank prior has been widely applied in Non-Rigid Structure from Motion (NRSfM). Based on a low-rank constraint, it avoids the inherent ambiguity of basis number selection in conventional base-shape or base-trajectory methods. Despite the efficiency in deformable shape reconstruction, it remains unclear how to assess the uncertainty of the recovered shape from the SDP process. In this paper, we present a statistical inference on the element-wise uncertainty quantification of the estimated deforming 3D shape points in the case of the exact low-rank SDP problem. A closed-form uncertainty quantification method is proposed and tested. Moreover, we extend the exact low-rank uncertainty quantification to the approximate low-rank scenario with a numerical optimal rank selection method, which enables solving practical application in SDP based NRSfM scenario. The proposed method provides an independent module to the SDP method and only requires the statistic information of the input 2D tracked points. Extensive experiments prove that the output 3D points have identical normal distribution to the 2D trackings, the proposed method and quantify the uncertainty accurately, and supports that it has desirable effects on routinely SDP low-rank based NRSfM solver.Comment: 9 pages, 2 figure

    Image collection pop-up: 3D reconstruction and clustering of rigid and non-rigid categories

    Β© 20xx IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.This paper introduces an approach to simultaneously estimate 3D shape, camera pose, and object and type of deformation clustering, from partial 2D annotations in a multi-instance collection of images. Furthermore, we can indistinctly process rigid and non-rigid categories. This advances existing work, which only addresses the problem for one single object or, if multiple objects are considered, they are assumed to be clustered a priori. To handle this broader version of the problem, we model object deformation using a formulation based on multiple unions of subspaces, able to span from small rigid motion to complex deformations. The parameters of this model are learned via Augmented Lagrange Multipliers, in a completely unsupervised manner that does not require any training data at all. Extensive validation is provided in a wide variety of synthetic and real scenarios, including rigid and non-rigid categories with small and large deformations. In all cases our approach outperforms state-of-the-art in terms of 3D reconstruction accuracy, while also providing clustering results that allow segmenting the images into object instances and their associated type of deformation (or action the object is performing).Postprint (author's final draft

    MHR-Net: Multiple-Hypothesis Reconstruction of Non-Rigid Shapes from 2D Views

    We propose MHR-Net, a novel method for recovering Non-Rigid Shapes from Motion (NRSfM). MHR-Net aims to find a set of reasonable reconstructions for a 2D view, and it also selects the most likely reconstruction from the set. To deal with the challenging unsupervised generation of non-rigid shapes, we develop a new Deterministic Basis and Stochastic Deformation scheme in MHR-Net. The non-rigid shape is first expressed as the sum of a coarse shape basis and a flexible shape deformation, then multiple hypotheses are generated with uncertainty modeling of the deformation part. MHR-Net is optimized with reprojection loss on the basis and the best hypothesis. Furthermore, we design a new Procrustean Residual Loss, which reduces the rigid rotations between similar shapes and further improves the performance. Experiments show that MHR-Net achieves state-of-the-art reconstruction accuracy on Human3.6M, SURREAL and 300-VW datasets.Comment: Accepted to ECCV 202

    3차원 μ‚¬λžŒ μžμ„Έ 좔정을 μœ„ν•œ 3차원 볡원, μ•½μ§€λ„ν•™μŠ΅, μ§€λ„ν•™μŠ΅ 방법

    ν•™μœ„λ…Όλ¬Έ (박사)-- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : μœ΅ν•©κ³Όν•™κΈ°μˆ λŒ€ν•™μ› μœ΅ν•©κ³Όν•™λΆ€(지λŠ₯ν˜•μœ΅ν•©μ‹œμŠ€ν…œμ „κ³΅), 2019. 2. κ³½λ…Έμ€€.Estimating human poses from images is one of the fundamental tasks in computer vision, which leads to lots of applications such as action recognition, human-computer interaction, and virtual reality. Especially, estimating 3D human poses from 2D inputs is a challenging problem since it is inherently under-constrained. In addition, obtaining 3D ground truth data for human poses is only possible under the limited and restricted environments. In this dissertation, 3D human pose estimation is studied in different aspects focusing on various types of the availability of the data. To this end, three different methods to retrieve 3D human poses from 2D observations or from RGB images---algorithms of 3D reconstruction, weakly-supervised learning, and supervised learning---are proposed. First, a non-rigid structure from motion (NRSfM) algorithm that reconstructs 3D structures of non-rigid objects such as human bodies from 2D observations is proposed. In the proposed framework which is named as Procrustean Regression, the 3D shapes are regularized based on their aligned shapes. We show that the cost function of the Procrustean Regression can be casted into an unconstrained problem or a problem with simple bound constraints, which can be efficiently solved by existing gradient descent solvers. This framework can be easily integrated with numerous existing models and assumptions, which makes it more practical for various real situations. The experimental results show that the proposed method gives competitive result to the state-of-the-art methods for orthographic projection with much less time complexity and memory requirement, and outperforms the existing methods for perspective projection. Second, a weakly-supervised learning method that is capable of learning 3D structures when only 2D ground truth data is available as a training set is presented. Extending the Procrustean Regression framework, we suggest Procrustean Regression Network, a learning method that trains neural networks to learn 3D structures using training data with 2D ground truths. This is the first attempt that directly integrates an NRSfM algorithm into neural network training. The cost function that contains a low-rank function is also firstly used as a cost function of neural networks that reconstructs 3D shapes. During the test phase, 3D structures of human bodies can be obtained via a feed-forward operation, which enables the framework to have much faster inference time compared to the 3D reconstruction algorithms. Third, a supervised learning method that infers 3D poses from 2D inputs using neural networks is suggested. The method exploits a relational unit which captures the relations between different body parts. In the method, each pair of different body parts generates relational features, and the average of the features from all the pairs are used for 3D pose estimation. We also suggest a dropout method called relational dropout, which can be used in relational modules to impose robustness to the occlusions. The experimental results validate that the performance of the proposed algorithm does not degrade much when missing points exist while maintaining state-of-the-art performance when every point is visible.RGB μ˜μƒμ—μ„œμ˜ μ‚¬λžŒ μžμ„Έ μΆ”μ • 방법은 컴퓨터 λΉ„μ „ λΆ„μ•Όμ—μ„œ μ€‘μš”ν•˜λ©° μ—¬λŸ¬ μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜μ˜ 기본이 λ˜λŠ” κΈ°μˆ μ΄λ‹€. μ‚¬λžŒ μžμ„Έ 좔정은 λ™μž‘ 인식, 인간-컴퓨터 μƒν˜Έμž‘μš©, 가상 ν˜„μ‹€, 증강 ν˜„μ‹€ λ“± κ΄‘λ²”μœ„ν•œ λΆ„μ•Όμ—μ„œ 기반 기술둜 μ‚¬μš©λ  수 μžˆλ‹€. 특히, 2차원 μž…λ ₯μœΌλ‘œλΆ€ν„° 3차원 μ‚¬λžŒ μžμ„Έλ₯Ό μΆ”μ •ν•˜λŠ” λ¬Έμ œλŠ” 무수히 λ§Žμ€ ν•΄λ₯Ό κ°€μ§ˆ 수 μžˆλŠ” 문제이기 λ•Œλ¬Έμ— ν’€κΈ° μ–΄λ €μš΄ 문제둜 μ•Œλ €μ Έ μžˆλ‹€. λ˜ν•œ, 3차원 μ‹€μ œ λ°μ΄ν„°μ˜ μŠ΅λ“μ€ λͺ¨μ…˜μΊ‘처 μŠ€νŠœλ””μ˜€ λ“± μ œν•œλœ ν™˜κ²½ν•˜μ—μ„œλ§Œ κ°€λŠ₯ν•˜κΈ° λ•Œλ¬Έμ— 얻을 수 μžˆλŠ” λ°μ΄ν„°μ˜ 양이 ν•œμ •μ μ΄λ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ”, 얻을 수 μžˆλŠ” ν•™μŠ΅ λ°μ΄ν„°μ˜ μ’…λ₯˜μ— 따라 μ—¬λŸ¬ 방면으둜 3차원 μ‚¬λžŒ μžμ„Έλ₯Ό μΆ”μ •ν•˜λŠ” 방법을 μ—°κ΅¬ν•˜μ˜€λ‹€. ꡬ체적으둜, 2차원 κ΄€μΈ‘κ°’ λ˜λŠ” RGB μ˜μƒμ„ λ°”νƒ•μœΌλ‘œ 3차원 μ‚¬λžŒ μžμ„Έλ₯Ό μΆ”μ •, λ³΅μ›ν•˜λŠ” μ„Έ 가지 방법--3차원 볡원, μ•½μ§€λ„ν•™μŠ΅, μ§€λ„ν•™μŠ΅--을 μ œμ‹œν•˜μ˜€λ‹€. 첫 번째둜, μ‚¬λžŒμ˜ 신체와 같이 λΉ„μ •ν˜• 객체의 2차원 κ΄€μΈ‘κ°’μœΌλ‘œλΆ€ν„° 3차원 ꡬ쑰λ₯Ό λ³΅μ›ν•˜λŠ” λΉ„μ •ν˜• μ›€μ§μž„ 기반 ꡬ쑰 (Non-rigid structure from motion) μ•Œκ³ λ¦¬μ¦˜μ„ μ œμ•ˆν•˜μ˜€λ‹€. ν”„λ‘œν¬λ£¨μŠ€ν…ŒμŠ€ νšŒκ·€ (Procrustean regression)으둜 λͺ…λͺ…ν•œ μ œμ•ˆλœ ν”„λ ˆμž„μ›Œν¬μ—μ„œ, 3차원 ν˜•νƒœλ“€μ€ κ·Έλ“€μ˜ μ •λ ¬λœ ν˜•νƒœμ— λŒ€ν•œ ν•¨μˆ˜λ‘œ μ •κ·œν™”λœλ‹€. μ œμ•ˆλœ ν”„λ‘œν¬λ£¨μŠ€ν…ŒμŠ€ νšŒκ·€μ˜ λΉ„μš© ν•¨μˆ˜λŠ” 3차원 ν˜•νƒœ μ •λ ¬κ³Ό κ΄€λ ¨λœ μ œμ•½μ„ λΉ„μš© ν•¨μˆ˜μ— ν¬ν•¨μ‹œμΌœ 경사 ν•˜κ°•λ²•μ„ μ΄μš©ν•œ μ΅œμ ν™”κ°€ κ°€λŠ₯ν•˜λ‹€. μ œμ•ˆλœ 방법은 λ‹€μ–‘ν•œ λͺ¨λΈκ³Ό 가정을 ν¬ν•¨μ‹œν‚¬ 수 μžˆμ–΄ μ‹€μš©μ μ΄κ³  μœ μ—°ν•œ ν”„λ ˆμž„μ›Œν¬μ΄λ‹€. λ‹€μ–‘ν•œ μ‹€ν—˜μ„ 톡해 μ œμ•ˆλœ 방법은 세계 졜고 μˆ˜μ€€μ˜ 방법듀과 비ꡐ해 μœ μ‚¬ν•œ μ„±λŠ₯을 λ³΄μ΄λ©΄μ„œ, λ™μ‹œμ— μ‹œκ°„, 곡간 λ³΅μž‘λ„ λ©΄μ—μ„œ κΈ°μ‘΄ 방법에 λΉ„ν•΄ μš°μˆ˜ν•¨μ„ λ³΄μ˜€λ‹€. 두 번째둜 μ œμ•ˆλœ 방법은, 2차원 ν•™μŠ΅ λ°μ΄ν„°λ§Œ μ£Όμ–΄μ‘Œμ„ λ•Œ 2차원 μž…λ ₯μ—μ„œ 3차원 ꡬ쑰λ₯Ό λ³΅μ›ν•˜λŠ” μ•½μ§€λ„ν•™μŠ΅ 방법이닀. ν”„λ‘œν¬λ£¨μŠ€ν…ŒμŠ€ νšŒκ·€ 신경망 (Procrustean regression network)둜 λͺ…λͺ…ν•œ μ œμ•ˆλœ ν•™μŠ΅ 방법은 신경망 λ˜λŠ” μ»¨λ³Όλ£¨μ…˜ 신경망을 톡해 μ‚¬λžŒμ˜ 2차원 μžμ„Έλ‘œλΆ€ν„° 3차원 μžμ„Έλ₯Ό μΆ”μ •ν•˜λŠ” 방법을 ν•™μŠ΅ν•œλ‹€. ν”„λ‘œν¬λ£¨μŠ€ν…ŒμŠ€ νšŒκ·€μ— μ‚¬μš©λœ λΉ„μš© ν•¨μˆ˜λ₯Ό μˆ˜μ •ν•˜μ—¬ 신경망을 ν•™μŠ΅μ‹œν‚€λŠ” λ³Έ 방법은, λΉ„μ •ν˜• μ›€μ§μž„ 기반 ꡬ쑰에 μ‚¬μš©λœ λΉ„μš© ν•¨μˆ˜λ₯Ό 신경망 ν•™μŠ΅μ— μ μš©ν•œ 졜초의 μ‹œλ„μ΄λ‹€. λ˜ν•œ λΉ„μš©ν•¨μˆ˜μ— μ‚¬μš©λœ μ €κ³„μˆ˜ ν•¨μˆ˜ (low-rank function)λ₯Ό 신경망 ν•™μŠ΅μ— 처음으둜 μ‚¬μš©ν•˜μ˜€λ‹€. ν…ŒμŠ€νŠΈ 데이터에 λŒ€ν•΄μ„œ 3차원 μ‚¬λžŒ μžμ„ΈλŠ” μ‹ κ²½λ§μ˜ 전방전달(feed forward)연산에 μ˜ν•΄ μ–»μ–΄μ§€λ―€λ‘œ, 3차원 볡원 방법에 λΉ„ν•΄ 훨씬 λΉ λ₯Έ 3차원 μžμ„Έ 좔정이 κ°€λŠ₯ν•˜λ‹€. λ§ˆμ§€λ§‰μœΌλ‘œ, 신경망을 μ΄μš©ν•΄ 2차원 μž…λ ₯μœΌλ‘œλΆ€ν„° 3차원 μ‚¬λžŒ μžμ„Έλ₯Ό μΆ”μ •ν•˜λŠ” μ§€λ„ν•™μŠ΅ 방법을 μ œμ‹œν•˜μ˜€λ‹€. λ³Έ 방법은 관계 신경망 λͺ¨λ“ˆ(relational modules)을 ν™œμš©ν•΄ μ‹ μ²΄μ˜ λ‹€λ₯Έ λΆ€μœ„κ°„μ˜ 관계λ₯Ό ν•™μŠ΅ν•œλ‹€. μ„œλ‘œ λ‹€λ₯Έ λΆ€μœ„μ˜ μŒλ§ˆλ‹€ 관계 νŠΉμ§•μ„ μΆ”μΆœν•΄ λͺ¨λ“  관계 νŠΉμ§•μ˜ 평균을 μ΅œμ’… 3차원 μžμ„Έ 좔정에 μ‚¬μš©ν•œλ‹€. λ˜ν•œ κ΄€κ³„ν˜• λ“œλžμ•„μ›ƒ(relational dropout)μ΄λΌλŠ” μƒˆλ‘œμš΄ ν•™μŠ΅ 방법을 μ œμ‹œν•΄ 가렀짐에 μ˜ν•΄ λ‚˜νƒ€λ‚˜μ§€ μ•Šμ€ 2차원 관츑값이 μžˆλŠ” μƒν™©μ—μ„œ, κ°•μΈν•˜κ²Œ λ™μž‘ν•  수 μžˆλŠ” 3차원 μžμ„Έ μΆ”μ • 방법을 μ œμ‹œν•˜μ˜€λ‹€. μ‹€ν—˜μ„ 톡해 ν•΄λ‹Ή 방법이 2차원 관츑값이 μΌλΆ€λ§Œ 주어진 μƒν™©μ—μ„œλ„ 큰 μ„±λŠ₯ ν•˜λ½μ΄ 없이 효과적으둜 3차원 μžμ„Έλ₯Ό 좔정함을 증λͺ…ν•˜μ˜€λ‹€.Abstract i Contents iii List of Tables vi List of Figures viii 1 Introduction 1 1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4.1 3D Reconstruction of Human Bodies . . . . . . . . . . 9 1.4.2 Weakly-Supervised Learning for 3D HPE . . . . . . . . 11 1.4.3 Supervised Learning for 3D HPE . . . . . . . . . . . . 11 1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2 Related Works 14 2.1 2D Human Pose Estimation . . . . . . . . . . . . . . . . . . . . 14 2.2 3D Human Pose Estimation . . . . . . . . . . . . . . . . . . . . 16 2.3 Non-rigid Structure from Motion . . . . . . . . . . . . . . . . . 18 2.4 Learning to Reconstruct 3D Structures via Neural Networks . . 23 3 3D Reconstruction of Human Bodies via Procrustean Regression 25 3.1 Formalization of NRSfM . . . . . . . . . . . . . . . . . . . . . 27 3.2 Procrustean Regression . . . . . . . . . . . . . . . . . . . . . . 28 3.2.1 The Cost Function of Procrustean Regression . . . . . . 29 3.2.2 Derivatives of the Cost Function . . . . . . . . . . . . . 32 3.2.3 Example Functions for f and g . . . . . . . . . . . . . . 38 3.2.4 Handling Missing Points . . . . . . . . . . . . . . . . . 43 3.2.5 Optimization . . . . . . . . . . . . . . . . . . . . . . . 44 3.2.6 Initialization . . . . . . . . . . . . . . . . . . . . . . . 44 3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 45 3.3.1 Orthographic Projection . . . . . . . . . . . . . . . . . 46 3.3.2 Perspective Projection . . . . . . . . . . . . . . . . . . 56 3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4 Weakly-Supervised Learning of 3D Human Pose via Procrustean Regression Networks 69 4.1 The Cost Function for Procrustean Regression Network . . . . . 70 4.2 Choosing f and g for Procrustean Regression Network . . . . . 74 4.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 75 4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 77 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5 Supervised Learning of 3D Human Pose via Relational Networks 86 5.1 Relational Networks . . . . . . . . . . . . . . . . . . . . . . . 88 5.2 Relational Networks for 3D HPE . . . . . . . . . . . . . . . . . 88 5.3 Extensions to Multi-Frame Inputs . . . . . . . . . . . . . . . . 91 5.4 Relational Dropout . . . . . . . . . . . . . . . . . . . . . . . . 93 5.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 94 5.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 95 5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6 Concluding Remarks 105 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.3 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . 108 Abstract (In Korean) 128Docto

    Unsupervised 3D Pose Estimation with Non-Rigid Structure-from-Motion Modeling

    Most of the previous 3D human pose estimation work relied on the powerful memory capability of the network to obtain suitable 2D-3D mappings from the training data. Few works have studied the modeling of human posture deformation in motion. In this paper, we propose a new modeling method for human pose deformations and design an accompanying diffusion-based motion prior. Inspired by the field of non-rigid structure-from-motion, we divide the task of reconstructing 3D human skeletons in motion into the estimation of a 3D reference skeleton, and a frame-by-frame skeleton deformation. A mixed spatial-temporal NRSfMformer is used to simultaneously estimate the 3D reference skeleton and the skeleton deformation of each frame from 2D observations sequence, and then sum them to obtain the pose of each frame. Subsequently, a loss term based on the diffusion model is used to ensure that the pipeline learns the correct prior motion knowledge. Finally, we have evaluated our proposed method on mainstream datasets and obtained superior results outperforming the state-of-the-art
