135 research outputs found

    ํ™•๋ฅ ์ ์ธ 3์ฐจ์› ์ž์„ธ ๋ณต์›๊ณผ ํ–‰๋™์ธ์‹

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2016. 2. ์˜ค์„ฑํšŒ.These days, computer vision technology becomes popular and plays an important role in intelligent systems, such as augment reality, video and image analysis, and to name a few. Although cost effective depth cameras, like a Microsoft Kinect, have recently developed, most computer vision algorithms assume that observations are obtained from RGB cameras, which make 2D observations. If, somehow, we can estimate 3D information from 2D observations, it might give better solutions for many computer vision problems. In this dissertation, we focus on estimating 3D information from 2D observations, which is well known as non-rigid structure from motion (NRSfM). More formally, NRSfM finds the three dimensional structure of an object by analyzing image streams with the assumption that an object lies in a low-dimensional space. However, a human body for long periods of time can have complex shape variations and it makes a challenging problem for NRSfM due to its increased degree of freedom. In order to handle complex shape variations, we propose a Procrustean normal distribution mixture model (PNDMM) by extending a recently proposed Procrustean normal distribution (PND), which captures the distribution of non-rigid variations of an object by excluding the effects of rigid motion. Unlike existing methods which use a single model to solve an NRSfM problem, the proposed PNDMM decomposes complex shape variations into a collection of simpler ones, thereby model learning can be more tractable and accurate. We perform experiments showing that the proposed method outperforms existing methods on highly complex and long human motion sequences. In addition, we extend the PNDMM to a single view 3D human pose estimation problem. While recovering a 3D structure of a human body from an image is important, it is a highly ambiguous problem due to the deformation of an articulated human body. Moreover, before estimating a 3D human pose from a 2D human pose, it is important to obtain an accurate 2D human pose. In order to address inaccuracy of 2D pose estimation on a single image and 3D human pose ambiguities, we estimate multiple 2D and 3D human pose candidates and select the best one which can be explained by a 2D human pose detector and a 3D shape model. We also introduce a model transformation which is incorporated into the 3D shape prior model, such that the proposed method can be applied to a novel test image. Experimental results show that the proposed method can provide good 3D reconstruction results when tested on a novel test image, despite inaccuracies of 2D part detections and 3D shape ambiguities. Finally, we handle an action recognition problem from a video clip. Current studies show that high-level features obtained from estimated 2D human poses enable action recognition performance beyond current state-of-the-art methods using low- and mid-level features based on appearance and motion, despite inaccuracy of human pose estimation. Based on these findings, we propose an action recognition method using estimated 3D human pose information since the proposed PNDMM is able to reconstruct 3D shapes from 2D shapes. Experimental results show that 3D pose based descriptors are better than 2D pose based descriptors for action recognition, regardless of classification methods. Considering the fact that we use simple 3D pose descriptors based on a 3D shape model which is learned from 2D shapes, results reported in this dissertation are promising and obtaining accurate 3D information from 2D observations is still an important research issue for reliable computer vision systems.Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Research Issues 4 1.3 Organization of the Dissertation 6 Chapter 2 Preliminary 9 2.1 Generalized Procrustes Analysis (GPA) 11 2.2 EM-GPA Algorithm 12 2.2.1 Objective function 12 2.2.2 E-step 15 2.2.3 M-step 16 2.3 Implementation Considerations for EM-GPA 18 2.3.1 Preprocessing stage 18 2.3.2 Small update rate for the covariance matrix 20 2.4 Experiments 21 2.4.1 Shape alignment with the missing information 23 2.4.2 3D shape modeling 24 2.4.3 2D+3D active appearance models 28 2.5 Chapter Summary and Discussion 32 Chapter 3 Procrustean Normal Distribution Mixture Model 33 3.1 Non-Rigid Structure from Motion 35 3.2 Procrustean Normal Distribution (PND) 38 3.3 PND Mixture Model 41 3.4 Learning a PNDMM 43 3.4.1 E-step 44 3.4.2 M-step 46 3.5 Learning an Adaptive PNDMM 48 3.6 Experiments 50 3.6.1 Experimental setup 50 3.6.2 CMU Mocap database 53 3.6.3 UMPM dataset 69 3.6.4 Simple and short motions 74 3.6.5 Real sequence - qualitative representation 77 3.7 Chapter Summary 78 Chapter 4 Recovering a 3D Human Pose from a Novel Image 83 4.1 Single View 3D Human Pose Estimation 85 4.2 Candidate Generation 87 4.2.1 Initial pose generation 87 4.2.2 Part recombination 88 4.3 3D Shape Prior Model 89 4.3.1 Procrustean mixture model learning 89 4.3.2 Procrustean mixture model fitting 91 4.4 Model Transformation 92 4.4.1 Model normalization 92 4.4.2 Model adaptation 95 4.5 Result Selection 96 4.6 Experiments 98 4.6.1 Implementation details 98 4.6.2 Evaluation of the joint 2D and 3D pose estimation 99 4.6.3 Evaluation of the 2D pose estimation 104 4.6.4 Evaluation of the 3D pose estimation 106 4.7 Chapter Summary 108 Chapter 5 Application to Action Recognition 109 5.1 Appearance and Motion Based Descriptors 112 5.2 2D Pose Based Descriptors 113 5.3 Bag-of-Features with a Multiple Kernel Method 114 5.4 Classification - Kernel Group Sparse Representation 115 5.4.1 Group sparse representation for classification 116 5.4.2 Kernel group sparse (KGS) representation for classification 118 5.5 Experiment on sub-JHMDB Dataset 120 5.5.1 Experimental setup 120 5.5.2 3D pose based descriptor 122 5.5.3 Experimental results 123 5.6 Chapter Summary 129 Chapter 6 Conclusion and Future Work 131 Appendices 135 A Proof of Propositions in Chapter 2 137 A.1 Proof of Proposition 1 137 A.2 Proof of Proposition 3 138 A.3 Proof of Proposition 4 139 B Calculation of p(XijDii) in Chapter 3 141 B.1 Without the Dirac-delta term 141 B.2 With the Dirac-delta term 142 C Procrustean Mixture Model Learning and Fitting in Chapter 4 145 C.1 Procrustean Mixture Model Learning 145 C.2 Procrustean Mixture Model Fitting 147 Bibliography 153 ์ดˆ ๋ก 167Docto

    Lifting from the Deep: Convolutional 3D Pose Estimation from a Single Image

    Get PDF
    We propose a unified formulation for the problem of 3D human pose estimation from a single raw RGB image that reasons jointly about 2D joint estimation and 3D pose reconstruction to improve both tasks. We take an integrated approach that fuses probabilistic knowledge of 3D human pose with a multi-stage CNN architecture and uses the knowledge of plausible 3D landmark locations to refine the search for better 2D locations. The entire process is trained end-to-end, is extremely efficient and obtains state- of-the-art results on Human3.6M outperforming previous approaches both on 2D and 3D errors.Comment: Paper presented at CVPR 1

    3์ฐจ์› ์‚ฌ๋žŒ ์ž์„ธ ์ถ”์ •์„ ์œ„ํ•œ 3์ฐจ์› ๋ณต์›, ์•ฝ์ง€๋„ํ•™์Šต, ์ง€๋„ํ•™์Šต ๋ฐฉ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์œตํ•ฉ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™์› ์œตํ•ฉ๊ณผํ•™๋ถ€(์ง€๋Šฅํ˜•์œตํ•ฉ์‹œ์Šคํ…œ์ „๊ณต), 2019. 2. ๊ณฝ๋…ธ์ค€.Estimating human poses from images is one of the fundamental tasks in computer vision, which leads to lots of applications such as action recognition, human-computer interaction, and virtual reality. Especially, estimating 3D human poses from 2D inputs is a challenging problem since it is inherently under-constrained. In addition, obtaining 3D ground truth data for human poses is only possible under the limited and restricted environments. In this dissertation, 3D human pose estimation is studied in different aspects focusing on various types of the availability of the data. To this end, three different methods to retrieve 3D human poses from 2D observations or from RGB images---algorithms of 3D reconstruction, weakly-supervised learning, and supervised learning---are proposed. First, a non-rigid structure from motion (NRSfM) algorithm that reconstructs 3D structures of non-rigid objects such as human bodies from 2D observations is proposed. In the proposed framework which is named as Procrustean Regression, the 3D shapes are regularized based on their aligned shapes. We show that the cost function of the Procrustean Regression can be casted into an unconstrained problem or a problem with simple bound constraints, which can be efficiently solved by existing gradient descent solvers. This framework can be easily integrated with numerous existing models and assumptions, which makes it more practical for various real situations. The experimental results show that the proposed method gives competitive result to the state-of-the-art methods for orthographic projection with much less time complexity and memory requirement, and outperforms the existing methods for perspective projection. Second, a weakly-supervised learning method that is capable of learning 3D structures when only 2D ground truth data is available as a training set is presented. Extending the Procrustean Regression framework, we suggest Procrustean Regression Network, a learning method that trains neural networks to learn 3D structures using training data with 2D ground truths. This is the first attempt that directly integrates an NRSfM algorithm into neural network training. The cost function that contains a low-rank function is also firstly used as a cost function of neural networks that reconstructs 3D shapes. During the test phase, 3D structures of human bodies can be obtained via a feed-forward operation, which enables the framework to have much faster inference time compared to the 3D reconstruction algorithms. Third, a supervised learning method that infers 3D poses from 2D inputs using neural networks is suggested. The method exploits a relational unit which captures the relations between different body parts. In the method, each pair of different body parts generates relational features, and the average of the features from all the pairs are used for 3D pose estimation. We also suggest a dropout method called relational dropout, which can be used in relational modules to impose robustness to the occlusions. The experimental results validate that the performance of the proposed algorithm does not degrade much when missing points exist while maintaining state-of-the-art performance when every point is visible.RGB ์˜์ƒ์—์„œ์˜ ์‚ฌ๋žŒ ์ž์„ธ ์ถ”์ • ๋ฐฉ๋ฒ•์€ ์ปดํ“จํ„ฐ ๋น„์ „ ๋ถ„์•ผ์—์„œ ์ค‘์š”ํ•˜๋ฉฐ ์—ฌ๋Ÿฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ๊ธฐ๋ณธ์ด ๋˜๋Š” ๊ธฐ์ˆ ์ด๋‹ค. ์‚ฌ๋žŒ ์ž์„ธ ์ถ”์ •์€ ๋™์ž‘ ์ธ์‹, ์ธ๊ฐ„-์ปดํ“จํ„ฐ ์ƒํ˜ธ์ž‘์šฉ, ๊ฐ€์ƒ ํ˜„์‹ค, ์ฆ๊ฐ• ํ˜„์‹ค ๋“ฑ ๊ด‘๋ฒ”์œ„ํ•œ ๋ถ„์•ผ์—์„œ ๊ธฐ๋ฐ˜ ๊ธฐ์ˆ ๋กœ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค. ํŠนํžˆ, 2์ฐจ์› ์ž…๋ ฅ์œผ๋กœ๋ถ€ํ„ฐ 3์ฐจ์› ์‚ฌ๋žŒ ์ž์„ธ๋ฅผ ์ถ”์ •ํ•˜๋Š” ๋ฌธ์ œ๋Š” ๋ฌด์ˆ˜ํžˆ ๋งŽ์€ ํ•ด๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” ๋ฌธ์ œ์ด๊ธฐ ๋•Œ๋ฌธ์— ํ’€๊ธฐ ์–ด๋ ค์šด ๋ฌธ์ œ๋กœ ์•Œ๋ ค์ ธ ์žˆ๋‹ค. ๋˜ํ•œ, 3์ฐจ์› ์‹ค์ œ ๋ฐ์ดํ„ฐ์˜ ์Šต๋“์€ ๋ชจ์…˜์บก์ฒ˜ ์ŠคํŠœ๋””์˜ค ๋“ฑ ์ œํ•œ๋œ ํ™˜๊ฒฝํ•˜์—์„œ๋งŒ ๊ฐ€๋Šฅํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์–ป์„ ์ˆ˜ ์žˆ๋Š” ๋ฐ์ดํ„ฐ์˜ ์–‘์ด ํ•œ์ •์ ์ด๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š”, ์–ป์„ ์ˆ˜ ์žˆ๋Š” ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ ์ข…๋ฅ˜์— ๋”ฐ๋ผ ์—ฌ๋Ÿฌ ๋ฐฉ๋ฉด์œผ๋กœ 3์ฐจ์› ์‚ฌ๋žŒ ์ž์„ธ๋ฅผ ์ถ”์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์—ฐ๊ตฌํ•˜์˜€๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, 2์ฐจ์› ๊ด€์ธก๊ฐ’ ๋˜๋Š” RGB ์˜์ƒ์„ ๋ฐ”ํƒ•์œผ๋กœ 3์ฐจ์› ์‚ฌ๋žŒ ์ž์„ธ๋ฅผ ์ถ”์ •, ๋ณต์›ํ•˜๋Š” ์„ธ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•--3์ฐจ์› ๋ณต์›, ์•ฝ์ง€๋„ํ•™์Šต, ์ง€๋„ํ•™์Šต--์„ ์ œ์‹œํ•˜์˜€๋‹ค. ์ฒซ ๋ฒˆ์งธ๋กœ, ์‚ฌ๋žŒ์˜ ์‹ ์ฒด์™€ ๊ฐ™์ด ๋น„์ •ํ˜• ๊ฐ์ฒด์˜ 2์ฐจ์› ๊ด€์ธก๊ฐ’์œผ๋กœ๋ถ€ํ„ฐ 3์ฐจ์› ๊ตฌ์กฐ๋ฅผ ๋ณต์›ํ•˜๋Š” ๋น„์ •ํ˜• ์›€์ง์ž„ ๊ธฐ๋ฐ˜ ๊ตฌ์กฐ (Non-rigid structure from motion) ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ํ”„๋กœํฌ๋ฃจ์Šคํ…Œ์Šค ํšŒ๊ท€ (Procrustean regression)์œผ๋กœ ๋ช…๋ช…ํ•œ ์ œ์•ˆ๋œ ํ”„๋ ˆ์ž„์›Œํฌ์—์„œ, 3์ฐจ์› ํ˜•ํƒœ๋“ค์€ ๊ทธ๋“ค์˜ ์ •๋ ฌ๋œ ํ˜•ํƒœ์— ๋Œ€ํ•œ ํ•จ์ˆ˜๋กœ ์ •๊ทœํ™”๋œ๋‹ค. ์ œ์•ˆ๋œ ํ”„๋กœํฌ๋ฃจ์Šคํ…Œ์Šค ํšŒ๊ท€์˜ ๋น„์šฉ ํ•จ์ˆ˜๋Š” 3์ฐจ์› ํ˜•ํƒœ ์ •๋ ฌ๊ณผ ๊ด€๋ จ๋œ ์ œ์•ฝ์„ ๋น„์šฉ ํ•จ์ˆ˜์— ํฌํ•จ์‹œ์ผœ ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•์„ ์ด์šฉํ•œ ์ตœ์ ํ™”๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค. ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์€ ๋‹ค์–‘ํ•œ ๋ชจ๋ธ๊ณผ ๊ฐ€์ •์„ ํฌํ•จ์‹œํ‚ฌ ์ˆ˜ ์žˆ์–ด ์‹ค์šฉ์ ์ด๊ณ  ์œ ์—ฐํ•œ ํ”„๋ ˆ์ž„์›Œํฌ์ด๋‹ค. ๋‹ค์–‘ํ•œ ์‹คํ—˜์„ ํ†ตํ•ด ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์€ ์„ธ๊ณ„ ์ตœ๊ณ  ์ˆ˜์ค€์˜ ๋ฐฉ๋ฒ•๋“ค๊ณผ ๋น„๊ตํ•ด ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉด์„œ, ๋™์‹œ์— ์‹œ๊ฐ„, ๊ณต๊ฐ„ ๋ณต์žก๋„ ๋ฉด์—์„œ ๊ธฐ์กด ๋ฐฉ๋ฒ•์— ๋น„ํ•ด ์šฐ์ˆ˜ํ•จ์„ ๋ณด์˜€๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์€, 2์ฐจ์› ํ•™์Šต ๋ฐ์ดํ„ฐ๋งŒ ์ฃผ์–ด์กŒ์„ ๋•Œ 2์ฐจ์› ์ž…๋ ฅ์—์„œ 3์ฐจ์› ๊ตฌ์กฐ๋ฅผ ๋ณต์›ํ•˜๋Š” ์•ฝ์ง€๋„ํ•™์Šต ๋ฐฉ๋ฒ•์ด๋‹ค. ํ”„๋กœํฌ๋ฃจ์Šคํ…Œ์Šค ํšŒ๊ท€ ์‹ ๊ฒฝ๋ง (Procrustean regression network)๋กœ ๋ช…๋ช…ํ•œ ์ œ์•ˆ๋œ ํ•™์Šต ๋ฐฉ๋ฒ•์€ ์‹ ๊ฒฝ๋ง ๋˜๋Š” ์ปจ๋ณผ๋ฃจ์…˜ ์‹ ๊ฒฝ๋ง์„ ํ†ตํ•ด ์‚ฌ๋žŒ์˜ 2์ฐจ์› ์ž์„ธ๋กœ๋ถ€ํ„ฐ 3์ฐจ์› ์ž์„ธ๋ฅผ ์ถ”์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ํ•™์Šตํ•œ๋‹ค. ํ”„๋กœํฌ๋ฃจ์Šคํ…Œ์Šค ํšŒ๊ท€์— ์‚ฌ์šฉ๋œ ๋น„์šฉ ํ•จ์ˆ˜๋ฅผ ์ˆ˜์ •ํ•˜์—ฌ ์‹ ๊ฒฝ๋ง์„ ํ•™์Šต์‹œํ‚ค๋Š” ๋ณธ ๋ฐฉ๋ฒ•์€, ๋น„์ •ํ˜• ์›€์ง์ž„ ๊ธฐ๋ฐ˜ ๊ตฌ์กฐ์— ์‚ฌ์šฉ๋œ ๋น„์šฉ ํ•จ์ˆ˜๋ฅผ ์‹ ๊ฒฝ๋ง ํ•™์Šต์— ์ ์šฉํ•œ ์ตœ์ดˆ์˜ ์‹œ๋„์ด๋‹ค. ๋˜ํ•œ ๋น„์šฉํ•จ์ˆ˜์— ์‚ฌ์šฉ๋œ ์ €๊ณ„์ˆ˜ ํ•จ์ˆ˜ (low-rank function)๋ฅผ ์‹ ๊ฒฝ๋ง ํ•™์Šต์— ์ฒ˜์Œ์œผ๋กœ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ 3์ฐจ์› ์‚ฌ๋žŒ ์ž์„ธ๋Š” ์‹ ๊ฒฝ๋ง์˜ ์ „๋ฐฉ์ „๋‹ฌ(feed forward)์—ฐ์‚ฐ์— ์˜ํ•ด ์–ป์–ด์ง€๋ฏ€๋กœ, 3์ฐจ์› ๋ณต์› ๋ฐฉ๋ฒ•์— ๋น„ํ•ด ํ›จ์”ฌ ๋น ๋ฅธ 3์ฐจ์› ์ž์„ธ ์ถ”์ •์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ์‹ ๊ฒฝ๋ง์„ ์ด์šฉํ•ด 2์ฐจ์› ์ž…๋ ฅ์œผ๋กœ๋ถ€ํ„ฐ 3์ฐจ์› ์‚ฌ๋žŒ ์ž์„ธ๋ฅผ ์ถ”์ •ํ•˜๋Š” ์ง€๋„ํ•™์Šต ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•˜์˜€๋‹ค. ๋ณธ ๋ฐฉ๋ฒ•์€ ๊ด€๊ณ„ ์‹ ๊ฒฝ๋ง ๋ชจ๋“ˆ(relational modules)์„ ํ™œ์šฉํ•ด ์‹ ์ฒด์˜ ๋‹ค๋ฅธ ๋ถ€์œ„๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•œ๋‹ค. ์„œ๋กœ ๋‹ค๋ฅธ ๋ถ€์œ„์˜ ์Œ๋งˆ๋‹ค ๊ด€๊ณ„ ํŠน์ง•์„ ์ถ”์ถœํ•ด ๋ชจ๋“  ๊ด€๊ณ„ ํŠน์ง•์˜ ํ‰๊ท ์„ ์ตœ์ข… 3์ฐจ์› ์ž์„ธ ์ถ”์ •์— ์‚ฌ์šฉํ•œ๋‹ค. ๋˜ํ•œ ๊ด€๊ณ„ํ˜• ๋“œ๋ž์•„์›ƒ(relational dropout)์ด๋ผ๋Š” ์ƒˆ๋กœ์šด ํ•™์Šต ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•ด ๊ฐ€๋ ค์ง์— ์˜ํ•ด ๋‚˜ํƒ€๋‚˜์ง€ ์•Š์€ 2์ฐจ์› ๊ด€์ธก๊ฐ’์ด ์žˆ๋Š” ์ƒํ™ฉ์—์„œ, ๊ฐ•์ธํ•˜๊ฒŒ ๋™์ž‘ํ•  ์ˆ˜ ์žˆ๋Š” 3์ฐจ์› ์ž์„ธ ์ถ”์ • ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•˜์˜€๋‹ค. ์‹คํ—˜์„ ํ†ตํ•ด ํ•ด๋‹น ๋ฐฉ๋ฒ•์ด 2์ฐจ์› ๊ด€์ธก๊ฐ’์ด ์ผ๋ถ€๋งŒ ์ฃผ์–ด์ง„ ์ƒํ™ฉ์—์„œ๋„ ํฐ ์„ฑ๋Šฅ ํ•˜๋ฝ์ด ์—†์ด ํšจ๊ณผ์ ์œผ๋กœ 3์ฐจ์› ์ž์„ธ๋ฅผ ์ถ”์ •ํ•จ์„ ์ฆ๋ช…ํ•˜์˜€๋‹ค.Abstract i Contents iii List of Tables vi List of Figures viii 1 Introduction 1 1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4.1 3D Reconstruction of Human Bodies . . . . . . . . . . 9 1.4.2 Weakly-Supervised Learning for 3D HPE . . . . . . . . 11 1.4.3 Supervised Learning for 3D HPE . . . . . . . . . . . . 11 1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2 Related Works 14 2.1 2D Human Pose Estimation . . . . . . . . . . . . . . . . . . . . 14 2.2 3D Human Pose Estimation . . . . . . . . . . . . . . . . . . . . 16 2.3 Non-rigid Structure from Motion . . . . . . . . . . . . . . . . . 18 2.4 Learning to Reconstruct 3D Structures via Neural Networks . . 23 3 3D Reconstruction of Human Bodies via Procrustean Regression 25 3.1 Formalization of NRSfM . . . . . . . . . . . . . . . . . . . . . 27 3.2 Procrustean Regression . . . . . . . . . . . . . . . . . . . . . . 28 3.2.1 The Cost Function of Procrustean Regression . . . . . . 29 3.2.2 Derivatives of the Cost Function . . . . . . . . . . . . . 32 3.2.3 Example Functions for f and g . . . . . . . . . . . . . . 38 3.2.4 Handling Missing Points . . . . . . . . . . . . . . . . . 43 3.2.5 Optimization . . . . . . . . . . . . . . . . . . . . . . . 44 3.2.6 Initialization . . . . . . . . . . . . . . . . . . . . . . . 44 3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 45 3.3.1 Orthographic Projection . . . . . . . . . . . . . . . . . 46 3.3.2 Perspective Projection . . . . . . . . . . . . . . . . . . 56 3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4 Weakly-Supervised Learning of 3D Human Pose via Procrustean Regression Networks 69 4.1 The Cost Function for Procrustean Regression Network . . . . . 70 4.2 Choosing f and g for Procrustean Regression Network . . . . . 74 4.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 75 4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 77 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5 Supervised Learning of 3D Human Pose via Relational Networks 86 5.1 Relational Networks . . . . . . . . . . . . . . . . . . . . . . . 88 5.2 Relational Networks for 3D HPE . . . . . . . . . . . . . . . . . 88 5.3 Extensions to Multi-Frame Inputs . . . . . . . . . . . . . . . . 91 5.4 Relational Dropout . . . . . . . . . . . . . . . . . . . . . . . . 93 5.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 94 5.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 95 5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6 Concluding Remarks 105 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.3 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . 108 Abstract (In Korean) 128Docto

    Lifting from the Deep: Convolutional 3D Pose Estimation from a Single Image

    Get PDF
    We propose a unified formulation for the problem of 3D human pose estimation from a single raw RGB image that reasons jointly about 2D joint estimation and 3D pose reconstruction to improve both tasks. We take an integrated approach that fuses probabilistic knowledge of 3D human pose with a multi-stage CNN architecture and uses the knowledge of plausible 3D landmark locations to refine the search for better 2D locations. The entire process is trained end-to-end, is extremely efficient and obtains state-of-the-art results on Human3.6M outperforming previous approaches both on 2D and 3D errors

    Real-time 3D reconstruction of non-rigid shapes with a single moving camera

    Get PDF
    ยฉ . This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/This paper describes a real-time sequential method to simultaneously recover the camera motion and the 3D shape of deformable objects from a calibrated monocular video. For this purpose, we consider the Navier-Cauchy equations used in 3D linear elasticity and solved by finite elements, to model the time-varying shape per frame. These equations are embedded in an extended Kalman filter, resulting in sequential Bayesian estimation approach. We represent the shape, with unknown material properties, as a combination of elastic elements whose nodal points correspond to salient points in the image. The global rigidity of the shape is encoded by a stiffness matrix, computed after assembling each of these elements. With this piecewise model, we can linearly relate the 3D displacements with the 3D acting forces that cause the object deformation, assumed to be normally distributed. While standard finite-element-method techniques require imposing boundary conditions to solve the resulting linear system, in this work we eliminate this requirement by modeling the compliance matrix with a generalized pseudoinverse that enforces a pre-fixed rank. Our framework also ensures surface continuity without the need for a post-processing step to stitch all the piecewise reconstructions into a global smooth shape. We present experimental results using both synthetic and real videos for different scenarios ranging from isometric to elastic deformations. We also show the consistency of the estimation with respect to 3D ground truth data, include several experiments assessing robustness against artifacts and finally, provide an experimental validation of our performance in real time at frame rate for small mapsPeer ReviewedPostprint (author's final draft

    Online Monitoring for Neural Network Based Monocular Pedestrian Pose Estimation

    Full text link
    Several autonomy pipelines now have core components that rely on deep learning approaches. While these approaches work well in nominal conditions, they tend to have unexpected and severe failure modes that create concerns when used in safety-critical applications, including self-driving cars. There are several works that aim to characterize the robustness of networks offline, but currently there is a lack of tools to monitor the correctness of network outputs online during operation. We investigate the problem of online output monitoring for neural networks that estimate 3D human shapes and poses from images. Our first contribution is to present and evaluate model-based and learning-based monitors for a human-pose-and-shape reconstruction network, and assess their ability to predict the output loss for a given test input. As a second contribution, we introduce an Adversarially-Trained Online Monitor ( ATOM ) that learns how to effectively predict losses from data. ATOM dominates model-based baselines and can detect bad outputs, leading to substantial improvements in human pose output quality. Our final contribution is an extensive experimental evaluation that shows that discarding outputs flagged as incorrect by ATOM improves the average error by 12.5%, and the worst-case error by 126.5%.Comment: Accepted to ITSC 202

    A Benchmark and Evaluation of Non-Rigid Structure from Motion

    Full text link
    Non-Rigid structure from motion (NRSfM), is a long standing and central problem in computer vision, allowing us to obtain 3D information from multiple images when the scene is dynamic. A main issue regarding the further development of this important computer vision topic, is the lack of high quality data sets. We here address this issue by presenting of data set compiled for this purpose, which is made publicly available, and considerably larger than previous state of the art. To validate the applicability of this data set, and provide and investigation into the state of the art of NRSfM, including potential directions forward, we here present a benchmark and a scrupulous evaluation using this data set. This benchmark evaluates 16 different methods with available code, which we argue reasonably spans the state of the art in NRSfM. We also hope, that the presented and public data set and evaluation, will provide benchmark tools for further development in this field

    3D hand pose estimation using convolutional neural networks

    Get PDF
    3D hand pose estimation plays a fundamental role in natural human computer interactions. The problem is challenging due to complicated variations caused by complex articulations, multiple viewpoints, self-similar parts, severe self-occlusions, different shapes and sizes. To handle these challenges, the thesis makes the following contributions. First, the problem of the multiple viewpoints and complex articulations of hand pose estimation is tackled by decomposing and transforming the input and output space by spatial transformations following the hand structure. By the transformation, both the variation of the input space and output is reduced, which makes the learning easier. The second contribution is a probabilistic framework integrating all the hierarchical regressions. Variants with/without sampling, using different regressors and optimization methods are constructed and compared to provide an insight of the components under this framework. The third contribution is based on the observation that for images with occlusions, there exist multiple plausible configurations for the occluded parts. A hierarchical mixture density network is proposed to handle the multi-modality of the locations for occluded hand joints. It leverages the state-of-the-art hand pose estimators based on Convolutional Neural Networks to facilitate feature learning while models the multiple modes in a two-level hierarchy to reconcile single-valued (for visible joints) and multi-valued (for occluded joints) mapping in its output. In addition, a complete labeled real hand datasets is collected by a tracking system with six 6D magnetic sensors and inverse kinematics to automatically obtain 21-joints hand pose annotations of depth maps.Open Acces

    Ultrasound-Augmented Laparoscopy

    Get PDF
    Laparoscopic surgery is perhaps the most common minimally invasive procedure for many diseases in the abdomen. Since the laparoscopic camera provides only the surface view of the internal organs, in many procedures, surgeons use laparoscopic ultrasound (LUS) to visualize deep-seated surgical targets. Conventionally, the 2D LUS image is visualized in a display spatially separate from that displays the laparoscopic video. Therefore, reasoning about the geometry of hidden targets requires mentally solving the spatial alignment, and resolving the modality differences, which is cognitively very challenging. Moreover, the mental representation of hidden targets in space acquired through such cognitive medication may be error prone, and cause incorrect actions to be performed. To remedy this, advanced visualization strategies are required where the US information is visualized in the context of the laparoscopic video. To this end, efficient computational methods are required to accurately align the US image coordinate system with that centred in the camera, and to render the registered image information in the context of the camera such that surgeons perceive the geometry of hidden targets accurately. In this thesis, such a visualization pipeline is described. A novel method to register US images with a camera centric coordinate system is detailed with an experimental investigation into its accuracy bounds. An improved method to blend US information with the surface view is also presented with an experimental investigation into the accuracy of perception of the target locations in space
    • โ€ฆ
    corecore