39,426 research outputs found

    V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map

    Full text link
    Most of the existing deep learning-based methods for 3D hand and human pose estimation from a single depth map are based on a common framework that takes a 2D depth map and directly regresses the 3D coordinates of keypoints, such as hand or human body joints, via 2D convolutional neural networks (CNNs). The first weakness of this approach is the presence of perspective distortion in the 2D depth map. While the depth map is intrinsically 3D data, many previous methods treat depth maps as 2D images that can distort the shape of the actual object through projection from 3D to 2D space. This compels the network to perform perspective distortion-invariant estimation. The second weakness of the conventional approach is that directly regressing 3D coordinates from a 2D image is a highly non-linear mapping, which causes difficulty in the learning procedure. To overcome these weaknesses, we firstly cast the 3D hand and human pose estimation problem from a single depth map into a voxel-to-voxel prediction that uses a 3D voxelized grid and estimates the per-voxel likelihood for each keypoint. We design our model as a 3D CNN that provides accurate estimates while running in real-time. Our system outperforms previous methods in almost all publicly available 3D hand and human pose estimation datasets and placed first in the HANDS 2017 frame-based 3D hand pose estimation challenge. The code is available in https://github.com/mks0601/V2V-PoseNet_RELEASE.Comment: HANDS 2017 Challenge Frame-based 3D Hand Pose Estimation Winner (ICCV 2017), Published at CVPR 201

    단일 μ΄λ―Έμ§€λ‘œλΆ€ν„° μ—¬λŸ¬ μ‚¬λžŒμ˜ ν‘œν˜„μ  μ „μ‹  3D μžμ„Έ 및 ν˜•νƒœ μΆ”μ •

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (박사) -- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 전기·정보곡학뢀, 2021. 2. 이경무.Human is the most centric and interesting object in our life: many human-centric techniques and studies have been proposed from both industry and academia, such as motion capture and human-computer interaction. Recovery of accurate 3D geometry of human (i.e., 3D human pose and shape) is a key component of the human-centric techniques and studies. With the rapid spread of cameras, a single RGB image has become a popular input, and many single RGB-based 3D human pose and shape estimation methods have been proposed. The 3D pose and shape of the whole body, which includes hands and face, provides expressive and rich information, including human intention and feeling. Unfortunately, recovering the whole-body 3D pose and shape is greatly challenging; thus, it has been attempted by few works, called expressive methods. Instead of directly solving the expressive 3D pose and shape estimation, the literature has been developed for recovery of the 3D pose and shape of each part (i.e., body, hands, and face) separately, called part-specific methods. There are several more simplifications. For example, many works estimate only 3D pose without shape because additional 3D shape estimation makes the problem much harder. In addition, most works assume a single person case and do not consider a multi-person case. Therefore, there are several ways to categorize current literature; 1) part-specific methods and expressive methods, 2) 3D human pose estimation methods and 3D human pose and shape estimation methods, and 3) methods for a single person and methods for multiple persons. The difficulty increases while the outputs of methods become richer by changing from part-specific to expressive, from 3D pose estimation to 3D pose and shape estimation, and from a single person case to multi-person case. This dissertation introduces three approaches towards expressive 3D multi-person pose and shape estimation from a single image; thus, the output can finally provide the richest information. The first approach is for 3D multi-person body pose estimation, the second one is 3D multi-person body pose and shape estimation, and the final one is expressive 3D multi-person pose and shape estimation. Each approach tackles critical limitations of previous state-of-the-art methods, thus bringing the literature closer to the real-world environment. First, a 3D multi-person body pose estimation framework is introduced. In contrast to the single person case, the multi-person case additionally requires camera-relative 3D positions of the persons. Estimating the camera-relative 3D position from a single image involves high depth ambiguity. The proposed framework utilizes a deep image feature with the camera pinhole model to recover the camera-relative 3D position. The proposed framework can be combined with any 3D single person pose and shape estimation methods for 3D multi-person pose and shape. Therefore, the following two approaches focus on the single person case and can be easily extended to the multi-person case by using the framework of the first approach. Second, a 3D multi-person body pose and shape estimation method is introduced. It extends the first approach to additionally predict accurate 3D shape while its accuracy significantly outperforms previous state-of-the-art methods by proposing a new target representation, lixel-based 1D heatmap. Finally, an expressive 3D multi-person pose and shape estimation method is introduced. It integrates the part-specific 3D pose and shape of the above approaches; thus, it can provide expressive 3D human pose and shape. In addition, it boosts the accuracy of the estimated 3D pose and shape by proposing a 3D positional pose-guided 3D rotational pose prediction system. The proposed approaches successfully overcome the limitations of the previous state-of-the-art methods. The extensive experimental results demonstrate the superiority of the proposed approaches in both qualitative and quantitative ways.인간은 우리의 μΌμƒμƒν™œμ—μ„œ κ°€μž₯ 쀑심이 되고 ν₯미둜운 λŒ€μƒμ΄λ‹€. 그에 따라 λͺ¨μ…˜ 캑처, 인간-컴퓨터 μΈν„°λ ‰μ…˜ λ“± λ§Žμ€ μΈκ°„μ€‘μ‹¬μ˜ 기술과 학문이 산업계와 ν•™κ³„μ—μ„œ μ œμ•ˆλ˜μ—ˆλ‹€. μΈκ°„μ˜ μ •ν™•ν•œ 3D κΈ°ν•˜ (즉, μΈκ°„μ˜ 3D μžμ„Έμ™€ ν˜•νƒœ)λ₯Ό λ³΅μ›ν•˜λŠ” 것은 인간쀑심 기술과 ν•™λ¬Έμ—μ„œ κ°€μž₯ μ€‘μš”ν•œ λΆ€λΆ„ 쀑 ν•˜λ‚˜μ΄λ‹€. μΉ΄λ©”λΌμ˜ λΉ λ₯Έ λŒ€μ€‘ν™”λ‘œ 인해 단일 μ΄λ―Έμ§€λŠ” λ§Žμ€ μ•Œκ³ λ¦¬μ¦˜μ˜ 널리 μ“°μ΄λŠ” μž…λ ₯이 λ˜μ—ˆκ³ , 그둜 인해 λ§Žμ€ 단일 이미지 기반의 3D 인간 μžμ„Έ 및 ν˜•νƒœ μΆ”μ • μ•Œκ³ λ¦¬μ¦˜μ΄ μ œμ•ˆλ˜μ—ˆλ‹€. 손과 λ°œμ„ ν¬ν•¨ν•œ μ „μ‹ μ˜ 3D μžμ„Έμ™€ ν˜•νƒœλŠ” μΈκ°„μ˜ μ˜λ„μ™€ λŠλ‚Œμ„ ν¬ν•¨ν•œ ν‘œν˜„μ μ΄κ³  ν’λΆ€ν•œ 정보λ₯Ό μ œκ³΅ν•œλ‹€. ν•˜μ§€λ§Œ μ „μ‹ μ˜ 3D μžμ„Έμ™€ ν˜•νƒœλ₯Ό λ³΅μ›ν•˜λŠ” 것은 맀우 μ–΄λ ΅κΈ° λ•Œλ¬Έμ— 였직 κ·Ήμ†Œμˆ˜μ˜ λ°©λ²•λ§Œμ΄ 이λ₯Ό ν’€κΈ° μœ„ν•΄ μ œμ•ˆλ˜μ—ˆκ³ , 이λ₯Ό μœ„ν•œ 방법듀을 ν‘œν˜„μ μΈ 방법이라고 λΆ€λ₯Έλ‹€. ν‘œν˜„μ μΈ 3D μžμ„Έμ™€ ν˜•νƒœλ₯Ό ν•œ λ²ˆμ— λ³΅μ›ν•˜λŠ” 것 λŒ€μ‹ , μ‚¬λžŒμ˜ λͺΈ, 손, 그리고 μ–Όκ΅΄μ˜ 3D μžμ„Έμ™€ ν˜•νƒœλ₯Ό λ”°λ‘œ λ³΅μ›ν•˜λŠ” 방법듀이 μ œμ•ˆλ˜μ—ˆλ‹€. μ΄λŸ¬ν•œ 방법듀을 λΆ€λΆ„ 특유 방법이라고 λΆ€λ₯Έλ‹€. μ΄λŸ¬ν•œ 문제의 간단화 이외에도 λͺ‡ κ°€μ§€μ˜ 간단화가 더 μ‘΄μž¬ν•œλ‹€. 예λ₯Ό λ“€μ–΄, λ§Žμ€ 방법은 3D ν˜•νƒœλ₯Ό μ œμ™Έν•œ 3D μžμ„Έλ§Œμ„ μΆ”μ •ν•œλ‹€. μ΄λŠ” 좔가적인 3D ν˜•νƒœ 좔정이 문제λ₯Ό 더 μ–΄λ ΅κ²Œ λ§Œλ“€κΈ° λ•Œλ¬Έμ΄λ‹€. λ˜ν•œ, λŒ€λΆ€λΆ„μ˜ 방법은 였직 단일 μ‚¬λžŒμ˜ 경우만 κ³ λ €ν•˜κ³  μ—¬λŸ¬ μ‚¬λžŒμ˜ κ²½μš°λŠ” κ³ λ €ν•˜μ§€ μ•ŠλŠ”λ‹€. κ·ΈλŸ¬λ―€λ‘œ, ν˜„μž¬ μ œμ•ˆλœ 방법듀은 λͺ‡ 가지 기쀀에 μ˜ν•΄ λΆ„λ₯˜λ  수 μžˆλ‹€; 1) λΆ€λΆ„ 특유 방법 vs. ν‘œν˜„μ  방법, 2) 3D μžμ„Έ μΆ”μ • 방법 vs. 3D μžμ„Έ 및 ν˜•νƒœ μΆ”μ • 방법, 그리고 3) 단일 μ‚¬λžŒμ„ μœ„ν•œ 방법 vs. μ—¬λŸ¬ μ‚¬λžŒμ„ μœ„ν•œ 방법. λΆ€λΆ„ νŠΉμœ μ—μ„œ ν‘œν˜„μ μœΌλ‘œ, 3D μžμ„Έ μΆ”μ •μ—μ„œ 3D μžμ„Έ 및 ν˜•νƒœ μΆ”μ •μœΌλ‘œ, 단일 μ‚¬λžŒμ—μ„œ μ—¬λŸ¬ μ‚¬λžŒμœΌλ‘œ 갈수둝 좔정이 더 μ–΄λ €μ›Œμ§€μ§€λ§Œ, 더 ν’λΆ€ν•œ 정보λ₯Ό 좜λ ₯ν•  수 있게 λœλ‹€. λ³Έ ν•™μœ„λ…Όλ¬Έμ€ 단일 μ΄λ―Έμ§€λ‘œλΆ€ν„° μ—¬λŸ¬ μ‚¬λžŒμ˜ ν‘œν˜„μ μΈ 3D μžμ„Έ 및 ν˜•νƒœ 좔정을 ν–₯ν•˜λŠ” μ„Έ κ°€μ§€μ˜ 접근법을 μ†Œκ°œν•œλ‹€. λ”°λΌμ„œ μ΅œμ’…μ μœΌλ‘œ μ œμ•ˆλœ 방법은 κ°€μž₯ ν’λΆ€ν•œ 정보λ₯Ό μ œκ³΅ν•  수 μžˆλ‹€. 첫 번째 접근법은 μ—¬λŸ¬ μ‚¬λžŒμ„ μœ„ν•œ 3D μžμ„Έ 좔정이고, 두 λ²ˆμ§ΈλŠ” μ—¬λŸ¬ μ‚¬λžŒμ„ μœ„ν•œ 3D μžμ„Έ 및 ν˜•νƒœ 좔정이고, 그리고 λ§ˆμ§€λ§‰μ€ μ—¬λŸ¬ μ‚¬λžŒμ„ μœ„ν•œ ν‘œν˜„μ μΈ 3D μžμ„Έ 및 ν˜•νƒœ 좔정을 μœ„ν•œ 방법이닀. 각 접근법은 κΈ°μ‘΄ 방법듀이 가진 μ€‘μš”ν•œ ν•œκ³„μ λ“€μ„ ν•΄κ²°ν•˜μ—¬ μ œμ•ˆλœ 방법듀이 μ‹€μƒν™œμ—μ„œ 쓰일 수 μžˆλ„λ‘ ν•œλ‹€. 첫 번째 접근법은 μ—¬λŸ¬ μ‚¬λžŒμ„ μœ„ν•œ 3D μžμ„Έ μΆ”μ • ν”„λ ˆμž„μ›Œν¬μ΄λ‹€. 단일 μ‚¬λžŒμ˜ κ²½μš°μ™€λŠ” λ‹€λ₯΄κ²Œ μ—¬λŸ¬ μ‚¬λžŒμ˜ 경우 μ‚¬λžŒλ§ˆλ‹€ 카메라 μƒλŒ€μ μΈ 3D μœ„μΉ˜κ°€ ν•„μš”ν•˜λ‹€. 카메라 μƒλŒ€μ μΈ 3D μœ„μΉ˜λ₯Ό 단일 μ΄λ―Έμ§€λ‘œλΆ€ν„° μΆ”μ •ν•˜λŠ” 것은 맀우 높은 깊이 λͺ¨ν˜Έμ„±μ„ λ™λ°˜ν•œλ‹€. μ œμ•ˆν•˜λŠ” ν”„λ ˆμž„μ›Œν¬λŠ” 심측 이미지 피쳐와 카메라 핀홀 λͺ¨λΈμ„ μ‚¬μš©ν•˜μ—¬ 카메라 μƒλŒ€μ μΈ 3D μœ„μΉ˜λ₯Ό λ³΅μ›ν•œλ‹€. 이 ν”„λ ˆμž„μ›Œν¬λŠ” μ–΄λ–€ 단일 μ‚¬λžŒμ„ μœ„ν•œ 3D μžμ„Έ 및 ν˜•νƒœ μΆ”μ • 방법과 ν•©μ³μ§ˆ 수 있기 λ•Œλ¬Έμ—, λ‹€μŒμ— μ†Œκ°œλ  두 접근법은 였직 단일 μ‚¬λžŒμ„ μœ„ν•œ 3D μžμ„Έ 및 ν˜•νƒœ 좔정에 μ΄ˆμ μ„ λ§žμΆ˜λ‹€. λ‹€μŒμ— μ†Œκ°œλ  두 μ ‘κ·Όλ²•μ—μ„œ μ œμ•ˆλœ 단일 μ‚¬λžŒμ„ μœ„ν•œ 방법듀은 첫 번째 μ ‘κ·Όλ²•μ—μ„œ μ†Œκ°œλ˜λŠ” μ—¬λŸ¬ μ‚¬λžŒμ„ μœ„ν•œ ν”„λ ˆμž„μ›Œν¬λ₯Ό μ‚¬μš©ν•˜μ—¬ μ‰½κ²Œ μ—¬λŸ¬ μ‚¬λžŒμ˜ 경우둜 ν™•μž₯ν•  수 μžˆλ‹€. 두 번째 접근법은 μ—¬λŸ¬ μ‚¬λžŒμ„ μœ„ν•œ 3D μžμ„Έ 및 ν˜•νƒœ μΆ”μ • 방법이닀. 이 방법은 첫 번째 접근법을 ν™•μž₯ν•˜μ—¬ 정확도λ₯Ό μœ μ§€ν•˜λ©΄μ„œ μΆ”κ°€λ‘œ 3D ν˜•νƒœλ₯Ό μΆ”μ •ν•˜κ²Œ ν•œλ‹€. 높은 정확도λ₯Ό μœ„ν•΄ λ¦­μ…€ 기반의 1D νžˆνŠΈλ§΅μ„ μ œμ•ˆν•˜κ³ , 이둜 인해 기쑴에 λ°œν‘œλœ 방법듀보닀 큰 폭으둜 높은 μ„±λŠ₯을 μ–»λŠ”λ‹€. λ§ˆμ§€λ§‰ 접근법은 μ—¬λŸ¬ μ‚¬λžŒμ„ μœ„ν•œ ν‘œν˜„μ μΈ 3D μžμ„Έ 및 ν˜•νƒœ μΆ”μ • 방법이닀. 이것은 λͺΈ, 손, 그리고 μ–Όκ΅΄λ§ˆλ‹€ 3D μžμ„Έ 및 ν˜•νƒœλ₯Ό ν•˜λ‚˜λ‘œ ν†΅ν•©ν•˜μ—¬ ν‘œν˜„μ μΈ 3D μžμ„Έ 및 ν˜•νƒœλ₯Ό μ–»λŠ”λ‹€. κ²Œλ‹€κ°€, 이것은 3D μœ„μΉ˜ 포즈 기반의 3D νšŒμ „ 포즈 좔정기법을 μ œμ•ˆν•¨μœΌλ‘œμ¨ 기쑴에 λ°œν‘œλœ 방법듀보닀 훨씬 높은 μ„±λŠ₯을 μ–»λŠ”λ‹€. μ œμ•ˆλœ 접근법듀은 기쑴에 λ°œν‘œλ˜μ—ˆλ˜ 방법듀이 κ°–λŠ” ν•œκ³„μ λ“€μ„ μ„±κ³΅μ μœΌλ‘œ κ·Ήλ³΅ν•œλ‹€. κ΄‘λ²”μœ„ν•œ μ‹€ν—˜μ  κ²°κ³Όκ°€ 정성적, μ •λŸ‰μ μœΌλ‘œ μ œμ•ˆν•˜λŠ” λ°©λ²•λ“€μ˜ νš¨μš©μ„±μ„ 보여쀀닀.1 Introduction 1 1.1 Background and Research Issues 1 1.2 Outline of the Dissertation 3 2 3D Multi-Person Pose Estimation 7 2.1 Introduction 7 2.2 Related works 10 2.3 Overview of the proposed model 13 2.4 DetectNet 13 2.5 PoseNet 14 2.5.1 Model design 14 2.5.2 Loss function 14 2.6 RootNet 15 2.6.1 Model design 15 2.6.2 Camera normalization 19 2.6.3 Network architecture 19 2.6.4 Loss function 20 2.7 Implementation details 20 2.8 Experiment 21 2.8.1 Dataset and evaluation metric 21 2.8.2 Experimental protocol 22 2.8.3 Ablation study 23 2.8.4 Comparison with state-of-the-art methods 25 2.8.5 Running time of the proposed framework 31 2.8.6 Qualitative results 31 2.9 Conclusion 34 3 3D Multi-Person Pose and Shape Estimation 35 3.1 Introduction 35 3.2 Related works 38 3.3 I2L-MeshNet 41 3.3.1 PoseNet 41 3.3.2 MeshNet 43 3.3.3 Final 3D human pose and mesh 45 3.3.4 Loss functions 45 3.4 Implementation details 47 3.5 Experiment 48 3.5.1 Datasets and evaluation metrics 48 3.5.2 Ablation study 50 3.5.3 Comparison with state-of-the-art methods 57 3.6 Conclusion 60 4 Expressive 3D Multi-Person Pose and Shape Estimation 63 4.1 Introduction 63 4.2 Related works 66 4.3 Pose2Pose 69 4.3.1 PositionNet 69 4.3.2 RotationNet 70 4.4 Expressive 3D human pose and mesh estimation 72 4.4.1 Body part 72 4.4.2 Hand part 73 4.4.3 Face part 73 4.4.4 Training the networks 74 4.4.5 Integration of all parts in the testing stage 74 4.5 Implementation details 77 4.6 Experiment 78 4.6.1 Training sets and evaluation metrics 78 4.6.2 Ablation study 78 4.6.3 Comparison with state-of-the-art methods 82 4.6.4 Running time 87 4.7 Conclusion 87 5 Conclusion and Future Work 89 5.1 Summary and Contributions of the Dissertation 89 5.2 Future Directions 90 5.2.1 Global Context-Aware 3D Multi-Person Pose Estimation 91 5.2.2 Unied Framework for Expressive 3D Human Pose and Shape Estimation 91 5.2.3 Enhancing Appearance Diversity of Images Captured from Multi-View Studio 92 5.2.4 Extension to the video for temporally consistent estimation 94 5.2.5 3D clothed human shape estimation in the wild 94 5.2.6 Robust human action recognition from a video 96 Bibliography 98 ꡭ문초둝 111Docto
    • …
    corecore