64,572 research outputs found

    Co-Evolution of Pose and Mesh for 3D Human Body Estimation from Video

    Full text link
    Despite significant progress in single image-based 3D human mesh recovery, accurately and smoothly recovering 3D human motion from a video remains challenging. Existing video-based methods generally recover human mesh by estimating the complex pose and shape parameters from coupled image features, whose high complexity and low representation ability often result in inconsistent pose motion and limited shape patterns. To alleviate this issue, we introduce 3D pose as the intermediary and propose a Pose and Mesh Co-Evolution network (PMCE) that decouples this task into two parts: 1) video-based 3D human pose estimation and 2) mesh vertices regression from the estimated 3D pose and temporal image feature. Specifically, we propose a two-stream encoder that estimates mid-frame 3D pose and extracts a temporal image feature from the input image sequence. In addition, we design a co-evolution decoder that performs pose and mesh interactions with the image-guided Adaptive Layer Normalization (AdaLN) to make pose and mesh fit the human body shape. Extensive experiments demonstrate that the proposed PMCE outperforms previous state-of-the-art methods in terms of both per-frame accuracy and temporal consistency on three benchmark datasets: 3DPW, Human3.6M, and MPI-INF-3DHP. Our code is available at https://github.com/kasvii/PMCE.Comment: Accepted by ICCV 2023. Project page: https://kasvii.github.io/PMC

    Monocular 3d Object Recognition

    Get PDF
    Object recognition is one of the fundamental tasks of computer vision. Recent advances in the field enable reliable 2D detections from a single cluttered image. However, many challenges still remain. Object detection needs timely response for real world applications. Moreover, we are genuinely interested in estimating the 3D pose and shape of an object or human for the sake of robotic manipulation and human-robot interaction. In this thesis, a suite of solutions to these challenges is presented. First, Active Deformable Part Models (ADPM) is proposed for fast part-based object detection. ADPM dramatically accelerates the detection by dynamically scheduling the part evaluations and efficiently pruning the image locations. Second, we unleash the power of marrying discriminative 2D parts with an explicit 3D geometric representation. Several methods of such scheme are proposed for recovering rich 3D information of both rigid and non-rigid objects from monocular RGB images. (1) The accurate 3D pose of an object instance is recovered from cluttered images using only the CAD model. (2) A global optimal solution for simultaneous 2D part localization, 3D pose and shape estimation is obtained by optimizing a unified convex objective function. Both appearance and geometric compatibility are jointly maximized. (3) 3D human pose estimation from an image sequence is realized via an Expectation-Maximization algorithm. The 2D joint location uncertainties are marginalized out during inference and 3D pose smoothness is enforced across frames. By bridging the gap between 2D and 3D, our methods provide an end-to-end solution to 3D object recognition from images. We demonstrate a range of interesting applications using only a single image or a monocular video, including autonomous robotic grasping with a single image, 3D object image pop-up and a monocular human MoCap system. We also show empirical start-of-art results on a number of benchmarks on 2D detection and 3D pose and shape estimation

    Monocular 3D Body Shape Reconstruction under Clothing

    Get PDF
    Estimating the 3D shape of objects from monocular images is a well-established and challenging task in the computer vision field. Further challenges arise when highly deformable objects, such as human faces or bodies, are considered. In this work, we address the problem of estimating the 3D shape of a human body from single images. In particular, we provide a solution to the problem of estimating the shape of the body when the subject is wearing clothes. This is a highly challenging scenario as loose clothes might hide the underlying body shape to a large extent. To this aim, we make use of a parametric 3D body model, the SMPL, whose parameters describe the body pose and shape of the body. Our main intuition is that the shape parameters associated with an individual should not change whether the subject is wearing clothes or not. To improve the shape estimation under clothing, we train a deep convolutional network to regress the shape parameters from a single image of a person. To increase the robustness to clothing, we build our training dataset by associating the shape parameters of a β€œminimally clothed” person to other samples of the same person wearing looser clothes. Experimental validation shows that our approach can more accurately estimate body shape parameters with respect to state-of-the-art approaches, even in the case of loose clothes

    Mirror-Aware Neural Humans

    Full text link
    Human motion capture either requires multi-camera systems or is unreliable using single-view input due to depth ambiguities. Meanwhile, mirrors are readily available in urban environments and form an affordable alternative by recording two views with only a single camera. However, the mirror setting poses the additional challenge of handling occlusions of real and mirror image. Going beyond existing mirror approaches for 3D human pose estimation, we utilize mirrors for learning a complete body model, including shape and dense appearance. Our main contributions are extending articulated neural radiance fields to include a notion of a mirror, making it sample-efficient over potential occlusion regions. Together, our contributions realize a consumer-level 3D motion capture system that starts from off-the-shelf 2D poses by automatically calibrating the camera, estimating mirror orientation, and subsequently lifting 2D keypoint detections to 3D skeleton pose that is used to condition the mirror-aware NeRF. We empirically demonstrate the benefit of learning a body model and accounting for occlusion in challenging mirror scenes.Comment: Project website: https://danielajisafe.github.io/mirror-aware-neural-humans

    단일 μ΄λ―Έμ§€λ‘œλΆ€ν„° μ—¬λŸ¬ μ‚¬λžŒμ˜ ν‘œν˜„μ  μ „μ‹  3D μžμ„Έ 및 ν˜•νƒœ μΆ”μ •

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (박사) -- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 전기·정보곡학뢀, 2021. 2. 이경무.Human is the most centric and interesting object in our life: many human-centric techniques and studies have been proposed from both industry and academia, such as motion capture and human-computer interaction. Recovery of accurate 3D geometry of human (i.e., 3D human pose and shape) is a key component of the human-centric techniques and studies. With the rapid spread of cameras, a single RGB image has become a popular input, and many single RGB-based 3D human pose and shape estimation methods have been proposed. The 3D pose and shape of the whole body, which includes hands and face, provides expressive and rich information, including human intention and feeling. Unfortunately, recovering the whole-body 3D pose and shape is greatly challenging; thus, it has been attempted by few works, called expressive methods. Instead of directly solving the expressive 3D pose and shape estimation, the literature has been developed for recovery of the 3D pose and shape of each part (i.e., body, hands, and face) separately, called part-specific methods. There are several more simplifications. For example, many works estimate only 3D pose without shape because additional 3D shape estimation makes the problem much harder. In addition, most works assume a single person case and do not consider a multi-person case. Therefore, there are several ways to categorize current literature; 1) part-specific methods and expressive methods, 2) 3D human pose estimation methods and 3D human pose and shape estimation methods, and 3) methods for a single person and methods for multiple persons. The difficulty increases while the outputs of methods become richer by changing from part-specific to expressive, from 3D pose estimation to 3D pose and shape estimation, and from a single person case to multi-person case. This dissertation introduces three approaches towards expressive 3D multi-person pose and shape estimation from a single image; thus, the output can finally provide the richest information. The first approach is for 3D multi-person body pose estimation, the second one is 3D multi-person body pose and shape estimation, and the final one is expressive 3D multi-person pose and shape estimation. Each approach tackles critical limitations of previous state-of-the-art methods, thus bringing the literature closer to the real-world environment. First, a 3D multi-person body pose estimation framework is introduced. In contrast to the single person case, the multi-person case additionally requires camera-relative 3D positions of the persons. Estimating the camera-relative 3D position from a single image involves high depth ambiguity. The proposed framework utilizes a deep image feature with the camera pinhole model to recover the camera-relative 3D position. The proposed framework can be combined with any 3D single person pose and shape estimation methods for 3D multi-person pose and shape. Therefore, the following two approaches focus on the single person case and can be easily extended to the multi-person case by using the framework of the first approach. Second, a 3D multi-person body pose and shape estimation method is introduced. It extends the first approach to additionally predict accurate 3D shape while its accuracy significantly outperforms previous state-of-the-art methods by proposing a new target representation, lixel-based 1D heatmap. Finally, an expressive 3D multi-person pose and shape estimation method is introduced. It integrates the part-specific 3D pose and shape of the above approaches; thus, it can provide expressive 3D human pose and shape. In addition, it boosts the accuracy of the estimated 3D pose and shape by proposing a 3D positional pose-guided 3D rotational pose prediction system. The proposed approaches successfully overcome the limitations of the previous state-of-the-art methods. The extensive experimental results demonstrate the superiority of the proposed approaches in both qualitative and quantitative ways.인간은 우리의 μΌμƒμƒν™œμ—μ„œ κ°€μž₯ 쀑심이 되고 ν₯미둜운 λŒ€μƒμ΄λ‹€. 그에 따라 λͺ¨μ…˜ 캑처, 인간-컴퓨터 μΈν„°λ ‰μ…˜ λ“± λ§Žμ€ μΈκ°„μ€‘μ‹¬μ˜ 기술과 학문이 산업계와 ν•™κ³„μ—μ„œ μ œμ•ˆλ˜μ—ˆλ‹€. μΈκ°„μ˜ μ •ν™•ν•œ 3D κΈ°ν•˜ (즉, μΈκ°„μ˜ 3D μžμ„Έμ™€ ν˜•νƒœ)λ₯Ό λ³΅μ›ν•˜λŠ” 것은 인간쀑심 기술과 ν•™λ¬Έμ—μ„œ κ°€μž₯ μ€‘μš”ν•œ λΆ€λΆ„ 쀑 ν•˜λ‚˜μ΄λ‹€. μΉ΄λ©”λΌμ˜ λΉ λ₯Έ λŒ€μ€‘ν™”λ‘œ 인해 단일 μ΄λ―Έμ§€λŠ” λ§Žμ€ μ•Œκ³ λ¦¬μ¦˜μ˜ 널리 μ“°μ΄λŠ” μž…λ ₯이 λ˜μ—ˆκ³ , 그둜 인해 λ§Žμ€ 단일 이미지 기반의 3D 인간 μžμ„Έ 및 ν˜•νƒœ μΆ”μ • μ•Œκ³ λ¦¬μ¦˜μ΄ μ œμ•ˆλ˜μ—ˆλ‹€. 손과 λ°œμ„ ν¬ν•¨ν•œ μ „μ‹ μ˜ 3D μžμ„Έμ™€ ν˜•νƒœλŠ” μΈκ°„μ˜ μ˜λ„μ™€ λŠλ‚Œμ„ ν¬ν•¨ν•œ ν‘œν˜„μ μ΄κ³  ν’λΆ€ν•œ 정보λ₯Ό μ œκ³΅ν•œλ‹€. ν•˜μ§€λ§Œ μ „μ‹ μ˜ 3D μžμ„Έμ™€ ν˜•νƒœλ₯Ό λ³΅μ›ν•˜λŠ” 것은 맀우 μ–΄λ ΅κΈ° λ•Œλ¬Έμ— 였직 κ·Ήμ†Œμˆ˜μ˜ λ°©λ²•λ§Œμ΄ 이λ₯Ό ν’€κΈ° μœ„ν•΄ μ œμ•ˆλ˜μ—ˆκ³ , 이λ₯Ό μœ„ν•œ 방법듀을 ν‘œν˜„μ μΈ 방법이라고 λΆ€λ₯Έλ‹€. ν‘œν˜„μ μΈ 3D μžμ„Έμ™€ ν˜•νƒœλ₯Ό ν•œ λ²ˆμ— λ³΅μ›ν•˜λŠ” 것 λŒ€μ‹ , μ‚¬λžŒμ˜ λͺΈ, 손, 그리고 μ–Όκ΅΄μ˜ 3D μžμ„Έμ™€ ν˜•νƒœλ₯Ό λ”°λ‘œ λ³΅μ›ν•˜λŠ” 방법듀이 μ œμ•ˆλ˜μ—ˆλ‹€. μ΄λŸ¬ν•œ 방법듀을 λΆ€λΆ„ 특유 방법이라고 λΆ€λ₯Έλ‹€. μ΄λŸ¬ν•œ 문제의 간단화 이외에도 λͺ‡ κ°€μ§€μ˜ 간단화가 더 μ‘΄μž¬ν•œλ‹€. 예λ₯Ό λ“€μ–΄, λ§Žμ€ 방법은 3D ν˜•νƒœλ₯Ό μ œμ™Έν•œ 3D μžμ„Έλ§Œμ„ μΆ”μ •ν•œλ‹€. μ΄λŠ” 좔가적인 3D ν˜•νƒœ 좔정이 문제λ₯Ό 더 μ–΄λ ΅κ²Œ λ§Œλ“€κΈ° λ•Œλ¬Έμ΄λ‹€. λ˜ν•œ, λŒ€λΆ€λΆ„μ˜ 방법은 였직 단일 μ‚¬λžŒμ˜ 경우만 κ³ λ €ν•˜κ³  μ—¬λŸ¬ μ‚¬λžŒμ˜ κ²½μš°λŠ” κ³ λ €ν•˜μ§€ μ•ŠλŠ”λ‹€. κ·ΈλŸ¬λ―€λ‘œ, ν˜„μž¬ μ œμ•ˆλœ 방법듀은 λͺ‡ 가지 기쀀에 μ˜ν•΄ λΆ„λ₯˜λ  수 μžˆλ‹€; 1) λΆ€λΆ„ 특유 방법 vs. ν‘œν˜„μ  방법, 2) 3D μžμ„Έ μΆ”μ • 방법 vs. 3D μžμ„Έ 및 ν˜•νƒœ μΆ”μ • 방법, 그리고 3) 단일 μ‚¬λžŒμ„ μœ„ν•œ 방법 vs. μ—¬λŸ¬ μ‚¬λžŒμ„ μœ„ν•œ 방법. λΆ€λΆ„ νŠΉμœ μ—μ„œ ν‘œν˜„μ μœΌλ‘œ, 3D μžμ„Έ μΆ”μ •μ—μ„œ 3D μžμ„Έ 및 ν˜•νƒœ μΆ”μ •μœΌλ‘œ, 단일 μ‚¬λžŒμ—μ„œ μ—¬λŸ¬ μ‚¬λžŒμœΌλ‘œ 갈수둝 좔정이 더 μ–΄λ €μ›Œμ§€μ§€λ§Œ, 더 ν’λΆ€ν•œ 정보λ₯Ό 좜λ ₯ν•  수 있게 λœλ‹€. λ³Έ ν•™μœ„λ…Όλ¬Έμ€ 단일 μ΄λ―Έμ§€λ‘œλΆ€ν„° μ—¬λŸ¬ μ‚¬λžŒμ˜ ν‘œν˜„μ μΈ 3D μžμ„Έ 및 ν˜•νƒœ 좔정을 ν–₯ν•˜λŠ” μ„Έ κ°€μ§€μ˜ 접근법을 μ†Œκ°œν•œλ‹€. λ”°λΌμ„œ μ΅œμ’…μ μœΌλ‘œ μ œμ•ˆλœ 방법은 κ°€μž₯ ν’λΆ€ν•œ 정보λ₯Ό μ œκ³΅ν•  수 μžˆλ‹€. 첫 번째 접근법은 μ—¬λŸ¬ μ‚¬λžŒμ„ μœ„ν•œ 3D μžμ„Έ 좔정이고, 두 λ²ˆμ§ΈλŠ” μ—¬λŸ¬ μ‚¬λžŒμ„ μœ„ν•œ 3D μžμ„Έ 및 ν˜•νƒœ 좔정이고, 그리고 λ§ˆμ§€λ§‰μ€ μ—¬λŸ¬ μ‚¬λžŒμ„ μœ„ν•œ ν‘œν˜„μ μΈ 3D μžμ„Έ 및 ν˜•νƒœ 좔정을 μœ„ν•œ 방법이닀. 각 접근법은 κΈ°μ‘΄ 방법듀이 가진 μ€‘μš”ν•œ ν•œκ³„μ λ“€μ„ ν•΄κ²°ν•˜μ—¬ μ œμ•ˆλœ 방법듀이 μ‹€μƒν™œμ—μ„œ 쓰일 수 μžˆλ„λ‘ ν•œλ‹€. 첫 번째 접근법은 μ—¬λŸ¬ μ‚¬λžŒμ„ μœ„ν•œ 3D μžμ„Έ μΆ”μ • ν”„λ ˆμž„μ›Œν¬μ΄λ‹€. 단일 μ‚¬λžŒμ˜ κ²½μš°μ™€λŠ” λ‹€λ₯΄κ²Œ μ—¬λŸ¬ μ‚¬λžŒμ˜ 경우 μ‚¬λžŒλ§ˆλ‹€ 카메라 μƒλŒ€μ μΈ 3D μœ„μΉ˜κ°€ ν•„μš”ν•˜λ‹€. 카메라 μƒλŒ€μ μΈ 3D μœ„μΉ˜λ₯Ό 단일 μ΄λ―Έμ§€λ‘œλΆ€ν„° μΆ”μ •ν•˜λŠ” 것은 맀우 높은 깊이 λͺ¨ν˜Έμ„±μ„ λ™λ°˜ν•œλ‹€. μ œμ•ˆν•˜λŠ” ν”„λ ˆμž„μ›Œν¬λŠ” 심측 이미지 피쳐와 카메라 핀홀 λͺ¨λΈμ„ μ‚¬μš©ν•˜μ—¬ 카메라 μƒλŒ€μ μΈ 3D μœ„μΉ˜λ₯Ό λ³΅μ›ν•œλ‹€. 이 ν”„λ ˆμž„μ›Œν¬λŠ” μ–΄λ–€ 단일 μ‚¬λžŒμ„ μœ„ν•œ 3D μžμ„Έ 및 ν˜•νƒœ μΆ”μ • 방법과 ν•©μ³μ§ˆ 수 있기 λ•Œλ¬Έμ—, λ‹€μŒμ— μ†Œκ°œλ  두 접근법은 였직 단일 μ‚¬λžŒμ„ μœ„ν•œ 3D μžμ„Έ 및 ν˜•νƒœ 좔정에 μ΄ˆμ μ„ λ§žμΆ˜λ‹€. λ‹€μŒμ— μ†Œκ°œλ  두 μ ‘κ·Όλ²•μ—μ„œ μ œμ•ˆλœ 단일 μ‚¬λžŒμ„ μœ„ν•œ 방법듀은 첫 번째 μ ‘κ·Όλ²•μ—μ„œ μ†Œκ°œλ˜λŠ” μ—¬λŸ¬ μ‚¬λžŒμ„ μœ„ν•œ ν”„λ ˆμž„μ›Œν¬λ₯Ό μ‚¬μš©ν•˜μ—¬ μ‰½κ²Œ μ—¬λŸ¬ μ‚¬λžŒμ˜ 경우둜 ν™•μž₯ν•  수 μžˆλ‹€. 두 번째 접근법은 μ—¬λŸ¬ μ‚¬λžŒμ„ μœ„ν•œ 3D μžμ„Έ 및 ν˜•νƒœ μΆ”μ • 방법이닀. 이 방법은 첫 번째 접근법을 ν™•μž₯ν•˜μ—¬ 정확도λ₯Ό μœ μ§€ν•˜λ©΄μ„œ μΆ”κ°€λ‘œ 3D ν˜•νƒœλ₯Ό μΆ”μ •ν•˜κ²Œ ν•œλ‹€. 높은 정확도λ₯Ό μœ„ν•΄ λ¦­μ…€ 기반의 1D νžˆνŠΈλ§΅μ„ μ œμ•ˆν•˜κ³ , 이둜 인해 기쑴에 λ°œν‘œλœ 방법듀보닀 큰 폭으둜 높은 μ„±λŠ₯을 μ–»λŠ”λ‹€. λ§ˆμ§€λ§‰ 접근법은 μ—¬λŸ¬ μ‚¬λžŒμ„ μœ„ν•œ ν‘œν˜„μ μΈ 3D μžμ„Έ 및 ν˜•νƒœ μΆ”μ • 방법이닀. 이것은 λͺΈ, 손, 그리고 μ–Όκ΅΄λ§ˆλ‹€ 3D μžμ„Έ 및 ν˜•νƒœλ₯Ό ν•˜λ‚˜λ‘œ ν†΅ν•©ν•˜μ—¬ ν‘œν˜„μ μΈ 3D μžμ„Έ 및 ν˜•νƒœλ₯Ό μ–»λŠ”λ‹€. κ²Œλ‹€κ°€, 이것은 3D μœ„μΉ˜ 포즈 기반의 3D νšŒμ „ 포즈 좔정기법을 μ œμ•ˆν•¨μœΌλ‘œμ¨ 기쑴에 λ°œν‘œλœ 방법듀보닀 훨씬 높은 μ„±λŠ₯을 μ–»λŠ”λ‹€. μ œμ•ˆλœ 접근법듀은 기쑴에 λ°œν‘œλ˜μ—ˆλ˜ 방법듀이 κ°–λŠ” ν•œκ³„μ λ“€μ„ μ„±κ³΅μ μœΌλ‘œ κ·Ήλ³΅ν•œλ‹€. κ΄‘λ²”μœ„ν•œ μ‹€ν—˜μ  κ²°κ³Όκ°€ 정성적, μ •λŸ‰μ μœΌλ‘œ μ œμ•ˆν•˜λŠ” λ°©λ²•λ“€μ˜ νš¨μš©μ„±μ„ 보여쀀닀.1 Introduction 1 1.1 Background and Research Issues 1 1.2 Outline of the Dissertation 3 2 3D Multi-Person Pose Estimation 7 2.1 Introduction 7 2.2 Related works 10 2.3 Overview of the proposed model 13 2.4 DetectNet 13 2.5 PoseNet 14 2.5.1 Model design 14 2.5.2 Loss function 14 2.6 RootNet 15 2.6.1 Model design 15 2.6.2 Camera normalization 19 2.6.3 Network architecture 19 2.6.4 Loss function 20 2.7 Implementation details 20 2.8 Experiment 21 2.8.1 Dataset and evaluation metric 21 2.8.2 Experimental protocol 22 2.8.3 Ablation study 23 2.8.4 Comparison with state-of-the-art methods 25 2.8.5 Running time of the proposed framework 31 2.8.6 Qualitative results 31 2.9 Conclusion 34 3 3D Multi-Person Pose and Shape Estimation 35 3.1 Introduction 35 3.2 Related works 38 3.3 I2L-MeshNet 41 3.3.1 PoseNet 41 3.3.2 MeshNet 43 3.3.3 Final 3D human pose and mesh 45 3.3.4 Loss functions 45 3.4 Implementation details 47 3.5 Experiment 48 3.5.1 Datasets and evaluation metrics 48 3.5.2 Ablation study 50 3.5.3 Comparison with state-of-the-art methods 57 3.6 Conclusion 60 4 Expressive 3D Multi-Person Pose and Shape Estimation 63 4.1 Introduction 63 4.2 Related works 66 4.3 Pose2Pose 69 4.3.1 PositionNet 69 4.3.2 RotationNet 70 4.4 Expressive 3D human pose and mesh estimation 72 4.4.1 Body part 72 4.4.2 Hand part 73 4.4.3 Face part 73 4.4.4 Training the networks 74 4.4.5 Integration of all parts in the testing stage 74 4.5 Implementation details 77 4.6 Experiment 78 4.6.1 Training sets and evaluation metrics 78 4.6.2 Ablation study 78 4.6.3 Comparison with state-of-the-art methods 82 4.6.4 Running time 87 4.7 Conclusion 87 5 Conclusion and Future Work 89 5.1 Summary and Contributions of the Dissertation 89 5.2 Future Directions 90 5.2.1 Global Context-Aware 3D Multi-Person Pose Estimation 91 5.2.2 Unied Framework for Expressive 3D Human Pose and Shape Estimation 91 5.2.3 Enhancing Appearance Diversity of Images Captured from Multi-View Studio 92 5.2.4 Extension to the video for temporally consistent estimation 94 5.2.5 3D clothed human shape estimation in the wild 94 5.2.6 Robust human action recognition from a video 96 Bibliography 98 ꡭ문초둝 111Docto

    ν™•λ₯ μ μΈ 3차원 μžμ„Έ 볡원과 행동인식

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (박사)-- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : 전기·컴퓨터곡학뢀, 2016. 2. μ˜€μ„±νšŒ.These days, computer vision technology becomes popular and plays an important role in intelligent systems, such as augment reality, video and image analysis, and to name a few. Although cost effective depth cameras, like a Microsoft Kinect, have recently developed, most computer vision algorithms assume that observations are obtained from RGB cameras, which make 2D observations. If, somehow, we can estimate 3D information from 2D observations, it might give better solutions for many computer vision problems. In this dissertation, we focus on estimating 3D information from 2D observations, which is well known as non-rigid structure from motion (NRSfM). More formally, NRSfM finds the three dimensional structure of an object by analyzing image streams with the assumption that an object lies in a low-dimensional space. However, a human body for long periods of time can have complex shape variations and it makes a challenging problem for NRSfM due to its increased degree of freedom. In order to handle complex shape variations, we propose a Procrustean normal distribution mixture model (PNDMM) by extending a recently proposed Procrustean normal distribution (PND), which captures the distribution of non-rigid variations of an object by excluding the effects of rigid motion. Unlike existing methods which use a single model to solve an NRSfM problem, the proposed PNDMM decomposes complex shape variations into a collection of simpler ones, thereby model learning can be more tractable and accurate. We perform experiments showing that the proposed method outperforms existing methods on highly complex and long human motion sequences. In addition, we extend the PNDMM to a single view 3D human pose estimation problem. While recovering a 3D structure of a human body from an image is important, it is a highly ambiguous problem due to the deformation of an articulated human body. Moreover, before estimating a 3D human pose from a 2D human pose, it is important to obtain an accurate 2D human pose. In order to address inaccuracy of 2D pose estimation on a single image and 3D human pose ambiguities, we estimate multiple 2D and 3D human pose candidates and select the best one which can be explained by a 2D human pose detector and a 3D shape model. We also introduce a model transformation which is incorporated into the 3D shape prior model, such that the proposed method can be applied to a novel test image. Experimental results show that the proposed method can provide good 3D reconstruction results when tested on a novel test image, despite inaccuracies of 2D part detections and 3D shape ambiguities. Finally, we handle an action recognition problem from a video clip. Current studies show that high-level features obtained from estimated 2D human poses enable action recognition performance beyond current state-of-the-art methods using low- and mid-level features based on appearance and motion, despite inaccuracy of human pose estimation. Based on these findings, we propose an action recognition method using estimated 3D human pose information since the proposed PNDMM is able to reconstruct 3D shapes from 2D shapes. Experimental results show that 3D pose based descriptors are better than 2D pose based descriptors for action recognition, regardless of classification methods. Considering the fact that we use simple 3D pose descriptors based on a 3D shape model which is learned from 2D shapes, results reported in this dissertation are promising and obtaining accurate 3D information from 2D observations is still an important research issue for reliable computer vision systems.Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Research Issues 4 1.3 Organization of the Dissertation 6 Chapter 2 Preliminary 9 2.1 Generalized Procrustes Analysis (GPA) 11 2.2 EM-GPA Algorithm 12 2.2.1 Objective function 12 2.2.2 E-step 15 2.2.3 M-step 16 2.3 Implementation Considerations for EM-GPA 18 2.3.1 Preprocessing stage 18 2.3.2 Small update rate for the covariance matrix 20 2.4 Experiments 21 2.4.1 Shape alignment with the missing information 23 2.4.2 3D shape modeling 24 2.4.3 2D+3D active appearance models 28 2.5 Chapter Summary and Discussion 32 Chapter 3 Procrustean Normal Distribution Mixture Model 33 3.1 Non-Rigid Structure from Motion 35 3.2 Procrustean Normal Distribution (PND) 38 3.3 PND Mixture Model 41 3.4 Learning a PNDMM 43 3.4.1 E-step 44 3.4.2 M-step 46 3.5 Learning an Adaptive PNDMM 48 3.6 Experiments 50 3.6.1 Experimental setup 50 3.6.2 CMU Mocap database 53 3.6.3 UMPM dataset 69 3.6.4 Simple and short motions 74 3.6.5 Real sequence - qualitative representation 77 3.7 Chapter Summary 78 Chapter 4 Recovering a 3D Human Pose from a Novel Image 83 4.1 Single View 3D Human Pose Estimation 85 4.2 Candidate Generation 87 4.2.1 Initial pose generation 87 4.2.2 Part recombination 88 4.3 3D Shape Prior Model 89 4.3.1 Procrustean mixture model learning 89 4.3.2 Procrustean mixture model fitting 91 4.4 Model Transformation 92 4.4.1 Model normalization 92 4.4.2 Model adaptation 95 4.5 Result Selection 96 4.6 Experiments 98 4.6.1 Implementation details 98 4.6.2 Evaluation of the joint 2D and 3D pose estimation 99 4.6.3 Evaluation of the 2D pose estimation 104 4.6.4 Evaluation of the 3D pose estimation 106 4.7 Chapter Summary 108 Chapter 5 Application to Action Recognition 109 5.1 Appearance and Motion Based Descriptors 112 5.2 2D Pose Based Descriptors 113 5.3 Bag-of-Features with a Multiple Kernel Method 114 5.4 Classification - Kernel Group Sparse Representation 115 5.4.1 Group sparse representation for classification 116 5.4.2 Kernel group sparse (KGS) representation for classification 118 5.5 Experiment on sub-JHMDB Dataset 120 5.5.1 Experimental setup 120 5.5.2 3D pose based descriptor 122 5.5.3 Experimental results 123 5.6 Chapter Summary 129 Chapter 6 Conclusion and Future Work 131 Appendices 135 A Proof of Propositions in Chapter 2 137 A.1 Proof of Proposition 1 137 A.2 Proof of Proposition 3 138 A.3 Proof of Proposition 4 139 B Calculation of p(XijDii) in Chapter 3 141 B.1 Without the Dirac-delta term 141 B.2 With the Dirac-delta term 142 C Procrustean Mixture Model Learning and Fitting in Chapter 4 145 C.1 Procrustean Mixture Model Learning 145 C.2 Procrustean Mixture Model Fitting 147 Bibliography 153 초 둝 167Docto
    • …
    corecore