26,826 research outputs found
Monocular Expressive Body Regression through Body-Driven Attention
To understand how people look, interact, or perform tasks, we need to quickly
and accurately capture their 3D body, face, and hands together from an RGB
image. Most existing methods focus only on parts of the body. A few recent
approaches reconstruct full expressive 3D humans from images using 3D body
models that include the face and hands. These methods are optimization-based
and thus slow, prone to local optima, and require 2D keypoints as input. We
address these limitations by introducing ExPose (EXpressive POse and Shape
rEgression), which directly regresses the body, face, and hands, in SMPL-X
format, from an RGB image. This is a hard problem due to the high
dimensionality of the body and the lack of expressive training data.
Additionally, hands and faces are much smaller than the body, occupying very
few image pixels. This makes hand and face estimation hard when body images are
downscaled for neural networks. We make three main contributions. First, we
account for the lack of training data by curating a dataset of SMPL-X fits on
in-the-wild images. Second, we observe that body estimation localizes the face
and hands reasonably well. We introduce body-driven attention for face and hand
regions in the original image to extract higher-resolution crops that are fed
to dedicated refinement modules. Third, these modules exploit part-specific
knowledge from existing face- and hand-only datasets. ExPose estimates
expressive 3D humans more accurately than existing optimization methods at a
small fraction of the computational cost. Our data, model and code are
available for research at https://expose.is.tue.mpg.de .Comment: Accepted in ECCV'20. Project page: http://expose.is.tue.mpg.d
3D μ ν¬μ¦ μΈμμ μν μΈμ‘° λ°μ΄ν°μ μ΄μ©
νμλ
Όλ¬Έ(λ°μ¬) -- μμΈλνκ΅λνμ : μ΅ν©κ³ΌνκΈ°μ λνμ μ΅ν©κ³ΌνλΆ(μ§λ₯νμ΅ν©μμ€ν
μ 곡), 2021.8. μνμ΄.3D hand pose estimation (HPE) based on RGB images has been studied for a long time. Relevant methods have focused mainly on optimization of neural framework for graphically connected finger joints. Training RGB-based HPE models has not been easy to train because of the scarcity on RGB hand pose datasets; unlike human body pose datasets, the finger joints that span hand postures are structured delicately and exquisitely. Such structure makes accurately annotating each joint with unique 3D world coordinates difficult, which is why many conventional methods rely on synthetic data samples to cover large variations of hand postures.
Synthetic dataset consists of very precise annotations of ground truths, and further allows control over the variety of data samples, yielding a learning model to be trained with a large pose space. Most of the studies, however, have performed frame-by-frame estimation based on independent static images. Synthetic visual data can provide practically infinite diversity and rich labels, while avoiding ethical issues with privacy and bias. However, for many tasks, current models trained on synthetic data generalize poorly to real data. The task of 3D human hand pose estimation is a particularly interesting example of this synthetic-to-real problem, because learning-based approaches perform reasonably well given real training data, yet labeled 3D poses are extremely difficult to obtain in the wild, limiting scalability.
In this dissertation, we attempt to not only consider the appearance of a hand but incorporate the temporal movement information of a hand in motion into the learning framework for better 3D hand pose estimation performance, which leads to the necessity of a large scale dataset with sequential RGB hand images.
We propose a novel method that generates a synthetic dataset that mimics natural human hand movements by re-engineering annotations of an extant static hand pose dataset into pose-flows. With the generated dataset, we train a newly proposed recurrent framework, exploiting visuo-temporal features from sequential images of synthetic hands in motion and emphasizing temporal smoothness of estimations with a temporal consistency constraint. Our novel training strategy of detaching the recurrent layer of the framework during domain finetuning from synthetic to real allows preservation of the visuo-temporal features learned from sequential synthetic hand images. Hand poses that are sequentially estimated consequently produce natural and smooth hand movements which lead to more robust estimations. We show that utilizing temporal information for
3D hand pose estimation significantly enhances general pose estimations by outperforming state-of-the-art methods in experiments on hand pose estimation benchmarks.
Since a fixed set of dataset provides a finite distribution of data samples, the generalization of a learning pose estimation network is limited in terms of pose, RGB and viewpoint spaces. We further propose to augment the data automatically such that the augmented pose sampling is performed in favor of training pose estimators generalization performance. Such auto-augmentation of poses is performed within a learning feature space in order to avoid computational burden of generating synthetic sample for every iteration of updates. The proposed
effort can be considered as generating and utilizing synthetic samples for network training in the feature space. This allows training efficiency by requiring less number of real data samples, enhanced generalization power over multiple dataset domains and estimation performance caused by efficient augmentation.2D μ΄λ―Έμ§μμ μ¬λμ μ λͺ¨μκ³Ό ν¬μ¦λ₯Ό μΈμνκ³ κ΅¬ννλ μ°κ΅¬λ κ° μκ°λ½ μ‘°μΈνΈλ€μ 3D μμΉλ₯Ό κ²μΆνλ κ²μ λͺ©νλ‘νλ€. μ ν¬μ¦λ μκ°λ½ μ‘°μΈνΈλ€λ‘ ꡬμ±λμ΄ μκ³ μλͺ© κ΄μ λΆν° MCP, PIP, DIP μ‘°μΈνΈλ€λ‘ μ¬λ μμ ꡬμ±νλ μ 체μ μμλ€μ μλ―Ένλ€. μ ν¬μ¦ μ 보λ λ€μν λΆμΌμμ νμ©λ μ μκ³ μ μ μ€μ³ κ°μ§ μ°κ΅¬ λΆμΌμμ μ ν¬μ¦ μ λ³΄κ° λ§€μ° νλ₯ν μ
λ ₯ νΉμ§ κ°μΌλ‘ μ¬μ©λλ€.
μ¬λμ μ ν¬μ¦ κ²μΆ μ°κ΅¬λ₯Ό μ€μ μμ€ν
μ μ μ©νκΈ° μν΄μλ λμ μ νλ, μ€μκ°μ±, λ€μν κΈ°κΈ°μ μ¬μ© κ°λ₯νλλ‘ κ°λ²Όμ΄ λͺ¨λΈμ΄ νμνκ³ , μ΄κ²μ κ°λ₯μΌ νκΈ° μν΄μ νμ΅ν μΈκ³΅μ κ²½λ§ λͺ¨λΈμ νμ΅νλλ°μλ λ§μ λ°μ΄ν°κ° νμλ‘ νλ€. νμ§λ§ μ¬λ μ ν¬μ¦λ₯Ό μΈ‘μ νλ κΈ°κ³λ€μ΄ κ½€ λΆμμ νκ³ , μ΄ κΈ°κ³λ€μ μ₯μ°©νκ³ μλ μ΄λ―Έμ§λ μ¬λ μ νΌλΆ μκ³Όλ λ§μ΄ λ¬λΌ νμ΅μ μ¬μ©νκΈ°κ° μ μ νμ§ μλ€. κ·Έλ¬κΈ° λλ¬Έμ λ³Έ λ
Όλ¬Έμμλ μ΄λ¬ν λ¬Έμ λ₯Ό ν΄κ²°νκΈ° μν΄ μΈκ³΅μ μΌλ‘ λ§λ€μ΄λΈ λ°μ΄ν°λ₯Ό μ¬κ°κ³΅ λ° μ¦λνμ¬ νμ΅μ μ¬μ©νκ³ , κ·Έκ²μ ν΅ν΄ λ μ’μ νμ΅μ±κ³Όλ₯Ό μ΄λ£¨λ €κ³ νλ€.
μΈκ³΅μ μΌλ‘ λ§λ€μ΄λΈ μ¬λ μ μ΄λ―Έμ§ λ°μ΄ν°λ€μ μ€μ μ¬λ μ νΌλΆμκ³Όλ λΉμ·ν μ§μΈμ λν
μΌν ν
μ€μ³κ° λ§μ΄ λ¬λΌ, μ€μ λ‘ μΈκ³΅ λ°μ΄ν°λ₯Ό νμ΅ν λͺ¨λΈμ μ€μ μ λ°μ΄ν°μμ μ±λ₯μ΄ νμ ν λ§μ΄ λ¨μ΄μ§λ€. μ΄ λ λ°μ΄νμ λλ©μΈμ μ€μ΄κΈ° μν΄μ 첫λ²μ§Έλ‘λ μ¬λμμ ꡬ쑰λ₯Ό λ¨Όμ νμ΅ μν€κΈ°μν΄, μ λͺ¨μ
μ μ¬κ°κ³΅νμ¬ κ·Έ μμ§μ ꡬ쑰λ₯Ό νμ€ν μκ°μ μ 보λ₯Ό λΊ λλ¨Έμ§λ§ μ€μ μ μ΄λ―Έμ§ λ°μ΄ν°μ νμ΅νμκ³ ν¬κ² ν¨κ³Όλ₯Ό λ΄μλ€.
μ΄λ μ€μ μ¬λ μλͺ¨μ
μ λͺ¨λ°©νλ λ°©λ²λ‘ μ μ μνμλ€.
λλ²μ§Έλ‘λ λ λλ©μΈμ΄ λ€λ₯Έ λ°μ΄ν°λ₯Ό λ€νΈμν¬ νΌμ³ 곡κ°μμ alignμμΌ°λ€. κ·ΈλΏλ§μλλΌ μΈκ³΅ ν¬μ¦λ₯Ό νΉμ λ°μ΄ν°λ€λ‘ augmentνμ§ μκ³ λ€νΈμν¬κ° λ§μ΄ λ³΄μ§ λͺ»ν ν¬μ¦κ° λ§λ€μ΄μ§λλ‘ νλμ νλ₯ λͺ¨λΈλ‘μ μ€μ νμ¬ κ·Έκ²μμ μνλ§νλ ꡬ쑰λ₯Ό μ μνμλ€.
λ³Έ λ
Όλ¬Έμμλ μΈκ³΅ λ°μ΄ν°λ₯Ό λ ν¨κ³Όμ μΌλ‘ μ¬μ©νμ¬ annotationμ΄ μ΄λ €μ΄ μ€μ λ°μ΄ν°λ₯Ό λ λͺ¨μΌλ μκ³ μ€λ¬μ μμ΄ μΈκ³΅ λ°μ΄ν°λ€μ λ ν¨κ³Όμ μΌλ‘ λ§λ€μ΄ λ΄λ κ² λΏλ§ μλλΌ, λ μμ νκ³ μ§μμ νΉμ§κ³Ό μκ°μ νΉμ§μ νμ©ν΄μ ν¬μ¦μ μ±λ₯μ κ°μ νλ λ°©λ²λ€μ μ μνλ€. λν, λ€νΈμν¬κ° μ€μ€λ‘ νμν λ°μ΄ν°λ₯Ό μ°Ύμμ νμ΅ν μ μλ μλ λ°μ΄ν° μ¦λ λ°©λ²λ‘ λ ν¨κ» μ μνμλ€. μ΄λ κ² μ μλ λ°©λ²μ κ²°ν©ν΄μ λ λμ μ ν¬μ¦μ μ±λ₯μ ν₯μ ν μ μλ€.1. Introduction 1
2. Related Works 14
3. Preliminaries: 3D Hand Mesh Model 27
4. SeqHAND: RGB-sequence-based 3D Hand Pose and Shape Estimation 31
5. Hand Pose Auto-Augment 66
6. Conclusion 85
Abstract (Korea) 101
κ°μ¬μ κΈ 103λ°
λ¨μΌ μ΄λ―Έμ§λ‘λΆν° μ¬λ¬ μ¬λμ ννμ μ μ 3D μμΈ λ° νν μΆμ
νμλ
Όλ¬Έ (λ°μ¬) -- μμΈλνκ΅ λνμ : 곡과λν μ κΈ°Β·μ 보곡νλΆ, 2021. 2. μ΄κ²½λ¬΄.Human is the most centric and interesting object in our life: many human-centric techniques and studies have been proposed from both industry and academia, such as motion capture and human-computer interaction. Recovery of accurate 3D geometry of human (i.e., 3D human pose and shape) is a key component of the human-centric techniques and studies. With the rapid spread of cameras, a single RGB image has become a popular input, and many single RGB-based 3D human pose and shape estimation methods have been proposed.
The 3D pose and shape of the whole body, which includes hands and face, provides expressive and rich information, including human intention and feeling. Unfortunately, recovering the whole-body 3D pose and shape is greatly challenging; thus, it has been attempted by few works, called expressive methods. Instead of directly solving the expressive 3D pose and shape estimation, the literature has been developed for recovery of the 3D pose and shape of each part (i.e., body, hands, and face) separately, called part-specific methods. There are several more simplifications. For example, many works estimate only 3D pose without shape because additional 3D shape estimation makes the problem much harder. In addition, most works assume a single person case and do not consider a multi-person case. Therefore, there are several ways to categorize current literature; 1) part-specific methods and expressive methods, 2) 3D human pose estimation methods and 3D human pose and shape estimation methods, and 3) methods for a single person and methods for multiple persons. The difficulty increases while the outputs of methods become richer by changing from part-specific to expressive, from 3D pose estimation to 3D pose and shape estimation, and from a single person case to multi-person case.
This dissertation introduces three approaches towards expressive 3D multi-person pose and shape estimation from a single image; thus, the output can finally provide the richest information. The first approach is for 3D multi-person body pose estimation, the second one is 3D multi-person body pose and shape estimation, and the final one is expressive 3D multi-person pose and shape estimation. Each approach tackles critical limitations of previous state-of-the-art methods, thus bringing the literature closer to the real-world environment.
First, a 3D multi-person body pose estimation framework is introduced. In contrast to the single person case, the multi-person case additionally requires camera-relative 3D positions of the persons. Estimating the camera-relative 3D position from a single image involves high depth ambiguity. The proposed framework utilizes a deep image feature with the camera pinhole model to recover the camera-relative 3D position. The proposed framework can be combined with any 3D single person pose and shape estimation methods for 3D multi-person pose and shape. Therefore, the following two approaches focus on the single person case and can be easily extended to the multi-person case by using the framework of the first approach. Second, a 3D multi-person body pose and shape estimation method is introduced. It extends the first approach to additionally predict accurate 3D shape while its accuracy significantly outperforms previous state-of-the-art methods by proposing a new target representation, lixel-based 1D heatmap. Finally, an expressive 3D multi-person pose and shape estimation method is introduced. It integrates the part-specific 3D pose and shape of the above approaches; thus, it can provide expressive 3D human pose and shape. In addition, it boosts the accuracy of the estimated 3D pose and shape by proposing a 3D positional pose-guided 3D rotational pose prediction system.
The proposed approaches successfully overcome the limitations of the previous state-of-the-art methods. The extensive experimental results demonstrate the superiority of the proposed approaches in both qualitative and quantitative ways.μΈκ°μ μ°λ¦¬μ μΌμμνμμ κ°μ₯ μ€μ¬μ΄ λκ³ ν₯λ―Έλ‘μ΄ λμμ΄λ€. κ·Έμ λ°λΌ λͺ¨μ
μΊ‘μ², μΈκ°-μ»΄ν¨ν° μΈν°λ μ
λ± λ§μ μΈκ°μ€μ¬μ κΈ°μ κ³Ό νλ¬Έμ΄ μ°μ
κ³μ νκ³μμ μ μλμλ€. μΈκ°μ μ νν 3D κΈ°ν (μ¦, μΈκ°μ 3D μμΈμ νν)λ₯Ό 볡μνλ κ²μ μΈκ°μ€μ¬ κΈ°μ κ³Ό νλ¬Έμμ κ°μ₯ μ€μν λΆλΆ μ€ νλμ΄λ€. μΉ΄λ©λΌμ λΉ λ₯Έ λμ€νλ‘ μΈν΄ λ¨μΌ μ΄λ―Έμ§λ λ§μ μκ³ λ¦¬μ¦μ λ리 μ°μ΄λ μ
λ ₯μ΄ λμκ³ , κ·Έλ‘ μΈν΄ λ§μ λ¨μΌ μ΄λ―Έμ§ κΈ°λ°μ 3D μΈκ° μμΈ λ° νν μΆμ μκ³ λ¦¬μ¦μ΄ μ μλμλ€.
μκ³Ό λ°μ ν¬ν¨ν μ μ μ 3D μμΈμ ννλ μΈκ°μ μλμ λλμ ν¬ν¨ν ννμ μ΄κ³ νλΆν μ 보λ₯Ό μ 곡νλ€. νμ§λ§ μ μ μ 3D μμΈμ ννλ₯Ό 볡μνλ κ²μ λ§€μ° μ΄λ ΅κΈ° λλ¬Έμ μ€μ§ κ·Ήμμμ λ°©λ²λ§μ΄ μ΄λ₯Ό νκΈ° μν΄ μ μλμκ³ , μ΄λ₯Ό μν λ°©λ²λ€μ ννμ μΈ λ°©λ²μ΄λΌκ³ λΆλ₯Έλ€. ννμ μΈ 3D μμΈμ ννλ₯Ό ν λ²μ 볡μνλ κ² λμ , μ¬λμ λͺΈ, μ, κ·Έλ¦¬κ³ μΌκ΅΄μ 3D μμΈμ ννλ₯Ό λ°λ‘ 볡μνλ λ°©λ²λ€μ΄ μ μλμλ€. μ΄λ¬ν λ°©λ²λ€μ λΆλΆ νΉμ λ°©λ²μ΄λΌκ³ λΆλ₯Έλ€. μ΄λ¬ν λ¬Έμ μ κ°λ¨ν μ΄μΈμλ λͺ κ°μ§μ κ°λ¨νκ° λ μ‘΄μ¬νλ€. μλ₯Ό λ€μ΄, λ§μ λ°©λ²μ 3D ννλ₯Ό μ μΈν 3D μμΈλ§μ μΆμ νλ€. μ΄λ μΆκ°μ μΈ 3D νν μΆμ μ΄ λ¬Έμ λ₯Ό λ μ΄λ ΅κ² λ§λ€κΈ° λλ¬Έμ΄λ€. λν, λλΆλΆμ λ°©λ²μ μ€μ§ λ¨μΌ μ¬λμ κ²½μ°λ§ κ³ λ €νκ³ μ¬λ¬ μ¬λμ κ²½μ°λ κ³ λ €νμ§ μλλ€. κ·Έλ¬λ―λ‘, νμ¬ μ μλ λ°©λ²λ€μ λͺ κ°μ§ κΈ°μ€μ μν΄ λΆλ₯λ μ μλ€; 1) λΆλΆ νΉμ λ°©λ² vs. ννμ λ°©λ², 2) 3D μμΈ μΆμ λ°©λ² vs. 3D μμΈ λ° νν μΆμ λ°©λ², κ·Έλ¦¬κ³ 3) λ¨μΌ μ¬λμ μν λ°©λ² vs. μ¬λ¬ μ¬λμ μν λ°©λ². λΆλΆ νΉμ μμ ννμ μΌλ‘, 3D μμΈ μΆμ μμ 3D μμΈ λ° νν μΆμ μΌλ‘, λ¨μΌ μ¬λμμ μ¬λ¬ μ¬λμΌλ‘ κ°μλ‘ μΆμ μ΄ λ μ΄λ €μμ§μ§λ§, λ νλΆν μ 보λ₯Ό μΆλ ₯ν μ μκ² λλ€.
λ³Έ νμλ
Όλ¬Έμ λ¨μΌ μ΄λ―Έμ§λ‘λΆν° μ¬λ¬ μ¬λμ ννμ μΈ 3D μμΈ λ° νν μΆμ μ ν₯νλ μΈ κ°μ§μ μ κ·Όλ²μ μκ°νλ€. λ°λΌμ μ΅μ’
μ μΌλ‘ μ μλ λ°©λ²μ κ°μ₯ νλΆν μ 보λ₯Ό μ 곡ν μ μλ€. 첫 λ²μ§Έ μ κ·Όλ²μ μ¬λ¬ μ¬λμ μν 3D μμΈ μΆμ μ΄κ³ , λ λ²μ§Έλ μ¬λ¬ μ¬λμ μν 3D μμΈ λ° νν μΆμ μ΄κ³ , κ·Έλ¦¬κ³ λ§μ§λ§μ μ¬λ¬ μ¬λμ μν ννμ μΈ 3D μμΈ λ° νν μΆμ μ μν λ°©λ²μ΄λ€. κ° μ κ·Όλ²μ κΈ°μ‘΄ λ°©λ²λ€μ΄ κ°μ§ μ€μν νκ³μ λ€μ ν΄κ²°νμ¬ μ μλ λ°©λ²λ€μ΄ μ€μνμμ μ°μΌ μ μλλ‘ νλ€.
첫 λ²μ§Έ μ κ·Όλ²μ μ¬λ¬ μ¬λμ μν 3D μμΈ μΆμ νλ μμν¬μ΄λ€. λ¨μΌ μ¬λμ κ²½μ°μλ λ€λ₯΄κ² μ¬λ¬ μ¬λμ κ²½μ° μ¬λλ§λ€ μΉ΄λ©λΌ μλμ μΈ 3D μμΉκ° νμνλ€. μΉ΄λ©λΌ μλμ μΈ 3D μμΉλ₯Ό λ¨μΌ μ΄λ―Έμ§λ‘λΆν° μΆμ νλ κ²μ λ§€μ° λμ κΉμ΄ λͺ¨νΈμ±μ λλ°νλ€. μ μνλ νλ μμν¬λ μ¬μΈ΅ μ΄λ―Έμ§ νΌμ³μ μΉ΄λ©λΌ νν λͺ¨λΈμ μ¬μ©νμ¬ μΉ΄λ©λΌ μλμ μΈ 3D μμΉλ₯Ό 볡μνλ€. μ΄ νλ μμν¬λ μ΄λ€ λ¨μΌ μ¬λμ μν 3D μμΈ λ° νν μΆμ λ°©λ²κ³Ό ν©μ³μ§ μ μκΈ° λλ¬Έμ, λ€μμ μκ°λ λ μ κ·Όλ²μ μ€μ§ λ¨μΌ μ¬λμ μν 3D μμΈ λ° νν μΆμ μ μ΄μ μ λ§μΆλ€. λ€μμ μκ°λ λ μ κ·Όλ²μμ μ μλ λ¨μΌ μ¬λμ μν λ°©λ²λ€μ 첫 λ²μ§Έ μ κ·Όλ²μμ μκ°λλ μ¬λ¬ μ¬λμ μν νλ μμν¬λ₯Ό μ¬μ©νμ¬ μ½κ² μ¬λ¬ μ¬λμ κ²½μ°λ‘ νμ₯ν μ μλ€. λ λ²μ§Έ μ κ·Όλ²μ μ¬λ¬ μ¬λμ μν 3D μμΈ λ° νν μΆμ λ°©λ²μ΄λ€. μ΄ λ°©λ²μ 첫 λ²μ§Έ μ κ·Όλ²μ νμ₯νμ¬ μ νλλ₯Ό μ μ§νλ©΄μ μΆκ°λ‘ 3D ννλ₯Ό μΆμ νκ² νλ€. λμ μ νλλ₯Ό μν΄ λ¦μ
κΈ°λ°μ 1D ννΈλ§΅μ μ μνκ³ , μ΄λ‘ μΈν΄ κΈ°μ‘΄μ λ°νλ λ°©λ²λ€λ³΄λ€ ν° νμΌλ‘ λμ μ±λ₯μ μ»λλ€. λ§μ§λ§ μ κ·Όλ²μ μ¬λ¬ μ¬λμ μν ννμ μΈ 3D μμΈ λ° νν μΆμ λ°©λ²μ΄λ€. μ΄κ²μ λͺΈ, μ, κ·Έλ¦¬κ³ μΌκ΅΄λ§λ€ 3D μμΈ λ° ννλ₯Ό νλλ‘ ν΅ν©νμ¬ ννμ μΈ 3D μμΈ λ° ννλ₯Ό μ»λλ€. κ²λ€κ°, μ΄κ²μ 3D μμΉ ν¬μ¦ κΈ°λ°μ 3D νμ ν¬μ¦ μΆμ κΈ°λ²μ μ μν¨μΌλ‘μ¨ κΈ°μ‘΄μ λ°νλ λ°©λ²λ€λ³΄λ€ ν¨μ¬ λμ μ±λ₯μ μ»λλ€.
μ μλ μ κ·Όλ²λ€μ κΈ°μ‘΄μ λ°νλμλ λ°©λ²λ€μ΄ κ°λ νκ³μ λ€μ μ±κ³΅μ μΌλ‘ 극볡νλ€. κ΄λ²μν μ€νμ κ²°κ³Όκ° μ μ±μ , μ λμ μΌλ‘ μ μνλ λ°©λ²λ€μ ν¨μ©μ±μ 보μ¬μ€λ€.1 Introduction 1
1.1 Background and Research Issues 1
1.2 Outline of the Dissertation 3
2 3D Multi-Person Pose Estimation 7
2.1 Introduction 7
2.2 Related works 10
2.3 Overview of the proposed model 13
2.4 DetectNet 13
2.5 PoseNet 14
2.5.1 Model design 14
2.5.2 Loss function 14
2.6 RootNet 15
2.6.1 Model design 15
2.6.2 Camera normalization 19
2.6.3 Network architecture 19
2.6.4 Loss function 20
2.7 Implementation details 20
2.8 Experiment 21
2.8.1 Dataset and evaluation metric 21
2.8.2 Experimental protocol 22
2.8.3 Ablation study 23
2.8.4 Comparison with state-of-the-art methods 25
2.8.5 Running time of the proposed framework 31
2.8.6 Qualitative results 31
2.9 Conclusion 34
3 3D Multi-Person Pose and Shape Estimation 35
3.1 Introduction 35
3.2 Related works 38
3.3 I2L-MeshNet 41
3.3.1 PoseNet 41
3.3.2 MeshNet 43
3.3.3 Final 3D human pose and mesh 45
3.3.4 Loss functions 45
3.4 Implementation details 47
3.5 Experiment 48
3.5.1 Datasets and evaluation metrics 48
3.5.2 Ablation study 50
3.5.3 Comparison with state-of-the-art methods 57
3.6 Conclusion 60
4 Expressive 3D Multi-Person Pose and Shape Estimation 63
4.1 Introduction 63
4.2 Related works 66
4.3 Pose2Pose 69
4.3.1 PositionNet 69
4.3.2 RotationNet 70
4.4 Expressive 3D human pose and mesh estimation 72
4.4.1 Body part 72
4.4.2 Hand part 73
4.4.3 Face part 73
4.4.4 Training the networks 74
4.4.5 Integration of all parts in the testing stage 74
4.5 Implementation details 77
4.6 Experiment 78
4.6.1 Training sets and evaluation metrics 78
4.6.2 Ablation study 78
4.6.3 Comparison with state-of-the-art methods 82
4.6.4 Running time 87
4.7 Conclusion 87
5 Conclusion and Future Work 89
5.1 Summary and Contributions of the Dissertation 89
5.2 Future Directions 90
5.2.1 Global Context-Aware 3D Multi-Person Pose Estimation 91
5.2.2 Unied Framework for Expressive 3D Human Pose and Shape Estimation 91
5.2.3 Enhancing Appearance Diversity of Images Captured from Multi-View Studio 92
5.2.4 Extension to the video for temporally consistent estimation 94
5.2.5 3D clothed human shape estimation in the wild 94
5.2.6 Robust human action recognition from a video 96
Bibliography 98
κ΅λ¬Έμ΄λ‘ 111Docto
- β¦