14,901 research outputs found

    Direct Prediction of 3D Body Poses from Motion Compensated Sequences

    Get PDF
    We propose an efficient approach to exploiting motion information from consecutive frames of a video sequence to recover the 3D pose of people. Previous approaches typically compute candidate poses in individual frames and then link them in a post-processing step to resolve ambiguities. By contrast, we directly regress from a spatio-temporal volume of bounding boxes to a 3D pose in the central frame. We further show that, for this approach to achieve its full potential, it is essential to compensate for the motion in consecutive frames so that the subject remains centered. This then allows us to effectively overcome ambiguities and improve upon the state-of-the-art by a large margin on the Human3.6m, HumanEva, and KTH Multiview Football 3D human pose estimation benchmarks

    3D ์† ํฌ์ฆˆ ์ธ์‹์„ ์œ„ํ•œ ์ธ์กฐ ๋ฐ์ดํ„ฐ์˜ ์ด์šฉ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์œตํ•ฉ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™์› ์œตํ•ฉ๊ณผํ•™๋ถ€(์ง€๋Šฅํ˜•์œตํ•ฉ์‹œ์Šคํ…œ์ „๊ณต), 2021.8. ์–‘ํ•œ์—ด.3D hand pose estimation (HPE) based on RGB images has been studied for a long time. Relevant methods have focused mainly on optimization of neural framework for graphically connected finger joints. Training RGB-based HPE models has not been easy to train because of the scarcity on RGB hand pose datasets; unlike human body pose datasets, the finger joints that span hand postures are structured delicately and exquisitely. Such structure makes accurately annotating each joint with unique 3D world coordinates difficult, which is why many conventional methods rely on synthetic data samples to cover large variations of hand postures. Synthetic dataset consists of very precise annotations of ground truths, and further allows control over the variety of data samples, yielding a learning model to be trained with a large pose space. Most of the studies, however, have performed frame-by-frame estimation based on independent static images. Synthetic visual data can provide practically infinite diversity and rich labels, while avoiding ethical issues with privacy and bias. However, for many tasks, current models trained on synthetic data generalize poorly to real data. The task of 3D human hand pose estimation is a particularly interesting example of this synthetic-to-real problem, because learning-based approaches perform reasonably well given real training data, yet labeled 3D poses are extremely difficult to obtain in the wild, limiting scalability. In this dissertation, we attempt to not only consider the appearance of a hand but incorporate the temporal movement information of a hand in motion into the learning framework for better 3D hand pose estimation performance, which leads to the necessity of a large scale dataset with sequential RGB hand images. We propose a novel method that generates a synthetic dataset that mimics natural human hand movements by re-engineering annotations of an extant static hand pose dataset into pose-flows. With the generated dataset, we train a newly proposed recurrent framework, exploiting visuo-temporal features from sequential images of synthetic hands in motion and emphasizing temporal smoothness of estimations with a temporal consistency constraint. Our novel training strategy of detaching the recurrent layer of the framework during domain finetuning from synthetic to real allows preservation of the visuo-temporal features learned from sequential synthetic hand images. Hand poses that are sequentially estimated consequently produce natural and smooth hand movements which lead to more robust estimations. We show that utilizing temporal information for 3D hand pose estimation significantly enhances general pose estimations by outperforming state-of-the-art methods in experiments on hand pose estimation benchmarks. Since a fixed set of dataset provides a finite distribution of data samples, the generalization of a learning pose estimation network is limited in terms of pose, RGB and viewpoint spaces. We further propose to augment the data automatically such that the augmented pose sampling is performed in favor of training pose estimators generalization performance. Such auto-augmentation of poses is performed within a learning feature space in order to avoid computational burden of generating synthetic sample for every iteration of updates. The proposed effort can be considered as generating and utilizing synthetic samples for network training in the feature space. This allows training efficiency by requiring less number of real data samples, enhanced generalization power over multiple dataset domains and estimation performance caused by efficient augmentation.2D ์ด๋ฏธ์ง€์—์„œ ์‚ฌ๋žŒ์˜ ์† ๋ชจ์–‘๊ณผ ํฌ์ฆˆ๋ฅผ ์ธ์‹ํ•˜๊ณ  ๊ตฌํ˜„ํ๋Š” ์—ฐ๊ตฌ๋Š” ๊ฐ ์†๊ฐ€๋ฝ ์กฐ์ธํŠธ๋“ค์˜ 3D ์œ„์น˜๋ฅผ ๊ฒ€์ถœํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœํ•œ๋‹ค. ์† ํฌ์ฆˆ๋Š” ์†๊ฐ€๋ฝ ์กฐ์ธํŠธ๋“ค๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๊ณ  ์†๋ชฉ ๊ด€์ ˆ๋ถ€ํ„ฐ MCP, PIP, DIP ์กฐ์ธํŠธ๋“ค๋กœ ์‚ฌ๋žŒ ์†์„ ๊ตฌ์„ฑํ•˜๋Š” ์‹ ์ฒด์  ์š”์†Œ๋“ค์„ ์˜๋ฏธํ•œ๋‹ค. ์† ํฌ์ฆˆ ์ •๋ณด๋Š” ๋‹ค์–‘ํ•œ ๋ถ„์•ผ์—์„œ ํ™œ์šฉ๋ ์ˆ˜ ์žˆ๊ณ  ์† ์ œ์Šค์ณ ๊ฐ์ง€ ์—ฐ๊ตฌ ๋ถ„์•ผ์—์„œ ์† ํฌ์ฆˆ ์ •๋ณด๊ฐ€ ๋งค์šฐ ํ›Œ๋ฅญํ•œ ์ž…๋ ฅ ํŠน์ง• ๊ฐ’์œผ๋กœ ์‚ฌ์šฉ๋œ๋‹ค. ์‚ฌ๋žŒ์˜ ์† ํฌ์ฆˆ ๊ฒ€์ถœ ์—ฐ๊ตฌ๋ฅผ ์‹ค์ œ ์‹œ์Šคํ…œ์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋†’์€ ์ •ํ™•๋„, ์‹ค์‹œ๊ฐ„์„ฑ, ๋‹ค์–‘ํ•œ ๊ธฐ๊ธฐ์— ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•˜๋„๋ก ๊ฐ€๋ฒผ์šด ๋ชจ๋ธ์ด ํ•„์š”ํ•˜๊ณ , ์ด๊ฒƒ์„ ๊ฐ€๋Šฅ์ผ€ ํ•˜๊ธฐ ์œ„ํ•ด์„œ ํ•™์Šตํ•œ ์ธ๊ณต์‹ ๊ฒฝ๋ง ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๋Š”๋ฐ์—๋Š” ๋งŽ์€ ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์š”๋กœ ํ•œ๋‹ค. ํ•˜์ง€๋งŒ ์‚ฌ๋žŒ ์† ํฌ์ฆˆ๋ฅผ ์ธก์ •ํ•˜๋Š” ๊ธฐ๊ณ„๋“ค์ด ๊ฝค ๋ถˆ์•ˆ์ •ํ•˜๊ณ , ์ด ๊ธฐ๊ณ„๋“ค์„ ์žฅ์ฐฉํ•˜๊ณ  ์žˆ๋Š” ์ด๋ฏธ์ง€๋Š” ์‚ฌ๋žŒ ์† ํ”ผ๋ถ€ ์ƒ‰๊ณผ๋Š” ๋งŽ์ด ๋‹ฌ๋ผ ํ•™์Šต์— ์‚ฌ์šฉํ•˜๊ธฐ๊ฐ€ ์ ์ ˆํ•˜์ง€ ์•Š๋‹ค. ๊ทธ๋Ÿฌ๊ธฐ ๋•Œ๋ฌธ์— ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ธ๊ณต์ ์œผ๋กœ ๋งŒ๋“ค์–ด๋‚ธ ๋ฐ์ดํ„ฐ๋ฅผ ์žฌ๊ฐ€๊ณต ๋ฐ ์ฆ๋Ÿ‰ํ•˜์—ฌ ํ•™์Šต์— ์‚ฌ์šฉํ•˜๊ณ , ๊ทธ๊ฒƒ์„ ํ†ตํ•ด ๋” ์ข‹์€ ํ•™์Šต์„ฑ๊ณผ๋ฅผ ์ด๋ฃจ๋ ค๊ณ  ํ•œ๋‹ค. ์ธ๊ณต์ ์œผ๋กœ ๋งŒ๋“ค์–ด๋‚ธ ์‚ฌ๋žŒ ์† ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ๋“ค์€ ์‹ค์ œ ์‚ฌ๋žŒ ์† ํ”ผ๋ถ€์ƒ‰๊ณผ๋Š” ๋น„์Šทํ• ์ง€์–ธ์ • ๋””ํ…Œ์ผํ•œ ํ…์Šค์ณ๊ฐ€ ๋งŽ์ด ๋‹ฌ๋ผ, ์‹ค์ œ๋กœ ์ธ๊ณต ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šตํ•œ ๋ชจ๋ธ์€ ์‹ค์ œ ์† ๋ฐ์ดํ„ฐ์—์„œ ์„ฑ๋Šฅ์ด ํ˜„์ €ํžˆ ๋งŽ์ด ๋–จ์–ด์ง„๋‹ค. ์ด ๋‘ ๋ฐ์ดํƒ€์˜ ๋„๋ฉ”์ธ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด์„œ ์ฒซ๋ฒˆ์งธ๋กœ๋Š” ์‚ฌ๋žŒ์†์˜ ๊ตฌ์กฐ๋ฅผ ๋จผ์ € ํ•™์Šต ์‹œํ‚ค๊ธฐ์œ„ํ•ด, ์† ๋ชจ์…˜์„ ์žฌ๊ฐ€๊ณตํ•˜์—ฌ ๊ทธ ์›€์ง์ž„ ๊ตฌ์กฐ๋ฅผ ํ•™์Šคํ•œ ์‹œ๊ฐ„์  ์ •๋ณด๋ฅผ ๋บ€ ๋‚˜๋จธ์ง€๋งŒ ์‹ค์ œ ์† ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ์— ํ•™์Šตํ•˜์˜€๊ณ  ํฌ๊ฒŒ ํšจ๊ณผ๋ฅผ ๋‚ด์—ˆ๋‹ค. ์ด๋•Œ ์‹ค์ œ ์‚ฌ๋žŒ ์†๋ชจ์…˜์„ ๋ชจ๋ฐฉํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์‹œํ•˜์˜€๋‹ค. ๋‘๋ฒˆ์งธ๋กœ๋Š” ๋‘ ๋„๋ฉ”์ธ์ด ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๋ฅผ ๋„คํŠธ์›Œํฌ ํ”ผ์ณ ๊ณต๊ฐ„์—์„œ align์‹œ์ผฐ๋‹ค. ๊ทธ๋ฟ๋งŒ์•„๋‹ˆ๋ผ ์ธ๊ณต ํฌ์ฆˆ๋ฅผ ํŠน์ • ๋ฐ์ดํ„ฐ๋“ค๋กœ augmentํ•˜์ง€ ์•Š๊ณ  ๋„คํŠธ์›Œํฌ๊ฐ€ ๋งŽ์ด ๋ณด์ง€ ๋ชปํ•œ ํฌ์ฆˆ๊ฐ€ ๋งŒ๋“ค์–ด์ง€๋„๋ก ํ•˜๋‚˜์˜ ํ™•๋ฅ  ๋ชจ๋ธ๋กœ์„œ ์„ค์ •ํ•˜์—ฌ ๊ทธ๊ฒƒ์—์„œ ์ƒ˜ํ”Œ๋งํ•˜๋Š” ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ธ๊ณต ๋ฐ์ดํ„ฐ๋ฅผ ๋” ํšจ๊ณผ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ annotation์ด ์–ด๋ ค์šด ์‹ค์ œ ๋ฐ์ดํ„ฐ๋ฅผ ๋” ๋ชจ์œผ๋Š” ์ˆ˜๊ณ ์Šค๋Ÿฌ์›€ ์—†์ด ์ธ๊ณต ๋ฐ์ดํ„ฐ๋“ค์„ ๋” ํšจ๊ณผ์ ์œผ๋กœ ๋งŒ๋“ค์–ด ๋‚ด๋Š” ๊ฒƒ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ๋” ์•ˆ์ „ํ•˜๊ณ  ์ง€์—ญ์  ํŠน์ง•๊ณผ ์‹œ๊ฐ„์  ํŠน์ง•์„ ํ™œ์šฉํ•ด์„œ ํฌ์ฆˆ์˜ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•˜๋Š” ๋ฐฉ๋ฒ•๋“ค์„ ์ œ์•ˆํ–ˆ๋‹ค. ๋˜ํ•œ, ๋„คํŠธ์›Œํฌ๊ฐ€ ์Šค์Šค๋กœ ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฐพ์•„์„œ ํ•™์Šตํ• ์ˆ˜ ์žˆ๋Š” ์ž๋™ ๋ฐ์ดํ„ฐ ์ฆ๋Ÿ‰ ๋ฐฉ๋ฒ•๋ก ๋„ ํ•จ๊ป˜ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ด๋ ‡๊ฒŒ ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์„ ๊ฒฐํ•ฉํ•ด์„œ ๋” ๋‚˜์€ ์† ํฌ์ฆˆ์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ ํ•  ์ˆ˜ ์žˆ๋‹ค.1. Introduction 1 2. Related Works 14 3. Preliminaries: 3D Hand Mesh Model 27 4. SeqHAND: RGB-sequence-based 3D Hand Pose and Shape Estimation 31 5. Hand Pose Auto-Augment 66 6. Conclusion 85 Abstract (Korea) 101 ๊ฐ์‚ฌ์˜ ๊ธ€ 103๋ฐ•

    Exploiting temporal information for 3D pose estimation

    Full text link
    In this work, we address the problem of 3D human pose estimation from a sequence of 2D human poses. Although the recent success of deep networks has led many state-of-the-art methods for 3D pose estimation to train deep networks end-to-end to predict from images directly, the top-performing approaches have shown the effectiveness of dividing the task of 3D pose estimation into two steps: using a state-of-the-art 2D pose estimator to estimate the 2D pose from images and then mapping them into 3D space. They also showed that a low-dimensional representation like 2D locations of a set of joints can be discriminative enough to estimate 3D pose with high accuracy. However, estimation of 3D pose for individual frames leads to temporally incoherent estimates due to independent error in each frame causing jitter. Therefore, in this work we utilize the temporal information across a sequence of 2D joint locations to estimate a sequence of 3D poses. We designed a sequence-to-sequence network composed of layer-normalized LSTM units with shortcut connections connecting the input to the output on the decoder side and imposed temporal smoothness constraint during training. We found that the knowledge of temporal consistency improves the best reported result on Human3.6M dataset by approximately 12.2%12.2\% and helps our network to recover temporally consistent 3D poses over a sequence of images even when the 2D pose detector fails

    Unsupervised 3D Pose Estimation with Geometric Self-Supervision

    Full text link
    We present an unsupervised learning approach to recover 3D human pose from 2D skeletal joints extracted from a single image. Our method does not require any multi-view image data, 3D skeletons, correspondences between 2D-3D points, or use previously learned 3D priors during training. A lifting network accepts 2D landmarks as inputs and generates a corresponding 3D skeleton estimate. During training, the recovered 3D skeleton is reprojected on random camera viewpoints to generate new "synthetic" 2D poses. By lifting the synthetic 2D poses back to 3D and re-projecting them in the original camera view, we can define self-consistency loss both in 3D and in 2D. The training can thus be self supervised by exploiting the geometric self-consistency of the lift-reproject-lift process. We show that self-consistency alone is not sufficient to generate realistic skeletons, however adding a 2D pose discriminator enables the lifter to output valid 3D poses. Additionally, to learn from 2D poses "in the wild", we train an unsupervised 2D domain adapter network to allow for an expansion of 2D data. This improves results and demonstrates the usefulness of 2D pose data for unsupervised 3D lifting. Results on Human3.6M dataset for 3D human pose estimation demonstrate that our approach improves upon the previous unsupervised methods by 30% and outperforms many weakly supervised approaches that explicitly use 3D data
    • โ€ฆ
    corecore