756 research outputs found

    GANerated Hands for Real-time 3D Hand Tracking from Monocular RGB

    Full text link
    We address the highly challenging problem of real-time 3D hand tracking based on a monocular RGB-only sequence. Our tracking method combines a convolutional neural network with a kinematic 3D hand model, such that it generalizes well to unseen data, is robust to occlusions and varying camera viewpoints, and leads to anatomically plausible as well as temporally smooth hand motions. For training our CNN we propose a novel approach for the synthetic generation of training data that is based on a geometrically consistent image-to-image translation network. To be more specific, we use a neural network that translates synthetic images to "real" images, such that the so-generated images follow the same statistical distribution as real-world hand images. For training this translation network we combine an adversarial loss and a cycle-consistency loss with a geometric consistency loss in order to preserve geometric properties (such as hand pose) during translation. We demonstrate that our hand tracking system outperforms the current state-of-the-art on challenging RGB-only footage

    3D 손 포즈 인식을 μœ„ν•œ 인쑰 λ°μ΄ν„°μ˜ 이용

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사) -- μ„œμšΈλŒ€ν•™κ΅λŒ€ν•™μ› : μœ΅ν•©κ³Όν•™κΈ°μˆ λŒ€ν•™μ› μœ΅ν•©κ³Όν•™λΆ€(지λŠ₯ν˜•μœ΅ν•©μ‹œμŠ€ν…œμ „κ³΅), 2021.8. μ–‘ν•œμ—΄.3D hand pose estimation (HPE) based on RGB images has been studied for a long time. Relevant methods have focused mainly on optimization of neural framework for graphically connected finger joints. Training RGB-based HPE models has not been easy to train because of the scarcity on RGB hand pose datasets; unlike human body pose datasets, the finger joints that span hand postures are structured delicately and exquisitely. Such structure makes accurately annotating each joint with unique 3D world coordinates difficult, which is why many conventional methods rely on synthetic data samples to cover large variations of hand postures. Synthetic dataset consists of very precise annotations of ground truths, and further allows control over the variety of data samples, yielding a learning model to be trained with a large pose space. Most of the studies, however, have performed frame-by-frame estimation based on independent static images. Synthetic visual data can provide practically infinite diversity and rich labels, while avoiding ethical issues with privacy and bias. However, for many tasks, current models trained on synthetic data generalize poorly to real data. The task of 3D human hand pose estimation is a particularly interesting example of this synthetic-to-real problem, because learning-based approaches perform reasonably well given real training data, yet labeled 3D poses are extremely difficult to obtain in the wild, limiting scalability. In this dissertation, we attempt to not only consider the appearance of a hand but incorporate the temporal movement information of a hand in motion into the learning framework for better 3D hand pose estimation performance, which leads to the necessity of a large scale dataset with sequential RGB hand images. We propose a novel method that generates a synthetic dataset that mimics natural human hand movements by re-engineering annotations of an extant static hand pose dataset into pose-flows. With the generated dataset, we train a newly proposed recurrent framework, exploiting visuo-temporal features from sequential images of synthetic hands in motion and emphasizing temporal smoothness of estimations with a temporal consistency constraint. Our novel training strategy of detaching the recurrent layer of the framework during domain finetuning from synthetic to real allows preservation of the visuo-temporal features learned from sequential synthetic hand images. Hand poses that are sequentially estimated consequently produce natural and smooth hand movements which lead to more robust estimations. We show that utilizing temporal information for 3D hand pose estimation significantly enhances general pose estimations by outperforming state-of-the-art methods in experiments on hand pose estimation benchmarks. Since a fixed set of dataset provides a finite distribution of data samples, the generalization of a learning pose estimation network is limited in terms of pose, RGB and viewpoint spaces. We further propose to augment the data automatically such that the augmented pose sampling is performed in favor of training pose estimators generalization performance. Such auto-augmentation of poses is performed within a learning feature space in order to avoid computational burden of generating synthetic sample for every iteration of updates. The proposed effort can be considered as generating and utilizing synthetic samples for network training in the feature space. This allows training efficiency by requiring less number of real data samples, enhanced generalization power over multiple dataset domains and estimation performance caused by efficient augmentation.2D μ΄λ―Έμ§€μ—μ„œ μ‚¬λžŒμ˜ 손 λͺ¨μ–‘κ³Ό 포즈λ₯Ό μΈμ‹ν•˜κ³  κ΅¬ν˜„νλŠ” μ—°κ΅¬λŠ” 각 손가락 μ‘°μΈνŠΈλ“€μ˜ 3D μœ„μΉ˜λ₯Ό κ²€μΆœν•˜λŠ” 것을 λͺ©ν‘œλ‘œν•œλ‹€. 손 ν¬μ¦ˆλŠ” 손가락 μ‘°μΈνŠΈλ“€λ‘œ κ΅¬μ„±λ˜μ–΄ 있고 손λͺ© κ΄€μ ˆλΆ€ν„° MCP, PIP, DIP μ‘°μΈνŠΈλ“€λ‘œ μ‚¬λžŒ 손을 κ΅¬μ„±ν•˜λŠ” 신체적 μš”μ†Œλ“€μ„ μ˜λ―Έν•œλ‹€. 손 포즈 μ •λ³΄λŠ” λ‹€μ–‘ν•œ λΆ„μ•Όμ—μ„œ ν™œμš©λ μˆ˜ 있고 손 제슀쳐 감지 연ꡬ λΆ„μ•Όμ—μ„œ 손 포즈 정보가 맀우 ν›Œλ₯­ν•œ μž…λ ₯ νŠΉμ§• κ°’μœΌλ‘œ μ‚¬μš©λœλ‹€. μ‚¬λžŒμ˜ 손 포즈 κ²€μΆœ 연ꡬλ₯Ό μ‹€μ œ μ‹œμŠ€ν…œμ— μ μš©ν•˜κΈ° μœ„ν•΄μ„œλŠ” 높은 정확도, μ‹€μ‹œκ°„μ„±, λ‹€μ–‘ν•œ 기기에 μ‚¬μš© κ°€λŠ₯ν•˜λ„λ‘ κ°€λ²Όμš΄ λͺ¨λΈμ΄ ν•„μš”ν•˜κ³ , 이것을 κ°€λŠ₯μΌ€ ν•˜κΈ° μœ„ν•΄μ„œ ν•™μŠ΅ν•œ 인곡신경망 λͺ¨λΈμ„ ν•™μŠ΅ν•˜λŠ”λ°μ—λŠ” λ§Žμ€ 데이터가 ν•„μš”λ‘œ ν•œλ‹€. ν•˜μ§€λ§Œ μ‚¬λžŒ 손 포즈λ₯Ό μΈ‘μ •ν•˜λŠ” 기계듀이 κ½€ λΆˆμ•ˆμ •ν•˜κ³ , 이 기계듀을 μž₯μ°©ν•˜κ³  μžˆλŠ” μ΄λ―Έμ§€λŠ” μ‚¬λžŒ 손 ν”ΌλΆ€ μƒ‰κ³ΌλŠ” 많이 달라 ν•™μŠ΅μ— μ‚¬μš©ν•˜κΈ°κ°€ μ μ ˆν•˜μ§€ μ•Šλ‹€. 그러기 λ•Œλ¬Έμ— λ³Έ λ…Όλ¬Έμ—μ„œλŠ” μ΄λŸ¬ν•œ 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ 인곡적으둜 λ§Œλ“€μ–΄λ‚Έ 데이터λ₯Ό μž¬κ°€κ³΅ 및 μ¦λŸ‰ν•˜μ—¬ ν•™μŠ΅μ— μ‚¬μš©ν•˜κ³ , 그것을 톡해 더 쒋은 ν•™μŠ΅μ„±κ³Όλ₯Ό 이루렀고 ν•œλ‹€. 인곡적으둜 λ§Œλ“€μ–΄λ‚Έ μ‚¬λžŒ 손 이미지 데이터듀은 μ‹€μ œ μ‚¬λžŒ 손 ν”ΌλΆ€μƒ‰κ³ΌλŠ” λΉ„μŠ·ν• μ§€μ–Έμ • λ””ν…ŒμΌν•œ ν…μŠ€μ³κ°€ 많이 달라, μ‹€μ œλ‘œ 인곡 데이터λ₯Ό ν•™μŠ΅ν•œ λͺ¨λΈμ€ μ‹€μ œ 손 λ°μ΄ν„°μ—μ„œ μ„±λŠ₯이 ν˜„μ €νžˆ 많이 떨어진닀. 이 두 λ°μ΄νƒ€μ˜ 도메인을 쀄이기 μœ„ν•΄μ„œ μ²«λ²ˆμ§Έλ‘œλŠ” μ‚¬λžŒμ†μ˜ ꡬ쑰λ₯Ό λ¨Όμ € ν•™μŠ΅ μ‹œν‚€κΈ°μœ„ν•΄, 손 λͺ¨μ…˜μ„ μž¬κ°€κ³΅ν•˜μ—¬ κ·Έ μ›€μ§μž„ ꡬ쑰λ₯Ό ν•™μŠ€ν•œ μ‹œκ°„μ  정보λ₯Ό λΊ€ λ‚˜λ¨Έμ§€λ§Œ μ‹€μ œ 손 이미지 데이터에 ν•™μŠ΅ν•˜μ˜€κ³  크게 효과λ₯Ό λ‚΄μ—ˆλ‹€. μ΄λ•Œ μ‹€μ œ μ‚¬λžŒ 손λͺ¨μ…˜μ„ λͺ¨λ°©ν•˜λŠ” 방법둠을 μ œμ‹œν•˜μ˜€λ‹€. λ‘λ²ˆμ§Έλ‘œλŠ” 두 도메인이 λ‹€λ₯Έ 데이터λ₯Ό λ„€νŠΈμ›Œν¬ 피쳐 κ³΅κ°„μ—μ„œ alignμ‹œμΌ°λ‹€. κ·ΈλΏλ§Œμ•„λ‹ˆλΌ 인곡 포즈λ₯Ό νŠΉμ • λ°μ΄ν„°λ“€λ‘œ augmentν•˜μ§€ μ•Šκ³  λ„€νŠΈμ›Œν¬κ°€ 많이 보지 λͺ»ν•œ ν¬μ¦ˆκ°€ λ§Œλ“€μ–΄μ§€λ„λ‘ ν•˜λ‚˜μ˜ ν™•λ₯  λͺ¨λΈλ‘œμ„œ μ„€μ •ν•˜μ—¬ κ·Έκ²ƒμ—μ„œ μƒ˜ν”Œλ§ν•˜λŠ” ꡬ쑰λ₯Ό μ œμ•ˆν•˜μ˜€λ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” 인곡 데이터λ₯Ό 더 효과적으둜 μ‚¬μš©ν•˜μ—¬ annotation이 μ–΄λ €μš΄ μ‹€μ œ 데이터λ₯Ό 더 λͺ¨μœΌλŠ” μˆ˜κ³ μŠ€λŸ¬μ›€ 없이 인곡 데이터듀을 더 효과적으둜 λ§Œλ“€μ–΄ λ‚΄λŠ” 것 뿐만 μ•„λ‹ˆλΌ, 더 μ•ˆμ „ν•˜κ³  지역적 νŠΉμ§•κ³Ό μ‹œκ°„μ  νŠΉμ§•μ„ ν™œμš©ν•΄μ„œ 포즈의 μ„±λŠ₯을 κ°œμ„ ν•˜λŠ” 방법듀을 μ œμ•ˆν–ˆλ‹€. λ˜ν•œ, λ„€νŠΈμ›Œν¬κ°€ 슀슀둜 ν•„μš”ν•œ 데이터λ₯Ό μ°Ύμ•„μ„œ ν•™μŠ΅ν• μˆ˜ μžˆλŠ” μžλ™ 데이터 μ¦λŸ‰ 방법둠도 ν•¨κ»˜ μ œμ•ˆν•˜μ˜€λ‹€. μ΄λ ‡κ²Œ μ œμ•ˆλœ 방법을 κ²°ν•©ν•΄μ„œ 더 λ‚˜μ€ 손 포즈의 μ„±λŠ₯을 ν–₯상 ν•  수 μžˆλ‹€.1. Introduction 1 2. Related Works 14 3. Preliminaries: 3D Hand Mesh Model 27 4. SeqHAND: RGB-sequence-based 3D Hand Pose and Shape Estimation 31 5. Hand Pose Auto-Augment 66 6. Conclusion 85 Abstract (Korea) 101 κ°μ‚¬μ˜ κΈ€ 103λ°•

    Detection of hand gestures with human computer recognition by using support vector machine

    Get PDF
    Many applications, such as interactive data analysis and sign detection, can benefit from hand gesture recognition. We offer a low-cost approach based on human-computer interaction for predicting hand movements in real time. Our technique involves using a color glove to train a random forest classifier and then predicting a naked hand at the pixel level. Our algorithm anticipates all pixels at a rate of around 3 frames per second and is unaffected by differences in the surroundings. It's also been proven that HCI-based data augmentation is more effective than any other way for enhancing interactive data. In addition, the augmentation experiment was carried out on multiple subsets of the original hand skeleton sequence dataset, each with a different number of classes, as well as on the entire dataset. On practically all subsets, the proposed base architecture improved classification accuracy. When the entire dataset was used, there was even a modest improvement. Correct identification could be regarded as a quality indicator. The best accuracy score was 94.02 percent for the HCI-model with support vector machine (SVM) classifier
    • …