574 research outputs found

    NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding

    Full text link
    Research on depth-based human activity analysis achieved outstanding performance and demonstrated the effectiveness of 3D representation for action recognition. The existing depth-based and RGB+D-based action recognition benchmarks have a number of limitations, including the lack of large-scale training samples, realistic number of distinct class categories, diversity in camera views, varied environmental conditions, and variety of human subjects. In this work, we introduce a large-scale dataset for RGB+D human action recognition, which is collected from 106 distinct subjects and contains more than 114 thousand video samples and 8 million frames. This dataset contains 120 different action classes including daily, mutual, and health-related activities. We evaluate the performance of a series of existing 3D activity analysis methods on this dataset, and show the advantage of applying deep learning methods for 3D-based human action recognition. Furthermore, we investigate a novel one-shot 3D activity recognition problem on our dataset, and a simple yet effective Action-Part Semantic Relevance-aware (APSR) framework is proposed for this task, which yields promising results for recognition of the novel action classes. We believe the introduction of this large-scale dataset will enable the community to apply, adapt, and develop various data-hungry learning techniques for depth-based and RGB+D-based human activity understanding. [The dataset is available at: http://rose1.ntu.edu.sg/Datasets/actionRecognition.asp]Comment: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI

    3D ์† ํฌ์ฆˆ ์ธ์‹์„ ์œ„ํ•œ ์ธ์กฐ ๋ฐ์ดํ„ฐ์˜ ์ด์šฉ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์œตํ•ฉ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™์› ์œตํ•ฉ๊ณผํ•™๋ถ€(์ง€๋Šฅํ˜•์œตํ•ฉ์‹œ์Šคํ…œ์ „๊ณต), 2021.8. ์–‘ํ•œ์—ด.3D hand pose estimation (HPE) based on RGB images has been studied for a long time. Relevant methods have focused mainly on optimization of neural framework for graphically connected finger joints. Training RGB-based HPE models has not been easy to train because of the scarcity on RGB hand pose datasets; unlike human body pose datasets, the finger joints that span hand postures are structured delicately and exquisitely. Such structure makes accurately annotating each joint with unique 3D world coordinates difficult, which is why many conventional methods rely on synthetic data samples to cover large variations of hand postures. Synthetic dataset consists of very precise annotations of ground truths, and further allows control over the variety of data samples, yielding a learning model to be trained with a large pose space. Most of the studies, however, have performed frame-by-frame estimation based on independent static images. Synthetic visual data can provide practically infinite diversity and rich labels, while avoiding ethical issues with privacy and bias. However, for many tasks, current models trained on synthetic data generalize poorly to real data. The task of 3D human hand pose estimation is a particularly interesting example of this synthetic-to-real problem, because learning-based approaches perform reasonably well given real training data, yet labeled 3D poses are extremely difficult to obtain in the wild, limiting scalability. In this dissertation, we attempt to not only consider the appearance of a hand but incorporate the temporal movement information of a hand in motion into the learning framework for better 3D hand pose estimation performance, which leads to the necessity of a large scale dataset with sequential RGB hand images. We propose a novel method that generates a synthetic dataset that mimics natural human hand movements by re-engineering annotations of an extant static hand pose dataset into pose-flows. With the generated dataset, we train a newly proposed recurrent framework, exploiting visuo-temporal features from sequential images of synthetic hands in motion and emphasizing temporal smoothness of estimations with a temporal consistency constraint. Our novel training strategy of detaching the recurrent layer of the framework during domain finetuning from synthetic to real allows preservation of the visuo-temporal features learned from sequential synthetic hand images. Hand poses that are sequentially estimated consequently produce natural and smooth hand movements which lead to more robust estimations. We show that utilizing temporal information for 3D hand pose estimation significantly enhances general pose estimations by outperforming state-of-the-art methods in experiments on hand pose estimation benchmarks. Since a fixed set of dataset provides a finite distribution of data samples, the generalization of a learning pose estimation network is limited in terms of pose, RGB and viewpoint spaces. We further propose to augment the data automatically such that the augmented pose sampling is performed in favor of training pose estimators generalization performance. Such auto-augmentation of poses is performed within a learning feature space in order to avoid computational burden of generating synthetic sample for every iteration of updates. The proposed effort can be considered as generating and utilizing synthetic samples for network training in the feature space. This allows training efficiency by requiring less number of real data samples, enhanced generalization power over multiple dataset domains and estimation performance caused by efficient augmentation.2D ์ด๋ฏธ์ง€์—์„œ ์‚ฌ๋žŒ์˜ ์† ๋ชจ์–‘๊ณผ ํฌ์ฆˆ๋ฅผ ์ธ์‹ํ•˜๊ณ  ๊ตฌํ˜„ํ๋Š” ์—ฐ๊ตฌ๋Š” ๊ฐ ์†๊ฐ€๋ฝ ์กฐ์ธํŠธ๋“ค์˜ 3D ์œ„์น˜๋ฅผ ๊ฒ€์ถœํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœํ•œ๋‹ค. ์† ํฌ์ฆˆ๋Š” ์†๊ฐ€๋ฝ ์กฐ์ธํŠธ๋“ค๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๊ณ  ์†๋ชฉ ๊ด€์ ˆ๋ถ€ํ„ฐ MCP, PIP, DIP ์กฐ์ธํŠธ๋“ค๋กœ ์‚ฌ๋žŒ ์†์„ ๊ตฌ์„ฑํ•˜๋Š” ์‹ ์ฒด์  ์š”์†Œ๋“ค์„ ์˜๋ฏธํ•œ๋‹ค. ์† ํฌ์ฆˆ ์ •๋ณด๋Š” ๋‹ค์–‘ํ•œ ๋ถ„์•ผ์—์„œ ํ™œ์šฉ๋ ์ˆ˜ ์žˆ๊ณ  ์† ์ œ์Šค์ณ ๊ฐ์ง€ ์—ฐ๊ตฌ ๋ถ„์•ผ์—์„œ ์† ํฌ์ฆˆ ์ •๋ณด๊ฐ€ ๋งค์šฐ ํ›Œ๋ฅญํ•œ ์ž…๋ ฅ ํŠน์ง• ๊ฐ’์œผ๋กœ ์‚ฌ์šฉ๋œ๋‹ค. ์‚ฌ๋žŒ์˜ ์† ํฌ์ฆˆ ๊ฒ€์ถœ ์—ฐ๊ตฌ๋ฅผ ์‹ค์ œ ์‹œ์Šคํ…œ์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋†’์€ ์ •ํ™•๋„, ์‹ค์‹œ๊ฐ„์„ฑ, ๋‹ค์–‘ํ•œ ๊ธฐ๊ธฐ์— ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•˜๋„๋ก ๊ฐ€๋ฒผ์šด ๋ชจ๋ธ์ด ํ•„์š”ํ•˜๊ณ , ์ด๊ฒƒ์„ ๊ฐ€๋Šฅ์ผ€ ํ•˜๊ธฐ ์œ„ํ•ด์„œ ํ•™์Šตํ•œ ์ธ๊ณต์‹ ๊ฒฝ๋ง ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๋Š”๋ฐ์—๋Š” ๋งŽ์€ ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์š”๋กœ ํ•œ๋‹ค. ํ•˜์ง€๋งŒ ์‚ฌ๋žŒ ์† ํฌ์ฆˆ๋ฅผ ์ธก์ •ํ•˜๋Š” ๊ธฐ๊ณ„๋“ค์ด ๊ฝค ๋ถˆ์•ˆ์ •ํ•˜๊ณ , ์ด ๊ธฐ๊ณ„๋“ค์„ ์žฅ์ฐฉํ•˜๊ณ  ์žˆ๋Š” ์ด๋ฏธ์ง€๋Š” ์‚ฌ๋žŒ ์† ํ”ผ๋ถ€ ์ƒ‰๊ณผ๋Š” ๋งŽ์ด ๋‹ฌ๋ผ ํ•™์Šต์— ์‚ฌ์šฉํ•˜๊ธฐ๊ฐ€ ์ ์ ˆํ•˜์ง€ ์•Š๋‹ค. ๊ทธ๋Ÿฌ๊ธฐ ๋•Œ๋ฌธ์— ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ธ๊ณต์ ์œผ๋กœ ๋งŒ๋“ค์–ด๋‚ธ ๋ฐ์ดํ„ฐ๋ฅผ ์žฌ๊ฐ€๊ณต ๋ฐ ์ฆ๋Ÿ‰ํ•˜์—ฌ ํ•™์Šต์— ์‚ฌ์šฉํ•˜๊ณ , ๊ทธ๊ฒƒ์„ ํ†ตํ•ด ๋” ์ข‹์€ ํ•™์Šต์„ฑ๊ณผ๋ฅผ ์ด๋ฃจ๋ ค๊ณ  ํ•œ๋‹ค. ์ธ๊ณต์ ์œผ๋กœ ๋งŒ๋“ค์–ด๋‚ธ ์‚ฌ๋žŒ ์† ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ๋“ค์€ ์‹ค์ œ ์‚ฌ๋žŒ ์† ํ”ผ๋ถ€์ƒ‰๊ณผ๋Š” ๋น„์Šทํ• ์ง€์–ธ์ • ๋””ํ…Œ์ผํ•œ ํ…์Šค์ณ๊ฐ€ ๋งŽ์ด ๋‹ฌ๋ผ, ์‹ค์ œ๋กœ ์ธ๊ณต ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šตํ•œ ๋ชจ๋ธ์€ ์‹ค์ œ ์† ๋ฐ์ดํ„ฐ์—์„œ ์„ฑ๋Šฅ์ด ํ˜„์ €ํžˆ ๋งŽ์ด ๋–จ์–ด์ง„๋‹ค. ์ด ๋‘ ๋ฐ์ดํƒ€์˜ ๋„๋ฉ”์ธ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด์„œ ์ฒซ๋ฒˆ์งธ๋กœ๋Š” ์‚ฌ๋žŒ์†์˜ ๊ตฌ์กฐ๋ฅผ ๋จผ์ € ํ•™์Šต ์‹œํ‚ค๊ธฐ์œ„ํ•ด, ์† ๋ชจ์…˜์„ ์žฌ๊ฐ€๊ณตํ•˜์—ฌ ๊ทธ ์›€์ง์ž„ ๊ตฌ์กฐ๋ฅผ ํ•™์Šคํ•œ ์‹œ๊ฐ„์  ์ •๋ณด๋ฅผ ๋บ€ ๋‚˜๋จธ์ง€๋งŒ ์‹ค์ œ ์† ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ์— ํ•™์Šตํ•˜์˜€๊ณ  ํฌ๊ฒŒ ํšจ๊ณผ๋ฅผ ๋‚ด์—ˆ๋‹ค. ์ด๋•Œ ์‹ค์ œ ์‚ฌ๋žŒ ์†๋ชจ์…˜์„ ๋ชจ๋ฐฉํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์‹œํ•˜์˜€๋‹ค. ๋‘๋ฒˆ์งธ๋กœ๋Š” ๋‘ ๋„๋ฉ”์ธ์ด ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๋ฅผ ๋„คํŠธ์›Œํฌ ํ”ผ์ณ ๊ณต๊ฐ„์—์„œ align์‹œ์ผฐ๋‹ค. ๊ทธ๋ฟ๋งŒ์•„๋‹ˆ๋ผ ์ธ๊ณต ํฌ์ฆˆ๋ฅผ ํŠน์ • ๋ฐ์ดํ„ฐ๋“ค๋กœ augmentํ•˜์ง€ ์•Š๊ณ  ๋„คํŠธ์›Œํฌ๊ฐ€ ๋งŽ์ด ๋ณด์ง€ ๋ชปํ•œ ํฌ์ฆˆ๊ฐ€ ๋งŒ๋“ค์–ด์ง€๋„๋ก ํ•˜๋‚˜์˜ ํ™•๋ฅ  ๋ชจ๋ธ๋กœ์„œ ์„ค์ •ํ•˜์—ฌ ๊ทธ๊ฒƒ์—์„œ ์ƒ˜ํ”Œ๋งํ•˜๋Š” ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ธ๊ณต ๋ฐ์ดํ„ฐ๋ฅผ ๋” ํšจ๊ณผ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ annotation์ด ์–ด๋ ค์šด ์‹ค์ œ ๋ฐ์ดํ„ฐ๋ฅผ ๋” ๋ชจ์œผ๋Š” ์ˆ˜๊ณ ์Šค๋Ÿฌ์›€ ์—†์ด ์ธ๊ณต ๋ฐ์ดํ„ฐ๋“ค์„ ๋” ํšจ๊ณผ์ ์œผ๋กœ ๋งŒ๋“ค์–ด ๋‚ด๋Š” ๊ฒƒ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ๋” ์•ˆ์ „ํ•˜๊ณ  ์ง€์—ญ์  ํŠน์ง•๊ณผ ์‹œ๊ฐ„์  ํŠน์ง•์„ ํ™œ์šฉํ•ด์„œ ํฌ์ฆˆ์˜ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•˜๋Š” ๋ฐฉ๋ฒ•๋“ค์„ ์ œ์•ˆํ–ˆ๋‹ค. ๋˜ํ•œ, ๋„คํŠธ์›Œํฌ๊ฐ€ ์Šค์Šค๋กœ ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฐพ์•„์„œ ํ•™์Šตํ• ์ˆ˜ ์žˆ๋Š” ์ž๋™ ๋ฐ์ดํ„ฐ ์ฆ๋Ÿ‰ ๋ฐฉ๋ฒ•๋ก ๋„ ํ•จ๊ป˜ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ด๋ ‡๊ฒŒ ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์„ ๊ฒฐํ•ฉํ•ด์„œ ๋” ๋‚˜์€ ์† ํฌ์ฆˆ์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ ํ•  ์ˆ˜ ์žˆ๋‹ค.1. Introduction 1 2. Related Works 14 3. Preliminaries: 3D Hand Mesh Model 27 4. SeqHAND: RGB-sequence-based 3D Hand Pose and Shape Estimation 31 5. Hand Pose Auto-Augment 66 6. Conclusion 85 Abstract (Korea) 101 ๊ฐ์‚ฌ์˜ ๊ธ€ 103๋ฐ•

    An original framework for understanding human actions and body language by using deep neural networks

    Get PDF
    The evolution of both fields of Computer Vision (CV) and Artificial Neural Networks (ANNs) has allowed the development of efficient automatic systems for the analysis of people's behaviour. By studying hand movements it is possible to recognize gestures, often used by people to communicate information in a non-verbal way. These gestures can also be used to control or interact with devices without physically touching them. In particular, sign language and semaphoric hand gestures are the two foremost areas of interest due to their importance in Human-Human Communication (HHC) and Human-Computer Interaction (HCI), respectively. While the processing of body movements play a key role in the action recognition and affective computing fields. The former is essential to understand how people act in an environment, while the latter tries to interpret people's emotions based on their poses and movements; both are essential tasks in many computer vision applications, including event recognition, and video surveillance. In this Ph.D. thesis, an original framework for understanding Actions and body language is presented. The framework is composed of three main modules: in the first one, a Long Short Term Memory Recurrent Neural Networks (LSTM-RNNs) based method for the Recognition of Sign Language and Semaphoric Hand Gestures is proposed; the second module presents a solution based on 2D skeleton and two-branch stacked LSTM-RNNs for action recognition in video sequences; finally, in the last module, a solution for basic non-acted emotion recognition by using 3D skeleton and Deep Neural Networks (DNNs) is provided. The performances of RNN-LSTMs are explored in depth, due to their ability to model the long term contextual information of temporal sequences, making them suitable for analysing body movements. All the modules were tested by using challenging datasets, well known in the state of the art, showing remarkable results compared to the current literature methods
    • โ€ฆ
    corecore