130,924 research outputs found
확률적인 3차원 자세 복원과 행동인식
학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2016. 2. 오성회.These days, computer vision technology becomes popular and plays an important role in intelligent systems, such as augment reality, video and image analysis, and to name a few. Although cost effective depth cameras, like a Microsoft Kinect, have recently developed, most computer vision algorithms assume that observations are obtained from RGB cameras, which make 2D observations. If, somehow, we can estimate 3D information from 2D observations, it might give better solutions for many computer vision problems.
In this dissertation, we focus on estimating 3D information from 2D observations, which is well known as non-rigid structure from motion (NRSfM).
More formally, NRSfM finds the three dimensional structure of an object by analyzing image streams with the assumption that an object lies in a low-dimensional space. However, a human body for long periods of time can have complex shape variations and it makes a challenging problem for NRSfM due to its increased degree of freedom. In order to handle complex shape variations, we propose a Procrustean normal distribution mixture model (PNDMM) by extending a recently proposed Procrustean normal distribution (PND), which captures the distribution of non-rigid variations of an object by excluding the effects of rigid motion.
Unlike existing methods which use a single model to solve an NRSfM problem, the proposed PNDMM decomposes complex shape variations into a collection of simpler ones, thereby model learning can be more tractable and accurate. We perform experiments showing that the proposed method outperforms existing methods on highly complex and long human motion sequences.
In addition, we extend the PNDMM to a single view 3D human pose estimation problem. While recovering a 3D structure of a human body from an image is important, it is a highly ambiguous problem due to the deformation of an articulated human body. Moreover, before estimating a 3D human pose from a 2D human pose, it is important to obtain an accurate 2D human pose. In order to address inaccuracy of 2D pose estimation on a single image and 3D human pose ambiguities, we estimate multiple 2D and 3D human pose candidates and select the best one which can be explained by a 2D human pose detector and a 3D shape model. We also introduce a model transformation which is incorporated into the 3D shape prior model, such that the proposed method can be applied to a novel test image.
Experimental results show that the proposed method can provide good 3D reconstruction results when tested on a novel test image, despite inaccuracies of 2D part detections and 3D shape ambiguities.
Finally, we handle an action recognition problem from a video clip. Current studies show that high-level features obtained from estimated 2D human poses enable action recognition performance beyond current state-of-the-art methods using low- and mid-level features based on appearance and motion, despite inaccuracy of human pose estimation. Based on these findings, we propose an action recognition method using estimated 3D human pose information since the proposed PNDMM is able to reconstruct 3D shapes from 2D shapes. Experimental results show that 3D pose based descriptors are better than 2D pose based descriptors for action recognition, regardless of classification methods. Considering the fact that we use simple 3D pose descriptors based on a 3D shape model which is learned from 2D shapes, results reported in this dissertation are promising and obtaining accurate 3D information from 2D observations is still an important research issue for reliable computer vision systems.Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Research Issues 4
1.3 Organization of the Dissertation 6
Chapter 2 Preliminary 9
2.1 Generalized Procrustes Analysis (GPA) 11
2.2 EM-GPA Algorithm 12
2.2.1 Objective function 12
2.2.2 E-step 15
2.2.3 M-step 16
2.3 Implementation Considerations for EM-GPA 18
2.3.1 Preprocessing stage 18
2.3.2 Small update rate for the covariance matrix 20
2.4 Experiments 21
2.4.1 Shape alignment with the missing information 23
2.4.2 3D shape modeling 24
2.4.3 2D+3D active appearance models 28
2.5 Chapter Summary and Discussion 32
Chapter 3 Procrustean Normal Distribution Mixture Model 33
3.1 Non-Rigid Structure from Motion 35
3.2 Procrustean Normal Distribution (PND) 38
3.3 PND Mixture Model 41
3.4 Learning a PNDMM 43
3.4.1 E-step 44
3.4.2 M-step 46
3.5 Learning an Adaptive PNDMM 48
3.6 Experiments 50
3.6.1 Experimental setup 50
3.6.2 CMU Mocap database 53
3.6.3 UMPM dataset 69
3.6.4 Simple and short motions 74
3.6.5 Real sequence - qualitative representation 77
3.7 Chapter Summary 78
Chapter 4 Recovering a 3D Human Pose from a Novel Image 83
4.1 Single View 3D Human Pose Estimation 85
4.2 Candidate Generation 87
4.2.1 Initial pose generation 87
4.2.2 Part recombination 88
4.3 3D Shape Prior Model 89
4.3.1 Procrustean mixture model learning 89
4.3.2 Procrustean mixture model fitting 91
4.4 Model Transformation 92
4.4.1 Model normalization 92
4.4.2 Model adaptation 95
4.5 Result Selection 96
4.6 Experiments 98
4.6.1 Implementation details 98
4.6.2 Evaluation of the joint 2D and 3D pose estimation 99
4.6.3 Evaluation of the 2D pose estimation 104
4.6.4 Evaluation of the 3D pose estimation 106
4.7 Chapter Summary 108
Chapter 5 Application to Action Recognition 109
5.1 Appearance and Motion Based Descriptors 112
5.2 2D Pose Based Descriptors 113
5.3 Bag-of-Features with a Multiple Kernel Method 114
5.4 Classification - Kernel Group Sparse Representation 115
5.4.1 Group sparse representation for classification 116
5.4.2 Kernel group sparse (KGS) representation for classification 118
5.5 Experiment on sub-JHMDB Dataset 120
5.5.1 Experimental setup 120
5.5.2 3D pose based descriptor 122
5.5.3 Experimental results 123
5.6 Chapter Summary 129
Chapter 6 Conclusion and Future Work 131
Appendices 135
A Proof of Propositions in Chapter 2 137
A.1 Proof of Proposition 1 137
A.2 Proof of Proposition 3 138
A.3 Proof of Proposition 4 139
B Calculation of p(XijDii) in Chapter 3 141
B.1 Without the Dirac-delta term 141
B.2 With the Dirac-delta term 142
C Procrustean Mixture Model Learning and Fitting in Chapter 4 145
C.1 Procrustean Mixture Model Learning 145
C.2 Procrustean Mixture Model Fitting 147
Bibliography 153
초 록 167Docto
NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding
Research on depth-based human activity analysis achieved outstanding
performance and demonstrated the effectiveness of 3D representation for action
recognition. The existing depth-based and RGB+D-based action recognition
benchmarks have a number of limitations, including the lack of large-scale
training samples, realistic number of distinct class categories, diversity in
camera views, varied environmental conditions, and variety of human subjects.
In this work, we introduce a large-scale dataset for RGB+D human action
recognition, which is collected from 106 distinct subjects and contains more
than 114 thousand video samples and 8 million frames. This dataset contains 120
different action classes including daily, mutual, and health-related
activities. We evaluate the performance of a series of existing 3D activity
analysis methods on this dataset, and show the advantage of applying deep
learning methods for 3D-based human action recognition. Furthermore, we
investigate a novel one-shot 3D activity recognition problem on our dataset,
and a simple yet effective Action-Part Semantic Relevance-aware (APSR)
framework is proposed for this task, which yields promising results for
recognition of the novel action classes. We believe the introduction of this
large-scale dataset will enable the community to apply, adapt, and develop
various data-hungry learning techniques for depth-based and RGB+D-based human
activity understanding. [The dataset is available at:
http://rose1.ntu.edu.sg/Datasets/actionRecognition.asp]Comment: IEEE Transactions on Pattern Analysis and Machine Intelligence
(TPAMI
An original framework for understanding human actions and body language by using deep neural networks
The evolution of both fields of Computer Vision (CV) and Artificial Neural Networks (ANNs) has allowed the development of efficient automatic systems for the analysis of people's behaviour.
By studying hand movements it is possible to recognize gestures, often used by people to communicate information in a non-verbal way.
These gestures can also be used to control or interact with devices without physically touching them. In particular, sign language and semaphoric hand gestures are the two foremost areas of interest due to their importance in Human-Human Communication (HHC) and Human-Computer Interaction (HCI), respectively.
While the processing of body movements play a key role in the action recognition and affective computing fields. The former is essential to understand how people act in an environment, while the latter tries to interpret people's emotions based on their poses and movements;
both are essential tasks in many computer vision applications, including event recognition, and video surveillance.
In this Ph.D. thesis, an original framework for understanding Actions and body language is presented. The framework is composed of three main modules: in the first one, a Long Short Term Memory Recurrent Neural Networks (LSTM-RNNs) based method for the Recognition of Sign Language and Semaphoric Hand Gestures is proposed; the second module presents a solution based on 2D skeleton and two-branch stacked LSTM-RNNs for action recognition in video sequences; finally, in the last module, a solution for basic non-acted emotion recognition by using 3D skeleton and Deep Neural Networks (DNNs) is provided.
The performances of RNN-LSTMs are explored in depth, due to their ability to model the long term contextual information of temporal sequences, making them suitable for analysing body movements.
All the modules were tested by using challenging datasets, well known in the state of the art, showing remarkable results compared to the current literature methods
Action tube extraction based 3D-CNN for RGB-D action recognition
In this paper we propose a novel action tube extractor for RGB-D action recognition in trimmed videos. The action tube extractor takes as input a video and outputs an action tube. The method consists of two parts: spatial tube extraction and temporal sampling. The first part is built upon MobileNet-SSD and its role is to define the spatial region where the action takes place. The second part is based on the structural similarity index (SSIM) and is designed to remove frames without obvious motion from the primary action tube. The final extracted action tube has two benefits: 1) a higher ratio of ROI (subjects of action) to background; 2) most frames contain obvious motion change. We propose to use a two-stream (RGB and Depth) I3D architecture as our 3D-CNN model. Our approach outperforms the state-of-the-art methods on the OA and NTU RGB-D datasets. © 2018 IEEE.Peer ReviewedPostprint (published version
Mining Mid-level Features for Action Recognition Based on Effective Skeleton Representation
Recently, mid-level features have shown promising performance in computer
vision. Mid-level features learned by incorporating class-level information are
potentially more discriminative than traditional low-level local features. In
this paper, an effective method is proposed to extract mid-level features from
Kinect skeletons for 3D human action recognition. Firstly, the orientations of
limbs connected by two skeleton joints are computed and each orientation is
encoded into one of the 27 states indicating the spatial relationship of the
joints. Secondly, limbs are combined into parts and the limb's states are
mapped into part states. Finally, frequent pattern mining is employed to mine
the most frequent and relevant (discriminative, representative and
non-redundant) states of parts in continuous several frames. These parts are
referred to as Frequent Local Parts or FLPs. The FLPs allow us to build
powerful bag-of-FLP-based action representation. This new representation yields
state-of-the-art results on MSR DailyActivity3D and MSR ActionPairs3D
Skeleton-Based Human Action Recognition with Global Context-Aware Attention LSTM Networks
Human action recognition in 3D skeleton sequences has attracted a lot of
research attention. Recently, Long Short-Term Memory (LSTM) networks have shown
promising performance in this task due to their strengths in modeling the
dependencies and dynamics in sequential data. As not all skeletal joints are
informative for action recognition, and the irrelevant joints often bring noise
which can degrade the performance, we need to pay more attention to the
informative ones. However, the original LSTM network does not have explicit
attention ability. In this paper, we propose a new class of LSTM network,
Global Context-Aware Attention LSTM (GCA-LSTM), for skeleton based action
recognition. This network is capable of selectively focusing on the informative
joints in each frame of each skeleton sequence by using a global context memory
cell. To further improve the attention capability of our network, we also
introduce a recurrent attention mechanism, with which the attention performance
of the network can be enhanced progressively. Moreover, we propose a stepwise
training scheme in order to train our network effectively. Our approach
achieves state-of-the-art performance on five challenging benchmark datasets
for skeleton based action recognition
- …