130,924 research outputs found

    확률적인 3차원 자세 복원과 행동인식

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2016. 2. 오성회.These days, computer vision technology becomes popular and plays an important role in intelligent systems, such as augment reality, video and image analysis, and to name a few. Although cost effective depth cameras, like a Microsoft Kinect, have recently developed, most computer vision algorithms assume that observations are obtained from RGB cameras, which make 2D observations. If, somehow, we can estimate 3D information from 2D observations, it might give better solutions for many computer vision problems. In this dissertation, we focus on estimating 3D information from 2D observations, which is well known as non-rigid structure from motion (NRSfM). More formally, NRSfM finds the three dimensional structure of an object by analyzing image streams with the assumption that an object lies in a low-dimensional space. However, a human body for long periods of time can have complex shape variations and it makes a challenging problem for NRSfM due to its increased degree of freedom. In order to handle complex shape variations, we propose a Procrustean normal distribution mixture model (PNDMM) by extending a recently proposed Procrustean normal distribution (PND), which captures the distribution of non-rigid variations of an object by excluding the effects of rigid motion. Unlike existing methods which use a single model to solve an NRSfM problem, the proposed PNDMM decomposes complex shape variations into a collection of simpler ones, thereby model learning can be more tractable and accurate. We perform experiments showing that the proposed method outperforms existing methods on highly complex and long human motion sequences. In addition, we extend the PNDMM to a single view 3D human pose estimation problem. While recovering a 3D structure of a human body from an image is important, it is a highly ambiguous problem due to the deformation of an articulated human body. Moreover, before estimating a 3D human pose from a 2D human pose, it is important to obtain an accurate 2D human pose. In order to address inaccuracy of 2D pose estimation on a single image and 3D human pose ambiguities, we estimate multiple 2D and 3D human pose candidates and select the best one which can be explained by a 2D human pose detector and a 3D shape model. We also introduce a model transformation which is incorporated into the 3D shape prior model, such that the proposed method can be applied to a novel test image. Experimental results show that the proposed method can provide good 3D reconstruction results when tested on a novel test image, despite inaccuracies of 2D part detections and 3D shape ambiguities. Finally, we handle an action recognition problem from a video clip. Current studies show that high-level features obtained from estimated 2D human poses enable action recognition performance beyond current state-of-the-art methods using low- and mid-level features based on appearance and motion, despite inaccuracy of human pose estimation. Based on these findings, we propose an action recognition method using estimated 3D human pose information since the proposed PNDMM is able to reconstruct 3D shapes from 2D shapes. Experimental results show that 3D pose based descriptors are better than 2D pose based descriptors for action recognition, regardless of classification methods. Considering the fact that we use simple 3D pose descriptors based on a 3D shape model which is learned from 2D shapes, results reported in this dissertation are promising and obtaining accurate 3D information from 2D observations is still an important research issue for reliable computer vision systems.Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Research Issues 4 1.3 Organization of the Dissertation 6 Chapter 2 Preliminary 9 2.1 Generalized Procrustes Analysis (GPA) 11 2.2 EM-GPA Algorithm 12 2.2.1 Objective function 12 2.2.2 E-step 15 2.2.3 M-step 16 2.3 Implementation Considerations for EM-GPA 18 2.3.1 Preprocessing stage 18 2.3.2 Small update rate for the covariance matrix 20 2.4 Experiments 21 2.4.1 Shape alignment with the missing information 23 2.4.2 3D shape modeling 24 2.4.3 2D+3D active appearance models 28 2.5 Chapter Summary and Discussion 32 Chapter 3 Procrustean Normal Distribution Mixture Model 33 3.1 Non-Rigid Structure from Motion 35 3.2 Procrustean Normal Distribution (PND) 38 3.3 PND Mixture Model 41 3.4 Learning a PNDMM 43 3.4.1 E-step 44 3.4.2 M-step 46 3.5 Learning an Adaptive PNDMM 48 3.6 Experiments 50 3.6.1 Experimental setup 50 3.6.2 CMU Mocap database 53 3.6.3 UMPM dataset 69 3.6.4 Simple and short motions 74 3.6.5 Real sequence - qualitative representation 77 3.7 Chapter Summary 78 Chapter 4 Recovering a 3D Human Pose from a Novel Image 83 4.1 Single View 3D Human Pose Estimation 85 4.2 Candidate Generation 87 4.2.1 Initial pose generation 87 4.2.2 Part recombination 88 4.3 3D Shape Prior Model 89 4.3.1 Procrustean mixture model learning 89 4.3.2 Procrustean mixture model fitting 91 4.4 Model Transformation 92 4.4.1 Model normalization 92 4.4.2 Model adaptation 95 4.5 Result Selection 96 4.6 Experiments 98 4.6.1 Implementation details 98 4.6.2 Evaluation of the joint 2D and 3D pose estimation 99 4.6.3 Evaluation of the 2D pose estimation 104 4.6.4 Evaluation of the 3D pose estimation 106 4.7 Chapter Summary 108 Chapter 5 Application to Action Recognition 109 5.1 Appearance and Motion Based Descriptors 112 5.2 2D Pose Based Descriptors 113 5.3 Bag-of-Features with a Multiple Kernel Method 114 5.4 Classification - Kernel Group Sparse Representation 115 5.4.1 Group sparse representation for classification 116 5.4.2 Kernel group sparse (KGS) representation for classification 118 5.5 Experiment on sub-JHMDB Dataset 120 5.5.1 Experimental setup 120 5.5.2 3D pose based descriptor 122 5.5.3 Experimental results 123 5.6 Chapter Summary 129 Chapter 6 Conclusion and Future Work 131 Appendices 135 A Proof of Propositions in Chapter 2 137 A.1 Proof of Proposition 1 137 A.2 Proof of Proposition 3 138 A.3 Proof of Proposition 4 139 B Calculation of p(XijDii) in Chapter 3 141 B.1 Without the Dirac-delta term 141 B.2 With the Dirac-delta term 142 C Procrustean Mixture Model Learning and Fitting in Chapter 4 145 C.1 Procrustean Mixture Model Learning 145 C.2 Procrustean Mixture Model Fitting 147 Bibliography 153 초 록 167Docto

    NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding

    Full text link
    Research on depth-based human activity analysis achieved outstanding performance and demonstrated the effectiveness of 3D representation for action recognition. The existing depth-based and RGB+D-based action recognition benchmarks have a number of limitations, including the lack of large-scale training samples, realistic number of distinct class categories, diversity in camera views, varied environmental conditions, and variety of human subjects. In this work, we introduce a large-scale dataset for RGB+D human action recognition, which is collected from 106 distinct subjects and contains more than 114 thousand video samples and 8 million frames. This dataset contains 120 different action classes including daily, mutual, and health-related activities. We evaluate the performance of a series of existing 3D activity analysis methods on this dataset, and show the advantage of applying deep learning methods for 3D-based human action recognition. Furthermore, we investigate a novel one-shot 3D activity recognition problem on our dataset, and a simple yet effective Action-Part Semantic Relevance-aware (APSR) framework is proposed for this task, which yields promising results for recognition of the novel action classes. We believe the introduction of this large-scale dataset will enable the community to apply, adapt, and develop various data-hungry learning techniques for depth-based and RGB+D-based human activity understanding. [The dataset is available at: http://rose1.ntu.edu.sg/Datasets/actionRecognition.asp]Comment: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI

    An original framework for understanding human actions and body language by using deep neural networks

    Get PDF
    The evolution of both fields of Computer Vision (CV) and Artificial Neural Networks (ANNs) has allowed the development of efficient automatic systems for the analysis of people's behaviour. By studying hand movements it is possible to recognize gestures, often used by people to communicate information in a non-verbal way. These gestures can also be used to control or interact with devices without physically touching them. In particular, sign language and semaphoric hand gestures are the two foremost areas of interest due to their importance in Human-Human Communication (HHC) and Human-Computer Interaction (HCI), respectively. While the processing of body movements play a key role in the action recognition and affective computing fields. The former is essential to understand how people act in an environment, while the latter tries to interpret people's emotions based on their poses and movements; both are essential tasks in many computer vision applications, including event recognition, and video surveillance. In this Ph.D. thesis, an original framework for understanding Actions and body language is presented. The framework is composed of three main modules: in the first one, a Long Short Term Memory Recurrent Neural Networks (LSTM-RNNs) based method for the Recognition of Sign Language and Semaphoric Hand Gestures is proposed; the second module presents a solution based on 2D skeleton and two-branch stacked LSTM-RNNs for action recognition in video sequences; finally, in the last module, a solution for basic non-acted emotion recognition by using 3D skeleton and Deep Neural Networks (DNNs) is provided. The performances of RNN-LSTMs are explored in depth, due to their ability to model the long term contextual information of temporal sequences, making them suitable for analysing body movements. All the modules were tested by using challenging datasets, well known in the state of the art, showing remarkable results compared to the current literature methods

    Action tube extraction based 3D-CNN for RGB-D action recognition

    Get PDF
    In this paper we propose a novel action tube extractor for RGB-D action recognition in trimmed videos. The action tube extractor takes as input a video and outputs an action tube. The method consists of two parts: spatial tube extraction and temporal sampling. The first part is built upon MobileNet-SSD and its role is to define the spatial region where the action takes place. The second part is based on the structural similarity index (SSIM) and is designed to remove frames without obvious motion from the primary action tube. The final extracted action tube has two benefits: 1) a higher ratio of ROI (subjects of action) to background; 2) most frames contain obvious motion change. We propose to use a two-stream (RGB and Depth) I3D architecture as our 3D-CNN model. Our approach outperforms the state-of-the-art methods on the OA and NTU RGB-D datasets. © 2018 IEEE.Peer ReviewedPostprint (published version

    Mining Mid-level Features for Action Recognition Based on Effective Skeleton Representation

    Get PDF
    Recently, mid-level features have shown promising performance in computer vision. Mid-level features learned by incorporating class-level information are potentially more discriminative than traditional low-level local features. In this paper, an effective method is proposed to extract mid-level features from Kinect skeletons for 3D human action recognition. Firstly, the orientations of limbs connected by two skeleton joints are computed and each orientation is encoded into one of the 27 states indicating the spatial relationship of the joints. Secondly, limbs are combined into parts and the limb's states are mapped into part states. Finally, frequent pattern mining is employed to mine the most frequent and relevant (discriminative, representative and non-redundant) states of parts in continuous several frames. These parts are referred to as Frequent Local Parts or FLPs. The FLPs allow us to build powerful bag-of-FLP-based action representation. This new representation yields state-of-the-art results on MSR DailyActivity3D and MSR ActionPairs3D

    Skeleton-Based Human Action Recognition with Global Context-Aware Attention LSTM Networks

    Full text link
    Human action recognition in 3D skeleton sequences has attracted a lot of research attention. Recently, Long Short-Term Memory (LSTM) networks have shown promising performance in this task due to their strengths in modeling the dependencies and dynamics in sequential data. As not all skeletal joints are informative for action recognition, and the irrelevant joints often bring noise which can degrade the performance, we need to pay more attention to the informative ones. However, the original LSTM network does not have explicit attention ability. In this paper, we propose a new class of LSTM network, Global Context-Aware Attention LSTM (GCA-LSTM), for skeleton based action recognition. This network is capable of selectively focusing on the informative joints in each frame of each skeleton sequence by using a global context memory cell. To further improve the attention capability of our network, we also introduce a recurrent attention mechanism, with which the attention performance of the network can be enhanced progressively. Moreover, we propose a stepwise training scheme in order to train our network effectively. Our approach achieves state-of-the-art performance on five challenging benchmark datasets for skeleton based action recognition
    corecore