    3DFCNN: real-time action recognition using 3D deep neural networks with raw depth information

    This work describes an end-to-end approach for real-time human action recognition from raw depth image-sequences. The proposal is based on a 3D fully convolutional neural network, named 3DFCNN, which automatically encodes spatio-temporal patterns from raw depth sequences. The described 3D-CNN allows actions classification from the spatial and temporal encoded information of depth sequences. The use of depth data ensures that action recognition is carried out protecting people"s privacy, since their identities can not be recognized from these data. The proposed 3DFCNN has been optimized to reach a good performance in terms of accuracy while working in real-time. Then, it has been evaluated and compared with other state-of-the-art systems in three widely used public datasets with different characteristics, demonstrating that 3DFCNN outperforms all the non-DNNbased state-of-the-art methods with a maximum accuracy of 83.6% and obtains results that are comparable to the DNN-based approaches, while maintaining a much lower computational cost of 1.09 seconds, what significantly increases its applicability in real-world environments.Agencia Estatal de InvestigaciónUniversidad de Alcal

    Online View-invariant Human Action Recognition Using RGB-D Spatio-temporal Matrix

    近年來,動作辨識是影像視覺領域熱門的研究主題。 為了使系統能夠以最貼近人類 ,最自然的方式來解讀精細且複雜的動作,我們採取視覺為基礎來設計系統; 人類在辨識他人的肢體動作時,不一定要從表演者的正前方,只要能夠獲取足夠 的視覺資訊,可以從各個視點去辨識。因此,在本篇論文中,我們的目標為 建造出一個以視覺為基礎的動作辨識系統,此系統可以不受視點的影響,在 獲得足夠的肢體資訊下皆可有效的分辨人類的動作。 為了達到此目的,我們引用了自身相似(Self-Similarity)的概念。不同的視點 即使做相同的動作,因為所看到的實際畫面不同,會萃取出不同的特徵,因此 不同以往的直接使用萃取之特徵建立模型,我們計算所有幀與幀之間的特徵距離 存取在一矩陣中稱之為自身相似矩陣(Self-Similarity Matrix),我們進一步將 此矩陣切割成多個子矩陣。接著利用我們提出的時間金字塔詞袋 (Temporal-Pyramid Bag-of-Words)來表示各個子矩陣,並利用所有子矩陣的 金字塔詞袋來表示一個動作。我們將時間金字塔詞袋做為輸入向量訓練出一支持 向量機藉此達到無關視角動作辨識之目的。Understanding human action has drawn attention to the field of computer vision. We choose vision-based system so that computer system can understand human actions naturally. When people are recognizing actions of other people, the actors do not have to stand right in front of the observer. Therefore, in this thesis, we aim to build a vision-based action recognition system which is invariant to the viewpoint. To achieve this goal, we include the idea of self-similarity. When two video sequences record a specific action from various camera views, the resulting appearances of actions would be entirely different. Consequently, if we simply apply feature extraction methods to the raw video, we will end up getting totally different features. Instead of doing the extraction of spatio-temporal feature for every frame and using these feature vectors directly, our study uses the Euclidean distance between feature vectors that are represented in a Self-Similarity Matrix (SSM). To recognize the action, we describe the local tendency of the SSM using pyramid-structural bag-of-words and train a Support-Vector Machine as our classifier. Extensive experiments have been conducted to validate the proposed action recognition system.致謝 i Acknowledgements ii 摘要 iii Abstract iv 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.1 Action Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.2 Dealing with Perspective of Camera View . . . . . . . . . . . . . 8 1.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.5 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 Preliminaries 11 2.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.1 Histogram of Oriented Gradient . . . . . . . . . . . . . . . . . . 12 2.1.2 Histogram of Optical Flow . . . . . . . . . . . . . . . . . . . . . 13 v 2.2 Bag-of-Words Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.1 Codebook Generation . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.2 Histogram of Codewords . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.1 Linear SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.2 Soft Margin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.3 Nonlinear SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3 Feature Extraction and Self Similarity 25 3.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 Spatio-Temporal Feature Extraction . . . . . . . . . . . . . . . . . . . . 28 3.3 Spatio-Temporal Self-Similarity Matrix . . . . . . . . . . . . . . . . . . 31 3.3.1 Self-Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.2 Spatio-Temporal Self-Similarity Matrix . . . . . . . . . . . . . . 32 3.4 Structural Stability of SSM across views . . . . . . . . . . . . . . . . . . 33 4 SSM-Based Action Description and Action Recognition 36 4.1 Local Feature Descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2 Temporal Pyramid Bag-of-Word Representation . . . . . . . . . . . . . . 39 4.3 Action Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3.1 Off-line Training . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3.2 On-line Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5 Experiments 46 5.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 vi 5.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2.1 Weizmann Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2.2 IXMAS Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.2.3 ViData Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.3.1 Temporal-Pyramid Bag-of-Words Evaluation . . . . . . . . . . . 51 5.3.2 View-Invariant Action Recognition Performance . . . . . . . . . 55 5.3.3 Action Spotting . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.3.4 Computational Cost Evaluation . . . . . . . . . . . . . . . . . . 59 6 Conclusion and Future Work 61 Reference 6