17 research outputs found
PMHI: Proposals From Motion History Images for Temporal Segmentation of Long Uncut Videos
This letter proposes a method for the generation of temporal action proposals for the segmentation of long uncut video sequences. The presence of consecutive multiple actions in video sequences makes the temporal segmentation a challenging problem due to the unconstrained nature of actions in space and time. To address this issue, we exploit the nonaction segments present between the actual human actions in uncut videos. From the long uncut video, we compute the energy of consecutive nonoverlapping motion history images (MHIs), which provides spatiotemporal information of motion. Our proposals from MHIs (PMHI) are based on clustering the MHIs into actions and nonaction segments by detecting minima from the energy of MHIs. PMHI efficiently segments the long uncut videos into a small number of nonoverlapping temporal action proposals. The strength of PMHI is that it is unsupervised, which alleviates the requirement for any training data. Our temporal action proposal method outperforms the existing proposal methods on the Multi-view Human Action video (MuHAVi)-uncut and Computer Vision and Pattern recognition (CVPR) 2012 Change Detection datasets with an average recall rate of 86.1% and 86.0%, respectively.Sergio A Velastin acknowledges funding by the Universidad Carlos III de Madrid, the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement nÂş 600371, el Ministerio de EconomĂa y Competitividad (COFUND2013-51509) and Banco Santande
DA-VLAD: Discriminative action vector of locally aggregated descriptors for action recognition
This paper has been presented at : 25th IEEE International Conference on Image Processing (ICIP 2018)In this paper, we propose a novel encoding method for the representation of human action videos, that we call Discriminative Action Vector of Locally Aggregated Descriptors (DA-VLAD). DA-VLAD is motivated by the fact that there are many unnecessary and overlapping frames that cause non-discriminative codewords during the training process. DA-VLAD deals with this issue by extracting class-specific clusters and learning the discriminative power of these codewords in the form of informative weights. We use these discriminative action weights with standard VLAD encoding as a contribution of each codeword. DA-VLAD reduces the inter-class similarity efficiently by diminishing the effect of common codewords among multiple action classes during the encoding process. We present the effectiveness of DA-VLAD on two challenging action recognition datasets: UCF101 and HMDB51, improving the state-of-the-art with accuracies of 95.1% and 80.1% respectively.We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research. We also acknowledge the support from the Directorate of Advance Studies, Research and Technological development (ASR) & TD, University of Engineering and Technology Taxila, Pakistan. Sergio A Velastin acknowledges funding by the Universidad Carlos III de Madrid, the European Unions Seventh Framework Programme for research, technological development and demonstration under grant agreement n 600371, el Ministerio de Economia y Competitividad (COFUND2013-51509) and Banco Santander
Multi-view human action recognition using 2D motion templates based on MHIs and their HOG description
In this study, a new multi-view human action recognition approach is proposed by exploiting low-dimensional motion information of actions. Before feature extraction, pre-processing steps are performed to remove noise from silhouettes, incurred due to imperfect, but realistic segmentation. Two-dimensional motion templates based on motion history image (MHI) are computed for each view/action video. Histograms of oriented gradients (HOGs) are used as an efficient description of the MHIs which are classified using nearest neighbor (NN) classifier. As compared with existing approaches, the proposed method has three advantages: (i) does not require a fixed number of cameras setup during training and testing stages hence missing camera-views can be tolerated, (ii) requires less memory and bandwidth requirements and hence (iii) is computationally efficient which makes it suitable for real-time action recognition. As far as the authors know, this is the first report of results on the MuHAVi-uncut dataset having a large number of action categories and a large set of camera-views with noisy silhouettes which can be used by future workers as a baseline to improve on. Experimentation results on multi-view with this dataset gives a high-accuracy rate of 95.4% using leave-one-sequence-out cross-validation technique and compares well to similar state-of-the-art approachesSergio A Velastin acknowledges the Chilean National Science and Technology Council (CONICYT) for its funding under grant CONICYT-Fondecyt Regular no. 1140209 (“OBSERVE”). He is currently funded by the Universidad Carlos III de Madrid, the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement nÂş 600371, el Ministerio de EconomĂa y Competitividad (COFUND2013-51509) and Banco Santander
Multi-view Human Action Recognition using Histograms of Oriented Gradients (HOG) Description of Motion History Images (MHIs)
This paper has been presented at : 13th International Conference on Frontiers of Information Technology (FIT)In this paper, a silhouette-based view-independent human action recognition scheme is proposed for multi-camera dataset. To overcome the high-dimensionality issue, incurred due to multi-camera data, the low-dimensional representation based on Motion History Image (MHI) was extracted. A single MHI is computed for each view/action video. For efficient description of MHIs Histograms of Oriented Gradients (HOG) are employed. Finally the classification of HOG based description of MHIs is based on Nearest Neighbor (NN) classifier. The proposed method does not employ feature fusion for multi-view data and therefore this method does not require a fixed number of cameras setup during training and testing stages. The proposed method is suitable for multi-view as well as single view dataset as no feature fusion is used. Experimentation results on multi-view MuHAVi-14 and MuHAVi-8 datasets give high accuracy rates of 92.65% and 99.26% respectively using Leave-One-Sequence-Out (LOSO) cross validation technique as compared to similar state-of-the-art approaches. The proposed method is computationally efficient and hence suitable for real-time action recognition systems.S.A. Velastin acknowledges funding from the Universidad Carlos III de Madrid, the European Union's Seventh Framework Programme for research, technological development and demonstration under grant agreement n° 600371, el Ministerio de Economia y Competitividad (COFUND2013-51509) and Banco Santander
End-to-End Temporal Action Detection using Bag of Discriminant Snippets (BoDS)
Detecting human actions in long untrimmed videosis a challenging problem. Existing temporal action detectionmethods have difficulties in finding the precise starting andending time of the actions in untrimmed videos. In this letter, wepropose a temporal action detection framework based on a Bagof Discriminant Snippets (BoDS) that can detect multiple actionsin an end-to-end manner. BoDS is based on the observationthat multiple actions and the background classes have similarsnippets, which cause incorrect classification of action regionsand imprecise boundaries. We solve this issue by finding the keysnippetsfrom the training data of each class and compute theirdiscriminative power which is used in BoDS encoding. Duringtesting of an untrimmed video, we find the BoDS representationfor multiple candidate proposals and find their class label basedon a majority voting scheme. We test BoDS on the Thumos14 andActivityNet datasets and obtain state-of-the-art results. For thesports subset of ActivityNet dataset, we obtain a mean AveragePrecision (mAP) value of 29% at 0.7 temporal intersection overunion (tIoU) threshold. For the Thumos14 dataset, we obtain asignificant gain in terms of mAP i.e., improving from 20.8% to31.6% at tIoU=0.7.This work was supported by the ASR&TD, University of Engineering and
Technology (UET) Taxila, Pakistan. The work of S. A. Velastin was supported
by the Universidad Carlos III de Madrid, the European Unions Seventh
Framework Program for research, technological development, and demonstration
under Grant 600371, el Ministerio de Economia y Competitividad
(COFUND2013-51509), and Banco Santander
TAB: Temporally aggregated bag-of-discriminant-words for temporal action proposals
In this work, we propose a new method to generate temporal action proposals from long untrimmed videos named Temporally Aggregated Bag-of-Discriminant-Words (TAB). TAB is based on the observation that there are many overlapping frames in action and background temporal regions of untrimmed videos, which cause difficulties in segmenting actions from non-action regions. TAB solves this issue by extracting class-specific codewords from the action and background videos and extracting the discriminative weights of these codewords based on their ability to discriminate between these two classes. We integrate these discriminative weights with Bag of Word encoding, which we then call Bag-of-Discriminant-Words (BoDW). We sample the untrimmed videos into non-overlapping snippets and temporally aggregate the BoDW representation of multiple snippets into action proposals using a binary classifier trained on trimmed videos in a single pass. TAB can be used with different types of features, including those computed by deep networks. We present the effectiveness of the TAB proposal extraction method on two challenging temporal action detection datasets: MSR-II and Thumos14, where it improves upon state-of-the-art with recall rates of 87.0% and 82.0% respectively at a temporal intersection over union ratio of 0.8.We gratefully acknowledge the support of NVIDIA Corporation with the donation of the TITAN Xp GPU used for this research. We also acknowledge the support from the Directorate of Advance Studies, Research and Technological development (ASR & TD), University of Engineering and Technology Taxila, Pakistan. Sergio A. Velastin acknowledges funding by the Universidad Carlos III de Madrid, the European Unions Seventh Framework Programme for research, technological development and demonstration under grant agreement no. 600371, el Ministerio de EconomĂa y Competitividad (COFUND2013-51509) and Banco Santander
Multiple Batches of Motion History Images (MB-MHIs) for Multi-view Human Action Recognition
The recognition of human actions recorded in a multi-camera environment faces the challenging issue of viewpoint variation. Multi-view methods employ videos from different views to generate a compact view-invariant representation of human actions. This paper proposes a novel multi-view human action recognition approach that uses multiple low-dimensional temporal templates and a reconstruction-based encoding scheme. The proposed approach is based upon the extraction of multiple 2D motion history images (MHIs) of human action videos over non-overlapping temporal windows, constructing multiple batches of motion history images (MB-MHIs). Then, two kinds of descriptions are computed for these MHIs batches based on (1) a deep residual network (ResNet) and (2) histogram of oriented gradients (HOG) to effectively quantify a change in gradient. ResNet descriptions are average pooled at each batch. HOG descriptions are processed independently at each batch to learn a class-based dictionary using a K-spectral value decomposition algorithm. Later, the sparse codes of feature descriptions are obtained using an orthogonal matching pursuit approach. These sparse codes are average pooled to extract encoded feature vectors. Then, encoded feature vectors at each batch are fused to form a final view-invariant feature representation. Finally, a linear support vector machine classifier is trained for action recognition. Experimental results are given on three versions of a multi-view dataset: MuHAVi-8, MuHAVi-14, and MuHAVi-uncut. The proposed approach shows promising results when tested for a novel camera. Results on deep features indicate that action representation by MB-MHIs is more view-invariant than single MHIs.Muhammad Haroon Yousaf has received funding from Higher Education Commission, Pakistan, for Swarm Robotics Lab under National Centre for Robotics and Automation (NCRA). S.A. Velastin is grateful to funding received from the Universidad Carlos III de Madrid, the European Union’s Seventh Framework Programme for research, technological development, and demonstration under Grant Agreement No. 600371, el Ministerio de EconomĂa y Competitividad (COFUND2013-51509), and Banco Santander. The authors also acknowledge support from the Directorate of ASR&TD, University of Engineering and Technology, Taxila, Pakistan, and of Nvidia Corporation for its donation of GPU equipment
Passenger Detection and Counting during Getting on and off from Public Transport Systems
Implementing accurate and reliable passenger detection and counting system is an important task for the correct distribution of available transport system. The aim of this paper is to develop an accurate computer vision-based system to track and count passengers. The proposed passenger detection system incorporates the ideas of well-established detection techniques and is optimally customised for both indoor and outdoor scenarios. The candidate foreground regions (inside an image) are extracted in the proposed method and are described using the histograms of oriented gradient descriptor. These features are trained and tested using support vector machine classifier and the detected passengers are tracked using a filter. The proposed counting system is used to count passengers automatically when they pass through a virtual line of interest. Accuracies ranging 91.2 percent to 86.24 percent were found for passenger detection using the proposed passenger detection and counting system whereas relative counting errors varied ten percent to thirteen percent
Vectors of Temporally Correlated Snippets for Temporal Action Detection
Detection of human actions in long untrimmed videos is an important but challenging task due to the unconstrained nature of actions present in untrimmed videos. We argue that untrimmed videos contain multiple snippets from actions and the background classes having significant correlation with each other, which results in imprecise detection of start-end times for action regions. In this work, we propose Vectors of Temporally Correlated Snippets (VTCS) which addresses this problem by finding the snippet-centroids from each class which are discriminant for their own class. For each untrimmed video, non-overlapping snippets are temporally correlated with the snippet-centroids using VTCS encoding to find the action proposals. We evaluate the performance of VTCS on the Thumos14 and ActivityNet datasets. For Thumos14, VTCS achieves a significant gain in mean Average Precision (mAP) at temporal Intersection over Union (tIoU) threshold 0.5, improving from 41.5% to 44.3%. For the sports-subset of ActivityNet dataset, VTCS obtains 38.5% mAP @0.5 tIoU threshold.We gratefully acknowledge the funding for Swarm Robotics Lab under National Centre for Robotics and Automation, Pakistan. We also acknowledge the support of NVIDIA Corporation with the donation of the TITAN Xp GPU used for this research. Sergio A Velastin acknowledges funding by the Universidad Carlos III de Madrid, the European Unions Seventh Framework Programme for research, technological development and demonstration under grant agreement no. 600371, el Ministerio de EconomĂa y Competitividad (COFUND2013-51509) and Banco Santander
Agglomerative Clustering and Residual-VLAD Encoding for Human Action Recognition
Human action recognition has gathered significant attention in recent years due to its high demand in various application domains. In this work, we propose a novel codebook generation and hybrid encoding scheme for classification of action videos. The proposed scheme develops a discriminative codebook and a hybrid feature vector by encoding the features extracted from CNNs (convolutional neural networks). We explore different CNN architectures for extracting spatio-temporal features. We employ an agglomerative clustering approach for codebook generation, which intends to combine the advantages of global and class-specific codebooks. We propose a Residual Vector of Locally Aggregated Descriptors (R-VLAD) and fuse it with locality-based coding to form a hybrid feature vector. It provides a compact representation along with high order statistics. We evaluated our work on two publicly available standard benchmark datasets HMDB-51 and UCF-101. The proposed method achieves 72.6% and 96.2% on HMDB51 and UCF101, respectively. We conclude that the proposed scheme is able to boost recognition accuracy for human action recognition