6 research outputs found

    Do less and achieve more: Training CNNs for action recognition utilizing action images from the Web

    Full text link
    Recently, attempts have been made to collect millions of videos to train Convolutional Neural Network (CNN) models for action recognition in videos. However, curating such large-scale video datasets requires immense human labor, and training CNNs on millions of videos demands huge computational resources. In contrast, collecting action images from the Web is much easier and training on images requires much less computation. In addition, labeled web images tend to contain discriminative action poses, which highlight discriminative portions of a video’s temporal progression. Through extensive experiments, we explore the question of whether we can utilize web action images to train better CNN models for action recognition in videos. We collect 23.8K manually filtered images from the Web that depict the 101 actions in the UCF101 action video dataset. We show that by utilizing web action images along with videos in training, significant performance boosts of CNN models can be achieved. We also investigate the scalability of the process by leveraging crawled web images (unfiltered) for UCF101 and ActivityNet. Using unfiltered images we can achieve performance improvements that are on-par with using filtered images. This means we can further reduce annotation labor and easily scale-up to larger problems. We also shed light on an artifact of finetuning CNN models that reduces the effective parameters of the CNN and show that using web action images can significantly alleviate this problem.https://arxiv.org/pdf/1512.07155v1.pdfFirst author draf

    Learning space-time structures for action recognition and localization

    Get PDF
    In this thesis the problem of automatic human action recognition and localization in videos is studied. In this problem, our goal is to recognize the category of the human action that is happening in the video, and also to localize the action in space and/or time. This problem is challenging due to the complexity of the human actions, the large intra-class variations and the distraction of backgrounds. Human actions are inherently structured patterns of body movements. However, past works are inadequate in learning the space-time structures in human actions and exploring them for better recognition and localization. In this thesis new methods are proposed that exploit such space-time structures for effective human action recognition and localization in videos, including sports videos, YouTube videos, TV programs and movies. A new local space-time video representation, the hierarchical Space-Time Segments, is first proposed. Using this new video representation, ensembles of hierarchical spatio-temporal trees, discovered directly from the training videos, are constructed to model the hierarchical, spatial and temporal structures of human actions. This proposed approach achieves promising performances in action recognition and localization on challenging benchmark datasets. Moreover, the discovered trees show good cross-dataset generalizability: trees learned on one dataset can be used to recognize and localize similar actions in another dataset. To handle large scale data, a deep model is explored that learns temporal progression of the actions using Long Short Term Memory (LSTM), which is a type of Recurrent Neural Network (RNN). Two novel ranking losses are proposed to train the model to better capture the temporal structures of actions for accurate action recognition and temporal localization. This model achieves state-of-art performance on a large scale video dataset. A deep model usually employs a Convolutional Neural Network (CNN) to learn visual features from video frames. The problem of utilizing web action images for training a Convolutional Neural Network (CNN) is also studied: training CNN typically requires a large number of training videos, but the findings of this study show that web action images can be utilized as additional training data to significantly reduce the burden of video training data collection

    Grounding deep models of visual data

    Get PDF
    Deep models are state-of-the-art for many computer vision tasks including object classification, action recognition, and captioning. As Artificial Intelligence systems that utilize deep models are becoming ubiquitous, it is also becoming crucial to explain why they make certain decisions: Grounding model decisions. In this thesis, we study: 1) Improving Model Classification. We show that by utilizing web action images along with videos in training for action recognition, significant performance boosts of convolutional models can be achieved. Without explicit grounding, labeled web action images tend to contain discriminative action poses, which highlight discriminative portions of a video’s temporal progression. 2) Spatial Grounding. We visualize spatial evidence of deep model predictions using a discriminative top-down attention mechanism, called Excitation Backprop. We show how such visualizations are equally informative for correct and incorrect model predictions, and highlight the shift of focus when different training strategies are adopted. 3) Spatial Grounding for Improving Model Classification at Training Time. We propose a guided dropout regularizer for deep networks based on the evidence of a network prediction. This approach penalizes neurons that are most relevant for model prediction. By dropping such high-saliency neurons, the network is forced to learn alternative paths in order to maintain loss minimization. We demonstrate better generalization ability, an increased utilization of network neurons, and a higher resilience to network compression. 4) Spatial Grounding for Improving Model Classification at Test Time. We propose Guided Zoom, an approach that utilizes spatial grounding to make more informed predictions at test time. Guided Zoom compares the evidence used to make a preliminary decision with the evidence of correctly classified training examples to ensure evidenceprediction consistency, otherwise refines the prediction. We demonstrate accuracy gains for fine-grained classification. 5) Spatiotemporal Grounding. We devise a formulation that simultaneously grounds evidence in space and time, in a single pass, using top-down saliency. We visualize the spatiotemporal cues that contribute to a deep recurrent neural network’s classification/captioning output. Based on these spatiotemporal cues, we are able to localize segments within a video that correspond with a specific action, or phrase from a caption, without explicitly optimizing/training for these tasks

    Privacy aware human action recognition: an exploration of temporal salience modelling and neuromorphic vision sensing

    Get PDF
    Solving the issue of privacy in the application of vision-based home monitoring has emerged as a significant demand. The state-of-the-art studies contain advanced privacy protection by filtering/covering the most sensitive content, which is the identity in this scenario. However, going beyond privacy remains a challenge for the machine to explore the obfuscated data, i.e., utility. Thanks for the usefulness of exploring the human visual system to solve the problem of visual data. Nowadays, a high level of visual abstraction can be obtained from the visual scene by constructing saliency maps that highlight the most useful content in the scene and attenuate others. One way of maintaining privacy with keeping useful information about the action is by discovering the most significant region and removing the redundancy. Another solution to address the privacy is motivated by the new visual sensor technology, i.e., neuromorphic vision sensor. In this thesis, we first introduce a novel method for vision-based privacy preservation. Particularly, we propose a new temporal salience-based anonymisation method to preserve privacy with maintaining the usefulness of the anonymity domain-based data. This anonymisation method has achieved a high level of privacy compared to the current work. The second contribution involves the development of a new descriptor for human action recognition (HAR) based on exploring the anonymity domain of the temporal salience method. The proposed descriptor tests the utility of the anonymised data without referring to RGB intensities of the original data. The extracted features using our proposed descriptor have shown an improvement with accuracies of the human actions, outperforming the existing methods. The proposed method has shown improvements by 3.04%, 3.14%, 0.83%, 3.67%, and 16.71% for DHA, KTH, UIUC1, UCF sports, and HMDB51 datasets, respectively, compared to state-of-the-art methods. The third contribution focuses on proposing a new method to deal with the new neuromorphic vision domain, which has come up to the application, since the issue of privacy has been already solved by the sensor itself. The output of this new domain is exploited by further exploring the local and global details of the log intensity changes. The empirical evaluation shows that exploring the neuromorphic domain provides useful details that have demonstrated increasing accuracy rates for E-KTH, E-UCF11 and E-HMDB5 by 0.54%, 19.42% and 25.61%, respectively
    corecore