60,918 research outputs found

    Spatiotemporal Features and Deep Learning Methods for Video Classification

    Get PDF
    Classification of human actions from real-world video data is one of the most important topics in computer vision and it has been an interesting and challenging research topic in recent decades. It is commonly used in many applications such as video retrieval, video surveillance, human-computer interaction, robotics, and health care. Therefore, robust, fast, and accurate action recognition systems are highly demanded. Deep learning techniques developed for action recognition from the image domain can be extended to the video domain. Nonetheless, deep learning solutions for two-dimensional image data cannot be directly applicable for the video domain because of the larger scale and temporal nature of the video. Specifically, each frame involves spatial information, while the sequence of frames carries temporal information. Therefore, this study focused on both spatial and temporal features, aiming to improve the accuracy of human action recognition from videos by making use of spatiotemporal information. In this thesis, several deep learning architectures were proposed to model both spatial and temporal components. Firstly, a novel deep neural network was developed for video classification by combining Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). Secondly, an action template-based keyframe extraction method was proposed and temporal clues between action regions were used to extract more informative keyframes. Thirdly, a novel decision-level fusion rule was proposed to better combine spatial and temporal aspects of videos in two-stream networks. Finally, an extensive investigation was conducted to find out how to combine various information from feature and decision fusion to improve the video classification performance in multi-stream neural networks. Extensive experiments were conducted using the proposed methods and the results highlighted that using both spatial and temporal information is required in video classification architectures and employing temporal information effectively in multi-stream deep neural networks is crucial to improve video classification accuracy

    Fully-Coupled Two-Stream Spatiotemporal Networks for Extremely Low Resolution Action Recognition

    Full text link
    A major emerging challenge is how to protect people's privacy as cameras and computer vision are increasingly integrated into our daily lives, including in smart devices inside homes. A potential solution is to capture and record just the minimum amount of information needed to perform a task of interest. In this paper, we propose a fully-coupled two-stream spatiotemporal architecture for reliable human action recognition on extremely low resolution (e.g., 12x16 pixel) videos. We provide an efficient method to extract spatial and temporal features and to aggregate them into a robust feature representation for an entire action video sequence. We also consider how to incorporate high resolution videos during training in order to build better low resolution action recognition models. We evaluate on two publicly-available datasets, showing significant improvements over the state-of-the-art.Comment: 9 pagers, 5 figures, published in WACV 201
    • …
    corecore