93,359 research outputs found

    Action Capsules: Human Skeleton Action Recognition

    Full text link
    Due to the compact and rich high-level representations offered, skeleton-based human action recognition has recently become a highly active research topic. Previous studies have demonstrated that investigating joint relationships in spatial and temporal dimensions provides effective information critical to action recognition. However, effectively encoding global dependencies of joints during spatio-temporal feature extraction is still challenging. In this paper, we introduce Action Capsule which identifies action-related key joints by considering the latent correlation of joints in a skeleton sequence. We show that, during inference, our end-to-end network pays attention to a set of joints specific to each action, whose encoded spatio-temporal features are aggregated to recognize the action. Additionally, the use of multiple stages of action capsules enhances the ability of the network to classify similar actions. Consequently, our network outperforms the state-of-the-art approaches on the N-UCLA dataset and obtains competitive results on the NTURGBD dataset. This is while our approach has significantly lower computational requirements based on GFLOPs measurements.Comment: 11 pages, 11 figure

    Spatio-Temporal Modeling for Action Recognition in Videos

    Get PDF
    Technological innovation in the field of video action recognition drives the development of video-based real-world applications. This PhD thesis provides a new set of machine learning algorithms for processing videos efficiently, leading to outstanding results in human action recognition in videos. First of all, two video representation extraction methods, Temporal Squeezed Pooling (TSP) and Pixel-Wise Temporal Projection (PWTP), are proposed in order to enhance the discriminative video feature learning abilities of Deep Neural Networks (DNNs). TSP enables spatio-temporal modeling by temporally aggregating the information from long video frame sequences. PWTP is an improved version TSP, which filters out static appearance while performing information aggregation. Secondly, we discuss how to address the long-term dependency modeling problem of video DNNs. To this end, we develop two spatio-temporal attention mechanisms, Region-based Non-local (RNL) and Convolution Pyramid Attention (CPA). We devise an attention chain by connecting the RNL or CPA module to the Squeeze-Excitation (SE) operation. We demonstrate how the attention mechanisms can be embedded into deep networks to alleviate the optimization difficulty. Finally, we are focused on tackling the problem of heavy computational cost in video models. To this end, we introduce the concept of busy-quiet video disentangling for exceedingly fast video modeling. We propose the Motion Band-Pass Module (MBPM) embedded into the Busy-Quiet Net (BQN) architecture to reduce videos’ information redundancy in the spatial and temporal dimensions. The BQN architecture is extremely lightweight while still performing better than other heavier models. Extensive experiments for all the proposed methods are provided on multiple video benchmarks, including UCF101, HMDB51, Kinetics400, Something-Something V1

    UntrimmedNets for Weakly Supervised Action Recognition and Detection

    Full text link
    Current action recognition methods heavily rely on trimmed videos for model training. However, it is expensive and time-consuming to acquire a large-scale trimmed video dataset. This paper presents a new weakly supervised architecture, called UntrimmedNet, which is able to directly learn action recognition models from untrimmed videos without the requirement of temporal annotations of action instances. Our UntrimmedNet couples two important components, the classification module and the selection module, to learn the action models and reason about the temporal duration of action instances, respectively. These two components are implemented with feed-forward networks, and UntrimmedNet is therefore an end-to-end trainable architecture. We exploit the learned models for action recognition (WSR) and detection (WSD) on the untrimmed video datasets of THUMOS14 and ActivityNet. Although our UntrimmedNet only employs weak supervision, our method achieves performance superior or comparable to that of those strongly supervised approaches on these two datasets.Comment: camera-ready version to appear in CVPR201

    Hierarchical Attention Network for Action Segmentation

    Full text link
    The temporal segmentation of events is an essential task and a precursor for the automatic recognition of human actions in the video. Several attempts have been made to capture frame-level salient aspects through attention but they lack the capacity to effectively map the temporal relationships in between the frames as they only capture a limited span of temporal dependencies. To this end we propose a complete end-to-end supervised learning approach that can better learn relationships between actions over time, thus improving the overall segmentation performance. The proposed hierarchical recurrent attention framework analyses the input video at multiple temporal scales, to form embeddings at frame level and segment level, and perform fine-grained action segmentation. This generates a simple, lightweight, yet extremely effective architecture for segmenting continuous video streams and has multiple application domains. We evaluate our system on multiple challenging public benchmark datasets, including MERL Shopping, 50 salads, and Georgia Tech Egocentric datasets, and achieves state-of-the-art performance. The evaluated datasets encompass numerous video capture settings which are inclusive of static overhead camera views and dynamic, ego-centric head-mounted camera views, demonstrating the direct applicability of the proposed framework in a variety of settings.Comment: Published in Pattern Recognition Letter

    Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification

    Full text link
    Recently, substantial research effort has focused on how to apply CNNs or RNNs to better extract temporal patterns from videos, so as to improve the accuracy of video classification. In this paper, however, we show that temporal information, especially longer-term patterns, may not be necessary to achieve competitive results on common video classification datasets. We investigate the potential of a purely attention based local feature integration. Accounting for the characteristics of such features in video classification, we propose a local feature integration framework based on attention clusters, and introduce a shifting operation to capture more diverse signals. We carefully analyze and compare the effect of different attention mechanisms, cluster sizes, and the use of the shifting operation, and also investigate the combination of attention clusters for multimodal integration. We demonstrate the effectiveness of our framework on three real-world video classification datasets. Our model achieves competitive results across all of these. In particular, on the large-scale Kinetics dataset, our framework obtains an excellent single model accuracy of 79.4% in terms of the top-1 and 94.0% in terms of the top-5 accuracy on the validation set. The attention clusters are the backbone of our winner solution at ActivityNet Kinetics Challenge 2017. Code and models will be released soon.Comment: The backbone of the winner solution at ActivityNet Kinetics Challenge 201

    DeepSignals: Predicting Intent of Drivers Through Visual Signals

    Full text link
    Detecting the intention of drivers is an essential task in self-driving, necessary to anticipate sudden events like lane changes and stops. Turn signals and emergency flashers communicate such intentions, providing seconds of potentially critical reaction time. In this paper, we propose to detect these signals in video sequences by using a deep neural network that reasons about both spatial and temporal information. Our experiments on more than a million frames show high per-frame accuracy in very challenging scenarios.Comment: To be presented at the IEEE International Conference on Robotics and Automation (ICRA), 201
    • …
    corecore