260,316 research outputs found

    Ordered Pooling of Optical Flow Sequences for Action Recognition

    Full text link
    Training of Convolutional Neural Networks (CNNs) on long video sequences is computationally expensive due to the substantial memory requirements and the massive number of parameters that deep architectures demand. Early fusion of video frames is thus a standard technique, in which several consecutive frames are first agglomerated into a compact representation, and then fed into the CNN as an input sample. For this purpose, a summarization approach that represents a set of consecutive RGB frames by a single dynamic image to capture pixel dynamics is proposed recently. In this paper, we introduce a novel ordered representation of consecutive optical flow frames as an alternative and argue that this representation captures the action dynamics more effectively than RGB frames. We provide intuitions on why such a representation is better for action recognition. We validate our claims on standard benchmark datasets and demonstrate that using summaries of flow images lead to significant improvements over RGB frames while achieving accuracy comparable to the state-of-the-art on UCF101 and HMDB datasets.Comment: Accepted in WACV 201

    Action recognition from RGB-D data

    Get PDF
    In recent years, action recognition based on RGB-D data has attracted increasing attention. Different from traditional 2D action recognition, RGB-D data contains extra depth and skeleton modalities. Different modalities have their own characteristics. This thesis presents seven novel methods to take advantages of the three modalities for action recognition. First, effective handcrafted features are designed and frequent pattern mining method is employed to mine the most discriminative, representative and nonredundant features for skeleton-based action recognition. Second, to take advantages of powerful Convolutional Neural Networks (ConvNets), it is proposed to represent spatio-temporal information carried in 3D skeleton sequences in three 2D images by encoding the joint trajectories and their dynamics into color distribution in the images, and ConvNets are adopted to learn the discriminative features for human action recognition. Third, for depth-based action recognition, three strategies of data augmentation are proposed to apply ConvNets to small training datasets. Forth, to take full advantage of the 3D structural information offered in the depth modality and its being insensitive to illumination variations, three simple, compact yet effective images-based representations are proposed and ConvNets are adopted for feature extraction and classification. However, both of previous two methods are sensitive to noise and could not differentiate well fine-grained actions. Fifth, it is proposed to represent a depth map sequence into three pairs of structured dynamic images at body, part and joint levels respectively through bidirectional rank pooling to deal with the issue. The structured dynamic image preserves the spatial-temporal information, enhances the structure information across both body parts/joints and different temporal scales, and takes advantages of ConvNets for action recognition. Sixth, it is proposed to extract and use scene flow for action recognition from RGB and depth data. Last, to exploit the joint information in multi-modal features arising from heterogeneous sources (RGB, depth), it is proposed to cooperatively train a single ConvNet (referred to as c-ConvNet) on both RGB features and depth features, and deeply aggregate the two modalities to achieve robust action recognition

    Skeletal Human Action Recognition using Hybrid Attention based Graph Convolutional Network

    Full text link
    In skeleton-based action recognition, Graph Convolutional Networks model human skeletal joints as vertices and connect them through an adjacency matrix, which can be seen as a local attention mask. However, in most existing Graph Convolutional Networks, the local attention mask is defined based on natural connections of human skeleton joints and ignores the dynamic relations for example between head, hands and feet joints. In addition, the attention mechanism has been proven effective in Natural Language Processing and image description, which is rarely investigated in existing methods. In this work, we proposed a new adaptive spatial attention layer that extends local attention map to global based on relative distance and relative angle information. Moreover, we design a new initial graph adjacency matrix that connects head, hands and feet, which shows visible improvement in terms of action recognition accuracy. The proposed model is evaluated on two large-scale and challenging datasets in the field of human activities in daily life: NTU-RGB+D and Kinetics skeleton. The results demonstrate that our model has strong performance on both dataset.Comment: 26th International Conference on Pattern Recognition, 202

    Describing Videos by Exploiting Temporal Structure

    Full text link
    Recent progress in using recurrent neural networks (RNNs) for image description has motivated the exploration of their application for video description. However, while images are static, working with videos requires modeling their dynamic temporal structure and then properly integrating that information into a natural language description. In this context, we propose an approach that successfully takes into account both the local and global temporal structure of videos to produce descriptions. First, our approach incorporates a spatial temporal 3-D convolutional neural network (3-D CNN) representation of the short temporal dynamics. The 3-D CNN representation is trained on video action recognition tasks, so as to produce a representation that is tuned to human motion and behavior. Second we propose a temporal attention mechanism that allows to go beyond local temporal modeling and learns to automatically select the most relevant temporal segments given the text-generating RNN. Our approach exceeds the current state-of-art for both BLEU and METEOR metrics on the Youtube2Text dataset. We also present results on a new, larger and more challenging dataset of paired video and natural language descriptions.Comment: Accepted to ICCV15. This version comes with code release and supplementary materia

    Deep recurrent spiking neural networks capture both static and dynamic representations of the visual cortex under movie stimuli

    Full text link
    In the real world, visual stimuli received by the biological visual system are predominantly dynamic rather than static. A better understanding of how the visual cortex represents movie stimuli could provide deeper insight into the information processing mechanisms of the visual system. Although some progress has been made in modeling neural responses to natural movies with deep neural networks, the visual representations of static and dynamic information under such time-series visual stimuli remain to be further explored. In this work, considering abundant recurrent connections in the mouse visual system, we design a recurrent module based on the hierarchy of the mouse cortex and add it into Deep Spiking Neural Networks, which have been demonstrated to be a more compelling computational model for the visual cortex. Using Time-Series Representational Similarity Analysis, we measure the representational similarity between networks and mouse cortical regions under natural movie stimuli. Subsequently, we conduct a comparison of the representational similarity across recurrent/feedforward networks and image/video training tasks. Trained on the video action recognition task, recurrent SNN achieves the highest representational similarity and significantly outperforms feedforward SNN trained on the same task by 15% and the recurrent SNN trained on the image classification task by 8%. We investigate how static and dynamic representations of SNNs influence the similarity, as a way to explain the importance of these two forms of representations in biological neural coding. Taken together, our work is the first to apply deep recurrent SNNs to model the mouse visual cortex under movie stimuli and we establish that these networks are competent to capture both static and dynamic representations and make contributions to understanding the movie information processing mechanisms of the visual cortex

    Im2Flow: Motion Hallucination from Static Images for Action Recognition

    Full text link
    Existing methods to recognize actions in static images take the images at their face value, learning the appearances---objects, scenes, and body poses---that distinguish each action class. However, such models are deprived of the rich dynamic structure and motions that also define human activity. We propose an approach that hallucinates the unobserved future motion implied by a single snapshot to help static-image action recognition. The key idea is to learn a prior over short-term dynamics from thousands of unlabeled videos, infer the anticipated optical flow on novel static images, and then train discriminative models that exploit both streams of information. Our main contributions are twofold. First, we devise an encoder-decoder convolutional neural network and a novel optical flow encoding that can translate a static image into an accurate flow map. Second, we show the power of hallucinated flow for recognition, successfully transferring the learned motion into a standard two-stream network for activity recognition. On seven datasets, we demonstrate the power of the approach. It not only achieves state-of-the-art accuracy for dense optical flow prediction, but also consistently enhances recognition of actions and dynamic scenes.Comment: Published in CVPR 2018, project page: http://vision.cs.utexas.edu/projects/im2flow
    • …
    corecore