72 research outputs found

    A Closer Look at Spatiotemporal Convolutions for Action Recognition

    Full text link
    In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition. Our motivation stems from the observation that 2D CNNs applied to individual frames of the video have remained solid performers in action recognition. In this work we empirically demonstrate the accuracy advantages of 3D CNNs over 2D CNNs within the framework of residual learning. Furthermore, we show that factorizing the 3D convolutional filters into separate spatial and temporal components yields significantly advantages in accuracy. Our empirical study leads to the design of a new spatiotemporal convolutional block "R(2+1)D" which gives rise to CNNs that achieve results comparable or superior to the state-of-the-art on Sports-1M, Kinetics, UCF101 and HMDB51

    Simple yet efficient real-time pose-based action recognition

    Full text link
    Recognizing human actions is a core challenge for autonomous systems as they directly share the same space with humans. Systems must be able to recognize and assess human actions in real-time. In order to train corresponding data-driven algorithms, a significant amount of annotated training data is required. We demonstrated a pipeline to detect humans, estimate their pose, track them over time and recognize their actions in real-time with standard monocular camera sensors. For action recognition, we encode the human pose into a new data format called Encoded Human Pose Image (EHPI) that can then be classified using standard methods from the computer vision community. With this simple procedure we achieve competitive state-of-the-art performance in pose-based action detection and can ensure real-time performance. In addition, we show a use case in the context of autonomous driving to demonstrate how such a system can be trained to recognize human actions using simulation data.Comment: Submitted to IEEE Intelligent Transportation Systems Conference (ITSC) 2019. Code will be available soon at https://github.com/noboevbo/ehpi_action_recognitio

    Neuromorphic Vision Sensing for CNN-based Action Recognition

    Get PDF
    Neuromorphic vision sensing (NVS) hardware is now gaining traction as a low-power/high-speed visual sensing technology that circumvents the limitations of conventional active pixel sensing (APS) cameras. While object detection and tracking models have been investigated in conjunction with NVS, there is currently little work on NVS for higher-level semantic tasks, such as action recognition. Contrary to recent work that considers homogeneous transfer between flow domains (optical flow to motion vectors), we propose to embed an NVS emulator into a multi-modal transfer learning framework that carries out heterogeneous transfer from optical flow to NVS. The potential of our framework is showcased by the fact that, for the first time, our NVS-based results achieve comparable action recognition performance to motion-vector or optical-flow based methods (i.e., accuracy on UCF-101 within 8.8% of I3D with optical flow), with the NVS emulator and NVS camera hardware offering 3 to 6 orders of magnitude faster frame generation (respectively) compared to standard Brox optical flow. Beyond this significant advantage, our CNN processing is found to have the lowest total GFLOP count against all competing methods (up to 7.7 times complexity saving compared to I3D with optical flow)

    Detecting Dairy Cow Behavior Using Vision Technology

    Get PDF
    The aim of this study was to investigate using existing image recognition techniques to predict the behavior of dairy cows. A total of 46 individual dairy cows were monitored continuously under 24 h video surveillance prior to calving. The video was annotated for the behaviors of standing, lying, walking, shuffling, eating, drinking and contractions for each cow from 10 h prior to calving. A total of 19,191 behavior records were obtained and a non-local neural network was trained and validated on video clips of each behavior. This study showed that the non-local network used correctly classified the seven behaviors 80% or more of the time in the validated dataset. In particular, the detection of birth contractions was correctly predicted 83% of the time, which in itself can be an early warning calving alert, as all cows start contractions several hours prior to giving birth. This approach to behavior recognition using video cameras can assist livestock management

    Fairer Evaluation of Zero Shot Action Recognition in Videos

    Get PDF
    Zero-shot learning (ZSL) for human action recognition (HAR) aims to recognise video action classes that have never been seen during model training. This is achieved by building mappings between visual and semantic embeddings. These visual embeddings are typically provided via a pre-trained deep neural network (DNN). The premise of ZSL is that the training and testing classes should be disjoint. In the parallel domain of ZSL for image input, the widespread poor evaluation protocol of pre-training on ZSL test classes has been highlighted. This is akin to providing a sneak preview of the evaluation classes. In this work, we investigate the extent to which this evaluation protocol has been used in ZSL for human action recognition research work. We show that in the field of ZSL for HAR, accuracies for overlapping classes are being boosted by between 5.75% to 51.94% depending on the use of visual and semantic features as a result of this flawed evaluation protocol. To assist other research ers in avoiding this problem in the future, we provide annotated versions of the relevant benchmark ZSL test datasets in the HAR field: UCF101 and HMDB51 datasets - highlighting overlaps to pre-training datasets in the field

    Prompt-Guided Zero-Shot Anomaly Action Recognition using Pretrained Deep Skeleton Features

    Full text link
    This study investigates unsupervised anomaly action recognition, which identifies video-level abnormal-human-behavior events in an unsupervised manner without abnormal samples, and simultaneously addresses three limitations in the conventional skeleton-based approaches: target domain-dependent DNN training, robustness against skeleton errors, and a lack of normal samples. We present a unified, user prompt-guided zero-shot learning framework using a target domain-independent skeleton feature extractor, which is pretrained on a large-scale action recognition dataset. Particularly, during the training phase using normal samples, the method models the distribution of skeleton features of the normal actions while freezing the weights of the DNNs and estimates the anomaly score using this distribution in the inference phase. Additionally, to increase robustness against skeleton errors, we introduce a DNN architecture inspired by a point cloud deep learning paradigm, which sparsely propagates the features between joints. Furthermore, to prevent the unobserved normal actions from being misidentified as abnormal actions, we incorporate a similarity score between the user prompt embeddings and skeleton features aligned in the common space into the anomaly score, which indirectly supplements normal actions. On two publicly available datasets, we conduct experiments to test the effectiveness of the proposed method with respect to abovementioned limitations.Comment: CVPR 202

    Hidden Two-Stream Convolutional Networks for Action Recognition

    Full text link
    Analyzing videos of human actions involves understanding the temporal relationships among video frames. State-of-the-art action recognition approaches rely on traditional optical flow estimation methods to pre-compute motion information for CNNs. Such a two-stage approach is computationally expensive, storage demanding, and not end-to-end trainable. In this paper, we present a novel CNN architecture that implicitly captures motion information between adjacent frames. We name our approach hidden two-stream CNNs because it only takes raw video frames as input and directly predicts action classes without explicitly computing optical flow. Our end-to-end approach is 10x faster than its two-stage baseline. Experimental results on four challenging action recognition datasets: UCF101, HMDB51, THUMOS14 and ActivityNet v1.2 show that our approach significantly outperforms the previous best real-time approaches.Comment: Accepted at ACCV 2018, camera ready. Code available at https://github.com/bryanyzhu/Hidden-Two-Strea

    Unified Keypoint-based Action Recognition Framework via Structured Keypoint Pooling

    Full text link
    This paper simultaneously addresses three limitations associated with conventional skeleton-based action recognition; skeleton detection and tracking errors, poor variety of the targeted actions, as well as person-wise and frame-wise action recognition. A point cloud deep-learning paradigm is introduced to the action recognition, and a unified framework along with a novel deep neural network architecture called Structured Keypoint Pooling is proposed. The proposed method sparsely aggregates keypoint features in a cascaded manner based on prior knowledge of the data structure (which is inherent in skeletons), such as the instances and frames to which each keypoint belongs, and achieves robustness against input errors. Its less constrained and tracking-free architecture enables time-series keypoints consisting of human skeletons and nonhuman object contours to be efficiently treated as an input 3D point cloud and extends the variety of the targeted action. Furthermore, we propose a Pooling-Switching Trick inspired by Structured Keypoint Pooling. This trick switches the pooling kernels between the training and inference phases to detect person-wise and frame-wise actions in a weakly supervised manner using only video-level action labels. This trick enables our training scheme to naturally introduce novel data augmentation, which mixes multiple point clouds extracted from different videos. In the experiments, we comprehensively verify the effectiveness of the proposed method against the limitations, and the method outperforms state-of-the-art skeleton-based action recognition and spatio-temporal action localization methods.Comment: CVPR 202

    Lane Change Classification and Prediction with Action Recognition Networks

    Full text link
    Anticipating lane change intentions of surrounding vehicles is crucial for efficient and safe driving decision making in an autonomous driving system. Previous works often adopt physical variables such as driving speed, acceleration and so forth for lane change classification. However, physical variables do not contain semantic information. Although 3D CNNs have been developing rapidly, the number of methods utilising action recognition models and appearance feature for lane change recognition is low, and they all require additional information to pre-process data. In this work, we propose an end-to-end framework including two action recognition methods for lane change recognition, using video data collected by cameras. Our method achieves the best lane change classification results using only the RGB video data of the PREVENTION dataset. Class activation maps demonstrate that action recognition models can efficiently extract lane change motions. A method to better extract motion clues is also proposed in this paper.Comment: Accepted by ECC

    In Defense of Image Pre-Training for Spatiotemporal Recognition

    Full text link
    Image pre-training, the current de-facto paradigm for a wide range of visual tasks, is generally less favored in the field of video recognition. By contrast, a common strategy is to directly train with spatiotemporal convolutional neural networks (CNNs) from scratch. Nonetheless, interestingly, by taking a closer look at these from-scratch learned CNNs, we note there exist certain 3D kernels that exhibit much stronger appearance modeling ability than others, arguably suggesting appearance information is already well disentangled in learning. Inspired by this observation, we hypothesize that the key to effectively leveraging image pre-training lies in the decomposition of learning spatial and temporal features, and revisiting image pre-training as the appearance prior to initializing 3D kernels. In addition, we propose Spatial-Temporal Separable (STS) convolution, which explicitly splits the feature channels into spatial and temporal groups, to further enable a more thorough decomposition of spatiotemporal features for fine-tuning 3D CNNs. Our experiments show that simply replacing 3D convolution with STS notably improves a wide range of 3D CNNs without increasing parameters and computation on both Kinetics-400 and Something-Something V2. Moreover, this new training pipeline consistently achieves better results on video recognition with significant speedup. For instance, we achieve +0.6% top-1 of Slowfast on Kinetics-400 over the strong 256-epoch 128-GPU baseline while fine-tuning for only 50 epochs with 4 GPUs. The code and models are available at https://github.com/UCSC-VLAA/Image-Pretraining-for-Video.Comment: Published as a conference paper at ECCV 202
    corecore