Search CORE

72 research outputs found

A Closer Look at Spatiotemporal Convolutions for Action Recognition

Author: LeCun Yann
Paluri Manohar
Ray Jamie
Torresani Lorenzo
Tran Du
Wang Heng
Publication venue
Publication date: 11/04/2018
Field of study

In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition. Our motivation stems from the observation that 2D CNNs applied to individual frames of the video have remained solid performers in action recognition. In this work we empirically demonstrate the accuracy advantages of 3D CNNs over 2D CNNs within the framework of residual learning. Furthermore, we show that factorizing the 3D convolutional filters into separate spatial and temporal components yields significantly advantages in accuracy. Our empirical study leads to the design of a new spatiotemporal convolutional block "R(2+1)D" which gives rise to CNNs that achieve results comparable or superior to the state-of-the-art on Sports-1M, Kinetics, UCF101 and HMDB51

arXiv.org e-Print Archive

Crossref

Simple yet efficient real-time pose-based action recognition

Author: Curio Cristóbal
Gulde Thomas
Ludl Dennis
Publication venue
Publication date: 01/01/2019
Field of study

Recognizing human actions is a core challenge for autonomous systems as they directly share the same space with humans. Systems must be able to recognize and assess human actions in real-time. In order to train corresponding data-driven algorithms, a significant amount of annotated training data is required. We demonstrated a pipeline to detect humans, estimate their pose, track them over time and recognize their actions in real-time with standard monocular camera sensors. For action recognition, we encode the human pose into a new data format called Encoded Human Pose Image (EHPI) that can then be classified using standard methods from the computer vision community. With this simple procedure we achieve competitive state-of-the-art performance in pose-based action detection and can ensure real-time performance. In addition, we show a use case in the context of autonomous driving to demonstrate how such a system can be trained to recognize human actions using simulation data.Comment: Submitted to IEEE Intelligent Transportation Systems Conference (ITSC) 2019. Code will be available soon at https://github.com/noboevbo/ehpi_action_recognitio

arXiv.org e-Print Archive

Repositorium und Bibliografie der Hochschule Reutlingen

Neuromorphic Vision Sensing for CNN-based Action Recognition

Author: Abbas A
Andreopoulos Y
Bi Y
Chadha A
Publication venue: 44th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Publication date: 17/04/2019
Field of study

Neuromorphic vision sensing (NVS) hardware is now gaining traction as a low-power/high-speed visual sensing technology that circumvents the limitations of conventional active pixel sensing (APS) cameras. While object detection and tracking models have been investigated in conjunction with NVS, there is currently little work on NVS for higher-level semantic tasks, such as action recognition. Contrary to recent work that considers homogeneous transfer between flow domains (optical flow to motion vectors), we propose to embed an NVS emulator into a multi-modal transfer learning framework that carries out heterogeneous transfer from optical flow to NVS. The potential of our framework is showcased by the fact that, for the first time, our NVS-based results achieve comparable action recognition performance to motion-vector or optical-flow based methods (i.e., accuracy on UCF-101 within 8.8% of I3D with optical flow), with the NVS emulator and NVS camera hardware offering 3 to 6 orders of magnitude faster frame generation (respectively) compared to standard Brox optical flow. Beyond this significant advantage, our CNN processing is found to have the lowest total GFLOP count against all competing methods (up to 7.7 times complexity saving compared to I3D with optical flow)

Crossref

UCL Discovery

Detecting Dairy Cow Behavior Using Vision Technology

Author: Bell Matt J.
Down Peter M.
Huggett Zoë J.
McDonagh John
Slinger Kimberley R.
Tzimiropoulos Georgios
Publication venue: 'MDPI AG'
Publication date: 01/07/2021
Field of study

The aim of this study was to investigate using existing image recognition techniques to predict the behavior of dairy cows. A total of 46 individual dairy cows were monitored continuously under 24 h video surveillance prior to calving. The video was annotated for the behaviors of standing, lying, walking, shuffling, eating, drinking and contractions for each cow from 10 h prior to calving. A total of 19,191 behavior records were obtained and a non-local neural network was trained and validated on video clips of each behavior. This study showed that the non-local network used correctly classified the seven behaviors 80% or more of the time in the validated dataset. In particular, the detection of birth contractions was correctly predicted 83% of the time, which in itself can be an early warning calving alert, as all cows start contractions several hours prior to giving birth. This approach to behavior recognition using video cameras can assist livestock management

Multidisciplinary Digital Publishing Institute

Repository@Nottingham

Hartpury University Repository

Fairer Evaluation of Zero Shot Action Recognition in Videos

Author: Delany Sarah Jane
Huang Kaiqiang
Mckeever Susan
Publication venue: Technological University Dublin
Publication date: 01/01/2021
Field of study

Zero-shot learning (ZSL) for human action recognition (HAR) aims to recognise video action classes that have never been seen during model training. This is achieved by building mappings between visual and semantic embeddings. These visual embeddings are typically provided via a pre-trained deep neural network (DNN). The premise of ZSL is that the training and testing classes should be disjoint. In the parallel domain of ZSL for image input, the widespread poor evaluation protocol of pre-training on ZSL test classes has been highlighted. This is akin to providing a sneak preview of the evaluation classes. In this work, we investigate the extent to which this evaluation protocol has been used in ZSL for human action recognition research work. We show that in the field of ZSL for HAR, accuracies for overlapping classes are being boosted by between 5.75% to 51.94% depending on the use of visual and semantic features as a result of this flawed evaluation protocol. To assist other research ers in avoiding this problem in the future, we provide annotated versions of the relevant benchmark ZSL test datasets in the HAR field: UCF101 and HMDB51 datasets - highlighting overlaps to pre-training datasets in the field

Arrow@TUDublin

Prompt-Guided Zero-Shot Anomaly Action Recognition using Pretrained Deep Skeleton Features

Author: Hachiuma Ryo
Sato Fumiaki
Sekii Taiki
Publication venue
Publication date: 27/03/2023
Field of study

This study investigates unsupervised anomaly action recognition, which identifies video-level abnormal-human-behavior events in an unsupervised manner without abnormal samples, and simultaneously addresses three limitations in the conventional skeleton-based approaches: target domain-dependent DNN training, robustness against skeleton errors, and a lack of normal samples. We present a unified, user prompt-guided zero-shot learning framework using a target domain-independent skeleton feature extractor, which is pretrained on a large-scale action recognition dataset. Particularly, during the training phase using normal samples, the method models the distribution of skeleton features of the normal actions while freezing the weights of the DNNs and estimates the anomaly score using this distribution in the inference phase. Additionally, to increase robustness against skeleton errors, we introduce a DNN architecture inspired by a point cloud deep learning paradigm, which sparsely propagates the features between joints. Furthermore, to prevent the unobserved normal actions from being misidentified as abnormal actions, we incorporate a similarity score between the user prompt embeddings and skeleton features aligned in the common space into the anomaly score, which indirectly supplements normal actions. On two publicly available datasets, we conduct experiments to test the effectiveness of the proposed method with respect to abovementioned limitations.Comment: CVPR 202

arXiv.org e-Print Archive

Hidden Two-Stream Convolutional Networks for Action Recognition

Author: Hauptmann Alexander G.
Lan Zhenzhong
Newsam Shawn
Zhu Yi
Publication venue
Publication date: 30/10/2018
Field of study

Analyzing videos of human actions involves understanding the temporal relationships among video frames. State-of-the-art action recognition approaches rely on traditional optical flow estimation methods to pre-compute motion information for CNNs. Such a two-stage approach is computationally expensive, storage demanding, and not end-to-end trainable. In this paper, we present a novel CNN architecture that implicitly captures motion information between adjacent frames. We name our approach hidden two-stream CNNs because it only takes raw video frames as input and directly predicts action classes without explicitly computing optical flow. Our end-to-end approach is 10x faster than its two-stage baseline. Experimental results on four challenging action recognition datasets: UCF101, HMDB51, THUMOS14 and ActivityNet v1.2 show that our approach significantly outperforms the previous best real-time approaches.Comment: Accepted at ACCV 2018, camera ready. Code available at https://github.com/bryanyzhu/Hidden-Two-Strea

arXiv.org e-Print Archive

Crossref

Unified Keypoint-based Action Recognition Framework via Structured Keypoint Pooling

Author: Hachiuma Ryo
Sato Fumiaki
Sekii Taiki
Publication venue
Publication date: 27/03/2023
Field of study

This paper simultaneously addresses three limitations associated with conventional skeleton-based action recognition; skeleton detection and tracking errors, poor variety of the targeted actions, as well as person-wise and frame-wise action recognition. A point cloud deep-learning paradigm is introduced to the action recognition, and a unified framework along with a novel deep neural network architecture called Structured Keypoint Pooling is proposed. The proposed method sparsely aggregates keypoint features in a cascaded manner based on prior knowledge of the data structure (which is inherent in skeletons), such as the instances and frames to which each keypoint belongs, and achieves robustness against input errors. Its less constrained and tracking-free architecture enables time-series keypoints consisting of human skeletons and nonhuman object contours to be efficiently treated as an input 3D point cloud and extends the variety of the targeted action. Furthermore, we propose a Pooling-Switching Trick inspired by Structured Keypoint Pooling. This trick switches the pooling kernels between the training and inference phases to detect person-wise and frame-wise actions in a weakly supervised manner using only video-level action labels. This trick enables our training scheme to naturally introduce novel data augmentation, which mixes multiple point clouds extracted from different videos. In the experiments, we comprehensively verify the effectiveness of the proposed method against the limitations, and the method outperforms state-of-the-art skeleton-based action recognition and spatio-temporal action localization methods.Comment: CVPR 202

arXiv.org e-Print Archive

Lane Change Classification and Prediction with Action Recognition Networks

Author: Bhalerao Abhir
Liang Kai
Wang Jun
Publication venue
Publication date: 12/09/2022
Field of study

Anticipating lane change intentions of surrounding vehicles is crucial for efficient and safe driving decision making in an autonomous driving system. Previous works often adopt physical variables such as driving speed, acceleration and so forth for lane change classification. However, physical variables do not contain semantic information. Although 3D CNNs have been developing rapidly, the number of methods utilising action recognition models and appearance feature for lane change recognition is low, and they all require additional information to pre-process data. In this work, we propose an end-to-end framework including two action recognition methods for lane change recognition, using video data collected by cameras. Our method achieves the best lane change classification results using only the RGB video data of the PREVENTION dataset. Class activation maps demonstrate that action recognition models can efficiently extract lane change motions. A method to better extract motion clues is also proposed in this paper.Comment: Accepted by ECC

arXiv.org e-Print Archive

In Defense of Image Pre-Training for Spatiotemporal Recognition

Author: Li Xianhang
Mei Jieru
Wang Huiyu
Wei Chen
Xie Cihang
Yuille Alan
Zhou Yuyin
Publication venue
Publication date: 01/08/2022
Field of study

Image pre-training, the current de-facto paradigm for a wide range of visual tasks, is generally less favored in the field of video recognition. By contrast, a common strategy is to directly train with spatiotemporal convolutional neural networks (CNNs) from scratch. Nonetheless, interestingly, by taking a closer look at these from-scratch learned CNNs, we note there exist certain 3D kernels that exhibit much stronger appearance modeling ability than others, arguably suggesting appearance information is already well disentangled in learning. Inspired by this observation, we hypothesize that the key to effectively leveraging image pre-training lies in the decomposition of learning spatial and temporal features, and revisiting image pre-training as the appearance prior to initializing 3D kernels. In addition, we propose Spatial-Temporal Separable (STS) convolution, which explicitly splits the feature channels into spatial and temporal groups, to further enable a more thorough decomposition of spatiotemporal features for fine-tuning 3D CNNs. Our experiments show that simply replacing 3D convolution with STS notably improves a wide range of 3D CNNs without increasing parameters and computation on both Kinetics-400 and Something-Something V2. Moreover, this new training pipeline consistently achieves better results on video recognition with significant speedup. For instance, we achieve +0.6% top-1 of Slowfast on Kinetics-400 over the strong 256-epoch 128-GPU baseline while fine-tuning for only 50 epochs with 4 GPUs. The code and models are available at https://github.com/UCSC-VLAA/Image-Pretraining-for-Video.Comment: Published as a conference paper at ECCV 202

arXiv.org e-Print Archive