Search CORE

284,551 research outputs found

Video Understanding for Complex Activity Recognition

Author: Borg Mark
Bremond François
Ferryman James
Fusier Florent
Thirde David
Thonnat Monique
Valentin Valery
Publication venue: Springer Verlag
Publication date: 01/01/2007
Field of study

International audienceThis paper presents a real-time video understanding system which automatically recognises activities occuring in environments observed through video surveillance cameras. Our approach consists in three main stages : Scene Tracking, Coherence Maintenance, and Scene Understanding. The main challenges are to provide a robust tracking process to be able to recognise events in outdoor and in real applications conditions, to allow the monitoring of a large scene through a camera network, and to automatically recognise complex events involving several actors interacting with each others. This approach has been validated for Airport Activity Monitoring in the framework of the European project AVITRACK

CiteSeerX

INRIA a CCSD electronic archive server

Long Movie Clip Classification with State-Space Video Models

Author: Bertasius Gedas
Islam Md Mohaiminul
Publication venue
Publication date: 04/04/2022
Field of study

Most modern video recognition models are designed to operate on short video clips (e.g., 5-10s in length). Because of this, it is challenging to apply such models to long movie understanding tasks, which typically require sophisticated long-range temporal reasoning capabilities. The recently introduced video transformers partially address this issue by using long-range temporal self-attention. However, due to the quadratic cost of self-attention, such models are often costly and impractical to use. Instead, we propose ViS4mer, an efficient long-range video model that combines the strengths of self-attention and the recently introduced structured state-space sequence (S4) layer. Our model uses a standard Transformer encoder for short-range spatiotemporal feature extraction, and a multi-scale temporal S4 decoder for subsequent long-range temporal reasoning. By progressively reducing the spatiotemporal feature resolution and channel dimension at each decoder layer, ViS4mer learns complex long-range spatiotemporal dependencies in a video. Furthermore, ViS4mer is

2.63\times

faster and requires

8\times

less GPU memory than the corresponding pure self-attention-based model. Additionally, ViS4mer achieves state-of-the-art results in

7

out of

9

long-form movie video classification tasks on the LVU benchmark. Furthermore, we also show that our approach successfully generalizes to other domains, achieving competitive results on the Breakfast and the COIN procedural activity datasets. The code will be made publicly available

arXiv.org e-Print Archive

3D Human Activity Recognition with Reconfigurable Convolutional Neural Networks

Author: Lin Liang
Wang Keze
Wang Meng
Wang Xiaolong
Zuo Wangmeng
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/02/2015
Field of study

Human activity understanding with 3D/depth sensors has received increasing attention in multimedia processing and interactions. This work targets on developing a novel deep model for automatic activity recognition from RGB-D videos. We represent each human activity as an ensemble of cubic-like video segments, and learn to discover the temporal structures for a category of activities, i.e. how the activities to be decomposed in terms of classification. Our model can be regarded as a structured deep architecture, as it extends the convolutional neural networks (CNNs) by incorporating structure alternatives. Specifically, we build the network consisting of 3D convolutions and max-pooling operators over the video segments, and introduce the latent variables in each convolutional layer manipulating the activation of neurons. Our model thus advances existing approaches in two aspects: (i) it acts directly on the raw inputs (grayscale-depth data) to conduct recognition instead of relying on hand-crafted features, and (ii) the model structure can be dynamically adjusted accounting for the temporal variations of human activities, i.e. the network configuration is allowed to be partially activated during inference. For model training, we propose an EM-type optimization method that iteratively (i) discovers the latent structure by determining the decomposed actions for each training example, and (ii) learns the network parameters by using the back-propagation algorithm. Our approach is validated in challenging scenarios, and outperforms state-of-the-art methods. A large human activity database of RGB-D videos is presented in addition.Comment: This manuscript has 10 pages with 9 figures, and a preliminary version was published in ACM MM'14 conferenc

arXiv.org e-Print Archive

Crossref

Going Deeper with Semantics: Video Activity Interpretation using Semantic Contextualization

Author: Aakur Sathyanarayanan N.
de Souza Fillipe DM
Sarkar Sudeep
Publication venue
Publication date: 15/11/2018
Field of study

A deeper understanding of video activities extends beyond recognition of underlying concepts such as actions and objects: constructing deep semantic representations requires reasoning about the semantic relationships among these concepts, often beyond what is directly observed in the data. To this end, we propose an energy minimization framework that leverages large-scale commonsense knowledge bases, such as ConceptNet, to provide contextual cues to establish semantic relationships among entities directly hypothesized from video signal. We mathematically express this using the language of Grenander's canonical pattern generator theory. We show that the use of prior encoded commonsense knowledge alleviate the need for large annotated training datasets and help tackle imbalance in training through prior knowledge. Using three different publicly available datasets - Charades, Microsoft Visual Description Corpus and Breakfast Actions datasets, we show that the proposed model can generate video interpretations whose quality is better than those reported by state-of-the-art approaches, which have substantial training needs. Through extensive experiments, we show that the use of commonsense knowledge from ConceptNet allows the proposed approach to handle various challenges such as training data imbalance, weak features, and complex semantic relationships and visual scenes.Comment: Accepted to WACV 201

arXiv.org e-Print Archive

Crossref