284,551 research outputs found
Video Understanding for Complex Activity Recognition
International audienceThis paper presents a real-time video understanding system which automatically recognises activities occuring in environments observed through video surveillance cameras. Our approach consists in three main stages : Scene Tracking, Coherence Maintenance, and Scene Understanding. The main challenges are to provide a robust tracking process to be able to recognise events in outdoor and in real applications conditions, to allow the monitoring of a large scene through a camera network, and to automatically recognise complex events involving several actors interacting with each others. This approach has been validated for Airport Activity Monitoring in the framework of the European project AVITRACK
Long Movie Clip Classification with State-Space Video Models
Most modern video recognition models are designed to operate on short video
clips (e.g., 5-10s in length). Because of this, it is challenging to apply such
models to long movie understanding tasks, which typically require sophisticated
long-range temporal reasoning capabilities. The recently introduced video
transformers partially address this issue by using long-range temporal
self-attention. However, due to the quadratic cost of self-attention, such
models are often costly and impractical to use. Instead, we propose ViS4mer, an
efficient long-range video model that combines the strengths of self-attention
and the recently introduced structured state-space sequence (S4) layer. Our
model uses a standard Transformer encoder for short-range spatiotemporal
feature extraction, and a multi-scale temporal S4 decoder for subsequent
long-range temporal reasoning. By progressively reducing the spatiotemporal
feature resolution and channel dimension at each decoder layer, ViS4mer learns
complex long-range spatiotemporal dependencies in a video. Furthermore, ViS4mer
is faster and requires less GPU memory than the
corresponding pure self-attention-based model. Additionally, ViS4mer achieves
state-of-the-art results in out of long-form movie video classification
tasks on the LVU benchmark. Furthermore, we also show that our approach
successfully generalizes to other domains, achieving competitive results on the
Breakfast and the COIN procedural activity datasets. The code will be made
publicly available
3D Human Activity Recognition with Reconfigurable Convolutional Neural Networks
Human activity understanding with 3D/depth sensors has received increasing
attention in multimedia processing and interactions. This work targets on
developing a novel deep model for automatic activity recognition from RGB-D
videos. We represent each human activity as an ensemble of cubic-like video
segments, and learn to discover the temporal structures for a category of
activities, i.e. how the activities to be decomposed in terms of
classification. Our model can be regarded as a structured deep architecture, as
it extends the convolutional neural networks (CNNs) by incorporating structure
alternatives. Specifically, we build the network consisting of 3D convolutions
and max-pooling operators over the video segments, and introduce the latent
variables in each convolutional layer manipulating the activation of neurons.
Our model thus advances existing approaches in two aspects: (i) it acts
directly on the raw inputs (grayscale-depth data) to conduct recognition
instead of relying on hand-crafted features, and (ii) the model structure can
be dynamically adjusted accounting for the temporal variations of human
activities, i.e. the network configuration is allowed to be partially activated
during inference. For model training, we propose an EM-type optimization method
that iteratively (i) discovers the latent structure by determining the
decomposed actions for each training example, and (ii) learns the network
parameters by using the back-propagation algorithm. Our approach is validated
in challenging scenarios, and outperforms state-of-the-art methods. A large
human activity database of RGB-D videos is presented in addition.Comment: This manuscript has 10 pages with 9 figures, and a preliminary
version was published in ACM MM'14 conferenc
Going Deeper with Semantics: Video Activity Interpretation using Semantic Contextualization
A deeper understanding of video activities extends beyond recognition of
underlying concepts such as actions and objects: constructing deep semantic
representations requires reasoning about the semantic relationships among these
concepts, often beyond what is directly observed in the data. To this end, we
propose an energy minimization framework that leverages large-scale commonsense
knowledge bases, such as ConceptNet, to provide contextual cues to establish
semantic relationships among entities directly hypothesized from video signal.
We mathematically express this using the language of Grenander's canonical
pattern generator theory. We show that the use of prior encoded commonsense
knowledge alleviate the need for large annotated training datasets and help
tackle imbalance in training through prior knowledge. Using three different
publicly available datasets - Charades, Microsoft Visual Description Corpus and
Breakfast Actions datasets, we show that the proposed model can generate video
interpretations whose quality is better than those reported by state-of-the-art
approaches, which have substantial training needs. Through extensive
experiments, we show that the use of commonsense knowledge from ConceptNet
allows the proposed approach to handle various challenges such as training data
imbalance, weak features, and complex semantic relationships and visual scenes.Comment: Accepted to WACV 201
- …