15,587 research outputs found
Learning Latent Super-Events to Detect Multiple Activities in Videos
In this paper, we introduce the concept of learning latent super-events from
activity videos, and present how it benefits activity detection in continuous
videos. We define a super-event as a set of multiple events occurring together
in videos with a particular temporal organization; it is the opposite concept
of sub-events. Real-world videos contain multiple activities and are rarely
segmented (e.g., surveillance videos), and learning latent super-events allows
the model to capture how the events are temporally related in videos. We design
temporal structure filters that enable the model to focus on particular
sub-intervals of the videos, and use them together with a soft attention
mechanism to learn representations of latent super-events. Super-event
representations are combined with per-frame or per-segment CNNs to provide
frame-level annotations. Our approach is designed to be fully differentiable,
enabling end-to-end learning of latent super-event representations jointly with
the activity detector using them. Our experiments with multiple public video
datasets confirm that the proposed concept of latent super-event learning
significantly benefits activity detection, advancing the state-of-the-arts.Comment: CVPR 201
Convolutional Drift Networks for Video Classification
Analyzing spatio-temporal data like video is a challenging task that requires
processing visual and temporal information effectively. Convolutional Neural
Networks have shown promise as baseline fixed feature extractors through
transfer learning, a technique that helps minimize the training cost on visual
information. Temporal information is often handled using hand-crafted features
or Recurrent Neural Networks, but this can be overly specific or prohibitively
complex. Building a fully trainable system that can efficiently analyze
spatio-temporal data without hand-crafted features or complex training is an
open challenge. We present a new neural network architecture to address this
challenge, the Convolutional Drift Network (CDN). Our CDN architecture combines
the visual feature extraction power of deep Convolutional Neural Networks with
the intrinsically efficient temporal processing provided by Reservoir
Computing. In this introductory paper on the CDN, we provide a very simple
baseline implementation tested on two egocentric (first-person) video activity
datasets.We achieve video-level activity classification results on-par with
state-of-the art methods. Notably, performance on this complex spatio-temporal
task was produced by only training a single feed-forward layer in the CDN.Comment: Published in IEEE Rebooting Computin
- …