125 research outputs found
Temporal activity detection in untrimmed videos with recurrent neural networks
This work proposes a simple pipeline to classify and temporally localize activities in untrimmed videos. Our system uses features from a 3D Convolutional Neural Network (C3D) as input to train a a recurrent neural network (RNN) that learns to classify video clips of 16 frames. After clip prediction, we post-process the output of the RNN to assign a single activity label to each video, and determine the temporal boundaries of the activity within the video. We show how our system can achieve competitive results in both tasks with a simple architecture. We evaluate our method in the ActivityNet Challenge 2016, achieving a 0.5874 mAP and a 0.2237 mAP in the classification and detection tasks, respectively. Our code and models are publicly available at: https://imatge-upc.github.io/activitynet-2016-cvprw/Peer ReviewedPostprint (published version
RED: Reinforced Encoder-Decoder Networks for Action Anticipation
Action anticipation aims to detect an action before it happens. Many real
world applications in robotics and surveillance are related to this predictive
capability. Current methods address this problem by first anticipating visual
representations of future frames and then categorizing the anticipated
representations to actions. However, anticipation is based on a single past
frame's representation, which ignores the history trend. Besides, it can only
anticipate a fixed future time. We propose a Reinforced Encoder-Decoder (RED)
network for action anticipation. RED takes multiple history representations as
input and learns to anticipate a sequence of future representations. One
salient aspect of RED is that a reinforcement module is adopted to provide
sequence-level supervision; the reward function is designed to encourage the
system to make correct predictions as early as possible. We test RED on
TVSeries, THUMOS-14 and TV-Human-Interaction datasets for action anticipation
and achieve state-of-the-art performance on all datasets
Spatio-Temporal Action Detection with Cascade Proposal and Location Anticipation
In this work, we address the problem of spatio-temporal action detection in
temporally untrimmed videos. It is an important and challenging task as finding
accurate human actions in both temporal and spatial space is important for
analyzing large-scale video data. To tackle this problem, we propose a cascade
proposal and location anticipation (CPLA) model for frame-level action
detection. There are several salient points of our model: (1) a cascade region
proposal network (casRPN) is adopted for action proposal generation and shows
better localization accuracy compared with single region proposal network
(RPN); (2) action spatio-temporal consistencies are exploited via a location
anticipation network (LAN) and thus frame-level action detection is not
conducted independently. Frame-level detections are then linked by solving an
linking score maximization problem, and temporally trimmed into spatio-temporal
action tubes. We demonstrate the effectiveness of our model on the challenging
UCF101 and LIRIS-HARL datasets, both achieving state-of-the-art performance.Comment: Accepted at BMVC 2017 (oral
Cascaded Boundary Regression for Temporal Action Detection
Temporal action detection in long videos is an important problem.
State-of-the-art methods address this problem by applying action classifiers on
sliding windows. Although sliding windows may contain an identifiable portion
of the actions, they may not necessarily cover the entire action instance,
which would lead to inferior performance. We adapt a two-stage temporal action
detection pipeline with Cascaded Boundary Regression (CBR) model.
Class-agnostic proposals and specific actions are detected respectively in the
first and the second stage. CBR uses temporal coordinate regression to refine
the temporal boundaries of the sliding windows. The salient aspect of the
refinement process is that, inside each stage, the temporal boundaries are
adjusted in a cascaded way by feeding the refined windows back to the system
for further boundary refinement. We test CBR on THUMOS-14 and TVSeries, and
achieve state-of-the-art performance on both datasets. The performance gain is
especially remarkable under high IoU thresholds, e.g. map@tIoU=0.5 on THUMOS-14
is improved from 19.0% to 31.0%
- …