21,647 research outputs found
Spatio-Temporal Action Detection with Cascade Proposal and Location Anticipation
In this work, we address the problem of spatio-temporal action detection in
temporally untrimmed videos. It is an important and challenging task as finding
accurate human actions in both temporal and spatial space is important for
analyzing large-scale video data. To tackle this problem, we propose a cascade
proposal and location anticipation (CPLA) model for frame-level action
detection. There are several salient points of our model: (1) a cascade region
proposal network (casRPN) is adopted for action proposal generation and shows
better localization accuracy compared with single region proposal network
(RPN); (2) action spatio-temporal consistencies are exploited via a location
anticipation network (LAN) and thus frame-level action detection is not
conducted independently. Frame-level detections are then linked by solving an
linking score maximization problem, and temporally trimmed into spatio-temporal
action tubes. We demonstrate the effectiveness of our model on the challenging
UCF101 and LIRIS-HARL datasets, both achieving state-of-the-art performance.Comment: Accepted at BMVC 2017 (oral
Action Tubelet Detector for Spatio-Temporal Action Localization
Current state-of-the-art approaches for spatio-temporal action localization
rely on detections at the frame level that are then linked or tracked across
time. In this paper, we leverage the temporal continuity of videos instead of
operating at the frame level. We propose the ACtion Tubelet detector
(ACT-detector) that takes as input a sequence of frames and outputs tubelets,
i.e., sequences of bounding boxes with associated scores. The same way
state-of-the-art object detectors rely on anchor boxes, our ACT-detector is
based on anchor cuboids. We build upon the SSD framework. Convolutional
features are extracted for each frame, while scores and regressions are based
on the temporal stacking of these features, thus exploiting information from a
sequence. Our experimental results show that leveraging sequences of frames
significantly improves detection performance over using individual frames. The
gain of our tubelet detector can be explained by both more accurate scores and
more precise localization. Our ACT-detector outperforms the state-of-the-art
methods for frame-mAP and video-mAP on the J-HMDB and UCF-101 datasets, in
particular at high overlap thresholds.Comment: 9 page
Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos
In this work, we propose an approach to the spatiotemporal localisation
(detection) and classification of multiple concurrent actions within temporally
untrimmed videos. Our framework is composed of three stages. In stage 1,
appearance and motion detection networks are employed to localise and score
actions from colour images and optical flow. In stage 2, the appearance network
detections are boosted by combining them with the motion detection scores, in
proportion to their respective spatial overlap. In stage 3, sequences of
detection boxes most likely to be associated with a single action instance,
called action tubes, are constructed by solving two energy maximisation
problems via dynamic programming. While in the first pass, action paths
spanning the whole video are built by linking detection boxes over time using
their class-specific scores and their spatial overlap, in the second pass,
temporal trimming is performed by ensuring label consistency for all
constituting detection boxes. We demonstrate the performance of our algorithm
on the challenging UCF101, J-HMDB-21 and LIRIS-HARL datasets, achieving new
state-of-the-art results across the board and significantly increasing
detection speed at test time. We achieve a huge leap forward in action
detection performance and report a 20% and 11% gain in mAP (mean average
precision) on UCF-101 and J-HMDB-21 datasets respectively when compared to the
state-of-the-art.Comment: Accepted by British Machine Vision Conference 201
Detect to Track and Track to Detect
Recent approaches for high accuracy detection and tracking of object
categories in video consist of complex multistage solutions that become more
cumbersome each year. In this paper we propose a ConvNet architecture that
jointly performs detection and tracking, solving the task in a simple and
effective way. Our contributions are threefold: (i) we set up a ConvNet
architecture for simultaneous detection and tracking, using a multi-task
objective for frame-based object detection and across-frame track regression;
(ii) we introduce correlation features that represent object co-occurrences
across time to aid the ConvNet during tracking; and (iii) we link the frame
level detections based on our across-frame tracklets to produce high accuracy
detections at the video level. Our ConvNet architecture for spatiotemporal
object detection is evaluated on the large-scale ImageNet VID dataset where it
achieves state-of-the-art results. Our approach provides better single model
performance than the winning method of the last ImageNet challenge while being
conceptually much simpler. Finally, we show that by increasing the temporal
stride we can dramatically increase the tracker speed.Comment: ICCV 2017. Code and models:
https://github.com/feichtenhofer/Detect-Track Results:
https://www.robots.ox.ac.uk/~vgg/research/detect-track
- …