896 research outputs found
Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos
Deep learning has been demonstrated to achieve excellent results for image
classification and object detection. However, the impact of deep learning on
video analysis (e.g. action detection and recognition) has been limited due to
complexity of video data and lack of annotations. Previous convolutional neural
networks (CNN) based video action detection approaches usually consist of two
major steps: frame-level action proposal detection and association of proposals
across frames. Also, these methods employ two-stream CNN framework to handle
spatial and temporal feature separately. In this paper, we propose an
end-to-end deep network called Tube Convolutional Neural Network (T-CNN) for
action detection in videos. The proposed architecture is a unified network that
is able to recognize and localize action based on 3D convolution features. A
video is first divided into equal length clips and for each clip a set of tube
proposals are generated next based on 3D Convolutional Network (ConvNet)
features. Finally, the tube proposals of different clips are linked together
employing network flow and spatio-temporal action detection is performed using
these linked video proposals. Extensive experiments on several video datasets
demonstrate the superior performance of T-CNN for classifying and localizing
actions in both trimmed and untrimmed videos compared to state-of-the-arts
Detect-and-Track: Efficient Pose Estimation in Videos
This paper addresses the problem of estimating and tracking human body
keypoints in complex, multi-person video. We propose an extremely lightweight
yet highly effective approach that builds upon the latest advancements in human
detection and video understanding. Our method operates in two-stages: keypoint
estimation in frames or short clips, followed by lightweight tracking to
generate keypoint predictions linked over the entire video. For frame-level
pose estimation we experiment with Mask R-CNN, as well as our own proposed 3D
extension of this model, which leverages temporal information over small clips
to generate more robust frame predictions. We conduct extensive ablative
experiments on the newly released multi-person video pose estimation benchmark,
PoseTrack, to validate various design choices of our model. Our approach
achieves an accuracy of 55.2% on the validation and 51.8% on the test set using
the Multi-Object Tracking Accuracy (MOTA) metric, and achieves state of the art
performance on the ICCV 2017 PoseTrack keypoint tracking challenge.Comment: In CVPR 2018. Ranked first in ICCV 2017 PoseTrack challenge (keypoint
tracking in videos). Code: https://github.com/facebookresearch/DetectAndTrack
and webpage: https://rohitgirdhar.github.io/DetectAndTrack
Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos
In this work, we propose an approach to the spatiotemporal localisation
(detection) and classification of multiple concurrent actions within temporally
untrimmed videos. Our framework is composed of three stages. In stage 1,
appearance and motion detection networks are employed to localise and score
actions from colour images and optical flow. In stage 2, the appearance network
detections are boosted by combining them with the motion detection scores, in
proportion to their respective spatial overlap. In stage 3, sequences of
detection boxes most likely to be associated with a single action instance,
called action tubes, are constructed by solving two energy maximisation
problems via dynamic programming. While in the first pass, action paths
spanning the whole video are built by linking detection boxes over time using
their class-specific scores and their spatial overlap, in the second pass,
temporal trimming is performed by ensuring label consistency for all
constituting detection boxes. We demonstrate the performance of our algorithm
on the challenging UCF101, J-HMDB-21 and LIRIS-HARL datasets, achieving new
state-of-the-art results across the board and significantly increasing
detection speed at test time. We achieve a huge leap forward in action
detection performance and report a 20% and 11% gain in mAP (mean average
precision) on UCF-101 and J-HMDB-21 datasets respectively when compared to the
state-of-the-art.Comment: Accepted by British Machine Vision Conference 201
Detect to Track and Track to Detect
Recent approaches for high accuracy detection and tracking of object
categories in video consist of complex multistage solutions that become more
cumbersome each year. In this paper we propose a ConvNet architecture that
jointly performs detection and tracking, solving the task in a simple and
effective way. Our contributions are threefold: (i) we set up a ConvNet
architecture for simultaneous detection and tracking, using a multi-task
objective for frame-based object detection and across-frame track regression;
(ii) we introduce correlation features that represent object co-occurrences
across time to aid the ConvNet during tracking; and (iii) we link the frame
level detections based on our across-frame tracklets to produce high accuracy
detections at the video level. Our ConvNet architecture for spatiotemporal
object detection is evaluated on the large-scale ImageNet VID dataset where it
achieves state-of-the-art results. Our approach provides better single model
performance than the winning method of the last ImageNet challenge while being
conceptually much simpler. Finally, we show that by increasing the temporal
stride we can dramatically increase the tracker speed.Comment: ICCV 2017. Code and models:
https://github.com/feichtenhofer/Detect-Track Results:
https://www.robots.ox.ac.uk/~vgg/research/detect-track
Action tube extraction based 3D-CNN for RGB-D action recognition
In this paper we propose a novel action tube extractor for RGB-D action recognition in trimmed videos. The action tube extractor takes as input a video and outputs an action tube. The method consists of two parts: spatial tube extraction and temporal sampling. The first part is built upon MobileNet-SSD and its role is to define the spatial region where the action takes place. The second part is based on the structural similarity index (SSIM) and is designed to remove frames without obvious motion from the primary action tube. The final extracted action tube has two benefits: 1) a higher ratio of ROI (subjects of action) to background; 2) most frames contain obvious motion change. We propose to use a two-stream (RGB and Depth) I3D architecture as our 3D-CNN model. Our approach outperforms the state-of-the-art methods on the OA and NTU RGB-D datasets. © 2018 IEEE.Peer ReviewedPostprint (published version
Am I Done? Predicting Action Progress in Videos
In this paper we deal with the problem of predicting action progress in
videos. We argue that this is an extremely important task since it can be
valuable for a wide range of interaction applications. To this end we introduce
a novel approach, named ProgressNet, capable of predicting when an action takes
place in a video, where it is located within the frames, and how far it has
progressed during its execution. To provide a general definition of action
progress, we ground our work in the linguistics literature, borrowing terms and
concepts to understand which actions can be the subject of progress estimation.
As a result, we define a categorization of actions and their phases. Motivated
by the recent success obtained from the interaction of Convolutional and
Recurrent Neural Networks, our model is based on a combination of the Faster
R-CNN framework, to make frame-wise predictions, and LSTM networks, to estimate
action progress through time. After introducing two evaluation protocols for
the task at hand, we demonstrate the capability of our model to effectively
predict action progress on the UCF-101 and J-HMDB datasets
- …