47 research outputs found

    The THUMOS Challenge on Action Recognition for Videos "in the Wild"

    Get PDF
    Automatically recognizing and localizing wide ranges of human actions has crucial importance for video understanding. Towards this goal, the THUMOS challenge was introduced in 2013 to serve as a benchmark for action recognition. Until then, video action recognition, including THUMOS challenge, had focused primarily on the classification of pre-segmented (i.e., trimmed) videos, which is an artificial task. In THUMOS 2014, we elevated action recognition to a more practical level by introducing temporally untrimmed videos. These also include `background videos' which share similar scenes and backgrounds as action videos, but are devoid of the specific actions. The three editions of the challenge organized in 2013--2015 have made THUMOS a common benchmark for action classification and detection and the annual challenge is widely attended by teams from around the world. In this paper we describe the THUMOS benchmark in detail and give an overview of data collection and annotation procedures. We present the evaluation protocols used to quantify results in the two THUMOS tasks of action classification and temporal detection. We also present results of submissions to the THUMOS 2015 challenge and review the participating approaches. Additionally, we include a comprehensive empirical study evaluating the differences in action recognition between trimmed and untrimmed videos, and how well methods trained on trimmed videos generalize to untrimmed videos. We conclude by proposing several directions and improvements for future THUMOS challenges.Comment: Preprint submitted to Computer Vision and Image Understandin

    TraMNet - Transition Matrix Network for Efficient Action Tube Proposals

    Full text link
    Current state-of-the-art methods solve spatiotemporal action localisation by extending 2D anchors to 3D-cuboid proposals on stacks of frames, to generate sets of temporally connected bounding boxes called \textit{action micro-tubes}. However, they fail to consider that the underlying anchor proposal hypotheses should also move (transition) from frame to frame, as the actor or the camera does. Assuming we evaluate nn 2D anchors in each frame, then the number of possible transitions from each 2D anchor to the next, for a sequence of ff consecutive frames, is in the order of O(nf)O(n^f), expensive even for small values of ff. To avoid this problem, we introduce a Transition-Matrix-based Network (TraMNet) which relies on computing transition probabilities between anchor proposals while maximising their overlap with ground truth bounding boxes across frames, and enforcing sparsity via a transition threshold. As the resulting transition matrix is sparse and stochastic, this reduces the proposal hypothesis search space from O(nf)O(n^f) to the cardinality of the thresholded matrix. At training time, transitions are specific to cell locations of the feature maps, so that a sparse (efficient) transition matrix is used to train the network. At test time, a denser transition matrix can be obtained either by decreasing the threshold or by adding to it all the relative transitions originating from any cell location, allowing the network to handle transitions in the test data that might not have been present in the training data, and making detection translation-invariant. Finally, we show that our network can handle sparse annotations such as those available in the DALY dataset. We report extensive experiments on the DALY, UCF101-24 and Transformed-UCF101-24 datasets to support our claims.Comment: 15 page

    Weakly-supervised Temporal Action Localization by Uncertainty Modeling

    Full text link
    Weakly-supervised temporal action localization aims to learn detecting temporal intervals of action classes with only video-level labels. To this end, it is crucial to separate frames of action classes from the background frames (i.e., frames not belonging to any action classes). In this paper, we present a new perspective on background frames where they are modeled as out-of-distribution samples regarding their inconsistency. Then, background frames can be detected by estimating the probability of each frame being out-of-distribution, known as uncertainty, but it is infeasible to directly learn uncertainty without frame-level labels. To realize the uncertainty learning in the weakly-supervised setting, we leverage the multiple instance learning formulation. Moreover, we further introduce a background entropy loss to better discriminate background frames by encouraging their in-distribution (action) probabilities to be uniformly distributed over all action classes. Experimental results show that our uncertainty modeling is effective at alleviating the interference of background frames and brings a large performance gain without bells and whistles. We demonstrate that our model significantly outperforms state-of-the-art methods on the benchmarks, THUMOS'14 and ActivityNet (1.2 & 1.3). Our code is available at https://github.com/Pilhyeon/WTAL-Uncertainty-Modeling.Comment: Accepted by the 35th AAAI Conference on Artificial Intelligence (AAAI 2021

    Am I Done? Predicting Action Progress in Videos

    Get PDF
    In this paper we deal with the problem of predicting action progress in videos. We argue that this is an extremely important task since it can be valuable for a wide range of interaction applications. To this end we introduce a novel approach, named ProgressNet, capable of predicting when an action takes place in a video, where it is located within the frames, and how far it has progressed during its execution. To provide a general definition of action progress, we ground our work in the linguistics literature, borrowing terms and concepts to understand which actions can be the subject of progress estimation. As a result, we define a categorization of actions and their phases. Motivated by the recent success obtained from the interaction of Convolutional and Recurrent Neural Networks, our model is based on a combination of the Faster R-CNN framework, to make frame-wise predictions, and LSTM networks, to estimate action progress through time. After introducing two evaluation protocols for the task at hand, we demonstrate the capability of our model to effectively predict action progress on the UCF-101 and J-HMDB datasets

    Action Sets: Weakly Supervised Action Segmentation without Ordering Constraints

    Full text link
    Action detection and temporal segmentation of actions in videos are topics of increasing interest. While fully supervised systems have gained much attention lately, full annotation of each action within the video is costly and impractical for large amounts of video data. Thus, weakly supervised action detection and temporal segmentation methods are of great importance. While most works in this area assume an ordered sequence of occurring actions to be given, our approach only uses a set of actions. Such action sets provide much less supervision since neither action ordering nor the number of action occurrences are known. In exchange, they can be easily obtained, for instance, from meta-tags, while ordered sequences still require human annotation. We introduce a system that automatically learns to temporally segment and label actions in a video, where the only supervision that is used are action sets. An evaluation on three datasets shows that our method still achieves good results although the amount of supervision is significantly smaller than for other related methods.Comment: CVPR 201

    Spatio-temporal human action detection and instance segmentation in videos

    Get PDF
    With an exponential growth in the number of video capturing devices and digital video content, automatic video understanding is now at the forefront of computer vision research. This thesis presents a series of models for automatic human action detection in videos and also addresses the space-time action instance segmentation problem. Both action detection and instance segmentation play vital roles in video understanding. Firstly, we propose a novel human action detection approach based on a frame-level deep feature representation combined with a two-pass dynamic programming approach. The method obtains a frame-level action representation by leveraging recent advances in deep learning based action recognition and object detection methods. To combine the the complementary appearance and motion cues, we introduce a new fusion technique which signicantly improves the detection performance. Further, we cast the temporal action detection as two energy optimisation problems which are solved using Viterbi algorithm. Exploiting a video-level representation further allows the network to learn the inter-frame temporal correspondence between action regions and it is bound to be a more optimal solution to the action detection problem than a frame-level representation. Secondly, we propose a novel deep network architecture which learns a video-level action representation by classifying and regressing 3D region proposals spanning two successive video frames. The proposed model is end-to-end trainable and can be jointly optimised for both proposal generation and action detection objectives in a single training step. We name our new network as \AMTnet" (Action Micro-Tube regression Network). We further extend the AMTnet model by incorporating optical ow features to encode motion patterns of actions. Finally, we address the problem of action instance segmentation in which multiple concurrent actions of the same class may be segmented out of an image sequence. By taking advantage of recent work on action foreground-background segmentation, we are able to associate each action tube with class-specic segmentations. We demonstrate the performance of our proposed models on challenging action detection benchmarks achieving new state-of-the-art results across the board and signicantly increasing detection speed at test time

    Online real-time multiple spatiotemporal action localisation and prediction

    Get PDF
    We present a deep-learning framework for real-time multiple spatio-temporal (S/T) action localisation, classification and early prediction. Current state-of-the-art approaches work offline, and are too slow be useful in realworld settings. To overcome their limitations we introduce two major developments. Firstly, we adopt real-time SSD (Single Shot MultiBox Detector) convolutional neural networks to regress and classify detection boxes in each video frame potentially containing an action of interest. Secondly, we design an original and efficient online algorithm to incrementally construct and label ‘action tubes’ from the SSD frame level detections. As a result, our system is not only capable of performing S/T detection in real time, but can also perform early action prediction in an online fashion. We achieve new state-of-the-art results in both S/T action localisation and early action prediction on the challenging UCF101-24 and J-HMDB-21 benchmarks, even when compared to the top offline competitors. To the best of our knowledge, ours is the first real-time (up to 40fps) system able to perform online S/T action localisation and early action prediction on the untrimmed videos of UCF101-24
    corecore