3,990 research outputs found

    Automatic Action Annotation in Weakly Labeled Videos

    Full text link
    Manual spatio-temporal annotation of human action in videos is laborious, requires several annotators and contains human biases. In this paper, we present a weakly supervised approach to automatically obtain spatio-temporal annotations of an actor in action videos. We first obtain a large number of action proposals in each video. To capture a few most representative action proposals in each video and evade processing thousands of them, we rank them using optical flow and saliency in a 3D-MRF based framework and select a few proposals using MAP based proposal subset selection method. We demonstrate that this ranking preserves the high quality action proposals. Several such proposals are generated for each video of the same action. Our next challenge is to iteratively select one proposal from each video so that all proposals are globally consistent. We formulate this as Generalized Maximum Clique Graph problem using shape, global and fine grained similarity of proposals across the videos. The output of our method is the most action representative proposals from each video. Our method can also annotate multiple instances of the same action in a video. We have validated our approach on three challenging action datasets: UCF Sport, sub-JHMDB and THUMOS'13 and have obtained promising results compared to several baseline methods. Moreover, on UCF Sports, we demonstrate that action classifiers trained on these automatically obtained spatio-temporal annotations have comparable performance to the classifiers trained on ground truth annotation

    Saliency-guided video classification via adaptively weighted learning

    Full text link
    Video classification is productive in many practical applications, and the recent deep learning has greatly improved its accuracy. However, existing works often model video frames indiscriminately, but from the view of motion, video frames can be decomposed into salient and non-salient areas naturally. Salient and non-salient areas should be modeled with different networks, for the former present both appearance and motion information, and the latter present static background information. To address this problem, in this paper, video saliency is predicted by optical flow without supervision firstly. Then two streams of 3D CNN are trained individually for raw frames and optical flow on salient areas, and another 2D CNN is trained for raw frames on non-salient areas. For the reason that these three streams play different roles for each class, the weights of each stream are adaptively learned for each class. Experimental results show that saliency-guided modeling and adaptively weighted learning can reinforce each other, and we achieve the state-of-the-art results.Comment: 6 pages, 1 figure, accepted by ICME 201

    Video Object Detection with an Aligned Spatial-Temporal Memory

    Full text link
    We introduce Spatial-Temporal Memory Networks for video object detection. At its core, a novel Spatial-Temporal Memory module (STMM) serves as the recurrent computation unit to model long-term temporal appearance and motion dynamics. The STMM's design enables full integration of pretrained backbone CNN weights, which we find to be critical for accurate detection. Furthermore, in order to tackle object motion in videos, we propose a novel MatchTrans module to align the spatial-temporal memory from frame to frame. Our method produces state-of-the-art results on the benchmark ImageNet VID dataset, and our ablative studies clearly demonstrate the contribution of our different design choices. We release our code and models at http://fanyix.cs.ucdavis.edu/project/stmn/project.html
    • …
    corecore