43 research outputs found
Localizing Actions from Video Labels and Pseudo-Annotations
The goal of this paper is to determine the spatio-temporal location of
actions in video. Where training from hard to obtain box annotations is the
norm, we propose an intuitive and effective algorithm that localizes actions
from their class label only. We are inspired by recent work showing that
unsupervised action proposals selected with human point-supervision perform as
well as using expensive box annotations. Rather than asking users to provide
point supervision, we propose fully automatic visual cues that replace manual
point annotations. We call the cues pseudo-annotations, introduce five of them,
and propose a correlation metric for automatically selecting and combining
them. Thorough evaluation on challenging action localization datasets shows
that we reach results comparable to results with full box supervision. We also
show that pseudo-annotations can be leveraged during testing to improve weakly-
and strongly-supervised localizers.Comment: BMV
Unsupervised Action Proposal Ranking through Proposal Recombination
Recently, action proposal methods have played an important role in action
recognition tasks, as they reduce the search space dramatically. Most
unsupervised action proposal methods tend to generate hundreds of action
proposals which include many noisy, inconsistent, and unranked action
proposals, while supervised action proposal methods take advantage of
predefined object detectors (e.g., human detector) to refine and score the
action proposals, but they require thousands of manual annotations to train.
Given the action proposals in a video, the goal of the proposed work is to
generate a few better action proposals that are ranked properly. In our
approach, we first divide action proposal into sub-proposal and then use
Dynamic Programming based graph optimization scheme to select the optimal
combinations of sub-proposals from different proposals and assign each new
proposal a score. We propose a new unsupervised image-based actioness detector
that leverages web images and employs it as one of the node scores in our graph
formulation. Moreover, we capture motion information by estimating the number
of motion contours within each action proposal patch. The proposed method is an
unsupervised method that neither needs bounding box annotations nor video level
labels, which is desirable with the current explosion of large-scale action
datasets. Our approach is generic and does not depend on a specific action
proposal method. We evaluate our approach on several publicly available trimmed
and un-trimmed datasets and obtain better performance compared to several
proposal ranking methods. In addition, we demonstrate that properly ranked
proposals produce significantly better action detection as compared to
state-of-the-art proposal based methods
Zero-Annotation Object Detection with Web Knowledge Transfer
Object detection is one of the major problems in computer vision, and has
been extensively studied. Most of the existing detection works rely on
labor-intensive supervision, such as ground truth bounding boxes of objects or
at least image-level annotations. On the contrary, we propose an object
detection method that does not require any form of human annotation on target
tasks, by exploiting freely available web images. In order to facilitate
effective knowledge transfer from web images, we introduce a multi-instance
multi-label domain adaption learning framework with two key innovations. First
of all, we propose an instance-level adversarial domain adaptation network with
attention on foreground objects to transfer the object appearances from web
domain to target domain. Second, to preserve the class-specific semantic
structure of transferred object features, we propose a simultaneous transfer
mechanism to transfer the supervision across domains through pseudo strong
label generation. With our end-to-end framework that simultaneously learns a
weakly supervised detector and transfers knowledge across domains, we achieved
significant improvements over baseline methods on the benchmark datasets.Comment: Accepted in ECCV 201
Weakly-Supervised Temporal Localization via Occurrence Count Learning
We propose a novel model for temporal detection and localization which allows
the training of deep neural networks using only counts of event occurrences as
training labels. This powerful weakly-supervised framework alleviates the
burden of the imprecise and time-consuming process of annotating event
locations in temporal data. Unlike existing methods, in which localization is
explicitly achieved by design, our model learns localization implicitly as a
byproduct of learning to count instances. This unique feature is a direct
consequence of the model's theoretical properties. We validate the
effectiveness of our approach in a number of experiments (drum hit and piano
onset detection in audio, digit detection in images) and demonstrate
performance comparable to that of fully-supervised state-of-the-art methods,
despite much weaker training requirements.Comment: Accepted at ICML 201
Learning without Prejudice: Avoiding Bias in Webly-Supervised Action Recognition
Webly-supervised learning has recently emerged as an alternative paradigm to
traditional supervised learning based on large-scale datasets with manual
annotations. The key idea is that models such as CNNs can be learned from the
noisy visual data available on the web. In this work we aim to exploit web data
for video understanding tasks such as action recognition and detection. One of
the main problems in webly-supervised learning is cleaning the noisy labeled
data from the web. The state-of-the-art paradigm relies on training a first
classifier on noisy data that is then used to clean the remaining dataset. Our
key insight is that this procedure biases the second classifier towards samples
that the first one understands. Here we train two independent CNNs, a RGB
network on web images and video frames and a second network using temporal
information from optical flow. We show that training the networks independently
is vastly superior to selecting the frames for the flow classifier by using our
RGB network. Moreover, we show benefits in enriching the training set with
different data sources from heterogeneous public web databases. We demonstrate
that our framework outperforms all other webly-supervised methods on two public
benchmarks, UCF-101 and Thumos'14.Comment: Submitted to CVIU SI: Computer Vision and the We
Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos
Deep learning has been demonstrated to achieve excellent results for image
classification and object detection. However, the impact of deep learning on
video analysis (e.g. action detection and recognition) has been limited due to
complexity of video data and lack of annotations. Previous convolutional neural
networks (CNN) based video action detection approaches usually consist of two
major steps: frame-level action proposal detection and association of proposals
across frames. Also, these methods employ two-stream CNN framework to handle
spatial and temporal feature separately. In this paper, we propose an
end-to-end deep network called Tube Convolutional Neural Network (T-CNN) for
action detection in videos. The proposed architecture is a unified network that
is able to recognize and localize action based on 3D convolution features. A
video is first divided into equal length clips and for each clip a set of tube
proposals are generated next based on 3D Convolutional Network (ConvNet)
features. Finally, the tube proposals of different clips are linked together
employing network flow and spatio-temporal action detection is performed using
these linked video proposals. Extensive experiments on several video datasets
demonstrate the superior performance of T-CNN for classifying and localizing
actions in both trimmed and untrimmed videos compared to state-of-the-arts