2,603 research outputs found
UntrimmedNets for Weakly Supervised Action Recognition and Detection
Current action recognition methods heavily rely on trimmed videos for model
training. However, it is expensive and time-consuming to acquire a large-scale
trimmed video dataset. This paper presents a new weakly supervised
architecture, called UntrimmedNet, which is able to directly learn action
recognition models from untrimmed videos without the requirement of temporal
annotations of action instances. Our UntrimmedNet couples two important
components, the classification module and the selection module, to learn the
action models and reason about the temporal duration of action instances,
respectively. These two components are implemented with feed-forward networks,
and UntrimmedNet is therefore an end-to-end trainable architecture. We exploit
the learned models for action recognition (WSR) and detection (WSD) on the
untrimmed video datasets of THUMOS14 and ActivityNet. Although our UntrimmedNet
only employs weak supervision, our method achieves performance superior or
comparable to that of those strongly supervised approaches on these two
datasets.Comment: camera-ready version to appear in CVPR201
Underwater Fish Detection with Weak Multi-Domain Supervision
Given a sufficiently large training dataset, it is relatively easy to train a
modern convolution neural network (CNN) as a required image classifier.
However, for the task of fish classification and/or fish detection, if a CNN
was trained to detect or classify particular fish species in particular
background habitats, the same CNN exhibits much lower accuracy when applied to
new/unseen fish species and/or fish habitats. Therefore, in practice, the CNN
needs to be continuously fine-tuned to improve its classification accuracy to
handle new project-specific fish species or habitats. In this work we present a
labelling-efficient method of training a CNN-based fish-detector (the Xception
CNN was used as the base) on relatively small numbers (4,000) of project-domain
underwater fish/no-fish images from 20 different habitats. Additionally, 17,000
of known negative (that is, missing fish) general-domain (VOC2012) above-water
images were used. Two publicly available fish-domain datasets supplied
additional 27,000 of above-water and underwater positive/fish images. By using
this multi-domain collection of images, the trained Xception-based binary
(fish/not-fish) classifier achieved 0.17% false-positives and 0.61%
false-negatives on the project's 20,000 negative and 16,000 positive holdout
test images, respectively. The area under the ROC curve (AUC) was 99.94%.Comment: Published in the 2019 International Joint Conference on Neural
Networks (IJCNN-2019), Budapest, Hungary, July 14-19, 2019,
https://www.ijcnn.org/ , https://ieeexplore.ieee.org/document/885190
SoccerNet: A Scalable Dataset for Action Spotting in Soccer Videos
In this paper, we introduce SoccerNet, a benchmark for action spotting in
soccer videos. The dataset is composed of 500 complete soccer games from six
main European leagues, covering three seasons from 2014 to 2017 and a total
duration of 764 hours. A total of 6,637 temporal annotations are automatically
parsed from online match reports at a one minute resolution for three main
classes of events (Goal, Yellow/Red Card, and Substitution). As such, the
dataset is easily scalable. These annotations are manually refined to a one
second resolution by anchoring them at a single timestamp following
well-defined soccer rules. With an average of one event every 6.9 minutes, this
dataset focuses on the problem of localizing very sparse events within long
videos. We define the task of spotting as finding the anchors of soccer events
in a video. Making use of recent developments in the realm of generic action
recognition and detection in video, we provide strong baselines for detecting
soccer events. We show that our best model for classifying temporal segments of
length one minute reaches a mean Average Precision (mAP) of 67.8%. For the
spotting task, our baseline reaches an Average-mAP of 49.7% for tolerances
ranging from 5 to 60 seconds. Our dataset and models are available at
https://silviogiancola.github.io/SoccerNet.Comment: CVPR Workshop on Computer Vision in Sports 201
Ridiculously Fast Shot Boundary Detection with Fully Convolutional Neural Networks
Shot boundary detection (SBD) is an important component of many video
analysis tasks, such as action recognition, video indexing, summarization and
editing. Previous work typically used a combination of low-level features like
color histograms, in conjunction with simple models such as SVMs. Instead, we
propose to learn shot detection end-to-end, from pixels to final shot
boundaries. For training such a model, we rely on our insight that all shot
boundaries are generated. Thus, we create a dataset with one million frames and
automatically generated transitions such as cuts, dissolves and fades. In order
to efficiently analyze hours of videos, we propose a Convolutional Neural
Network (CNN) which is fully convolutional in time, thus allowing to use a
large temporal context without the need to repeatedly processing frames. With
this architecture our method obtains state-of-the-art results while running at
an unprecedented speed of more than 120x real-time
Weakly-Supervised Dense Action Anticipation
Dense anticipation aims to forecast future actions and their durations for
long horizons. Existing approaches rely on fully-labelled data, i.e. sequences
labelled with all future actions and their durations. We present a (semi-)
weakly supervised method using only a small number of fully-labelled sequences
and predominantly sequences in which only the (one) upcoming action is
labelled. To this end, we propose a framework that generates pseudo-labels for
future actions and their durations and adaptively refines them through a
refinement module. Given only the upcoming action label as input, these
pseudo-labels guide action/duration prediction for the future. We further
design an attention mechanism to predict context-aware durations. Experiments
on the Breakfast and 50Salads benchmarks verify our method's effectiveness; we
are competitive even when compared to fully supervised state-of-the-art models.
We will make our code available at:
https://github.com/zhanghaotong1/WSLVideoDenseAnticipation.Comment: BMVC 202
Recent Advances in Transfer Learning for Cross-Dataset Visual Recognition: A Problem-Oriented Perspective
This paper takes a problem-oriented perspective and presents a comprehensive
review of transfer learning methods, both shallow and deep, for cross-dataset
visual recognition. Specifically, it categorises the cross-dataset recognition
into seventeen problems based on a set of carefully chosen data and label
attributes. Such a problem-oriented taxonomy has allowed us to examine how
different transfer learning approaches tackle each problem and how well each
problem has been researched to date. The comprehensive problem-oriented review
of the advances in transfer learning with respect to the problem has not only
revealed the challenges in transfer learning for visual recognition, but also
the problems (e.g. eight of the seventeen problems) that have been scarcely
studied. This survey not only presents an up-to-date technical review for
researchers, but also a systematic approach and a reference for a machine
learning practitioner to categorise a real problem and to look up for a
possible solution accordingly
TraMNet - Transition Matrix Network for Efficient Action Tube Proposals
Current state-of-the-art methods solve spatiotemporal action localisation by
extending 2D anchors to 3D-cuboid proposals on stacks of frames, to generate
sets of temporally connected bounding boxes called \textit{action micro-tubes}.
However, they fail to consider that the underlying anchor proposal hypotheses
should also move (transition) from frame to frame, as the actor or the camera
does. Assuming we evaluate 2D anchors in each frame, then the number of
possible transitions from each 2D anchor to the next, for a sequence of
consecutive frames, is in the order of , expensive even for small
values of . To avoid this problem, we introduce a Transition-Matrix-based
Network (TraMNet) which relies on computing transition probabilities between
anchor proposals while maximising their overlap with ground truth bounding
boxes across frames, and enforcing sparsity via a transition threshold. As the
resulting transition matrix is sparse and stochastic, this reduces the proposal
hypothesis search space from to the cardinality of the thresholded
matrix. At training time, transitions are specific to cell locations of the
feature maps, so that a sparse (efficient) transition matrix is used to train
the network. At test time, a denser transition matrix can be obtained either by
decreasing the threshold or by adding to it all the relative transitions
originating from any cell location, allowing the network to handle transitions
in the test data that might not have been present in the training data, and
making detection translation-invariant. Finally, we show that our network can
handle sparse annotations such as those available in the DALY dataset. We
report extensive experiments on the DALY, UCF101-24 and Transformed-UCF101-24
datasets to support our claims.Comment: 15 page
- …