10,449 research outputs found
SoccerNet: A Scalable Dataset for Action Spotting in Soccer Videos
In this paper, we introduce SoccerNet, a benchmark for action spotting in
soccer videos. The dataset is composed of 500 complete soccer games from six
main European leagues, covering three seasons from 2014 to 2017 and a total
duration of 764 hours. A total of 6,637 temporal annotations are automatically
parsed from online match reports at a one minute resolution for three main
classes of events (Goal, Yellow/Red Card, and Substitution). As such, the
dataset is easily scalable. These annotations are manually refined to a one
second resolution by anchoring them at a single timestamp following
well-defined soccer rules. With an average of one event every 6.9 minutes, this
dataset focuses on the problem of localizing very sparse events within long
videos. We define the task of spotting as finding the anchors of soccer events
in a video. Making use of recent developments in the realm of generic action
recognition and detection in video, we provide strong baselines for detecting
soccer events. We show that our best model for classifying temporal segments of
length one minute reaches a mean Average Precision (mAP) of 67.8%. For the
spotting task, our baseline reaches an Average-mAP of 49.7% for tolerances
ranging from 5 to 60 seconds. Our dataset and models are available at
https://silviogiancola.github.io/SoccerNet.Comment: CVPR Workshop on Computer Vision in Sports 201
Multi-Context Attention for Human Pose Estimation
In this paper, we propose to incorporate convolutional neural networks with a
multi-context attention mechanism into an end-to-end framework for human pose
estimation. We adopt stacked hourglass networks to generate attention maps from
features at multiple resolutions with various semantics. The Conditional Random
Field (CRF) is utilized to model the correlations among neighboring regions in
the attention map. We further combine the holistic attention model, which
focuses on the global consistency of the full human body, and the body part
attention model, which focuses on the detailed description for different body
parts. Hence our model has the ability to focus on different granularity from
local salient regions to global semantic-consistent spaces. Additionally, we
design novel Hourglass Residual Units (HRUs) to increase the receptive field of
the network. These units are extensions of residual units with a side branch
incorporating filters with larger receptive fields, hence features with various
scales are learned and combined within the HRUs. The effectiveness of the
proposed multi-context attention mechanism and the hourglass residual units is
evaluated on two widely used human pose estimation benchmarks. Our approach
outperforms all existing methods on both benchmarks over all the body parts.Comment: The first two authors contribute equally to this wor
Social Scene Understanding: End-to-End Multi-Person Action Localization and Collective Activity Recognition
We present a unified framework for understanding human social behaviors in
raw image sequences. Our model jointly detects multiple individuals, infers
their social actions, and estimates the collective actions with a single
feed-forward pass through a neural network. We propose a single architecture
that does not rely on external detection algorithms but rather is trained
end-to-end to generate dense proposal maps that are refined via a novel
inference scheme. The temporal consistency is handled via a person-level
matching Recurrent Neural Network. The complete model takes as input a sequence
of frames and outputs detections along with the estimates of individual actions
and collective activities. We demonstrate state-of-the-art performance of our
algorithm on multiple publicly available benchmarks
DeepPose: Human Pose Estimation via Deep Neural Networks
We propose a method for human pose estimation based on Deep Neural Networks
(DNNs). The pose estimation is formulated as a DNN-based regression problem
towards body joints. We present a cascade of such DNN regressors which results
in high precision pose estimates. The approach has the advantage of reasoning
about pose in a holistic fashion and has a simple but yet powerful formulation
which capitalizes on recent advances in Deep Learning. We present a detailed
empirical analysis with state-of-art or better performance on four academic
benchmarks of diverse real-world images.Comment: IEEE Conference on Computer Vision and Pattern Recognition, 201
- …