1,208 research outputs found
Unsupervised video summarization framework using keyframe extraction and video skimming
Video is one of the robust sources of information and the consumption of
online and offline videos has reached an unprecedented level in the last few
years. A fundamental challenge of extracting information from videos is a
viewer has to go through the complete video to understand the context, as
opposed to an image where the viewer can extract information from a single
frame. Apart from context understanding, it almost impossible to create a
universal summarized video for everyone, as everyone has their own bias of
keyframe, e.g; In a soccer game, a coach person might consider those frames
which consist of information on player placement, techniques, etc; however, a
person with less knowledge about a soccer game, will focus more on frames which
consist of goals and score-board. Therefore, if we were to tackle problem video
summarization through a supervised learning path, it will require extensive
personalized labeling of data. In this paper, we attempt to solve video
summarization through unsupervised learning by employing traditional
vision-based algorithmic methodologies for accurate feature extraction from
video frames. We have also proposed a deep learning-based feature extraction
followed by multiple clustering methods to find an effective way of summarizing
a video by interesting key-frame extraction. We have compared the performance
of these approaches on the SumMe dataset and showcased that using deep
learning-based feature extraction has been proven to perform better in case of
dynamic viewpoint videos.Comment: 5 pages, 3 figures. Technical Repor
Indirect Match Highlights Detection with Deep Convolutional Neural Networks
Highlights in a sport video are usually referred as actions that stimulate
excitement or attract attention of the audience. A big effort is spent in
designing techniques which find automatically highlights, in order to
automatize the otherwise manual editing process. Most of the state-of-the-art
approaches try to solve the problem by training a classifier using the
information extracted on the tv-like framing of players playing on the game
pitch, learning to detect game actions which are labeled by human observers
according to their perception of highlight. Obviously, this is a long and
expensive work. In this paper, we reverse the paradigm: instead of looking at
the gameplay, inferring what could be exciting for the audience, we directly
analyze the audience behavior, which we assume is triggered by events happening
during the game. We apply deep 3D Convolutional Neural Network (3D-CNN) to
extract visual features from cropped video recordings of the supporters that
are attending the event. Outputs of the crops belonging to the same frame are
then accumulated to produce a value indicating the Highlight Likelihood (HL)
which is then used to discriminate between positive (i.e. when a highlight
occurs) and negative samples (i.e. standard play or time-outs). Experimental
results on a public dataset of ice-hockey matches demonstrate the effectiveness
of our method and promote further research in this new exciting direction.Comment: "Social Signal Processing and Beyond" workshop, in conjunction with
ICIAP 201
SoccerNet: A Scalable Dataset for Action Spotting in Soccer Videos
In this paper, we introduce SoccerNet, a benchmark for action spotting in
soccer videos. The dataset is composed of 500 complete soccer games from six
main European leagues, covering three seasons from 2014 to 2017 and a total
duration of 764 hours. A total of 6,637 temporal annotations are automatically
parsed from online match reports at a one minute resolution for three main
classes of events (Goal, Yellow/Red Card, and Substitution). As such, the
dataset is easily scalable. These annotations are manually refined to a one
second resolution by anchoring them at a single timestamp following
well-defined soccer rules. With an average of one event every 6.9 minutes, this
dataset focuses on the problem of localizing very sparse events within long
videos. We define the task of spotting as finding the anchors of soccer events
in a video. Making use of recent developments in the realm of generic action
recognition and detection in video, we provide strong baselines for detecting
soccer events. We show that our best model for classifying temporal segments of
length one minute reaches a mean Average Precision (mAP) of 67.8%. For the
spotting task, our baseline reaches an Average-mAP of 49.7% for tolerances
ranging from 5 to 60 seconds. Our dataset and models are available at
https://silviogiancola.github.io/SoccerNet.Comment: CVPR Workshop on Computer Vision in Sports 201
- …