4,769 research outputs found
SoccerNet: A Scalable Dataset for Action Spotting in Soccer Videos
In this paper, we introduce SoccerNet, a benchmark for action spotting in
soccer videos. The dataset is composed of 500 complete soccer games from six
main European leagues, covering three seasons from 2014 to 2017 and a total
duration of 764 hours. A total of 6,637 temporal annotations are automatically
parsed from online match reports at a one minute resolution for three main
classes of events (Goal, Yellow/Red Card, and Substitution). As such, the
dataset is easily scalable. These annotations are manually refined to a one
second resolution by anchoring them at a single timestamp following
well-defined soccer rules. With an average of one event every 6.9 minutes, this
dataset focuses on the problem of localizing very sparse events within long
videos. We define the task of spotting as finding the anchors of soccer events
in a video. Making use of recent developments in the realm of generic action
recognition and detection in video, we provide strong baselines for detecting
soccer events. We show that our best model for classifying temporal segments of
length one minute reaches a mean Average Precision (mAP) of 67.8%. For the
spotting task, our baseline reaches an Average-mAP of 49.7% for tolerances
ranging from 5 to 60 seconds. Our dataset and models are available at
https://silviogiancola.github.io/SoccerNet.Comment: CVPR Workshop on Computer Vision in Sports 201
Video summarisation: A conceptual framework and survey of the state of the art
This is the post-print (final draft post-refereeing) version of the article. Copyright @ 2007 Elsevier Inc.Video summaries provide condensed and succinct representations of the content of a video stream through a combination of still images, video segments, graphical representations and textual descriptors. This paper presents a conceptual framework for video summarisation derived from the research literature and used as a means for surveying the research literature. The framework distinguishes between video summarisation techniques (the methods used to process content from a source video stream to achieve a summarisation of that stream) and video summaries (outputs of video summarisation techniques). Video summarisation techniques are considered within three broad categories: internal (analyse information sourced directly from the video stream), external (analyse information not sourced directly from the video stream) and hybrid (analyse a combination of internal and external information). Video summaries are considered as a function of the type of content they are derived from (object, event, perception or feature based) and the functionality offered to the user for their consumption (interactive or static, personalised or generic). It is argued that video summarisation would benefit from greater incorporation of external information, particularly user based information that is unobtrusively sourced, in order to overcome longstanding challenges such as the semantic gap and providing video summaries that have greater relevance to individual users
Self-Supervised Multi-Modal Sequential Recommendation
With the increasing development of e-commerce and online services,
personalized recommendation systems have become crucial for enhancing user
satisfaction and driving business revenue. Traditional sequential
recommendation methods that rely on explicit item IDs encounter challenges in
handling item cold start and domain transfer problems. Recent approaches have
attempted to use modal features associated with items as a replacement for item
IDs, enabling the transfer of learned knowledge across different datasets.
However, these methods typically calculate the correlation between the model's
output and item embeddings, which may suffer from inconsistencies between
high-level feature vectors and low-level feature embeddings, thereby hindering
further model learning. To address this issue, we propose a dual-tower
retrieval architecture for sequence recommendation. In this architecture, the
predicted embedding from the user encoder is used to retrieve the generated
embedding from the item encoder, thereby alleviating the issue of inconsistent
feature levels. Moreover, in order to further improve the retrieval performance
of the model, we also propose a self-supervised multi-modal pretraining method
inspired by the consistency property of contrastive learning. This pretraining
method enables the model to align various feature combinations of items,
thereby effectively generalizing to diverse datasets with different item
features. We evaluate the proposed method on five publicly available datasets
and conduct extensive experiments. The results demonstrate significant
performance improvement of our method
Human Pose Driven Object Effects Recommendation
In this paper, we research the new topic of object effects recommendation in
micro-video platforms, which is a challenging but important task for many
practical applications such as advertisement insertion. To avoid the problem of
introducing background bias caused by directly learning video content from
image frames, we propose to utilize the meaningful body language hidden in 3D
human pose for recommendation. To this end, in this work, a novel human pose
driven object effects recommendation network termed PoseRec is introduced.
PoseRec leverages the advantages of 3D human pose detection and learns
information from multi-frame 3D human pose for video-item registration,
resulting in high quality object effects recommendation performance. Moreover,
to solve the inherent ambiguity and sparsity issues that exist in object
effects recommendation, we further propose a novel item-aware implicit
prototype learning module and a novel pose-aware transductive hard-negative
mining module to better learn pose-item relationships. What's more, to
benchmark methods for the new research topic, we build a new dataset for object
effects recommendation named Pose-OBE. Extensive experiments on Pose-OBE
demonstrate that our method can achieve superior performance than strong
baselines
- …