23 research outputs found
Efficient tracking of team sport players with few game-specific annotations
One of the requirements for team sports analysis is to track and recognize
players. Many tracking and reidentification methods have been proposed in the
context of video surveillance. They show very convincing results when tested on
public datasets such as the MOT challenge. However, the performance of these
methods are not as satisfactory when applied to player tracking. Indeed, in
addition to moving very quickly and often being occluded, the players wear the
same jersey, which makes the task of reidentification very complex. Some recent
tracking methods have been developed more specifically for the team sport
context. Due to the lack of public data, these methods use private datasets
that make impossible a comparison with them. In this paper, we propose a new
generic method to track team sport players during a full game thanks to few
human annotations collected via a semi-interactive system. Non-ambiguous
tracklets and their appearance features are automatically generated with a
detection and a reidentification network both pre-trained on public datasets.
Then an incremental learning mechanism trains a Transformer to classify
identities using few game-specific human annotations. Finally, tracklets are
linked by an association algorithm. We demonstrate the efficiency of our
approach on a challenging rugby sevens dataset. To overcome the lack of public
sports tracking dataset, we publicly release this dataset at
https://kalisteo.cea.fr/index.php/free-resources/. We also show that our method
is able to track rugby sevens players during a full match, if they are
observable at a minimal resolution, with the annotation of only 6 few seconds
length tracklets per player.Comment: Accepted to 2022 8th International Workshop on Computer Vision in
Sports (CVsports 2022
Classifying All Interacting Pairs in a Single Shot
In this paper, we introduce a novel human interaction detection approach,
based on CALIPSO (Classifying ALl Interacting Pairs in a Single shOt), a
classifier of human-object interactions. This new single-shot interaction
classifier estimates interactions simultaneously for all human-object pairs,
regardless of their number and class. State-of-the-art approaches adopt a
multi-shot strategy based on a pairwise estimate of interactions for a set of
human-object candidate pairs, which leads to a complexity depending, at least,
on the number of interactions or, at most, on the number of candidate pairs. In
contrast, the proposed method estimates the interactions on the whole image.
Indeed, it simultaneously estimates all interactions between all human subjects
and object targets by performing a single forward pass throughout the image.
Consequently, it leads to a constant complexity and computation time
independent of the number of subjects, objects or interactions in the image. In
detail, interaction classification is achieved on a dense grid of anchors
thanks to a joint multi-task network that learns three complementary tasks
simultaneously: (i) prediction of the types of interaction, (ii) estimation of
the presence of a target and (iii) learning of an embedding which maps
interacting subject and target to a same representation, by using a metric
learning strategy. In addition, we introduce an object-centric passive-voice
verb estimation which significantly improves results. Evaluations on the two
well-known Human-Object Interaction image datasets, V-COCO and HICO-DET,
demonstrate the competitiveness of the proposed method (2nd place) compared to
the state-of-the-art while having constant computation time regardless of the
number of objects and interactions in the image.Comment: WACV 2020 (to appear
Similarity Contrastive Estimation for Image and Video Soft Contrastive Self-Supervised Learning
Contrastive representation learning has proven to be an effective
self-supervised learning method for images and videos. Most successful
approaches are based on Noise Contrastive Estimation (NCE) and use different
views of an instance as positives that should be contrasted with other
instances, called negatives, that are considered as noise. However, several
instances in a dataset are drawn from the same distribution and share
underlying semantic information. A good data representation should contain
relations between the instances, or semantic similarity and dissimilarity, that
contrastive learning harms by considering all negatives as noise. To circumvent
this issue, we propose a novel formulation of contrastive learning using
semantic similarity between instances called Similarity Contrastive Estimation
(SCE). Our training objective is a soft contrastive one that brings the
positives closer and estimates a continuous distribution to push or pull
negative instances based on their learned similarities. We validate empirically
our approach on both image and video representation learning. We show that SCE
performs competitively with the state of the art on the ImageNet linear
evaluation protocol for fewer pretraining epochs and that it generalizes to
several downstream image tasks. We also show that SCE reaches state-of-the-art
results for pretraining video representation and that the learned
representation can generalize to video downstream tasks.Comment: Extended version of our WACV 2023 paper to video self-supervised
learnin
COMEDIAN: Self-Supervised Learning and Knowledge Distillation for Action Spotting using Transformers
We present COMEDIAN, a novel pipeline to initialize spatio-temporal
transformers for action spotting, which involves self-supervised learning and
knowledge distillation. Action spotting is a timestamp-level temporal action
detection task. Our pipeline consists of three steps, with two initialization
stages. First, we perform self-supervised initialization of a spatial
transformer using short videos as input. Additionally, we initialize a temporal
transformer that enhances the spatial transformer's outputs with global context
through knowledge distillation from a pre-computed feature bank aligned with
each short video segment. In the final step, we fine-tune the transformers to
the action spotting task. The experiments, conducted on the SoccerNet-v2
dataset, demonstrate state-of-the-art performance and validate the
effectiveness of COMEDIAN's pretraining paradigm. Our results highlight several
advantages of our pretraining pipeline, including improved performance and
faster convergence compared to non-pretrained models.Comment: Source code is available here:
https://github.com/juliendenize/eztorc
SoccerNet 2023 Challenges Results
peer reviewedThe SoccerNet 2023 challenges were the third annual video understanding
challenges organized by the SoccerNet team. For this third edition, the
challenges were composed of seven vision-based tasks split into three main
themes. The first theme, broadcast video understanding, is composed of three
high-level tasks related to describing events occurring in the video
broadcasts: (1) action spotting, focusing on retrieving all timestamps related
to global actions in soccer, (2) ball action spotting, focusing on retrieving
all timestamps related to the soccer ball change of state, and (3) dense video
captioning, focusing on describing the broadcast with natural language and
anchored timestamps. The second theme, field understanding, relates to the
single task of (4) camera calibration, focusing on retrieving the intrinsic and
extrinsic camera parameters from images. The third and last theme, player
understanding, is composed of three low-level tasks related to extracting
information about the players: (5) re-identification, focusing on retrieving
the same players across multiple views, (6) multiple object tracking, focusing
on tracking players and the ball through unedited video streams, and (7) jersey
number recognition, focusing on recognizing the jersey number of players from
tracklets. Compared to the previous editions of the SoccerNet challenges, tasks
(2-3-7) are novel, including new annotations and data, task (4) was enhanced
with more data and annotations, and task (6) now focuses on end-to-end
approaches. More information on the tasks, challenges, and leaderboards are
available on https://www.soccer-net.org. Baselines and development kits can be
found on https://github.com/SoccerNet
Learning methods applied to vision-based human behaviour analysis
L'analyse du comportement humain par vision est une thématique de recherche très étudiée car malgré les progrès apportés par l'apprentissage profond en vision par ordinateur, comprendre finement ce qui est en train de se passer dans une scène est une tâche loin d'être résolue car elle présente un très haut niveau sémantique.Dans cette thèse nous nous intéressons à deux applications : la reconnaissance d'activités longues temporellement dans des vidéos et la détection d'interaction dans des images.La première contribution de ces travaux est l'élaboration de la première base de données d'activités quotidiennes présentant de fortes variabilités intra-classe.La deuxième contribution est la proposition d'une nouvelle méthode de détection d'interaction en une seule passe sur l'image ce qui lui permet d'être beaucoup plus rapide que les méthodes de l'état de l'art en deux étapes et appliquant un raisonnement par paire d'instances.Enfin, la troisième contribution de cette thèse est la constitution d'un nouveau jeu de données d'interactions composé d'interactions à la fois entre des personnes et des objets mais également entre des personnes ce qui n'existait pas jusqu'à maintenant et qui permet pourtant une analyse des interactions humaines exhaustive. De manière à proposer des résultats de référence sur ce nouveau jeu de données, la précédente méthode de détection d'interactions a été améliorée en proposant un apprentissage multi-tâches ce qui permet d'obtenir les meilleurs résultats sur la base de données publique largement utilisée par la communauté.The analysis of human behavior by vision is a strong studied research topic. Indeed despite the progress brought by deep learning in computer vision, understanding finely what is happening in a scene is a task far from being solved because it presents a very high semantic level.In this thesis we focus on two applications: the recognition of temporally long activities in videos and the detection of interaction in images.The first contribution of this work is the development of the first database of daily activities with high intra-class variability.The second contribution is the proposal of a new method for interaction detection in a single shot on the image which allows it to be much faster than the state of the art two-step methods which apply a reasoning by pair of instances.Finally, the third contribution of this thesis is the constitution of a new interaction dataset composed of interactions both between people and objects and between people which did not exist until now and which allows an exhaustive analysis of human interactions. In order to propose baseline results on this new dataset, the previous interaction detection method has been improved by proposing a multi-task learning which reaches the best results on the public dataset widely used by the community
Méthodes d'apprentissage appliquées à l'analyse du comportement humain par vision
The analysis of human behavior by vision is a strong studied research topic. Indeed despite the progress brought by deep learning in computer vision, understanding finely what is happening in a scene is a task far from being solved because it presents a very high semantic level.In this thesis we focus on two applications: the recognition of temporally long activities in videos and the detection of interaction in images.The first contribution of this work is the development of the first database of daily activities with high intra-class variability.The second contribution is the proposal of a new method for interaction detection in a single shot on the image which allows it to be much faster than the state of the art two-step methods which apply a reasoning by pair of instances.Finally, the third contribution of this thesis is the constitution of a new interaction dataset composed of interactions both between people and objects and between people which did not exist until now and which allows an exhaustive analysis of human interactions. In order to propose baseline results on this new dataset, the previous interaction detection method has been improved by proposing a multi-task learning which reaches the best results on the public dataset widely used by the community.L'analyse du comportement humain par vision est une thématique de recherche très étudiée car malgré les progrès apportés par l'apprentissage profond en vision par ordinateur, comprendre finement ce qui est en train de se passer dans une scène est une tâche loin d'être résolue car elle présente un très haut niveau sémantique.Dans cette thèse nous nous intéressons à deux applications : la reconnaissance d'activités longues temporellement dans des vidéos et la détection d'interaction dans des images.La première contribution de ces travaux est l'élaboration de la première base de données d'activités quotidiennes présentant de fortes variabilités intra-classe.La deuxième contribution est la proposition d'une nouvelle méthode de détection d'interaction en une seule passe sur l'image ce qui lui permet d'être beaucoup plus rapide que les méthodes de l'état de l'art en deux étapes et appliquant un raisonnement par paire d'instances.Enfin, la troisième contribution de cette thèse est la constitution d'un nouveau jeu de données d'interactions composé d'interactions à la fois entre des personnes et des objets mais également entre des personnes ce qui n'existait pas jusqu'à maintenant et qui permet pourtant une analyse des interactions humaines exhaustive. De manière à proposer des résultats de référence sur ce nouveau jeu de données, la précédente méthode de détection d'interactions a été améliorée en proposant un apprentissage multi-tâches ce qui permet d'obtenir les meilleurs résultats sur la base de données publique largement utilisée par la communauté
Méthodes d'apprentissage appliquées à l'analyse du comportement humain par vision
The analysis of human behavior by vision is a strong studied research topic. Indeed despite the progress brought by deep learning in computer vision, understanding finely what is happening in a scene is a task far from being solved because it presents a very high semantic level.In this thesis we focus on two applications: the recognition of temporally long activities in videos and the detection of interaction in images.The first contribution of this work is the development of the first database of daily activities with high intra-class variability.The second contribution is the proposal of a new method for interaction detection in a single shot on the image which allows it to be much faster than the state of the art two-step methods which apply a reasoning by pair of instances.Finally, the third contribution of this thesis is the constitution of a new interaction dataset composed of interactions both between people and objects and between people which did not exist until now and which allows an exhaustive analysis of human interactions. In order to propose baseline results on this new dataset, the previous interaction detection method has been improved by proposing a multi-task learning which reaches the best results on the public dataset widely used by the community.L'analyse du comportement humain par vision est une thématique de recherche très étudiée car malgré les progrès apportés par l'apprentissage profond en vision par ordinateur, comprendre finement ce qui est en train de se passer dans une scène est une tâche loin d'être résolue car elle présente un très haut niveau sémantique.Dans cette thèse nous nous intéressons à deux applications : la reconnaissance d'activités longues temporellement dans des vidéos et la détection d'interaction dans des images.La première contribution de ces travaux est l'élaboration de la première base de données d'activités quotidiennes présentant de fortes variabilités intra-classe.La deuxième contribution est la proposition d'une nouvelle méthode de détection d'interaction en une seule passe sur l'image ce qui lui permet d'être beaucoup plus rapide que les méthodes de l'état de l'art en deux étapes et appliquant un raisonnement par paire d'instances.Enfin, la troisième contribution de cette thèse est la constitution d'un nouveau jeu de données d'interactions composé d'interactions à la fois entre des personnes et des objets mais également entre des personnes ce qui n'existait pas jusqu'à maintenant et qui permet pourtant une analyse des interactions humaines exhaustive. De manière à proposer des résultats de référence sur ce nouveau jeu de données, la précédente méthode de détection d'interactions a été améliorée en proposant un apprentissage multi-tâches ce qui permet d'obtenir les meilleurs résultats sur la base de données publique largement utilisée par la communauté
KaliCalib: A Framework for Basketball Court Registration
International audienceTracking the players and the ball in team sports is key to analyse the performance or to enhance the game watching experience with augmented reality. When the only sources for this data are broadcast videos, sports-field registration systems are required to estimate the homography and re-project the ball or the players from the image space to the field space. This paper describes a new basketball court registration framework in the context of the MMSports 2022 camera calibration challenge. The method is based on the estimation by an encoder-decoder network of the positions of keypoints sampled with perspective-aware constraints. The regression of the basket positions and heavy data augmentation techniques make the model robust to different arenas. Ablation studies show the positive effects of our contributions on the challenge test set. Our method divides the mean squared error by 4.7 compared to the challenge baseline
Efficient tracking of team sport players with few game-specific annotations
International audienceOne of the requirements for team sports analysis is to track and recognize players. Many tracking and reidentification methods have been proposed in the context of video surveillance. They show very convincing results when tested on public datasets such as the MOT challenge. However, the performance of these methods are not as satisfactory when applied to player tracking. Indeed, in addition to moving very quickly and often being occluded, the players wear the same jersey, which makes the task of reidentification very complex. Some recent tracking methods have been developed more specifically for the team sport context. Due to the lack of public data, these methods use private datasets that make impossible a comparison with them. In this paper, we propose a new generic method to track team sport players during a full game thanks to few human annotations collected via a semi-interactive system. Non-ambiguous tracklets and their appearance features are automatically generated with a detection and a reidentification network both pre-trained on public datasets. Then an incremental learning mechanism trains a Transformer to classify identities using few game-specific human annotations. Finally, tracklets are linked by an association algorithm. We demonstrate the efficiency of our approach on a challenging rugby sevens dataset. To overcome the lack of public sports tracking dataset, we publicly release this dataset at https://kalisteo.cea.fr/index.php/free-resources/. We also show that our method is able to track rugby sevens players during a full match, if they are observable at a minimal resolution, with the annotation of only 6 few seconds length tracklets per player