80,900 research outputs found
Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization
We propose a self-supervised method for learning motion-focused video
representations. Existing approaches minimize distances between temporally
augmented videos, which maintain high spatial similarity. We instead propose to
learn similarities between videos with identical local motion dynamics but an
otherwise different appearance. We do so by adding synthetic motion
trajectories to videos which we refer to as tubelets. By simulating different
tubelet motions and applying transformations, such as scaling and rotation, we
introduce motion patterns beyond what is present in the pretraining data. This
allows us to learn a video representation that is remarkably data efficient:
our approach maintains performance when using only 25\% of the pretraining
videos. Experiments on 10 diverse downstream settings demonstrate our
competitive performance and generalizability to new domains and fine-grained
actions.Comment: Accepted in ICCV 202
Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization
We propose a self-supervised method for learning motion-focused video representations. Existing approaches minimize distances between temporally augmented videos, which maintain high spatial similarity. We instead propose to learn similarities between videos with identical local motion dynamics but an otherwise different appearance. We do so by adding synthetic motion trajectories to videos which we refer to as tubelets. By simulating different tubelet motions and applying transformations, such as scaling and rotation, we introduce motion patterns beyond what is present in the pretraining data. This allows us to learn a video representation that is remarkably data efficient: our approach maintains performance when using only 25% of the pretraining videos. Experiments on 10 diverse downstream settings demonstrate our competitive performance and generalizability to new domains and fine-grained actions. Code is available at https://github.com/fmthoker/tubelet-contrast
Unsupervised Learning of Video Representations via Dense Trajectory Clustering
This paper addresses the task of unsupervised learning of representations for
action recognition in videos. Previous works proposed to utilize future
prediction, or other domain-specific objectives to train a network, but
achieved only limited success. In contrast, in the relevant field of image
representation learning, simpler, discrimination-based methods have recently
bridged the gap to fully-supervised performance. We first propose to adapt two
top performing objectives in this class - instance recognition and local
aggregation, to the video domain. In particular, the latter approach iterates
between clustering the videos in the feature space of a network and updating it
to respect the cluster with a non-parametric classification loss. We observe
promising performance, but qualitative analysis shows that the learned
representations fail to capture motion patterns, grouping the videos based on
appearance. To mitigate this issue, we turn to the heuristic-based IDT
descriptors, that were manually designed to encode motion patterns in videos.
We form the clusters in the IDT space, using these descriptors as a an
unsupervised prior in the iterative local aggregation algorithm. Our
experiments demonstrates that this approach outperform prior work on UCF101 and
HMDB51 action recognition benchmarks. We also qualitatively analyze the learned
representations and show that they successfully capture video dynamics
Self-Supervised Relative Depth Learning for Urban Scene Understanding
As an agent moves through the world, the apparent motion of scene elements is
(usually) inversely proportional to their depth. It is natural for a learning
agent to associate image patterns with the magnitude of their displacement over
time: as the agent moves, faraway mountains don't move much; nearby trees move
a lot. This natural relationship between the appearance of objects and their
motion is a rich source of information about the world. In this work, we start
by training a deep network, using fully automatic supervision, to predict
relative scene depth from single images. The relative depth training images are
automatically derived from simple videos of cars moving through a scene, using
recent motion segmentation techniques, and no human-provided labels. This proxy
task of predicting relative depth from a single image induces features in the
network that result in large improvements in a set of downstream tasks
including semantic segmentation, joint road segmentation and car detection, and
monocular (absolute) depth estimation, over a network trained from scratch. The
improvement on the semantic segmentation task is greater than those produced by
any other automatically supervised methods. Moreover, for monocular depth
estimation, our unsupervised pre-training method even outperforms supervised
pre-training with ImageNet. In addition, we demonstrate benefits from learning
to predict (unsupervised) relative depth in the specific videos associated with
various downstream tasks. We adapt to the specific scenes in those tasks in an
unsupervised manner to improve performance. In summary, for semantic
segmentation, we present state-of-the-art results among methods that do not use
supervised pre-training, and we even exceed the performance of supervised
ImageNet pre-trained models for monocular depth estimation, achieving results
that are comparable with state-of-the-art methods
BRACE: The Breakdancing Competition Dataset for Dance Motion Synthesis
Generative models for audio-conditioned dance motion synthesis map music
features to dance movements. Models are trained to associate motion patterns to
audio patterns, usually without an explicit knowledge of the human body. This
approach relies on a few assumptions: strong music-dance correlation,
controlled motion data and relatively simple poses and movements. These
characteristics are found in all existing datasets for dance motion synthesis,
and indeed recent methods can achieve good results.We introduce a new dataset
aiming to challenge these common assumptions, compiling a set of dynamic dance
sequences displaying complex human poses. We focus on breakdancing which
features acrobatic moves and tangled postures. We source our data from the Red
Bull BC One competition videos. Estimating human keypoints from these videos is
difficult due to the complexity of the dance, as well as the multiple moving
cameras recording setup. We adopt a hybrid labelling pipeline leveraging deep
estimation models as well as manual annotations to obtain good quality keypoint
sequences at a reduced cost. Our efforts produced the BRACE dataset, which
contains over 3 hours and 30 minutes of densely annotated poses. We test
state-of-the-art methods on BRACE, showing their limitations when evaluated on
complex sequences. Our dataset can readily foster advance in dance motion
synthesis. With intricate poses and swift movements, models are forced to go
beyond learning a mapping between modalities and reason more effectively about
body structure and movements.Comment: ECCV 2022. Dataset available at https://github.com/dmoltisanti/brac
- …