122,212 research outputs found
Relational Self-Supervised Learning
Self-supervised Learning (SSL) including the mainstream contrastive learning
has achieved great success in learning visual representations without data
annotations. However, most methods mainly focus on the instance level
information (\ie, the different augmented images of the same instance should
have the same feature or cluster into the same class), but there is a lack of
attention on the relationships between different instances. In this paper, we
introduce a novel SSL paradigm, which we term as relational self-supervised
learning (ReSSL) framework that learns representations by modeling the
relationship between different instances. Specifically, our proposed method
employs sharpened distribution of pairwise similarities among different
instances as \textit{relation} metric, which is thus utilized to match the
feature embeddings of different augmentations. To boost the performance, we
argue that weak augmentations matter to represent a more reliable relation, and
leverage momentum strategy for practical efficiency. The designed asymmetric
predictor head and an InfoNCE warm-up strategy enhance the robustness to
hyper-parameters and benefit the resulting performance. Experimental results
show that our proposed ReSSL substantially outperforms the state-of-the-art
methods across different network architectures, including various lightweight
networks (\eg, EfficientNet and MobileNet).Comment: Extended version of NeurIPS 2021 paper. arXiv admin note: substantial
text overlap with arXiv:2107.0928
Connectionist Temporal Modeling for Weakly Supervised Action Labeling
We propose a weakly-supervised framework for action labeling in video, where
only the order of occurring actions is required during training time. The key
challenge is that the per-frame alignments between the input (video) and label
(action) sequences are unknown during training. We address this by introducing
the Extended Connectionist Temporal Classification (ECTC) framework to
efficiently evaluate all possible alignments via dynamic programming and
explicitly enforce their consistency with frame-to-frame visual similarities.
This protects the model from distractions of visually inconsistent or
degenerated alignments without the need of temporal supervision. We further
extend our framework to the semi-supervised case when a few frames are sparsely
annotated in a video. With less than 1% of labeled frames per video, our method
is able to outperform existing semi-supervised approaches and achieve
comparable performance to that of fully supervised approaches.Comment: To appear in ECCV 201
- …