3 research outputs found
Semi-Supervised Temporal Action Detection with Proposal-Free Masking
Existing temporal action detection (TAD) methods rely on a large number of
training data with segment-level annotations. Collecting and annotating such a
training set is thus highly expensive and unscalable. Semi-supervised TAD
(SS-TAD) alleviates this problem by leveraging unlabeled videos freely
available at scale. However, SS-TAD is also a much more challenging problem
than supervised TAD, and consequently much under-studied. Prior SS-TAD methods
directly combine an existing proposal-based TAD method and a SSL method. Due to
their sequential localization (e.g, proposal generation) and classification
design, they are prone to proposal error propagation. To overcome this
limitation, in this work we propose a novel Semi-supervised Temporal action
detection model based on PropOsal-free Temporal mask (SPOT) with a parallel
localization (mask generation) and classification architecture. Such a novel
design effectively eliminates the dependence between localization and
classification by cutting off the route for error propagation in-between. We
further introduce an interaction mechanism between classification and
localization for prediction refinement, and a new pretext task for
self-supervised model pre-training. Extensive experiments on two standard
benchmarks show that our SPOT outperforms state-of-the-art alternatives, often
by a large margin. The PyTorch implementation of SPOT is available at
https://github.com/sauradip/SPOTComment: ECCV 2022; Code available at https://github.com/sauradip/SPO
Learning Generalizable Visual Patterns Without Human Supervision
Owing to the existence of large labeled datasets, Deep Convolutional Neural Networks have ushered in a renaissance in computer vision. However, almost all of the visual data we generate daily - several human lives worth of it - remains unlabeled and thus out of reach of today’s dominant supervised learning paradigm. This thesis focuses on techniques that steer deep models towards learning generalizable visual patterns without human supervision. Our primary tool in this endeavor is the design of Self-Supervised Learning tasks, i.e., pretext-tasks for which labels do not involve human labor. Besides enabling the learning from large amounts of unlabeled data, we demonstrate how self-supervision can capture relevant patterns that supervised learning largely misses. For example, we design learning tasks that learn deep representations capturing shape from images, motion from video, and 3D pose features from multi-view data. Notably, these tasks’ design follows a common principle: The recognition of data transformations. The strong performance of the learned representations on downstream vision tasks such as classification, segmentation, action recognition, or pose estimation validate this pretext-task design.
This thesis also explores the use of Generative Adversarial Networks (GANs) for unsupervised representation learning. Besides leveraging generative adversarial learning to define image transformation for self-supervised learning tasks, we also address training instabilities of GANs through the use of noise.
While unsupervised techniques can significantly reduce the burden of supervision, in the end, we still rely on some annotated examples to fine-tune learned representations towards a target task. To improve the learning from scarce or noisy labels, we describe a supervised learning algorithm with improved generalization in these challenging settings