9 research outputs found

    Annotation Efficient Visual Recognition: from Semi-Supervised to Few-Shot Learning

    Get PDF
    In recent years, supervised deep learning has achieved remarkable success in solving a wide range of visual recognition problems. Large-scale labeled datasets have been crucial for this success and the progress has primarily been limited to controlled environments. In this dissertation, we present methods to improve the annotation efficiency of deep visual recognition models and also propose methods to improve the performance of annotation-efficient models in unconstrained open-world settings. To address the annotation bottleneck in supervised learning, we introduce a pseudo-labeling framework for semi-supervised learning. While consistency regularization methods dominate the field, they heavily rely on domain-specific data augmentations, limiting their applicability. We argue that even though pseudo-labeling is a general approach, it performs poorly due to high-confidence predictions from poorly calibrated models, leading to noisy training. To overcome this, we propose an uncertainty-aware pseudo-label selection method that greatly reduces the amount of noisy pseudo-labels. Furthermore, our proposed framework generalizes the pseudo-labeling process, allowing for the creation of negative pseudo-labels; these negative pseudo-labels can be used for multi-label classification as well as negative learning to improve the single-label classification. Even though the above semi-supervised learning method is very effective in reducing annotation costs, it is not suitable for real-world scenarios where there is limited control over the data collection process. Hence, our next focus is on open-world semi-supervised learning, which assumes that labeled and unlabeled data come from different distributions and unlabeled data may contain samples from unknown classes. We propose a method that utilizes a pairwise similarity loss to discover novel classes by implicitly clustering them while recognizing samples from known classes. Using a bi-level optimization rule this pairwise similarity loss exploits the information available in the labeled set. After discovering novel classes, our proposed method transforms the open-world semi-supervised learning problem into a standard semi-supervised learning problem to achieve additional performance gains using existing semi-supervised learning methods. Despite being effective the above solution relies on multiple objective functions, requires prior knowledge of the number of unknown classes, and only works on class-balanced data. To overcome these limitations, we propose a second, more practical, and streamlined solution for open-world semi-supervised learning. Our proposed solution utilizes sample uncertainty and incorporates prior knowledge about class distribution to generate reliable class-distribution-aware pseudo-labels for unlabeled data belonging to both known and unknown classes. Our constrained class-distribution-aware pseudo-label generation is an instance of the optimal transport problem which we solve using the Sinkhorn-Knopp algorithm. This method works with any distribution and does not require knowing the number of novel classes, making it more practical for deployment. In the above two works, we assume that samples from novel classes are available during training. However, this is difficult to satisfy in many real-world scenarios. Therefore, to study how models can achieve human-like performance in open-world settings, i.e., identifying new concepts with only a few examples, we shift our focus to few-shot learning. Learning good generalizable features is crucial for solving this problem. As a result, we propose a novel training mechanism that simultaneously enforces equivariance and invariance to a general set of geometric transformations. Simultaneous optimization for both of these contrasting objectives allows the model to jointly learn features that are not only independent of the input transformation but also encode the structure of geometric transformations. These complementary sets of features help generalize well to novel classes with only a few labeled data samples. We achieve additional improvements by incorporating a novel self-supervised distillation objective

    TimeBalance: Temporally-Invariant and Temporally-Distinctive Video Representations for Semi-Supervised Action Recognition

    Full text link
    Semi-Supervised Learning can be more beneficial for the video domain compared to images because of its higher annotation cost and dimensionality. Besides, any video understanding task requires reasoning over both spatial and temporal dimensions. In order to learn both the static and motion related features for the semi-supervised action recognition task, existing methods rely on hard input inductive biases like using two-modalities (RGB and Optical-flow) or two-stream of different playback rates. Instead of utilizing unlabeled videos through diverse input streams, we rely on self-supervised video representations, particularly, we utilize temporally-invariant and temporally-distinctive representations. We observe that these representations complement each other depending on the nature of the action. Based on this observation, we propose a student-teacher semi-supervised learning framework, TimeBalance, where we distill the knowledge from a temporally-invariant and a temporally-distinctive teacher. Depending on the nature of the unlabeled video, we dynamically combine the knowledge of these two teachers based on a novel temporal similarity-based reweighting scheme. Our method achieves state-of-the-art performance on three action recognition benchmarks: UCF101, HMDB51, and Kinetics400. Code: https://github.com/DAVEISHAN/TimeBalanceComment: CVPR-202

    Preserving Modality Structure Improves Multi-Modal Learning

    Full text link
    Self-supervised learning on large-scale multi-modal datasets allows learning semantically meaningful embeddings in a joint multi-modal representation space without relying on human annotations. These joint embeddings enable zero-shot cross-modal tasks like retrieval and classification. However, these methods often struggle to generalize well on out-of-domain data as they ignore the semantic structure present in modality-specific embeddings. In this context, we propose a novel Semantic-Structure-Preserving Consistency approach to improve generalizability by preserving the modality-specific relationships in the joint embedding space. To capture modality-specific semantic relationships between samples, we propose to learn multiple anchors and represent the multifaceted relationship between samples with respect to their relationship with these anchors. To assign multiple anchors to each sample, we propose a novel Multi-Assignment Sinkhorn-Knopp algorithm. Our experimentation demonstrates that our proposed approach learns semantically meaningful anchors in a self-supervised manner. Furthermore, our evaluation on MSR-VTT and YouCook2 datasets demonstrates that our proposed multi-anchor assignment based solution achieves state-of-the-art performance and generalizes to both inand out-of-domain datasets. Code: https://github.com/Swetha5/Multi_Sinkhorn_KnoppComment: Accepted at ICCV 202

    An Automatic Bleeding Frame and Region Detection Scheme for Wireless Capsule Endoscopy Videos Based on Interplane Intensity Variation Profile in Normalized RGB Color Space

    No full text
    Wireless capsule endoscopy (WCE) is an effective video technology to diagnose gastrointestinal (GI) disease, such as bleeding. In order to avoid conventional tedious and risky manual review process of long duration WCE videos, automatic bleeding detection schemes are getting importance. In this paper, to investigate bleeding, the analysis of WCE images is carried out in normalized RGB color space as human perception of bleeding is associated with different shades of red. In the proposed method, at first, from the WCE image frame, an efficient region of interest (ROI) is extracted based on interplane intensity variation profile in normalized RGB space. Next, from the extracted ROI, the variation in the normalized green plane is presented with the help of histogram. Features are extracted from the proposed normalized green plane histograms. For classification purpose, the K-nearest neighbors classifier is employed. Moreover, bleeding zones in a bleeding image are extracted utilizing some morphological operations. For performance evaluation, 2300 WCE images obtained from 30 publicly available WCE videos are used in a tenfold cross-validation scheme and the proposed method outperforms the reported four existing methods having an accuracy of 97.86%, a sensitivity of 95.20%, and a specificity of 98.32%

    Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video

    No full text
    International audienceSelf-supervised learning has unlocked the potential of scaling up pretraining to billions of images, since annotation is unnecessary. But are we making the best use of data? How more economical can we be? In this work, we attempt to answer this question by making two contributions. First, we investigate first-person videos and introduce a "Walking Tours" dataset. These videos are high-resolution, hourslong, captured in a single uninterrupted take, depicting a large number of objects and actions with natural scene transitions. They are unlabeled and uncurated, thus realistic for self-supervision and comparable with human learning. Second, we introduce a novel self-supervised image pretraining method tailored for learning from continuous videos. Existing methods typically adapt imagebased pretraining approaches to incorporate more frames. Instead, we advocate a "tracking to learn to recognize" approach. Our method called DORA, leads to attention maps that Discover and tRAck objects over time in an end-to-end manner, using transformer cross-attention. We derive multiple views from the tracks and use them in a classical self-supervised distillation loss. Using our novel approach, a single Walking Tours video remarkably becomes a strong competitor to ImageNet for several image and video downstream tasks

    Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video

    No full text
    International audienceSelf-supervised learning has unlocked the potential of scaling up pretraining to billions of images, since annotation is unnecessary. But are we making the best use of data? How more economical can we be? In this work, we attempt to answer this question by making two contributions. First, we investigate first-person videos and introduce a "Walking Tours" dataset. These videos are high-resolution, hourslong, captured in a single uninterrupted take, depicting a large number of objects and actions with natural scene transitions. They are unlabeled and uncurated, thus realistic for self-supervision and comparable with human learning. Second, we introduce a novel self-supervised image pretraining method tailored for learning from continuous videos. Existing methods typically adapt imagebased pretraining approaches to incorporate more frames. Instead, we advocate a "tracking to learn to recognize" approach. Our method called DORA, leads to attention maps that Discover and tRAck objects over time in an end-to-end manner, using transformer cross-attention. We derive multiple views from the tracks and use them in a classical self-supervised distillation loss. Using our novel approach, a single Walking Tours video remarkably becomes a strong competitor to ImageNet for several image and video downstream tasks

    Dora WalkingTours Dataset (ICLR 2024)

    No full text
    Self-supervised learning has unlocked the potential of scaling up pretraining to billions of images, since annotation is unnecessary. But are we making the best use of data? How more economical can we be? In this work, we attempt to answer this question by making two contributions. First, we investigate first-person videos and introduce a "Walking Tours" dataset. These videos are high-resolution, hours-long, captured in a single uninterrupted take, depicting a large number of objects and actions with natural scene transitions. They are unlabeled and uncurated, thus realistic for self-supervision and comparable with human learning.Second, we introduce a novel self-supervised image pretraining method tailored for learning from continuous videos.Reference:Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video. Shashanka Venkataramanan, Mamshad Nayeem Rizve, JoĂŁo Carreira, Yuki M. Asano, Yannis Avrithis. In: International Conference on Learning Representations 2024</p
    corecore