41,641 research outputs found

    DETReg: Unsupervised Pretraining with Region Priors for Object Detection

    Full text link
    Recent self-supervised pretraining methods for object detection largely focus on pretraining the backbone of the object detector, neglecting key parts of detection architecture. Instead, we introduce DETReg, a new self-supervised method that pretrains the entire object detection network, including the object localization and embedding components. During pretraining, DETReg predicts object localizations to match the localizations from an unsupervised region proposal generator and simultaneously aligns the corresponding feature embeddings with embeddings from a self-supervised image encoder. We implement DETReg using the DETR family of detectors and show that it improves over competitive baselines when finetuned on COCO, PASCAL VOC, and Airbus Ship benchmarks. In low-data regimes, including semi-supervised and few-shot learning settings, DETReg establishes many state-of-the-art results, e.g., on COCO we see a +6.0 AP improvement for 10-shot detection and over 2 AP improvements when training with only 1\% of the labels. For code and pretrained models, visit the project page at https://amirbar.net/detregComment: CVPR 2022 Camera Read

    Co-training for On-board Deep Object Detection

    Get PDF
    Providing ground truth supervision to train visual models has been a bottleneck over the years, exacerbated by domain shifts which degenerate the performance of such models. This was the case when visual tasks relied on handcrafted features and shallow machine learning and, despite its unprecedented performance gains, the problem remains open within the deep learning paradigm due to its data-hungry nature. Best performing deep vision-based object detectors are trained in a supervised manner by relying on human-labeled bounding boxes which localize class instances (i.e.objects) within the training images.Thus, object detection is one of such tasks for which human labeling is a major bottleneck. In this paper, we assess co-training as a semi-supervised learning method for self-labeling objects in unlabeled images, so reducing the human-labeling effort for developing deep object detectors. Our study pays special attention to a scenario involving domain shift; in particular, when we have automatically generated virtual-world images with object bounding boxes and we have real-world images which are unlabeled. Moreover, we are particularly interested in using co-training for deep object detection in the context of driver assistance systems and/or self-driving vehicles. Thus, using well-established datasets and protocols for object detection in these application contexts, we will show how co-training is a paradigm worth to pursue for alleviating object labeling, working both alone and together with task-agnostic domain adaptation

    Unsupervised and semi-supervised co-salient object detection via segmentation frequency statistics

    Full text link
    In this paper, we address the detection of co-occurring salient objects (CoSOD) in an image group using frequency statistics in an unsupervised manner, which further enable us to develop a semi-supervised method. While previous works have mostly focused on fully supervised CoSOD, less attention has been allocated to detecting co-salient objects when limited segmentation annotations are available for training. Our simple yet effective unsupervised method US-CoSOD combines the object co-occurrence frequency statistics of unsupervised single-image semantic segmentations with salient foreground detections using self-supervised feature learning. For the first time, we show that a large unlabeled dataset e.g. ImageNet-1k can be effectively leveraged to significantly improve unsupervised CoSOD performance. Our unsupervised model is a great pre-training initialization for our semi-supervised model SS-CoSOD, especially when very limited labeled data is available for training. To avoid propagating erroneous signals from predictions on unlabeled data, we propose a confidence estimation module to guide our semi-supervised training. Extensive experiments on three CoSOD benchmark datasets show that both of our unsupervised and semi-supervised models outperform the corresponding state-of-the-art models by a significant margin (e.g., on the Cosal2015 dataset, our US-CoSOD model has an 8.8% F-measure gain over a SOTA unsupervised co-segmentation model and our SS-CoSOD model has an 11.81% F-measure gain over a SOTA semi-supervised CoSOD model).Comment: Accepted at IEEE WACV 202

    Effective Training and Efficient Inference of Deep Neural Networks for Visual Understanding

    Get PDF
    Since the phenomenal success of deep neural networks (DNNs) on image classification, the research community have been developing wider and deeper networks with complex components for a variety of visual understanding tasks. While such “heavy” models achieve excellent performance, they pose two main challenges: (1) the training requires a significant amount of computational resource as well as large-scale labeled datasets acquired from time-consuming and labor-intensive human annotation process; and (2) the inference can be slow even with expensive graphics cards due to the high model complexity. To address these challenges, we explore improving the effectiveness of training DNNs so that better performance is achieved under the same computation and/or annotation cost during training, and improving the efficiency of inference that reduces the computational cost of DNNs while maintaining high accuracy. In this dissertation, we first propose several approaches including devising noise-aware supervisory signals, developing better semi-supervised learning methods and analyzing different pre-training techniques for training object recognition and detection models more effectively. In the second part, we present two adaptive computation frameworks that improve the inference efficiency of 3D convolutional networks and attention-based vision Transformers for the tasks ofimage and video classification. Specifically, we first introduce NoisyAnchor, in which we identify the intrinsic label noise generated from the harsh and binary IoU-based (Intersection-over-Union)foreground/background split of training samples in object detection, and mitigate such noise through deriving a cleanliness score with the detectors’ output and down-weight noisy training samples with further derived soft category labels and loss re-weighting coefficients. We then seek to boost object detection performance with the readily available unannotated images, and propose improved semi-supervised learning (SSL) techniques that aim at addressing two unique challenges of semi-supervised object detection, i.e. the lack of localization quality estimation and the amplified class imbalance when generating pseudo labels. In the third work we empirically analyze the differences of the impact on downstream tasks when pre-training on image classification and object detection, providing more intuitions and actionable practices for effective task-specific pre-training. To improve inference efficiency, we explore adaptive computation methods that produce input-specific inference policies for an overall reduced computational cost, and present Ada3D and AdaViT. In particular, Ada3D learns to adaptively allocate computational resources by selectively keeping informative input frames and activating 3D convolutional layers of 3D models on a per-input basis for video classification; AdaViT exploits the redundancy of self-attention mechanism in Vision Transformers for image classification and improves their efficiency through deriving input-specific usage policies on which patches, self-attention heads and transformer blocks to use throughout the backbone. Such adaptive computation methods tend to allocate less computation for “easy” images and “static” videos, thus resulting in a reduced computational cost

    Deep Active Learning for Autonomous Perception

    Get PDF
    Traditional supervised learning requires significant amounts of labeled training data to achieve satisfactory results. As autonomous perception systems collect continuous data, the labeling process becomes expensive and time-consuming. Active learning is a specialized semi-supervised learning strategy that allows a machine learning model to achieve high performance using less training data, thereby minimizing the cost of manual annotation. We explore active learning for autonomous vehicles, and propose a novel deep active learning framework for object detection and instance segmentation. We review prominent active learning approaches, study their performances in the aforementioned computer vision tasks, and perform several experiments using state-of-the-art R-CNN-based models for datasets in the self-driving domain. Our empirical experiments on a number of datasets reflect that active learning reduces the amount of training data required. We observe that early exploration with instance-rich training sets leads to good performance, and that false positives can have a negative impact if not dealt with appropriately. Furthermore, we perform a qualitative evaluation using autonomous driving data collected from Trondheim, illustrating that active learning can help in selecting more informative images to annotate

    Downstream Task Self-Supervised Learning for Object Recognition and Tracking

    Get PDF
    This dissertation addresses three limitations of deep learning methods in image and video understanding-based machine vision applications. Firstly, although deep convolutional neural networks (CNNs) are efficient for image recognition applications such as object detection and segmentation, they perform poorly under perspective distortions. In real-world applications, the camera perspective is a common problem that we can address by annotating large amounts of data, thus limiting the applicability of the deep learning models. Secondly, the typical approach for single-camera tracking problems is to use separate motion and appearance models, which are expensive in terms of computations and training data requirements. Finally, conventional multi-camera video understanding techniques use supervised learning algorithms to determine temporal relationships among objects. In large-scale applications, these methods are also limited by the requirement of extensive manually annotated data and computational resources.To address these limitations, we develop an uncertainty-aware self-supervised learning (SSL) technique that captures a model\u27s instance or semantic segmentation uncertainty from overhead images and guides the model to learn the impact of the new perspective on object appearance. The test-time data augmentation-based pseudo-label refinement technique continuously trains a model until convergence on new perspective images. The proposed method can be applied for both self-supervision and semi-supervision, thus increasing the effectiveness of a deep pre-trained model in new domains. Extensive experiments demonstrate the effectiveness of the SSL technique in both object detection and semantic segmentation problems. In video understanding applications, we introduce simultaneous segmentation and tracking as an unsupervised spatio-temporal latent feature clustering problem. The jointly learned multi-task features leverage the task-dependent uncertainty to generate discriminative features in multi-object videos. Experiments have shown that the proposed tracker outperforms several state-of-the-art supervised methods. Finally, we proposed an unsupervised multi-camera tracklet association (MCTA) algorithm to track multiple objects in real-time. MCTA leverages the self-supervised detector model for single-camera tracking and solves the multi-camera tracking problem using multiple pair-wise camera associations modeled as a connected graph. The graph optimization method generates a global solution for partially or fully overlapping camera networks
    corecore