8 research outputs found
Consistency Regularization with High-dimensional Non-adversarial Source-guided Perturbation for Unsupervised Domain Adaptation in Segmentation
Unsupervised domain adaptation for semantic segmentation has been intensively
studied due to the low cost of the pixel-level annotation for synthetic data.
The most common approaches try to generate images or features mimicking the
distribution in the target domain while preserving the semantic contents in the
source domain so that a model can be trained with annotations from the latter.
However, such methods highly rely on an image translator or feature extractor
trained in an elaborated mechanism including adversarial training, which brings
in extra complexity and instability in the adaptation process. Furthermore,
these methods mainly focus on taking advantage of the labeled source dataset,
leaving the unlabeled target dataset not fully utilized. In this paper, we
propose a bidirectional style-induced domain adaptation method, called BiSIDA,
that employs consistency regularization to efficiently exploit information from
the unlabeled target domain dataset, requiring only a simple neural style
transfer model. BiSIDA aligns domains by not only transferring source images
into the style of target images but also transferring target images into the
style of source images to perform high-dimensional perturbation on the
unlabeled target images, which is crucial to the success in applying
consistency regularization in segmentation tasks. Extensive experiments show
that our BiSIDA achieves new state-of-the-art on two commonly-used
synthetic-to-real domain adaptation benchmarks: GTA5-to-CityScapes and
SYNTHIA-to-CityScapes
Plug and Play Active Learning for Object Detection
Annotating data for supervised learning is expensive and tedious, and we want
to do as little of it as possible. To make the most of a given "annotation
budget" we can turn to active learning (AL) which aims to identify the most
informative samples in a dataset for annotation. Active learning algorithms are
typically uncertainty-based or diversity-based. Both have seen success in image
classification, but fall short when it comes to object detection. We
hypothesise that this is because: (1) it is difficult to quantify uncertainty
for object detection as it consists of both localisation and classification,
where some classes are harder to localise, and others are harder to classify;
(2) it is difficult to measure similarities for diversity-based AL when images
contain different numbers of objects. We propose a two-stage active learning
algorithm Plug and Play Active Learning (PPAL) that overcomes these
difficulties. It consists of (1) Difficulty Calibrated Uncertainty Sampling, in
which we used a category-wise difficulty coefficient that takes both
classification and localisation into account to re-weight object uncertainties
for uncertainty-based sampling; (2) Category Conditioned Matching Similarity to
compute the similarities of multi-instance images as ensembles of their
instance similarities. PPAL is highly generalisable because it makes no change
to model architectures or detector training pipelines. We benchmark PPAL on the
MS-COCO and Pascal VOC datasets using different detector architectures and show
that our method outperforms the prior state-of-the-art. Code is available at
https://github.com/ChenhongyiYang/PPA
Contrastive Object-level Pre-training with Spatial Noise Curriculum Learning
The goal of contrastive learning based pre-training is to leverage large
quantities of unlabeled data to produce a model that can be readily adapted
downstream. Current approaches revolve around solving an image discrimination
task: given an anchor image, an augmented counterpart of that image, and some
other images, the model must produce representations such that the distance
between the anchor and its counterpart is small, and the distances between the
anchor and the other images are large. There are two significant problems with
this approach: (i) by contrasting representations at the image-level, it is
hard to generate detailed object-sensitive features that are beneficial to
downstream object-level tasks such as instance segmentation; (ii) the
augmentation strategy of producing an augmented counterpart is fixed, making
learning less effective at the later stages of pre-training. In this work, we
introduce Curricular Contrastive Object-level Pre-training (CCOP) to tackle
these problems: (i) we use selective search to find rough object regions and
use them to build an inter-image object-level contrastive loss and an
intra-image object-level discrimination loss into our pre-training objective;
(ii) we present a curriculum learning mechanism that adaptively augments the
generated regions, which allows the model to consistently acquire a useful
learning signal, even in the later stages of pre-training. Our experiments show
that our approach improves on the MoCo v2 baseline by a large margin on
multiple object-level tasks when pre-training on multi-object scene image
datasets. Code is available at https://github.com/ChenhongyiYang/CCOP
GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group Propagation
We present the Group Propagation Vision Transformer (GPViT): a novel nonhierarchical (i.e. non-pyramidal) transformer model designed for general visual recognition with high-resolution features. High-resolution features (or tokens) are a natural fit for tasks that involve perceiving fine-grained details such as detection and segmentation, but exchanging global information between these features is expensive in memory and computation because of the way self-attention scales. We provide a highly efficient alternative Group Propagation Block (GP Block) to exchange global information. In each GP Block, features are first grouped together by a fixed number of learnable group tokens; we then perform Group Propagation where global information is exchanged between the grouped features; finally, global information in the updated grouped features is returned back to the image features through a transformer decoder. We evaluate GPViT on a variety of visual recognition tasks including image classification, semantic segmentation, object detection, and instance segmentation. Our method achieves significant performance gains over previous works across all tasks, especially on tasks that require highresolution outputs, for example, our GPViT-L3 outperforms Swin Transformer-B by 2.0 mIoU on ADE20K semantic segmentation with only half as many parameter
Learning to separate: detecting heavily-occluded objects in urban scenes
While visual object detection with deep learning has received much attention in the past decade, cases when heavy intra-class occlusions occur have not been studied thoroughly. In this work, we propose a Non-Maximum-Suppression (NMS) algorithm that dramatically improves the detection recall while maintaining high precision in scenes with heavy occlusions. Our NMS algorithm is derived from a novel embedding mechanism, in which the semantic and geometric features of the detected boxes are jointly exploited. The embedding makes it possible to determine whether two heavily-overlapping boxes belong to the same object in the physical world. Our approach is particularly useful for car detection and pedestrian detection in urban scenes where occlusions often happen. We show the effectiveness of our approach by creating a model called SG-Det (short for Semantics and Geometry Detection) and testing SG-Det on two widely-adopted datasets, KITTI and CityPersons for which it achieves state-of-the-art performance. Our code is available at https://github.com/ChenhongyiYang/SG-NMS.https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123630511.pd
Prediction-Guided Distillation for Dense Object Detection
Real-world object detection models should be cheap and accurate. Knowledge
distillation (KD) can boost the accuracy of a small, cheap detection model by
leveraging useful information from a larger teacher model. However, a key
challenge is identifying the most informative features produced by the teacher
for distillation. In this work, we show that only a very small fraction of
features within a ground-truth bounding box are responsible for a teacher's
high detection performance. Based on this, we propose Prediction-Guided
Distillation (PGD), which focuses distillation on these key predictive regions
of the teacher and yields considerable gains in performance over many existing
KD baselines. In addition, we propose an adaptive weighting scheme over the key
regions to smooth out their influence and achieve even better performance. Our
proposed approach outperforms current state-of-the-art KD baselines on a
variety of advanced one-stage detection architectures. Specifically, on the
COCO dataset, our method achieves between +3.1% and +4.6% AP improvement using
ResNet-101 and ResNet-50 as the teacher and student backbones, respectively. On
the CrowdHuman dataset, we achieve +3.2% and +2.0% improvements in MR and AP,
also using these backbones. Our code is available at
https://github.com/ChenhongyiYang/PGD.Comment: ECCV 202