2,096 research outputs found
Exploiting Unlabeled Data in CNNs by Self-supervised Learning to Rank
For many applications the collection of labeled data is expensive laborious.
Exploitation of unlabeled data during training is thus a long pursued objective
of machine learning. Self-supervised learning addresses this by positing an
auxiliary task (different, but related to the supervised task) for which data
is abundantly available. In this paper, we show how ranking can be used as a
proxy task for some regression problems. As another contribution, we propose an
efficient backpropagation technique for Siamese networks which prevents the
redundant computation introduced by the multi-branch network architecture. We
apply our framework to two regression problems: Image Quality Assessment (IQA)
and Crowd Counting. For both we show how to automatically generate ranked image
sets from unlabeled data. Our results show that networks trained to regress to
the ground truth targets for labeled data and to simultaneously learn to rank
unlabeled data obtain significantly better, state-of-the-art results for both
IQA and crowd counting. In addition, we show that measuring network uncertainty
on the self-supervised proxy task is a good measure of informativeness of
unlabeled data. This can be used to drive an algorithm for active learning and
we show that this reduces labeling effort by up to 50%.Comment: Accepted at TPAMI. (Keywords: Learning from rankings, image quality
assessment, crowd counting, active learning). arXiv admin note: text overlap
with arXiv:1803.0309
CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model
Supervised crowd counting relies heavily on costly manual labeling, which is
difficult and expensive, especially in dense scenes. To alleviate the problem,
we propose a novel unsupervised framework for crowd counting, named CrowdCLIP.
The core idea is built on two observations: 1) the recent contrastive
pre-trained vision-language model (CLIP) has presented impressive performance
on various downstream tasks; 2) there is a natural mapping between crowd
patches and count text. To the best of our knowledge, CrowdCLIP is the first to
investigate the vision language knowledge to solve the counting problem.
Specifically, in the training stage, we exploit the multi-modal ranking loss by
constructing ranking text prompts to match the size-sorted crowd patches to
guide the image encoder learning. In the testing stage, to deal with the
diversity of image patches, we propose a simple yet effective progressive
filtering strategy to first select the highly potential crowd patches and then
map them into the language space with various counting intervals. Extensive
experiments on five challenging datasets demonstrate that the proposed
CrowdCLIP achieves superior performance compared to previous unsupervised
state-of-the-art counting methods. Notably, CrowdCLIP even surpasses some
popular fully-supervised methods under the cross-dataset setting. The source
code will be available at https://github.com/dk-liang/CrowdCLIP.Comment: Accepted by CVPR 202
Fine-grained Domain Adaptive Crowd Counting via Point-derived Segmentation
Due to domain shift, a large performance drop is usually observed when a
trained crowd counting model is deployed in the wild. While existing
domain-adaptive crowd counting methods achieve promising results, they
typically regard each crowd image as a whole and reduce domain discrepancies in
a holistic manner, thus limiting further improvement of domain adaptation
performance. To this end, we propose to untangle \emph{domain-invariant} crowd
and \emph{domain-specific} background from crowd images and design a
fine-grained domain adaption method for crowd counting. Specifically, to
disentangle crowd from background, we propose to learn crowd segmentation from
point-level crowd counting annotations in a weakly-supervised manner. Based on
the derived segmentation, we design a crowd-aware domain adaptation mechanism
consisting of two crowd-aware adaptation modules, i.e., Crowd Region Transfer
(CRT) and Crowd Density Alignment (CDA). The CRT module is designed to guide
crowd features transfer across domains beyond background distractions. The CDA
module dedicates to regularising target-domain crowd density generation by its
own crowd density distribution. Our method outperforms previous approaches
consistently in the widely-used adaptation scenarios.Comment: 10 pages, 5 figures, and 9 table
Unsupervised Methods for Camera Pose Estimation and People Counting in Crowded Scenes
Most visual crowd counting methods rely on training with labeled data to learn a mapping between features in the image and the number of people in the scene. However, the exact nature of this mapping may change as a function of different scene and viewing conditions, limiting the ability of such supervised systems to generalize to novel conditions, and thus preventing broad deployment. Here I propose an alternative, unsupervised strategy anchored on a 3D simulation that automatically learns how groups of people appear in the image and adapts to the signal processing parameters of the current viewing scenario. To implement this 3D strategy, knowledge of the camera parameters is required. Most methods for automatic camera calibration make assumptions about regularities in scene structure or motion patterns, which do not always apply. I propose a novel motion based approach for recovering camera tilt that does not require tracking. Having an automatic camera calibration method allows for the implementation of an accurate crowd counting algorithm that reasons in 3D. The system is evaluated on various datasets and compared against state-of-art methods
Learning Segmentation Masks with the Independence Prior
An instance with a bad mask might make a composite image that uses it look
fake. This encourages us to learn segmentation by generating realistic
composite images. To achieve this, we propose a novel framework that exploits a
new proposed prior called the independence prior based on Generative
Adversarial Networks (GANs). The generator produces an image with multiple
category-specific instance providers, a layout module and a composition module.
Firstly, each provider independently outputs a category-specific instance image
with a soft mask. Then the provided instances' poses are corrected by the
layout module. Lastly, the composition module combines these instances into a
final image. Training with adversarial loss and penalty for mask area, each
provider learns a mask that is as small as possible but enough to cover a
complete category-specific instance. Weakly supervised semantic segmentation
methods widely use grouping cues modeling the association between image parts,
which are either artificially designed or learned with costly segmentation
labels or only modeled on local pairs. Unlike them, our method automatically
models the dependence between any parts and learns instance segmentation. We
apply our framework in two cases: (1) Foreground segmentation on
category-specific images with box-level annotation. (2) Unsupervised learning
of instance appearances and masks with only one image of homogeneous object
cluster (HOC). We get appealing results in both tasks, which shows the
independence prior is useful for instance segmentation and it is possible to
unsupervisedly learn instance masks with only one image.Comment: 7+5 pages, 13 figures, Accepted to AAAI 201
- …