42 research outputs found
Pedestrian Detection aided by Deep Learning Semantic Tasks
Deep learning methods have achieved great success in pedestrian detection,
owing to its ability to learn features from raw pixels. However, they mainly
capture middle-level representations, such as pose of pedestrian, but confuse
positive with hard negative samples, which have large ambiguity, e.g. the shape
and appearance of `tree trunk' or `wire pole' are similar to pedestrian in
certain viewpoint. This ambiguity can be distinguished by high-level
representation. To this end, this work jointly optimizes pedestrian detection
with semantic tasks, including pedestrian attributes (e.g. `carrying backpack')
and scene attributes (e.g. `road', `tree', and `horizontal'). Rather than
expensively annotating scene attributes, we transfer attributes information
from existing scene segmentation datasets to the pedestrian dataset, by
proposing a novel deep model to learn high-level features from multiple tasks
and multiple data sources. Since distinct tasks have distinct convergence rates
and data from different datasets have different distributions, a multi-task
objective function is carefully designed to coordinate tasks and reduce
discrepancies among datasets. The importance coefficients of tasks and network
parameters in this objective function can be iteratively estimated. Extensive
evaluations show that the proposed approach outperforms the state-of-the-art on
the challenging Caltech and ETH datasets, where it reduces the miss rates of
previous deep models by 17 and 5.5 percent, respectively
StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners
We investigate the potential of learning visual representations using
synthetic images generated by text-to-image models. This is a natural question
in the light of the excellent performance of such models in generating
high-quality images. We consider specifically the Stable Diffusion, one of the
leading open source text-to-image models. We show that (1) when the generative
model is configured with proper classifier-free guidance scale, training
self-supervised methods on synthetic images can match or beat the real image
counterpart; (2) by treating the multiple images generated from the same text
prompt as positives for each other, we develop a multi-positive contrastive
learning method, which we call StableRep. With solely synthetic images, the
representations learned by StableRep surpass the performance of representations
learned by SimCLR and CLIP using the same set of text prompts and corresponding
real images, on large scale datasets. When we further add language supervision,
StableRep trained with 20M synthetic images achieves better accuracy than CLIP
trained with 50M real images.Comment: code is available at:
https://github.com/google-research/syn-rep-lear
Improving CLIP Training with Language Rewrites
Contrastive Language-Image Pre-training (CLIP) stands as one of the most
effective and scalable methods for training transferable vision models using
paired image and text data. CLIP models are trained using contrastive loss,
which typically relies on data augmentations to prevent overfitting and
shortcuts. However, in the CLIP training paradigm, data augmentations are
exclusively applied to image inputs, while language inputs remain unchanged
throughout the entire training process, limiting the exposure of diverse texts
to the same image. In this paper, we introduce Language augmented CLIP
(LaCLIP), a simple yet highly effective approach to enhance CLIP training
through language rewrites. Leveraging the in-context learning capability of
large language models, we rewrite the text descriptions associated with each
image. These rewritten texts exhibit diversity in sentence structure and
vocabulary while preserving the original key concepts and meanings. During
training, LaCLIP randomly selects either the original texts or the rewritten
versions as text augmentations for each image. Extensive experiments on CC3M,
CC12M, RedCaps and LAION-400M datasets show that CLIP pre-training with
language rewrites significantly improves the transfer performance without
computation or memory overhead during training. Specifically for ImageNet
zero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on
LAION-400M. Code is available at https://github.com/LijieFan/LaCLIP
Rethinking Few-Shot Image Classification: a Good Embedding Is All You Need?
The focus of recent meta-learning research has been on the development of
learning algorithms that can quickly adapt to test time tasks with limited data
and low computational cost. Few-shot learning is widely used as one of the
standard benchmarks in meta-learning. In this work, we show that a simple
baseline: learning a supervised or self-supervised representation on the
meta-training set, followed by training a linear classifier on top of this
representation, outperforms state-of-the-art few-shot learning methods. An
additional boost can be achieved through the use of self-distillation. This
demonstrates that using a good learned embedding model can be more effective
than sophisticated meta-learning algorithms. We believe that our findings
motivate a rethinking of few-shot image classification benchmarks and the
associated role of meta-learning algorithms. Code is available at:
http://github.com/WangYueFt/rfs/.Comment: First two authors contributed equally. Project Page:
https://people.csail.mit.edu/yuewang/projects/rfs/ Code:
http://github.com/WangYueFt/rfs
Training-Free Uncertainty Estimation for Dense Regression: Sensitivity as a Surrogate
Uncertainty estimation is an essential step in the evaluation of the
robustness for deep learning models in computer vision, especially when applied
in risk-sensitive areas. However, most state-of-the-art deep learning models
either fail to obtain uncertainty estimation or need significant modification
(e.g., formulating a proper Bayesian treatment) to obtain it. Most previous
methods are not able to take an arbitrary model off the shelf and generate
uncertainty estimation without retraining or redesigning it. To address this
gap, we perform a systematic exploration into training-free uncertainty
estimation for dense regression, an unrecognized yet important problem, and
provide a theoretical construction justifying such estimations. We propose
three simple and scalable methods to analyze the variance of outputs from a
trained network under tolerable perturbations: infer-transformation,
infer-noise, and infer-dropout. They operate solely during inference, without
the need to re-train, re-design, or fine-tune the model, as typically required
by state-of-the-art uncertainty estimation methods. Surprisingly, even without
involving such perturbations in training, our methods produce comparable or
even better uncertainty estimation when compared to training-required
state-of-the-art methods.Comment: 18 pages, 13 figure