72,468 research outputs found
The effectiveness of MAE pre-pretraining for billion-scale pretraining
This paper revisits the standard pretrain-then-finetune paradigm used in
computer vision for visual recognition tasks. Typically, state-of-the-art
foundation models are pretrained using large scale (weakly) supervised datasets
with billions of images. We introduce an additional pre-pretraining stage that
is simple and uses the self-supervised MAE technique to initialize the model.
While MAE has only been shown to scale with the size of models, we find that it
scales with the size of the training dataset as well. Thus, our MAE-based
pre-pretraining scales with both model and data size making it applicable for
training foundation models. Pre-pretraining consistently improves both the
model convergence and the downstream transfer performance across a range of
model scales (millions to billions of parameters), and dataset sizes (millions
to billions of images). We measure the effectiveness of pre-pretraining on 10
different visual recognition tasks spanning image classification, video
recognition, object detection, low-shot classification and zero-shot
recognition. Our largest model achieves new state-of-the-art results on
iNaturalist-18 (91.3%), 1-shot ImageNet-1k (62.1%), and zero-shot transfer on
Food-101 (96.2%). Our study reveals that model initialization plays a
significant role, even for web-scale pretraining with billions of images
Evolution of a Web-Scale Near Duplicate Image Detection System
Detecting near duplicate images is fundamental to the content ecosystem of
photo sharing web applications. However, such a task is challenging when
involving a web-scale image corpus containing billions of images. In this
paper, we present an efficient system for detecting near duplicate images
across 8 billion images. Our system consists of three stages: candidate
generation, candidate selection, and clustering. We also demonstrate that this
system can be used to greatly improve the quality of recommendations and search
results across a number of real-world applications.
In addition, we include the evolution of the system over the course of six
years, bringing out experiences and lessons on how new systems are designed to
accommodate organic content growth as well as the latest technology. Finally,
we are releasing a human-labeled dataset of ~53,000 pairs of images introduced
in this paper
- …