94,781 research outputs found
The effectiveness of MAE pre-pretraining for billion-scale pretraining
This paper revisits the standard pretrain-then-finetune paradigm used in
computer vision for visual recognition tasks. Typically, state-of-the-art
foundation models are pretrained using large scale (weakly) supervised datasets
with billions of images. We introduce an additional pre-pretraining stage that
is simple and uses the self-supervised MAE technique to initialize the model.
While MAE has only been shown to scale with the size of models, we find that it
scales with the size of the training dataset as well. Thus, our MAE-based
pre-pretraining scales with both model and data size making it applicable for
training foundation models. Pre-pretraining consistently improves both the
model convergence and the downstream transfer performance across a range of
model scales (millions to billions of parameters), and dataset sizes (millions
to billions of images). We measure the effectiveness of pre-pretraining on 10
different visual recognition tasks spanning image classification, video
recognition, object detection, low-shot classification and zero-shot
recognition. Our largest model achieves new state-of-the-art results on
iNaturalist-18 (91.3%), 1-shot ImageNet-1k (62.1%), and zero-shot transfer on
Food-101 (96.2%). Our study reveals that model initialization plays a
significant role, even for web-scale pretraining with billions of images
Is an Object-Centric Video Representation Beneficial for Transfer?
The objective of this work is to learn an object-centric video
representation, with the aim of improving transferability to novel tasks, i.e.,
tasks different from the pre-training task of action classification. To this
end, we introduce a new object-centric video recognition model based on a
transformer architecture. The model learns a set of object-centric summary
vectors for the video, and uses these vectors to fuse the visual and
spatio-temporal trajectory 'modalities' of the video clip. We also introduce a
novel trajectory contrast loss to further enhance objectness in these summary
vectors. With experiments on four datasets -- SomethingSomething-V2,
SomethingElse, Action Genome and EpicKitchens -- we show that the
object-centric model outperforms prior video representations (both
object-agnostic and object-aware), when: (1) classifying actions on unseen
objects and unseen environments; (2) low-shot learning of novel classes; (3)
linear probe to other downstream tasks; as well as (4) for standard action
classification.Comment: Accepted to ACCV 202
- …