55 research outputs found
How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers
Vision Transformers (ViT) have been shown to attain highly competitive
performance for a wide range of vision applications, such as image
classification, object detection and semantic image segmentation. In comparison
to convolutional neural networks, the Vision Transformer's weaker inductive
bias is generally found to cause an increased reliance on model regularization
or data augmentation (``AugReg'' for short) when training on smaller training
datasets. We conduct a systematic empirical study in order to better understand
the interplay between the amount of training data, AugReg, model size and
compute budget. As one result of this study we find that the combination of
increased compute and AugReg can yield models with the same performance as
models trained on an order of magnitude more training data: we train ViT models
of various sizes on the public ImageNet-21k dataset which either match or
outperform their counterparts trained on the larger, but not publicly available
JFT-300M dataset.Comment: Andreas, Alex, Xiaohua and Lucas contributed equally. We release more
than 50'000 ViT models trained under diverse settings on various datasets. We
believe this to be a treasure trove for model analysis. Available at
https://github.com/google-research/vision_transformer and
https://github.com/rwightman/pytorch-image-model
FlexiViT: One Model for All Patch Sizes
Vision Transformers convert images to sequences by slicing them into patches.
The size of these patches controls a speed/accuracy tradeoff, with smaller
patches leading to higher accuracy at greater computational cost, but changing
the patch size typically requires retraining the model. In this paper, we
demonstrate that simply randomizing the patch size at training time leads to a
single set of weights that performs well across a wide range of patch sizes,
making it possible to tailor the model to different compute budgets at
deployment time. We extensively evaluate the resulting model, which we call
FlexiViT, on a wide range of tasks, including classification, image-text
retrieval, open-world detection, panoptic segmentation, and semantic
segmentation, concluding that it usually matches, and sometimes outperforms,
standard ViT models trained at a single patch size in an otherwise identical
setup. Hence, FlexiViT training is a simple drop-in improvement for ViT that
makes it easy to add compute-adaptive capabilities to most models relying on a
ViT backbone architecture. Code and pre-trained models are available at
https://github.com/google-research/big_visionComment: Code and pre-trained models available at
https://github.com/google-research/big_vision. All authors made significant
technical contributions. CVPR 202
- …