26 research outputs found
Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design
Scaling laws have been recently employed to derive compute-optimal model size
(number of parameters) for a given compute duration. We advance and refine such
methods to infer compute-optimal model shapes, such as width and depth, and
successfully implement this in vision transformers. Our shape-optimized vision
transformer, SoViT, achieves results competitive with models that exceed twice
its size, despite being pre-trained with an equivalent amount of compute. For
example, SoViT-400m/14 achieves 90.3% fine-tuning accuracy on ILSRCV2012,
surpassing the much larger ViT-g/14 and approaching ViT-G/14 under identical
settings, with also less than half the inference cost. We conduct a thorough
evaluation across multiple tasks, such as image classification, captioning, VQA
and zero-shot transfer, demonstrating the effectiveness of our model across a
broad range of domains and identifying limitations. Overall, our findings
challenge the prevailing approach of blindly scaling up vision models and pave
a path for a more informed scaling.Comment: 10 pages, 7 figures, 9 tables. Version 2: Layout fixe
CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?
We study the effectiveness of data-balancing for mitigating biases in
contrastive language-image pretraining (CLIP), identifying areas of strength
and limitation. First, we reaffirm prior conclusions that CLIP models can
inadvertently absorb societal stereotypes. To counter this, we present a novel
algorithm, called Multi-Modal Moment Matching (M4), designed to reduce both
representation and association biases (i.e. in first- and second-order
statistics) in multimodal data. We use M4 to conduct an in-depth analysis
taking into account various factors, such as the model, representation, and
data size. Our study also explores the dynamic nature of how CLIP learns and
unlearns biases. In particular, we find that fine-tuning is effective in
countering representation biases, though its impact diminishes for association
biases. Also, data balancing has a mixed impact on quality: it tends to improve
classification but can hurt retrieval. Interestingly, data and architectural
improvements seem to mitigate the negative impact of data balancing on
performance; e.g. applying M4 to SigLIP-B/16 with data quality filters improves
COCO image-to-text retrieval @5 from 86% (without data balancing) to 87% and
ImageNet 0-shot classification from 77% to 77.5%! Finally, we conclude with
recommendations for improving the efficacy of data balancing in multimodal
systems.Comment: 32 pages, 20 figures, 7 table
Fair Wrapping for Black-box Predictions
We introduce a new family of techniques to post-process ("wrap") a black-box
classifier in order to reduce its bias. Our technique builds on the recent
analysis of improper loss functions whose optimization can correct any twist in
prediction, unfairness being treated as a twist. In the post-processing, we
learn a wrapper function which we define as an -tree, which modifies
the prediction. We provide two generic boosting algorithms to learn
-trees. We show that our modification has appealing properties in terms
of composition of -trees, generalization, interpretability, and KL
divergence between modified and original predictions. We exemplify the use of
our technique in three fairness notions: conditional value-at-risk, equality of
opportunity, and statistical parity; and provide experiments on several readily
available datasets.Comment: Published in Advances in Neural Information Processing Systems 35
(NeurIPS 2022
FlexiViT: One Model for All Patch Sizes
Vision Transformers convert images to sequences by slicing them into patches.
The size of these patches controls a speed/accuracy tradeoff, with smaller
patches leading to higher accuracy at greater computational cost, but changing
the patch size typically requires retraining the model. In this paper, we
demonstrate that simply randomizing the patch size at training time leads to a
single set of weights that performs well across a wide range of patch sizes,
making it possible to tailor the model to different compute budgets at
deployment time. We extensively evaluate the resulting model, which we call
FlexiViT, on a wide range of tasks, including classification, image-text
retrieval, open-world detection, panoptic segmentation, and semantic
segmentation, concluding that it usually matches, and sometimes outperforms,
standard ViT models trained at a single patch size in an otherwise identical
setup. Hence, FlexiViT training is a simple drop-in improvement for ViT that
makes it easy to add compute-adaptive capabilities to most models relying on a
ViT backbone architecture. Code and pre-trained models are available at
https://github.com/google-research/big_visionComment: Code and pre-trained models available at
https://github.com/google-research/big_vision. All authors made significant
technical contributions. CVPR 202
Adapting to Latent Subgroup Shifts via Concepts and Proxies
We address the problem of unsupervised domain adaptation when the source domain differs from the target domain because of a shift in the distribution of a latent subgroup. When this subgroup confounds all observed data, neither covariate shift nor label shift assumptions apply. We show that the optimal target predictor can be non-parametrically identified with the help of concept and proxy variables available only in the source domain, and unlabeled data from the target. The identification results are constructive, immediately suggesting an algorithm for estimating the optimal predictor in the target. For continuous observations, when this algorithm becomes impractical, we propose a latent variable model specific to the data generation process at hand. We show how the approach degrades as the size of the shift changes, and verify that it outperforms both covariate and label shift adjustment