The recent work CLIPA presents an inverse scaling law for CLIP training --
whereby the larger the image/text encoders used, the shorter the sequence
length of image/text tokens that can be applied in training. This finding
enables us to train high-performance CLIP models with significantly reduced
computations. Building upon this work, we hereby present CLIPA-v2 with two key
contributions. Technically, we find this inverse scaling law is also applicable
in the finetuning stage, enabling further reduction in computational needs.
Empirically, we explore CLIPA at scale, extending the experiments up to the
H/14 model with ~13B image-text pairs seen during training.
Our results are exciting -- by only allocating a budget of \10,000,ourCLIPmodelachievesanimpressivezero−shotImageNetaccuracyof81.1thepriorbestCLIPmodel(fromOpenCLIP,80.1thecomputationalcostby 39X.Moreover,withanadditionalinvestmentof4,000, we can further elevate the zero-shot ImageNet accuracy to 81.8%. Our
code and models are available at https://github.com/UCSC-VLAA/CLIPA.Comment: Tech Report. Code is available at https://github.com/UCSC-VLAA/CLIP