PaLI-X: On Scaling up a Multilingual Vision and Language Model

Alabdulmohsin, Ibrahim; Amelot, Julien; Angelova, Anelia; Arnab, Anurag; Beyer, Lucas; Changpinyo, Soravit; Chen, Xi; Dehghani, Mostafa; Djolonga, Josip; Goodman, Sebastian; Houlsby, Neil; Hu, Hexiang; Joshi, Mandar; Keysers, Daniel; Kolesnikov, Alexander; Lee, Kenton; Li, Gang; Li, Yang; Lucic, Mario; Minderer, Matthias; Montgomery, Ceslee; Mustafa, Basil; Nagrani, Arsha; Padlewski, Piotr; Pang, Bo; Pavetic, Filip; Piergiovanni, AJ; Pietrzyk, Paulina; Ritter, Marvin; Rong, Keran; Ruiz, Carlos Riquelme; Salz, Daniel; Seyedhosseini, Mojtaba; Shakeri, Siamak; Soricut, Radu; Steiner, Andreas Peter; Tay, Yi; Tschannen, Michael; Wang, Xiao; Waters, Austin; Wu, Jialin; Xu, Yuanzhong; Zhai, Xiaohua

PaLI-X: On Scaling up a Multilingual Vision and Language Model

Abstract

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2305.18565

Last time updated on 02/06/2023