2 research outputs found
PaLI-X: On Scaling up a Multilingual Vision and Language Model
We present the training recipe and results of scaling up PaLI-X, a
multilingual vision and language model, both in terms of size of the components
and the breadth of its training task mixture. Our model achieves new levels of
performance on a wide-range of varied and complex tasks, including multiple
image-based captioning and question-answering tasks, image-based document
understanding and few-shot (in-context) learning, as well as object detection,
video question answering, and video captioning. PaLI-X advances the
state-of-the-art on most vision-and-language benchmarks considered (25+ of
them). Finally, we observe emerging capabilities, such as complex counting and
multilingual object detection, tasks that are not explicitly in the training
mix