Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

Bai, Jinze; Bai, Shuai; Lin, Junyang; Tan, Sinan; Wang, Peng; Wang, Shijie; Yang, Shusheng; Zhou, Chang; Zhou, Jingren

Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

Authors: Jinze Bai
Shuai Bai
Junyang Lin
Sinan Tan
Peng Wang
Shijie Wang
Shusheng Yang
Chang Zhou
Jingren Zhou
Publication date: 24 August 2023
Publisher

Abstract

We introduce the Qwen-VL series, a set of large-scale vision-language models designed to perceive and understand both text and images. Comprising Qwen-VL and Qwen-VL-Chat, these models exhibit remarkable performance in tasks like image captioning, question answering, visual localization, and flexible interaction. The evaluation covers a wide range of tasks including zero-shot captioning, visual or document visual question answering, and grounding. We demonstrate the Qwen-VL outperforms existing Large Vision Language Models (LVLMs). We present their architecture, training, capabilities, and performance, highlighting their contributions to advancing multimodal artificial intelligence. Code, demo and models are available at https://github.com/QwenLM/Qwen-VL.Comment: Code, demo and models are available at https://github.com/QwenLM/Qwen-V

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2308.12966

Last time updated on 08/09/2023