LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image
  Understanding

Gu, Jiuxiang; Lipka, Nedim; Sun, Tong; Yang, Diyi; Zhang, Ruiyi; Zhang, Yanzhe; Zhou, Yufan

LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

Authors: Jiuxiang Gu
Nedim Lipka
Tong Sun
Diyi Yang
Ruiyi Zhang
Yanzhe Zhang
Yufan Zhou
Publication date: 29 June 2023
Publisher

Abstract

Instruction tuning unlocks the superior capability of Large Language Models (LLM) to interact with humans. Furthermore, recent instruction-following datasets include images as visual inputs, collecting responses for image-based instructions. However, visual instruction-tuned models cannot comprehend textual details within images well. This work enhances the current visual instruction tuning pipeline with text-rich images (e.g., movie posters, book covers, etc.). Specifically, we first use publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset. Moreover, we prompt text-only GPT-4 with recognized texts and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images. By combining our collected data with previous multi-modal instruction-following data, our model, LLaVAR, substantially improves the LLaVA model's capability on text-based VQA datasets (up to 20% accuracy improvement) while achieving an accuracy of 91.42% on ScienceQA. The GPT-4-based instruction-following evaluation also demonstrates the improvement of our model on both natural images and text-rich images. Through qualitative analysis, LLaVAR shows promising interaction (e.g., reasoning, writing, and elaboration) skills with humans based on the latest real-world online content that combines text and images. We make our code/data/models publicly available at https://llavar.github.io/.Comment: Preprint. Work in progres

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2306.17107

Last time updated on 02/07/2023