InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document
  Understanding with Instructions

Iki, Taichi; Nishida, Kyosuke; Saito, Kuniko; Suzuki, Jun; Tanaka, Ryota

InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions

Authors: Taichi Iki
Kyosuke Nishida
Kuniko Saito
Jun Suzuki
Ryota Tanaka
Publication date: 24 January 2024
Publisher

Abstract

We study the problem of completing various visual document understanding (VDU) tasks, e.g., question answering and information extraction, on real-world documents through human-written instructions. To this end, we propose InstructDoc, the first large-scale collection of 30 publicly available VDU datasets, each with diverse instructions in a unified format, which covers a wide range of 12 tasks and includes open document types/formats. Furthermore, to enhance the generalization performance on VDU tasks, we design a new instruction-based document reading and understanding model, InstructDr, that connects document images, image encoders, and large language models (LLMs) through a trainable bridging module. Experiments demonstrate that InstructDr can effectively adapt to new VDU datasets, tasks, and domains via given instructions and outperforms existing multimodal LLMs and ChatGPT without specific training.Comment: Accepted by AAAI2024; project page: https://github.com/nttmdlab-nlp/InstructDo

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2401.13313

Last time updated on 22/08/2024