We study the problem of completing various visual document understanding
(VDU) tasks, e.g., question answering and information extraction, on real-world
documents through human-written instructions. To this end, we propose
InstructDoc, the first large-scale collection of 30 publicly available VDU
datasets, each with diverse instructions in a unified format, which covers a
wide range of 12 tasks and includes open document types/formats. Furthermore,
to enhance the generalization performance on VDU tasks, we design a new
instruction-based document reading and understanding model, InstructDr, that
connects document images, image encoders, and large language models (LLMs)
through a trainable bridging module. Experiments demonstrate that InstructDr
can effectively adapt to new VDU datasets, tasks, and domains via given
instructions and outperforms existing multimodal LLMs and ChatGPT without
specific training.Comment: Accepted by AAAI2024; project page:
https://github.com/nttmdlab-nlp/InstructDo