Document-based Visual Question Answering examines the document understanding
of document images in conditions of natural language questions. We proposed a
new document-based VQA dataset, PDF-VQA, to comprehensively examine the
document understanding from various aspects, including document element
recognition, document layout structural understanding as well as contextual
understanding and key information extraction. Our PDF-VQA dataset extends the
current scale of document understanding that limits on the single document page
to the new scale that asks questions over the full document of multiple pages.
We also propose a new graph-based VQA model that explicitly integrates the
spatial and hierarchically structural relationships between different document
elements to boost the document structural understanding. The performances are
compared with several baselines over different question types and
tasks\footnote{The full dataset will be released after paper acceptance