This paper introduces a deep learning model tailored for document information
analysis, emphasizing document classification, entity relation extraction, and
document visual question answering. The proposed model leverages
transformer-based models to encode all the information present in a document
image, including textual, visual, and layout information. The model is
pre-trained and subsequently fine-tuned for various document image analysis
tasks. The proposed model incorporates three additional tasks during the
pre-training phase, including reading order identification of different layout
segments in a document image, layout segments categorization as per PubLayNet,
and generation of the text sequence within a given layout segment (text block).
The model also incorporates a collective pre-training scheme where losses of
all the tasks under consideration, including pre-training and fine-tuning tasks
with all datasets, are considered. Additional encoder and decoder blocks are
added to the RoBERTa network to generate results for all tasks. The proposed
model achieved impressive results across all tasks, with an accuracy of 95.87%
on the RVL-CDIP dataset for document classification, F1 scores of 0.9306,
0.9804, 0.9794, and 0.8742 on the FUNSD, CORD, SROIE, and Kleister-NDA datasets
respectively for entity relation extraction, and an ANLS score of 0.8468 on the
DocVQA dataset for visual question answering. The results highlight the
effectiveness of the proposed model in understanding and interpreting complex
document layouts and content, making it a promising tool for document analysis
tasks