2 research outputs found
TransDocAnalyser: A Framework for Offline Semi-structured Handwritten Document Analysis in the Legal Domain
State-of-the-art offline Optical Character Recognition (OCR) frameworks
perform poorly on semi-structured handwritten domain-specific documents due to
their inability to localize and label form fields with domain-specific
semantics. Existing techniques for semi-structured document analysis have
primarily used datasets comprising invoices, purchase orders, receipts, and
identity-card documents for benchmarking. In this work, we build the first
semi-structured document analysis dataset in the legal domain by collecting a
large number of First Information Report (FIR) documents from several police
stations in India. This dataset, which we call the FIR dataset, is more
challenging than most existing document analysis datasets, since it combines a
wide variety of handwritten text with printed text. We also propose an
end-to-end framework for offline processing of handwritten semi-structured
documents, and benchmark it on our novel FIR dataset. Our framework used
Encoder-Decoder architecture for localizing and labelling the form fields and
for recognizing the handwritten content. The encoder consists of Faster-RCNN
and Vision Transformers. Further the Transformer-based decoder architecture is
trained with a domain-specific tokenizer. We also propose a post-correction
method to handle recognition errors pertaining to the domain-specific terms.
Our proposed framework achieves state-of-the-art results on the FIR dataset
outperforming several existing modelsComment: This paper has been accepted in 17th International Conference on
Document Analysis and Recognition(ICDAR) as an Oral presentatio