304 research outputs found
Adaptive Methods for Robust Document Image Understanding
A vast amount of digital document material is continuously being produced as part of major digitization efforts around the world. In this context, generic and efficient automatic solutions for document image understanding represent a stringent necessity. We propose a generic framework for document image understanding systems, usable for practically any document types available in digital form. Following the introduced workflow, we shift our attention to each of the following processing stages in turn: quality assurance, image enhancement, color reduction and binarization, skew and orientation detection, page segmentation and logical layout analysis. We review the state of the art in each area, identify current defficiencies, point out promising directions and give specific guidelines for future investigation. We address some of the identified issues by means of novel algorithmic solutions putting special focus on generality, computational efficiency and the exploitation of all available sources of information. More specifically, we introduce the following original methods: a fully automatic detection of color reference targets in digitized material, accurate foreground extraction from color historical documents, font enhancement for hot metal typesetted prints, a theoretically optimal solution for the document binarization problem from both computational complexity- and threshold selection point of view, a layout-independent skew and orientation detection, a robust and versatile page segmentation method, a semi-automatic front page detection algorithm and a complete framework for article segmentation in periodical publications. The proposed methods are experimentally evaluated on large datasets consisting of real-life heterogeneous document scans. The obtained results show that a document understanding system combining these modules is able to robustly process a wide variety of documents with good overall accuracy
MGDoc: Pre-training with Multi-granular Hierarchy for Document Image Understanding
Document images are a ubiquitous source of data where the text is organized
in a complex hierarchical structure ranging from fine granularity (e.g.,
words), medium granularity (e.g., regions such as paragraphs or figures), to
coarse granularity (e.g., the whole page). The spatial hierarchical
relationships between content at different levels of granularity are crucial
for document image understanding tasks. Existing methods learn features from
either word-level or region-level but fail to consider both simultaneously.
Word-level models are restricted by the fact that they originate from pure-text
language models, which only encode the word-level context. In contrast,
region-level models attempt to encode regions corresponding to paragraphs or
text blocks into a single embedding, but they perform worse with additional
word-level features. To deal with these issues, we propose MGDoc, a new
multi-modal multi-granular pre-training framework that encodes page-level,
region-level, and word-level information at the same time. MGDoc uses a unified
text-visual encoder to obtain multi-modal features across different
granularities, which makes it possible to project the multi-granular features
into the same hyperspace. To model the region-word correlation, we design a
cross-granular attention mechanism and specific pre-training tasks for our
model to reinforce the model of learning the hierarchy between regions and
words. Experiments demonstrate that our proposed model can learn better
features that perform well across granularities and lead to improvements in
downstream tasks.Comment: EMNLP 202
LayoutLM: Pre-training of Text and Layout for Document Image Understanding
Pre-training techniques have been verified successfully in a variety of NLP
tasks in recent years. Despite the widespread use of pre-training models for
NLP applications, they almost exclusively focus on text-level manipulation,
while neglecting layout and style information that is vital for document image
understanding. In this paper, we propose the \textbf{LayoutLM} to jointly model
interactions between text and layout information across scanned document
images, which is beneficial for a great number of real-world document image
understanding tasks such as information extraction from scanned documents.
Furthermore, we also leverage image features to incorporate words' visual
information into LayoutLM. To the best of our knowledge, this is the first time
that text and layout are jointly learned in a single framework for
document-level pre-training. It achieves new state-of-the-art results in
several downstream tasks, including form understanding (from 70.72 to 79.27),
receipt understanding (from 94.02 to 95.24) and document image classification
(from 93.07 to 94.42). The code and pre-trained LayoutLM models are publicly
available at \url{https://aka.ms/layoutlm}.Comment: KDD 202
Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration
We propose a novel end-to-end document understanding model called SeRum
(SElective Region Understanding Model) for extracting meaningful information
from document images, including document analysis, retrieval, and office
automation.
Unlike state-of-the-art approaches that rely on multi-stage technical schemes
and are computationally expensive,
SeRum converts document image understanding and recognition tasks into a
local decoding process of the visual tokens of interest, using a content-aware
token merge module.
This mechanism enables the model to pay more attention to regions of interest
generated by the query decoder, improving the model's effectiveness and
speeding up the decoding speed of the generative scheme.
We also designed several pre-training tasks to enhance the understanding and
local awareness of the model.
Experimental results demonstrate that SeRum achieves state-of-the-art
performance on document understanding tasks and competitive results on text
spotting tasks.
SeRum represents a substantial advancement towards enabling efficient and
effective end-to-end document understanding.Comment: Accepted to ICCV 2023 main conferenc
- …