486 research outputs found
Advanced document data extraction techniques to improve supply chain performance
In this thesis, a novel machine learning technique to extract text-based information from scanned images has been developed. This information extraction is performed in the context of scanned invoices and bills used in financial transactions. These financial transactions contain a considerable amount of data that must be extracted, refined, and stored digitally before it can be used for analysis. Converting this data into a digital format is often a time-consuming process. Automation and data optimisation show promise as methods for reducing the time required and the cost of Supply Chain Management (SCM) processes, especially Supplier Invoice Management (SIM), Financial Supply Chain Management (FSCM) and Supply Chain procurement processes. This thesis uses a cross-disciplinary approach involving Computer Science and Operational Management to explore the benefit of automated invoice data extraction in business and its impact on SCM. The study adopts a multimethod approach based on empirical research, surveys, and interviews performed on selected companies.The expert system developed in this thesis focuses on two distinct areas of research: Text/Object Detection and Text Extraction. For Text/Object Detection, the Faster R-CNN model was analysed. While this model yields outstanding results in terms of object detection, it is limited by poor performance when image quality is low. The Generative Adversarial Network (GAN) model is proposed in response to this limitation. The GAN model is a generator network that is implemented with the help of the Faster R-CNN model and a discriminator that relies on PatchGAN. The output of the GAN model is text data with bonding boxes. For text extraction from the bounding box, a novel data extraction framework consisting of various processes including XML processing in case of existing OCR engine, bounding box pre-processing, text clean up, OCR error correction, spell check, type check, pattern-based matching, and finally, a learning mechanism for automatizing future data extraction was designed. Whichever fields the system can extract successfully are provided in key-value format.The efficiency of the proposed system was validated using existing datasets such as SROIE and VATI. Real-time data was validated using invoices that were collected by two companies that provide invoice automation services in various countries. Currently, these scanned invoices are sent to an OCR system such as OmniPage, Tesseract, or ABBYY FRE to extract text blocks and later, a rule-based engine is used to extract relevant data. While the system’s methodology is robust, the companies surveyed were not satisfied with its accuracy. Thus, they sought out new, optimized solutions. To confirm the results, the engines were used to return XML-based files with text and metadata identified. The output XML data was then fed into this new system for information extraction. This system uses the existing OCR engine and a novel, self-adaptive, learning-based OCR engine. This new engine is based on the GAN model for better text identification. Experiments were conducted on various invoice formats to further test and refine its extraction capabilities. For cost optimisation and the analysis of spend classification, additional data were provided by another company in London that holds expertise in reducing their clients' procurement costs. This data was fed into our system to get a deeper level of spend classification and categorisation. This helped the company to reduce its reliance on human effort and allowed for greater efficiency in comparison with the process of performing similar tasks manually using excel sheets and Business Intelligence (BI) tools.The intention behind the development of this novel methodology was twofold. First, to test and develop a novel solution that does not depend on any specific OCR technology. Second, to increase the information extraction accuracy factor over that of existing methodologies. Finally, it evaluates the real-world need for the system and the impact it would have on SCM. This newly developed method is generic and can extract text from any given invoice, making it a valuable tool for optimizing SCM. In addition, the system uses a template-matching approach to ensure the quality of the extracted information
TRIE++: Towards End-to-End Information Extraction from Visually Rich Documents
Recently, automatically extracting information from visually rich documents
(e.g., tickets and resumes) has become a hot and vital research topic due to
its widespread commercial value. Most existing methods divide this task into
two subparts: the text reading part for obtaining the plain text from the
original document images and the information extraction part for extracting key
contents. These methods mainly focus on improving the second, while neglecting
that the two parts are highly correlated. This paper proposes a unified
end-to-end information extraction framework from visually rich documents, where
text reading and information extraction can reinforce each other via a
well-designed multi-modal context block. Specifically, the text reading part
provides multi-modal features like visual, textual and layout features. The
multi-modal context block is developed to fuse the generated multi-modal
features and even the prior knowledge from the pre-trained language model for
better semantic representation. The information extraction part is responsible
for generating key contents with the fused context features. The framework can
be trained in an end-to-end trainable manner, achieving global optimization.
What is more, we define and group visually rich documents into four categories
across two dimensions, the layout and text type. For each document category, we
provide or recommend the corresponding benchmarks, experimental settings and
strong baselines for remedying the problem that this research area lacks the
uniform evaluation standard. Extensive experiments on four kinds of benchmarks
(from fixed layout to variable layout, from full-structured text to
semi-unstructured text) are reported, demonstrating the proposed method's
effectiveness. Data, source code and models are available
Filled-in document image identification using landmarks
A technique is proposed to classify documents based on landmarks, with a reject option.Carrión Robles, D. (2011). Filled-in document image identification using landmarks. http://hdl.handle.net/10251/15836Archivo delegad
LayoutLM: Pre-training of Text and Layout for Document Image Understanding
Pre-training techniques have been verified successfully in a variety of NLP
tasks in recent years. Despite the widespread use of pre-training models for
NLP applications, they almost exclusively focus on text-level manipulation,
while neglecting layout and style information that is vital for document image
understanding. In this paper, we propose the \textbf{LayoutLM} to jointly model
interactions between text and layout information across scanned document
images, which is beneficial for a great number of real-world document image
understanding tasks such as information extraction from scanned documents.
Furthermore, we also leverage image features to incorporate words' visual
information into LayoutLM. To the best of our knowledge, this is the first time
that text and layout are jointly learned in a single framework for
document-level pre-training. It achieves new state-of-the-art results in
several downstream tasks, including form understanding (from 70.72 to 79.27),
receipt understanding (from 94.02 to 95.24) and document image classification
(from 93.07 to 94.42). The code and pre-trained LayoutLM models are publicly
available at \url{https://aka.ms/layoutlm}.Comment: KDD 202
- …