Development of an Artificial Intelligence-based Solution for Document Processing Automation Using Machine Learning and NLP Techniques

Abstract

The proposal focuses on Intelligent Document Processing (IDP), which aims to automate various activities related to document processing using Artificial Intelligence technologies, particularly Machine Learning and Natural Language Processing techniques. The proposed solution seeks to improve the efficiency and quality of document processing in many business and organizational contexts by automating tasks such as classification, information extraction, validation, and verification of consistency between documents. This thesis paper includes the following phases: “Text Identification, OCR, Invoice Data Extraction and Quality Assurance”. In case of document files, the data extraction is done in the first phase. This project thesis details the IDP solution developed, analyse processing results and the quality of the extracted information, and evaluate the accuracy and efficiency of the system. The thesis is focused on information extraction from key fields of invoices using two different methods based on sequence labeling. Invoices are unstructured documents in which data can be located based on the context. Their performances are expected to be generally high on documents they have been trained for but processing new templates often requires new manual annotations like prodigy tool, which is tedious and time-consuming to produce labeled data. This showcases a set of trials utilizing neural networks methods to examine the balance between data prerequisites and efficacy in retrieving data from crucial sections of invoices (such as invoice date, invoice number, order number, amount, supplier's name...). The main contribution of this thesis is a system that achieves competitive results using a small amount of data compared to the state-of-the-art systems that need to be trained on large datasets, using a custom Named Entity Recognition (NER) model to extract that relevant information from non-uniform commercial invoice formats

    Similar works