64 research outputs found
Advanced document data extraction techniques to improve supply chain performance
In this thesis, a novel machine learning technique to extract text-based information from scanned images has been developed. This information extraction is performed in the context of scanned invoices and bills used in financial transactions. These financial transactions contain a considerable amount of data that must be extracted, refined, and stored digitally before it can be used for analysis. Converting this data into a digital format is often a time-consuming process. Automation and data optimisation show promise as methods for reducing the time required and the cost of Supply Chain Management (SCM) processes, especially Supplier Invoice Management (SIM), Financial Supply Chain Management (FSCM) and Supply Chain procurement processes. This thesis uses a cross-disciplinary approach involving Computer Science and Operational Management to explore the benefit of automated invoice data extraction in business and its impact on SCM. The study adopts a multimethod approach based on empirical research, surveys, and interviews performed on selected companies.The expert system developed in this thesis focuses on two distinct areas of research: Text/Object Detection and Text Extraction. For Text/Object Detection, the Faster R-CNN model was analysed. While this model yields outstanding results in terms of object detection, it is limited by poor performance when image quality is low. The Generative Adversarial Network (GAN) model is proposed in response to this limitation. The GAN model is a generator network that is implemented with the help of the Faster R-CNN model and a discriminator that relies on PatchGAN. The output of the GAN model is text data with bonding boxes. For text extraction from the bounding box, a novel data extraction framework consisting of various processes including XML processing in case of existing OCR engine, bounding box pre-processing, text clean up, OCR error correction, spell check, type check, pattern-based matching, and finally, a learning mechanism for automatizing future data extraction was designed. Whichever fields the system can extract successfully are provided in key-value format.The efficiency of the proposed system was validated using existing datasets such as SROIE and VATI. Real-time data was validated using invoices that were collected by two companies that provide invoice automation services in various countries. Currently, these scanned invoices are sent to an OCR system such as OmniPage, Tesseract, or ABBYY FRE to extract text blocks and later, a rule-based engine is used to extract relevant data. While the system’s methodology is robust, the companies surveyed were not satisfied with its accuracy. Thus, they sought out new, optimized solutions. To confirm the results, the engines were used to return XML-based files with text and metadata identified. The output XML data was then fed into this new system for information extraction. This system uses the existing OCR engine and a novel, self-adaptive, learning-based OCR engine. This new engine is based on the GAN model for better text identification. Experiments were conducted on various invoice formats to further test and refine its extraction capabilities. For cost optimisation and the analysis of spend classification, additional data were provided by another company in London that holds expertise in reducing their clients' procurement costs. This data was fed into our system to get a deeper level of spend classification and categorisation. This helped the company to reduce its reliance on human effort and allowed for greater efficiency in comparison with the process of performing similar tasks manually using excel sheets and Business Intelligence (BI) tools.The intention behind the development of this novel methodology was twofold. First, to test and develop a novel solution that does not depend on any specific OCR technology. Second, to increase the information extraction accuracy factor over that of existing methodologies. Finally, it evaluates the real-world need for the system and the impact it would have on SCM. This newly developed method is generic and can extract text from any given invoice, making it a valuable tool for optimizing SCM. In addition, the system uses a template-matching approach to ensure the quality of the extracted information
Combination of deep neural networks and logical rules for record segmentation in historical handwritten registers using few examples
International audienceThis work focuses on the layout analysis of historical handwritten registers, in which local religious ceremonies were recorded. The aim of this work is to delimit each record in these registers. To this end, two approaches are proposed. Firstly, object detection networks are explored, as three state-of-the-art architectures are compared. Further experiments are then conducted on Mask R-CNN, as it yields the best performance. Secondly, we introduce and investigate Deep Syntax, a hybrid system that takes advantages of recurrent patterns to delimit each record, by combining ushaped networks and logical rules. Finally, these two approaches are evaluated on 3708 French records (16-18th centuries), as well as on the Esposalles public database, containing 253 Spanish records (17th century). While both systems perform well on homogeneous documents, we observe a significant drop in performance with Mask R-CNN on heterogeneous documents, especially when trained on a non-representative subset. By contrast, Deep Syntax relies on steady patterns, and is therefore able to process a wider range of documents with less training data. Not only Deep Syntax produces 15% more match configurations and reduces the ZoneMap surface error metric by 30% when both systems are trained on 120 images, but it also outperforms Mask R-CNN when trained on a database three times smaller. As Deep Syntax generalizes better, we believe it can be used in the context of massive document processing, as collecting and annotating a sufficiently large and representative set of training data is not always achievable
Towards robust real-world historical handwriting recognition
In this thesis, we make a bridge from the past to the future by using artificial-intelligence methods for text recognition in a historical Dutch collection of the Natuurkundige Commissie that explored Indonesia (1820-1850). In spite of the successes of systems like 'ChatGPT', reading historical handwriting is still quite challenging for AI. Whereas GPT-like methods work on digital texts, historical manuscripts are only available as an extremely diverse collections of (pixel) images. Despite the great results, current DL methods are very data greedy, time consuming, heavily dependent on the human expert from the humanities for labeling and require machine-learning experts for designing the models. Ideally, the use of deep learning methods should require minimal human effort, have an algorithm observe the evolution of the training process, and avoid inefficient use of the already sparse amount of labeled data. We present several approaches towards dealing with these problems, aiming to improve the robustness of current methods and to improve the autonomy in training. We applied our novel word and line text recognition approaches on nine data sets differing in time period, language, and difficulty: three locally collected historical Latin-based data sets from Naturalis, Leiden; four public Latin-based benchmark data sets for comparability with other approaches; and two Arabic data sets. Using ensemble voting of just five neural networks, a level of accuracy was achieved which required hundreds of neural networks in earlier studies. Moreover, we increased the speed of evaluation of each training epoch without the need of labeled data
Article Segmentation in Digitised Newspapers
Digitisation projects preserve and make available vast quantities of historical text. Among these, newspapers are an invaluable resource for the study of human culture and history. Article segmentation identifies each region in a digitised newspaper page that contains an article. Digital humanities, information retrieval (IR), and natural language processing (NLP) applications over digitised archives improve access to text and allow automatic information extraction. The lack of article segmentation impedes these applications. We contribute a thorough review of the existing approaches to article segmentation. Our analysis reveals divergent interpretations of the task, and inconsistent and often ambiguously defined evaluation metrics, making comparisons between systems challenging. We solve these issues by contributing a detailed task definition that examines the nuances and intricacies of article segmentation that are not immediately apparent. We provide practical guidelines on handling borderline cases and devise a new evaluation framework that allows insightful comparison of existing and future approaches. Our review also reveals that the lack of large datasets hinders meaningful evaluation and limits machine learning approaches. We solve these problems by contributing a distant supervision method for generating large datasets for article segmentation. We manually annotate a portion of our dataset and show that our method produces article segmentations over characters nearly as well as costly human annotators. We reimplement the seminal textual approach to article segmentation (Aiello and Pegoretti, 2006) and show that it does not generalise well when evaluated on a large dataset. We contribute a framework for textual article segmentation that divides the task into two distinct phases: block representation and clustering. We propose several techniques for block representation and contribute a novel highly-compressed semantic representation called similarity embeddings. We evaluate and compare different clustering techniques, and innovatively apply label propagation (Zhu and Ghahramani, 2002) to spread headline labels to similar blocks. Our similarity embeddings and label propagation approach substantially outperforms Aiello and Pegoretti but still falls short of human performance. Exploring visual approaches to article segmentation, we reimplement and analyse the state-of-the-art Bansal et al. (2014) approach. We contribute an innovative 2D Markov model approach that captures reading order dependencies and reduces the structured labelling problem to a Markov chain that we decode with Viterbi (1967). Our approach substantially outperforms Bansal et al., achieves accuracy as good as human annotators, and establishes a new state of the art in article segmentation. Our task definition, evaluation framework, and distant supervision dataset will encourage progress in the task of article segmentation. Our state-of-the-art textual and visual approaches will allow sophisticated IR and NLP applications over digitised newspaper archives, supporting research in the digital humanities
Advances in Character Recognition
This book presents advances in character recognition, and it consists of 12 chapters that cover wide range of topics on different aspects of character recognition. Hopefully, this book will serve as a reference source for academic research, for professionals working in the character recognition field and for all interested in the subject
B!SON: A Tool for Open Access Journal Recommendation
Finding a suitable open access journal to publish scientific work is a complex task: Researchers have to navigate a constantly growing number of journals, institutional agreements with publishers, funders’ conditions and the risk of Predatory Publishers. To help with these challenges, we introduce a web-based journal recommendation system called B!SON. It is developed based on a systematic requirements analysis, built on open data, gives publisher-independent recommendations and works across domains. It suggests open access journals based on title, abstract and references provided by the user. The recommendation quality has been evaluated using a large test set of 10,000 articles. Development by two German scientific libraries ensures the longevity of the project
Jewish Studies in the Digital Age
The digitisation boom of the last two decades, and the rapid advancement of digital tools to analyse data in myriad ways, have opened up new avenues for humanities research. This volume discusses how the so-called digital turn has affected the field of Jewish Studies, explores the current state of the art and probes how digital developments can be harnessed to address the specific questions, challenges and problems in the field
Deep Learning Methods for Dialogue Act Recognition using Visual Information
RozpoznávánĂ dialogovĂ˝ch aktĹŻ (DA) je dĹŻleĹľitĂ˝m krokem v Ĺ™ĂzenĂ a porozumÄ›nĂ dialogu. Tato Ăşloha spoÄŤĂvá v automatickĂ©m pĹ™iĹ™azenĂ tĹ™Ădy k vĂ˝roku/promluvÄ› (nebo jeho části) na základÄ› jeho funkce v dialogu (napĹ™. prohlášenĂ, otázka, potvrzenĂ atd.). Takováto klasifikace pak pomáhá modelovat a identifikovat strukturu spontánnĂch dialogĹŻ. I kdyĹľ je rozpoznávánĂ DA obvykle realizováno na zvukovĂ©m signálu (Ĺ™eÄŤi) pomocĂ modelĹŻ pro automatickĂ© rozpoznávánĂ Ĺ™eÄŤi, dialogy existujĂ rovněž ve formÄ› obrázkĹŻ (napĹ™. komiksy).
Tato práce se zabĂ˝vá automatickĂ˝m rozpoznávánĂm dialogovĂ˝ch aktĹŻ z obrazovĂ˝ch dokumentĹŻ.
Dle nás se jedná o prvnĂ pokus o navrĹľenĂ pĹ™Ăstupu rozpoznávánĂ DA vyuĹľĂvajĂcĂ obrázky jako vstup.
Pro tento Ăşkol je nutnĂ© extrahovat text z obrázkĹŻ. VyuĹľĂváme proto algoritmy z oblasti poÄŤĂtaÄŤovĂ©ho vidÄ›nĂ a~zpracovánĂ obrazu, jako je prahovánĂ obrazu, segmentace textu a optickĂ© rozpoznávánĂ znakĹŻ (OCR). HlavnĂm pĹ™Ănosem v tĂ©to oblasti je návrh a implementace OCR modelu zaloĹľenĂ©ho na konvoluÄŤnĂch a rekurentnĂch neuronovĂ˝ch sĂtĂch. TakĂ© prozkoumáváme rĹŻznĂ© strategie pro trĂ©novánĂ tohoto modelu, vÄŤetnÄ› generovánĂ syntetickĂ˝ch dat a technik rozšiĹ™ovánĂ dat (tzv. augmentace).
Dosahujeme vynikajĂcĂch vĂ˝sledkĹŻ OCR v pĹ™ĂpadÄ›, kdy je malĂ© mnoĹľstvĂ trĂ©novacĂch dat. Mezi naše pĹ™Ănosy tedy patřà to, jak vytvoĹ™it efektivnĂ OCR systĂ©m s~minimálnĂmi náklady na ruÄŤnĂ anotaci.
Dále se zabĂ˝váme vĂcejazyÄŤnostĂ v oblasti rozpoznávánĂ DA. ĂšspěšnÄ› jsme pouĹľili a nasadili obecnĂ˝ model, kterĂ˝ byl trĂ©nován všemi dostupnĂ˝mi jazyky, a takĂ© dalšà modely, kterĂ© byly trĂ©novány pouze na jednom jazyce, a vĂcejazyÄŤnosti je dosaĹľeno pomocĂ transformacĂ sĂ©mantickĂ©ho prostoru.
TakĂ© zkoumáme techniku pĹ™enosu uÄŤenĂ (tzv. transfer learning) pro tuto Ăşlohu tam, kde je k dispozici malĂ˝ poÄŤet anotovanĂ˝ch dat. PouĹľĂváme pĹ™Ăznaky jak na Ăşrovni slov, tak i vÄ›t a naše modely hlubokĂ˝ch neuronovĂ˝ch sĂtĂ (vÄŤetnÄ› architektury Transformer) dosáhly vĂ˝bornĂ˝ch vĂ˝sledkĹŻ v oblasti vĂcejazyÄŤnĂ©ho rozpoznávánĂ dialogovĂ˝ch aktĹŻ.
Pro rozpoznávánĂ DA z obrazovĂ˝ch dokumentĹŻ navrhujeme novĂ˝ multimodálnĂ model zaloĹľenĂ˝ na konvoluÄŤnĂ a rekurentnĂ neuronovĂ© sĂti. Tento model kombinuje textovĂ© a obrazovĂ© vstupy. Textová část zpracovává text z OCR, zatĂmco vizuálnà část extrahuje obrazovĂ© pĹ™Ăznaky, kterĂ© tvořà dalšà vstup do modelu. Text z OCR obsahuje ÄŤasto pĹ™eklepy nebo jinĂ© lexikálnĂ chyby. Demonstrujeme na experimentech, Ĺľe tento multimodálnĂ model vyuĹľĂvajĂcĂ dva vstupy dokáže částeÄŤnÄ› vyvážit ztrátu informace zpĹŻsobenou chybovostĂ OCR systĂ©mu.ObhájenoDialogue act (DA) recognition is an important step of dialogue management and understanding. This task is to automatically assign a label to an utterance (or its part) based on its function in a dialogue (e.g. statement, question, backchannel, etc.). Such utterance-level classification thus helps to model and identify the structure of spontaneous dialogues. Even though DA recognition is usually realized on audio data using an automatic speech recognition engine, the dialogues exist also in a form of images (e.g. comic books).
This thesis deals with automatic dialogue act recognition from image documents.
To the best of our knowledge, this is the first attempt to propose DA recognition approaches using the images as an input.
For this task, it is necessary to extract the text from the images.
Therefore, we employ algorithms from the field of computer vision and image processing such as image thresholding, text segmentation, and optical character recognition (OCR). The main contribution in this field is to design and implement a custom OCR model based on convolutional and recurrent neural networks. We also explore different strategies for training such a~model, including synthetic data generation and data augmentation techniques. We achieve new state-of-the-art OCR results in the constraints when only a few training data are available. Summing up, our contribution is hence also presenting an overview of how to create an efficient OCR system with minimal costs.
We further deal with the multilinguality in the DA recognition field. We successfully employ one general model that was trained by data from all available languages, as well as several models that are trained on a single language, and cross-linguality is achieved by using semantic space transformations. Moreover, we explore transfer learning for DA recognition where there is a small number of annotated data available. We use word-level and utterance-level features and our models contain deep neural network architectures, including Transformers. We obtain new state-of-the-art results in multi- and cross-lingual DA regonition field.
For DA recognition from image documents, we propose and implement a novel multimodal model based on convolutional and recurrent neural network. This model combines text and image inputs. A text part is fed by text tokens from OCR, while the visual part extracts image features that are considered as an auxiliary input. Extracted text from dialogues is often erroneous and contains typos or other lexical errors. We show that the multimodal model deals with the erroneous text and visual information partially balance this loss of information
Jewish Studies in the Digital Age
The digitisation boom of the last two decades, and the rapid advancement of digital tools to analyse data in myriad ways, have opened up new avenues for humanities research. This volume discusses how the so-called digital turn has affected the field of Jewish Studies, explores the current state of the art and probes how digital developments can be harnessed to address the specific questions, challenges and problems in the field
- …