285 research outputs found

    Automatic Metadata Extraction Incorporating Visual Features from Scanned Electronic Theses and Dissertations

    Get PDF
    Electronic Theses and Dissertations (ETDs) contain domain knowledge that can be used for many digital library tasks, such as analyzing citation networks and predicting research trends. Automatic metadata extraction is important to build scalable digital library search engines. Most existing methods are designed for born-digital documents, so they often fail to extract metadata from scanned documents such as for ETDs. Traditional sequence tagging methods mainly rely on text-based features. In this paper, we propose a conditional random field (CRF) model that combines text-based and visual features. To verify the robustness of our model, we extended an existing corpus and created a new ground truth corpus consisting of 500 ETD cover pages with human validated metadata. Our experiments show that CRF with visual features outperformed both a heuristic and a CRF model with only text-based features. The proposed model achieved 81.3%-96% F1 measure on seven metadata fields. The data and source code are publicly available on Google Drive (https://tinyurl.com/y8kxzwrp) and a GitHub repository (https://github.com/lamps-lab/ETDMiner/tree/master/etd_crf), respectively.Comment: 7 pages, 4 figures, 1 table. Accepted by JCDL '21 as a short pape

    Literature-driven Curation for Taxonomic Name Databases

    Get PDF
    Digitized biodiversity literature provides a wealth of content for using biodiversity knowledge by machines. However, identifying taxonomic names and the associated semantic metadata is a difficult and labour intensive process. We present a system to support human assisted creation of semantic metadata. Information extraction techniques auto-matically identify taxonomic names from scanned documents. They are then presented to users for manual correction or verification. The tools that support the curation process include taxonomic name identification and mapping, and community-driven taxonomic name verification. Our research shows the potential for these information extrac-tion techniques to support research and curation in disciplines dependent upon scanned document

    Comparative study of NER using Bi-LSTM-CRF with different word vectorisation techniques on DNB documents

    Get PDF
    The presence of huge volumes of unstructured data in the form of pdf documents poses a challenge to the organizations trying to extract valuable information from it. In this thesis, we try to solve this problem as per the requirement of DNB by building an automatic information extraction system to get only the key information in which the company is interested in from the pdf documents. This is achieved by comparing the performance of named entity recognition models for automatic text extraction, built using Bi-directional Long Short Term Memory (Bi-LSTM) with a Conditional Random Field (CRF) in combination with three variations of word vectorization techniques. The word vectorisation techniques compared in this thesis include randomly generated word embeddings by the Keras embedding layer, pre-trained static word embeddings focusing on 100-dimensional GloVe embeddings and, finally, deep-contextual ELMo word embeddings. Comparison of these models helps us identify the advantages and disadvantages of using different word embeddings by analysing their effect on NER performance. This study was performed on a DNB provided data set. The comparative study showed that the NER systems built using Bi-LSTM-CRF with GloVe embeddings gave the best results with a micro F1 score of 0.868 and a macro-F1 score of 0.872 on unseen data, in comparison to a Bi-LSTM-CRF based NER using Keras embedding layer and ELMo embeddings which gave micro F1 scores of 0.858 and 0.796 and macro F1 scores of 0.848 and 0.776 respectively. The result is in contrary to our assumption that NER using deep contextualised word embeddings show better performance when compared to NER using other word embeddings. We proposed that this contradicting performance is due to the high dimensionality, and we analysed it by using a lower-dimensional word embedding. It was found that using 50-dimensional GloVe embeddings instead of 100-dimensional GloVe embeddings resulted in an improvement of the overall micro and macro F1 score from 0.87 to 0.88. Additionally, optimising the best model, which was the Bi-LSTM-CRF using 100-dimensional GloVe embeddings, by tuning in a small hyperparameter search space did not result in any improvement from the present micro F1 score of 0.87 and macro F1 score of 0.87.M30-DV Master's ThesisM-D

    Fine-tuning a transformers-based model to extract relevant fields from invoices

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceExtraction of relevant fields from documents has been a relevant matter for decades. Although there are well-established algorithms to perform this task since the late XX century, this field of study has again gathered more attention with the fast growth of deep learning models and transfer learning. One of these models is LayoutLM, which is a Transformer-based architecture pre-trained with additional features that represent the 2D position of the words. In this dissertation, LayoutLM is fine-tuned on a set of invoices to extract some of its relevant fields, such as company name, address, document date, among others. Given the objective of deploying the model in a company’s internal accountant software, an end-to-end machine learning pipeline is presented. The training layer receives batches with images of documents and their corresponding annotations and fine-tunes the model for a sequence labeling task. The production layer inputs images and predicts the relevant fields. The images are pre-processed extracting the whole document text and bounding boxes using OCR. To automatically label the samples using Transformers-based input format, the text is labeled using an algorithm that searches parts of the text equal or highly similar to the annotations. Also, a new dataset to support this work is created and made publicly available. The dataset consists of 813 pictures and the annotation text for every relevant field, which include company name, company address, document date, document number, buyer tax number, seller tax number, total amount and tax amount. The models are fine-tuned and compared with two baseline models, showing a performance very close to the presented by the model authors. A sensitivity analysis is made to understand the impact of two datasets with different characteristics. In addition, the learning curves for different datasets define empirically that 100 to 200 samples are enough to fine-tune the model and achieve top performance. Based on the results, a strategy for model deployment is defined. Empirical results show that the already fine-tuned model is enough to guarantee top performance in production without the need of using online learning algorithms

    Advanced document data extraction techniques to improve supply chain performance

    Get PDF
    In this thesis, a novel machine learning technique to extract text-based information from scanned images has been developed. This information extraction is performed in the context of scanned invoices and bills used in financial transactions. These financial transactions contain a considerable amount of data that must be extracted, refined, and stored digitally before it can be used for analysis. Converting this data into a digital format is often a time-consuming process. Automation and data optimisation show promise as methods for reducing the time required and the cost of Supply Chain Management (SCM) processes, especially Supplier Invoice Management (SIM), Financial Supply Chain Management (FSCM) and Supply Chain procurement processes. This thesis uses a cross-disciplinary approach involving Computer Science and Operational Management to explore the benefit of automated invoice data extraction in business and its impact on SCM. The study adopts a multimethod approach based on empirical research, surveys, and interviews performed on selected companies.The expert system developed in this thesis focuses on two distinct areas of research: Text/Object Detection and Text Extraction. For Text/Object Detection, the Faster R-CNN model was analysed. While this model yields outstanding results in terms of object detection, it is limited by poor performance when image quality is low. The Generative Adversarial Network (GAN) model is proposed in response to this limitation. The GAN model is a generator network that is implemented with the help of the Faster R-CNN model and a discriminator that relies on PatchGAN. The output of the GAN model is text data with bonding boxes. For text extraction from the bounding box, a novel data extraction framework consisting of various processes including XML processing in case of existing OCR engine, bounding box pre-processing, text clean up, OCR error correction, spell check, type check, pattern-based matching, and finally, a learning mechanism for automatizing future data extraction was designed. Whichever fields the system can extract successfully are provided in key-value format.The efficiency of the proposed system was validated using existing datasets such as SROIE and VATI. Real-time data was validated using invoices that were collected by two companies that provide invoice automation services in various countries. Currently, these scanned invoices are sent to an OCR system such as OmniPage, Tesseract, or ABBYY FRE to extract text blocks and later, a rule-based engine is used to extract relevant data. While the system’s methodology is robust, the companies surveyed were not satisfied with its accuracy. Thus, they sought out new, optimized solutions. To confirm the results, the engines were used to return XML-based files with text and metadata identified. The output XML data was then fed into this new system for information extraction. This system uses the existing OCR engine and a novel, self-adaptive, learning-based OCR engine. This new engine is based on the GAN model for better text identification. Experiments were conducted on various invoice formats to further test and refine its extraction capabilities. For cost optimisation and the analysis of spend classification, additional data were provided by another company in London that holds expertise in reducing their clients' procurement costs. This data was fed into our system to get a deeper level of spend classification and categorisation. This helped the company to reduce its reliance on human effort and allowed for greater efficiency in comparison with the process of performing similar tasks manually using excel sheets and Business Intelligence (BI) tools.The intention behind the development of this novel methodology was twofold. First, to test and develop a novel solution that does not depend on any specific OCR technology. Second, to increase the information extraction accuracy factor over that of existing methodologies. Finally, it evaluates the real-world need for the system and the impact it would have on SCM. This newly developed method is generic and can extract text from any given invoice, making it a valuable tool for optimizing SCM. In addition, the system uses a template-matching approach to ensure the quality of the extracted information
    • …
    corecore