1,441 research outputs found

    Advanced document data extraction techniques to improve supply chain performance

    Get PDF
    In this thesis, a novel machine learning technique to extract text-based information from scanned images has been developed. This information extraction is performed in the context of scanned invoices and bills used in financial transactions. These financial transactions contain a considerable amount of data that must be extracted, refined, and stored digitally before it can be used for analysis. Converting this data into a digital format is often a time-consuming process. Automation and data optimisation show promise as methods for reducing the time required and the cost of Supply Chain Management (SCM) processes, especially Supplier Invoice Management (SIM), Financial Supply Chain Management (FSCM) and Supply Chain procurement processes. This thesis uses a cross-disciplinary approach involving Computer Science and Operational Management to explore the benefit of automated invoice data extraction in business and its impact on SCM. The study adopts a multimethod approach based on empirical research, surveys, and interviews performed on selected companies.The expert system developed in this thesis focuses on two distinct areas of research: Text/Object Detection and Text Extraction. For Text/Object Detection, the Faster R-CNN model was analysed. While this model yields outstanding results in terms of object detection, it is limited by poor performance when image quality is low. The Generative Adversarial Network (GAN) model is proposed in response to this limitation. The GAN model is a generator network that is implemented with the help of the Faster R-CNN model and a discriminator that relies on PatchGAN. The output of the GAN model is text data with bonding boxes. For text extraction from the bounding box, a novel data extraction framework consisting of various processes including XML processing in case of existing OCR engine, bounding box pre-processing, text clean up, OCR error correction, spell check, type check, pattern-based matching, and finally, a learning mechanism for automatizing future data extraction was designed. Whichever fields the system can extract successfully are provided in key-value format.The efficiency of the proposed system was validated using existing datasets such as SROIE and VATI. Real-time data was validated using invoices that were collected by two companies that provide invoice automation services in various countries. Currently, these scanned invoices are sent to an OCR system such as OmniPage, Tesseract, or ABBYY FRE to extract text blocks and later, a rule-based engine is used to extract relevant data. While the system’s methodology is robust, the companies surveyed were not satisfied with its accuracy. Thus, they sought out new, optimized solutions. To confirm the results, the engines were used to return XML-based files with text and metadata identified. The output XML data was then fed into this new system for information extraction. This system uses the existing OCR engine and a novel, self-adaptive, learning-based OCR engine. This new engine is based on the GAN model for better text identification. Experiments were conducted on various invoice formats to further test and refine its extraction capabilities. For cost optimisation and the analysis of spend classification, additional data were provided by another company in London that holds expertise in reducing their clients' procurement costs. This data was fed into our system to get a deeper level of spend classification and categorisation. This helped the company to reduce its reliance on human effort and allowed for greater efficiency in comparison with the process of performing similar tasks manually using excel sheets and Business Intelligence (BI) tools.The intention behind the development of this novel methodology was twofold. First, to test and develop a novel solution that does not depend on any specific OCR technology. Second, to increase the information extraction accuracy factor over that of existing methodologies. Finally, it evaluates the real-world need for the system and the impact it would have on SCM. This newly developed method is generic and can extract text from any given invoice, making it a valuable tool for optimizing SCM. In addition, the system uses a template-matching approach to ensure the quality of the extracted information

    Implementation of an information retrieval system within a central knowledge management system

    Get PDF
    Páginas numeradas: I-XIII, 14-126Estágio realizado na Wipro Portugal SA e orientado pelo Eng.º Hugo NetoTese de mestrado integrado. Engenharia Informática e Computação. Faculdade de Engenharia. Universidade do Porto. 201

    Matching XML schemas by a new tree matching algorithm.

    Get PDF
    Schema matching is one of the key operations in XML-based information integration and exchanging applications. Automatically or semi-automatically matching XML schemas has attracted a lot of attentions in academia and industry due to the extensive adoption of XML techniques. One of the most difficult tasks in this problem is to identify structural relations between two XML schemas. This thesis builds an automatic XML schema matching system which generates two types of outputs: the element mappings and schema similarity. In this system, an XML schema is modeled as a tree. We have proposed a new tree matching algorithm to compute the structural relation by extracting the most similar common substructures. The algorithm has been designed to achieve a trade-off between matching optimality and time complexity. The experimental results show that this system can be used to match large XML schemas.Dept. of Computer Science. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2004 .W366. Source: Masters Abstracts International, Volume: 43-05, page: 1758. Adviser: Jianguo Lu. Thesis (M.Sc.)--University of Windsor (Canada), 2004

    Improving e-commerce product recommendation using semantic context and sequential historical purchases

    Get PDF
    Collaborative Filtering (CF)-based recommendation methods suffer from (i) sparsity (have low user–item interactions) and (ii) cold start (an item cannot be recommended if no ratings exist). Systems using clustering and pattern mining (frequent and sequential) with similarity measures between clicks and purchases for next-item recommendation cannot perform well when the matrix is sparse, due to rapid increase in number of items. Additionally, they suffer from: (i) lack of personalization: patterns are not targeted for a specific customer and (ii) lack of semantics among recommended items: they can only recommend items that exist as a result of a matching rule generated from frequent sequential purchase pattern(s). To better understand users’ preferences and to infer the inherent meaning of items, this paper proposes a method to explore semantic associations between items obtained by utilizing item (products’) metadata such as title, description and brand based on their semantic context (co-purchased and co-reviewed products). The semantics of these interactions will be obtained through distributional hypothesis, which learns an item’s representation by analyzing the context (neighborhood) in which it is used. The idea is that items co-occurring in a context are likely to be semantically similar to each other (e.g., items in a user purchase sequence). The semantics are then integrated into different phases of recommendation process such as (i) preprocessing, to learn associations between items, (ii) candidate generation, while mining sequential patterns and in collaborative filtering to select top-N neighbors and (iii) output (recommendation). Experiments performed on publically available E-commerce data set show that the proposed model performed well and reflected user preferences by recommending semantically similar and sequential products

    ICSEA 2022: the seventeenth international conference on software engineering advances

    Get PDF
    The Seventeenth International Conference on Software Engineering Advances (ICSEA 2022), held between October 16th and October 20th, 2022, continued a series of events covering a broad spectrum of software-related topics. The conference covered fundamentals on designing, implementing, testing, validating and maintaining various kinds of software. Several tracks were proposed to treat the topics from theory to practice, in terms of methodologies, design, implementation, testing, use cases, tools, and lessons learned. The conference topics covered classical and advanced methodologies, open source, agile software, as well as software deployment and software economics and education. Other advanced aspects are related to on-time practical aspects, such as run-time vulnerability checking, rejuvenation process, updates partial or temporary feature deprecation, software deployment and configuration, and on-line software updates. These aspects trigger implications related to patenting, licensing, engineering education, new ways for software adoption and improvement, and ultimately, to software knowledge management. There are many advanced applications requiring robust, safe, and secure software: disaster recovery applications, vehicular systems, biomedical-related software, biometrics related software, mission critical software, E-health related software, crisis-situation software. These applications require appropriate software engineering techniques, metrics and formalisms, such as, software reuse, appropriate software quality metrics, composition and integration, consistency checking, model checking, provers and reasoning. The nature of research in software varies slightly with the specific discipline researchers work in, yet there is much common ground and room for a sharing of best practice, frameworks, tools, languages and methodologies. Despite the number of experts we have available, little work is done at the meta level, that is examining how we go about our research, and how this process can be improved. There are questions related to the choice of programming language, IDEs and documentation styles and standard. Reuse can be of great benefit to research projects yet reuse of prior research projects introduces special problems that need to be mitigated. The research environment is a mix of creativity and systematic approach which leads to a creative tension that needs to be managed or at least monitored. Much of the coding in any university is undertaken by research students or young researchers. Issues of skills training, development and quality control can have significant effects on an entire department. In an industrial research setting, the environment is not quite that of industry as a whole, nor does it follow the pattern set by the university. The unique approaches and issues of industrial research may hold lessons for researchers in other domains. We take here the opportunity to warmly thank all the members of the ICSEA 2022 technical program committee, as well as all the reviewers. The creation of such a high-quality conference program would not have been possible without their involvement. We also kindly thank all the authors who dedicated much of their time and effort to contribute to ICSEA 2022. We truly believe that, thanks to all these efforts, the final conference program consisted of top-quality contributions. We also thank the members of the ICSEA 2022 organizing committee for their help in handling the logistics of this event. We hope that ICSEA 2022 was a successful international forum for the exchange of ideas and results between academia and industry and for the promotion of progress in software engineering advances

    Analysis and detection of security vulnerabilities in contemporary software

    Get PDF
    Contemporary application systems are implemented using an assortment of high-level programming languages, software frameworks, and third party components. While this may help to lower development time and cost, the result is a complex system of interoperating parts whose behavior is difficult to fully and properly comprehend. This difficulty of comprehension often manifests itself in the form of program coding errors that are not directly related to security requirements but can have an impact on the security of the system. The thesis of this dissertation is that many security vulnerabilities in contemporary software may be attributed to unintended behavior due to unexpected execution paths resulting from the accidental misuse of the software components. Unlike many typical programmer errors such as missed boundary checks or user input validation, these software bugs are not easy to detect and avoid. While typical secure coding best practices, such as code reviews, dynamic and static analysis, offer little protection against such vulnerabilities, we argue that runtime verification of software execution against a specified expected behavior can help to identify unexpected behavior in the software. The dissertation explores how building software systems using components may lead to the emergence of unexpected software behavior that results in security vulnerabilities. The thesis is supported by a study of the evolution of a popular software product over a period of twelve years. While anomaly detection techniques could be applied to verify software verification at runtime, there are several practical challenges in using them in large-scale contemporary software. A model of expected application execution paths and a methodology that can be used to build it during the software development cycle is proposed. The dissertation explores its effectiveness in detecting exploits on vulnerabilities enabled by software errors in a popular, enterprise software product

    Knowledge-based systems for knowledge management in enterprises : Workshop held at the 21st Annual German Conference on AI (KI-97)

    Get PDF
    • …
    corecore