1,326 research outputs found

    XTaGe: a flexible generation system for complex XML collections

    Get PDF
    We introduce XTaGe (XML Tester and Generator), a system for the synthesis of XML collections meant for testing and micro benchmarking applications. In contrast with existing approaches, XTaGe focuses on complex collections, by providing a highly extensible framework to introduce controlled variability in XML structures. In this paper we present the theoretical foundation, internal architecture and main features of our generator; we describe its implementation, which includes a GUI to facilitate the specication of collections; we discuss how XTaGe's features compare with those in other XML generation systems; finally, we illustrate its usage by presenting a use case in the bioinformatics domai

    OQAFMA Querying Agent for the Foundational Model of Anatomy: a Prototype for Providing Flexible and Efficient Access to Large Semantic Networks

    Get PDF
    The development of large semantic networks, such as the UMLS, which are intended to support a variety of applications, requires a exible and e cient query interface for the extraction of information. Using one of the source vocabularies of UMLS as a test bed, we have developed such a prototype query interface. We rst identify common classes of queries needed by applications that access these semantic networks. Next, we survey STRUQL, an existing query language that we adopted, which supports all of these classes of queries. We then describe the OQAFMA Querying Agent for the Foundational Model of Anatomy (OQAFMA), which provides an e cient implementation of a subset of STRUQL by pre-computing a variety of indices. We describe how OQAFMA leverages database optimization by converting STRUQL queries to SQL. We evaluate the exibility and e ciency of our implementation using English queries written by anatomists. This evaluation veri es that OQAFMA provides exible, e cient access to one such large semantic network, the Foundational Model of Anatomy, and suggests that OQAFMA could be an e cient query interface to other large biomedical knowledge bases, such as the Uni ed Medical Language System

    Advanced document data extraction techniques to improve supply chain performance

    Get PDF
    In this thesis, a novel machine learning technique to extract text-based information from scanned images has been developed. This information extraction is performed in the context of scanned invoices and bills used in financial transactions. These financial transactions contain a considerable amount of data that must be extracted, refined, and stored digitally before it can be used for analysis. Converting this data into a digital format is often a time-consuming process. Automation and data optimisation show promise as methods for reducing the time required and the cost of Supply Chain Management (SCM) processes, especially Supplier Invoice Management (SIM), Financial Supply Chain Management (FSCM) and Supply Chain procurement processes. This thesis uses a cross-disciplinary approach involving Computer Science and Operational Management to explore the benefit of automated invoice data extraction in business and its impact on SCM. The study adopts a multimethod approach based on empirical research, surveys, and interviews performed on selected companies.The expert system developed in this thesis focuses on two distinct areas of research: Text/Object Detection and Text Extraction. For Text/Object Detection, the Faster R-CNN model was analysed. While this model yields outstanding results in terms of object detection, it is limited by poor performance when image quality is low. The Generative Adversarial Network (GAN) model is proposed in response to this limitation. The GAN model is a generator network that is implemented with the help of the Faster R-CNN model and a discriminator that relies on PatchGAN. The output of the GAN model is text data with bonding boxes. For text extraction from the bounding box, a novel data extraction framework consisting of various processes including XML processing in case of existing OCR engine, bounding box pre-processing, text clean up, OCR error correction, spell check, type check, pattern-based matching, and finally, a learning mechanism for automatizing future data extraction was designed. Whichever fields the system can extract successfully are provided in key-value format.The efficiency of the proposed system was validated using existing datasets such as SROIE and VATI. Real-time data was validated using invoices that were collected by two companies that provide invoice automation services in various countries. Currently, these scanned invoices are sent to an OCR system such as OmniPage, Tesseract, or ABBYY FRE to extract text blocks and later, a rule-based engine is used to extract relevant data. While the system’s methodology is robust, the companies surveyed were not satisfied with its accuracy. Thus, they sought out new, optimized solutions. To confirm the results, the engines were used to return XML-based files with text and metadata identified. The output XML data was then fed into this new system for information extraction. This system uses the existing OCR engine and a novel, self-adaptive, learning-based OCR engine. This new engine is based on the GAN model for better text identification. Experiments were conducted on various invoice formats to further test and refine its extraction capabilities. For cost optimisation and the analysis of spend classification, additional data were provided by another company in London that holds expertise in reducing their clients' procurement costs. This data was fed into our system to get a deeper level of spend classification and categorisation. This helped the company to reduce its reliance on human effort and allowed for greater efficiency in comparison with the process of performing similar tasks manually using excel sheets and Business Intelligence (BI) tools.The intention behind the development of this novel methodology was twofold. First, to test and develop a novel solution that does not depend on any specific OCR technology. Second, to increase the information extraction accuracy factor over that of existing methodologies. Finally, it evaluates the real-world need for the system and the impact it would have on SCM. This newly developed method is generic and can extract text from any given invoice, making it a valuable tool for optimizing SCM. In addition, the system uses a template-matching approach to ensure the quality of the extracted information

    New Path Based Index Structure for Processing CAS Queries over XML Database

    Get PDF
    Querying nested data has become one of the most challenging issues for retrieving desired information from the Web. Today diverse applications generate a tremendous amount of data in different formats. These data and information exchanged on the Web are commonly expressed as nested representation such as XML, JSON, etc. Unlike the traditional database system, they don\u27t have a rigid schema. In general, the nested data is managed by storing data and its structures separately which significantly reduces the performance of data retrieving. Ensuring efficiency of processing queries which locates the exact positions of the elements has become a big challenging issue. There are different indexing structures which have been proposed in the literature to improve the performance of the query processing on the nested structure. Most of the past researches on nested structure concentrate on the structure alone. This paper proposes new index structure which combines siblings of the terminal nodes as one path which efficiently processes twig queries with less number of lookups and joins. The proposed approach is compared with some of the existing approaches. The results also show that they are processed with better performance compared to the existing ones

    Geographical queries reformulation using a parallel association rules generator to build spatial taxonomies

    Get PDF
    Geographical queries need a special process of reformulation by information retrieval systems (IRS) due to their specificities and hierarchical structure. This fact is ignored by most of web search engines. In this paper, we propose an automatic approach for building a spatial taxonomy, that models’ the notion of adjacency that will be used in the reformulation of the spatial part of a geographical query. This approach exploits the documents that are in top of the retrieved list when submitting a spatial entity, which is composed of a spatial relation and a noun of a city. Then, a transactional database is constructed, considering each document extracted as a transaction that contains the nouns of the cities sharing the country of the submitted query’s city. The algorithm frequent pattern growth (FP-growth) is applied to this database in his parallel version (parallel FP-growth: PFP) in order to generate association rules, that will form the country’s taxonomy in a Big Data context. Experiments has been conducted on Spark and their results show that query reformulation using the taxonomy constructed based on our proposed approach improves the precision and the effectiveness of the IRS

    Automaton Meet Algebra: A Hybrid Paradigm for Efficiently Processing XQuery over XML Stream

    Get PDF
    XML stream applications bring the challenge of efficiently processing queries on sequentially accessible token-based data streams. The automaton paradigm is naturally suited for pattern retrieval on tokenized XML streams, but requires patches for implementing the filtering or restructuring functionalities common for the XML query languages. In contrast, the algebraic paradigm is well-established for processing self-contained tuples. However, it does not traditionally support token inputs. This dissertation proposes a framework called Raindrop, which accommodates both the automaton and algebra paradigms to take advantage of both. First, we propose an architecture for Raindrop. Raindrop is an algebra framework that models queries at different abstraction levels. We represent the token-based automaton computations as an algebraic subplan at the high level while exposing the automaton details at the low level. The algebraic subplan modeling automaton computations can thus be integrated with the algebraic subplan modeling the non-automaton computations. Second, we explore a novel optimization opportunity. Other XML stream processing systems always retrieve all the patterns in a query in the automaton. In contrast, Raindrop allows a plan to retrieve some of the pattern retrieval in the automaton and some out of the automaton. This opens up an automaton-in-or-out optimization opportunity. We study this optimization in two types of run-time environments, one with stable data characteristics and one with fluctuating data characteristics. We provide search strategies catering to each environment. We also describe how to migrate from a currently running plan to a new plan at run-time. Third, we optimize the automaton computations using the schema knowledge. A set of criteria are established to decide what schema constraints are useful to a given query. Optimization rules utilizing different types of schema constraints are proposed based on the criteria. We design a rule application algorithm which ensures both completeness (i.e., no optimization is missed) and minimality (i.e., no redundant optimization is introduced). The experimentations on both real and synthetic data illustrate that these techniques bring significant performance improvement with little overhead

    Integração automatizada de informação de horários de transportes

    Get PDF
    The ever-growing Web contains a large amount of data. This large amount of data is useful when combined with applications that can refine it and use it to improve its users’ lives. However, using the data available is not an easy task since most of the information is not represented in machine-friendly formats. Instead, this information is represented in formats ideal for human users, resulting in an additional effort for having machines interpreting, extracting, and integrating it, while at the same time ensuring the consistency of information from different sources. In this project, a solution using an ontology-based integration combined with web robots’ extraction automates the process required for updating information regarding schedules of public transports. An already existing application receives that information and uses it to calculate efficient routes for commuters. The proposed solution can extract information from multiple online sources and transform it into different formats. It can extract and transform the information from PDFs and HTML. The system provides a web service for the exportation of these formats by a route optimization system. This document contains the detailed process of the design and construction of the integration system. It describes the alternatives and selections that lead to the application created. Lastly, it evaluates the solution by performing extraction from several sources relevant to the project’s domain

    A process model in platform independent and neutral formal representation for design engineering automation

    Get PDF
    An engineering design process as part of product development (PD) needs to satisfy ever-changing customer demands by striking a balance between time, cost and quality. In order to achieve a faster lead-time, improved quality and reduced PD costs for increased profits, automation methods have been developed with the help of virtual engineering. There are various methods of achieving Design Engineering Automation (DEA) with Computer-Aided (CAx) tools such as CAD/CAE/CAM, Product Lifecycle Management (PLM) and Knowledge Based Engineering (KBE). For example, Computer Aided Design (CAD) tools enable Geometry Automation (GA), PLM systems allow for sharing and exchange of product knowledge throughout the PD lifecycle. Traditional automation methods are specific to individual products and are hard-coded and bound by the proprietary tool format. Also, existing CAx tools and PLM systems offer bespoke islands of automation as compared to KBE. KBE as a design method incorporates complete design intent by including re-usable geometric, non-geometric product knowledge as well as engineering process knowledge for DEA including various processes such as mechanical design, analysis and manufacturing. It has been recognised, through an extensive literature review, that a research gap exists in the form of a generic and structured method of knowledge modelling, both informal and formal modelling, of mechanical design process with manufacturing knowledge (DFM/DFA) as part of model based systems engineering (MBSE) for DEA with a KBE approach. There is a lack of a structured technique for knowledge modelling, which can provide a standardised method to use platform independent and neutral formal standards for DEA with generative modelling for mechanical product design process and DFM with preserved semantics. The neutral formal representation through computer or machine understandable format provides open standard usage. This thesis provides a contribution to knowledge by addressing this gap in two-steps: • In the first step, a coherent process model, GPM-DEA is developed as part of MBSE which can be used for modelling of mechanical design with manufacturing knowledge utilising hybrid approach, based on strengths of existing modelling standards such as IDEF0, UML, SysML and addition of constructs as per author’s Metamodel. The structured process model is highly granular with complex interdependencies such as activities, object, function, rule association and includes the effect of the process model on the product at both component and geometric attributes. • In the second step, a method is provided to map the schema of the process model to equivalent platform independent and neutral formal standards using OWL/SWRL ontology for system development using Protégé tool, enabling machine interpretability with semantic clarity for DEA with generative modelling by building queries and reasoning on set of generic SWRL functions developed by the author. Model development has been performed with the aid of literature analysis and pilot use-cases. Experimental verification with test use-cases has confirmed the reasoning and querying capability on formal axioms in generating accurate results. Some of the other key strengths are that knowledgebase is generic, scalable and extensible, hence provides re-usability and wider design space exploration. The generative modelling capability allows the model to generate activities and objects based on functional requirements of the mechanical design process with DFM/DFA and rules based on logic. With the help of application programming interface, a platform specific DEA system such as a KBE tool or a CAD tool enabling GA and a web page incorporating engineering knowledge for decision support can consume relevant part of the knowledgebase

    Perspectives on Languages for Specifying Simulation Experiments

    Get PDF
    While domain specific languages are well established for describing the system of interest in modeling and simulation, the last years have seen increasingly domain specific languages also exploited for specifying experiments. This development, whose application areas range from computational biology to network simulation, is motivated by the desire to facilitate the reproducibility of simulation results. Thereby, the experimentation process is treated as a first class object of simulation studies. As the experimentation process contains different tasks such as configuration, observation, analysis, and evaluation, domain-specific languages can be exploited to specify experiments as well as individual sub-tasks or even the goal of the experiment, thus opening up new avenues of research. The focus of our discussion will be on what information to express, also based on existing approaches. Referring to how to express the required information, we will sketch some of the pros and cons of external and embedded domain specific languages
    • …
    corecore