64,662 research outputs found

    Layout Inference and Table Detection in Spreadsheet Documents

    Get PDF
    Spreadsheets have found wide use in many different domains and settings. They provide a broad range of both basic and advanced functionalities. In this way, they can support data collection, transformation, analysis, and reporting. Nevertheless, at the same time spreadsheets maintain a friendly and intuitive interface. Additionally, they entail no to very low cost. Well-known spreadsheet applications, such as OpenOffice, LibreOffice, Google Sheets, and Gnumeric, are free to use. Moreover, Microsoft Excel is widely available, with millions of users worldwide. Thus, spreadsheets are not only powerful tools, but also have a very low entrance barrier. Therefore, they have become very popular with novices and professionals alike. As a result, a large volume of valuable data resides in these documents. From spreadsheets, of particular interest are data coming in tabular form, since they provide concise, factual, and to a large extend structured information. One natural progression is to transfer tabular data from spreadsheets to databases. This would allow spreadsheets to become a direct source of data for existing or new business processes. It would be easier to digest them into data warehouses and to integrate them with other sources. Nevertheless, besides databases, there are other means to work with spreadsheet data. New paradigms, like NoDB, advocate querying directly from raw documents. Going one step further, spreadsheets together with other raw documents can be stored in a sophisticated centralized repository, i.e., a data lake. From then on they can serve (on-demand) various tasks and applications. All in all, by making spreadsheet data easily accessible, we can prevent information silos, i.e., valuable knowledge being isolated and scattered in multiple spreadsheet documents. Yet, there are considerable challenges to the automatic processing and understanding of these documents. After all, spreadsheets are designed primarily for human consumption, and as such, they favor customization and visual comprehension. Data are often intermingled with formatting, formulas, layout artifacts, and textual metadata, which carry domain-specific or even user-specific information (i.e., personal preferences). Multiple tables, with different layout and structure, can be found on the same sheet. Most importantly, the structure of the tables is not known, i.e., not explicitly given by the spreadsheet documents. Altogether, spreadsheets are better described as partially structured, with a significant degree of implicit information. In literature, the automatic understanding of spreadsheet data has only been scarcely investigated, often assuming just the same uniform table layout. However, due to the manifold possibilities to structure tabular data in spreadsheets, the assumption of a uniform layout either excludes a substantial number of tables from the extraction process or leads to inaccurate results. In this thesis, we primarily address two fundamental tasks that can lead to more accurate information extraction from spreadsheet documents. Namely, we propose intuitive and effective approaches for layout analysis and table detection in spreadsheets. Nevertheless, our overall solution is designed as a processing pipeline, where specialized steps build on top of each other to discover the tabular data. One of our main objectives is to eliminate most of the assumptions from related work. Instead, we target highly diverse sheet layouts, with one or multiple tables. On the same time, we foresee the presence of textual metadata and other non-tabular data in the sheet. Furthermore, we make use of sophisticated machine learning and optimization techniques. This brings flexibility to our approach, allowing it to work even with complex or malformed tables. Moreover, this intended flexibility makes our approaches transferable to new spreadsheet datasets. Thus, we are not bounded to specific domains or settings.:1 INTRODUCTION 1.1 Motivation 1.2 Contributions 1.3 Outline 2 FOUNDATIONS AND RELATED WORK 2.1 The Evolution of Spreadsheet Documents 2.1.1 Spreadsheet User Interface and Functionalities 2.1.2 Spreadsheet File Formats 2.1.3 Spreadsheets Are Partially-Structured 2.2 Analysis and Recognition in Electronic Documents 2.2.1 A General Overview of DAR 2.2.2 DAR in Spreadsheets 2.3 Spreadsheet Research Areas 2.3.1 Layout Inference and Table Recognition 2.3.2 Unifying Databases and Spreadsheets 2.3.3 Spreadsheet Software Engineering 2.3.4 Data Wrangling Approaches 3 AN EMPIRICAL STUDY OF SPREADSHEET DOCUMENTS 3.1 Available Corpora 3.2 Creating a Gold Standard Dataset 3.2.1 Initial Selection 3.2.2 Annotation Methodology 3.3 Dataset Analysis 3.3.1 Takeaways from Business Spreadsheets 3.3.2 Comparison Between Domains 3.4 Summary and Discussion 3.4.1 Datasets for Experimental Evaluation 3.4.2 A Processing Pipeline 4 LAYOUT ANALYSIS 4.1 A Method for Layout Analysis in Spreadsheets 4.2 Feature Extraction 4.2.1 Content Features 4.2.2 Style Features 4.2.3 Font Features 4.2.4 Formula and Reference Features 4.2.5 Spatial Features 4.2.6 Geometrical Features 4.3 Cell Classification 4.3.1 Classification Datasets 4.3.2 Classifiers and Assessment Methods 4.3.3 Optimum Under-Sampling 4.3.4 Feature Selection 4.3.5 Parameter Tuning 4.3.6 Classification Evaluation 4.4 Layout Regions 4.5 Summary and Discussions 5 CLASSIFICATION POST-PROCESSING 5.1 Dataset for Post-Processing 5.2 Pattern-Based Revisions 5.2.1 Misclassification Patterns 5.2.2 Relabeling Cells 5.2.3 Evaluating the Patterns 5.3 Region-Based Revisions 5.3.1 Standardization Procedure 5.3.2 Extracting Features from Regions 5.3.3 Identifying Misclassified Regions 5.3.4 Relabeling Misclassified Regions 5.4 Summary and Discussion 6 TABLE DETECTION 6.1 A Method for Table Detection in Spreadsheets 6.2 Preliminaries 6.2.1 Introducing a Graph Model 6.2.2 Graph Partitioning for Table Detection 6.2.3 Pre-Processing for Table Detection 6.3 Rule-Based Detection 6.3.1 Remove and Conquer 6.4 Genetic-Based Detection 6.4.1 Undirected Graph 6.4.2 Header Cluster 6.4.3 Quality Metrics 6.4.4 Objective Function 6.4.5 Weight Tuning 6.4.6 Genetic Search 6.5 Experimental Evaluation 6.5.1 Testing Datasets 6.5.2 Training Datasets 6.5.3 Tuning Rounds 6.5.4 Search and Assessment 6.5.5 Evaluation Results 6.6 Summary and Discussions 7 XLINDY: A RESEARCH PROTOTYPE 7.1 Interface and Functionalities 7.1.1 Front-end Walkthrough 7.2 Implementation Details 7.2.1 Interoperability 7.2.2 Efficient Reads 7.3 Information Extraction 7.4 Summary and Discussions 8 CONCLUSION 8.1 Summary of Contributions 8.2 Directions of Future Work BIBLIOGRAPHY LIST OF FIGURES LIST OF TABLES A ANALYSIS OF REDUCED SAMPLES B TABLE DETECTION WITH TIRS B.1 Tables in TIRS B.2 Pairing Fences with Data Regions B.3 Heuristics Framewor

    Automatic categorization of diverse experimental information in the bioscience literature

    Get PDF
    Background: Curation of information from bioscience literature into biological knowledge databases is a crucial way of capturing experimental information in a computable form. During the biocuration process, a critical first step is to identify from all published literature the papers that contain results for a specific data type the curator is interested in annotating. This step normally requires curators to manually examine many papers to ascertain which few contain information of interest and thus, is usually time consuming. We developed an automatic method for identifying papers containing these curation data types among a large pool of published scientific papers based on the machine learning method Support Vector Machine (SVM). This classification system is completely automatic and can be readily applied to diverse experimental data types. It has been in use in production for automatic categorization of 10 different experimental datatypes in the biocuration process at WormBase for the past two years and it is in the process of being adopted in the biocuration process at FlyBase and the Saccharomyces Genome Database (SGD). We anticipate that this method can be readily adopted by various databases in the biocuration community and thereby greatly reducing time spent on an otherwise laborious and demanding task. We also developed a simple, readily automated procedure to utilize training papers of similar data types from different bodies of literature such as C. elegans and D. melanogaster to identify papers with any of these data types for a single database. This approach has great significance because for some data types, especially those of low occurrence, a single corpus often does not have enough training papers to achieve satisfactory performance. Results: We successfully tested the method on ten data types from WormBase, fifteen data types from FlyBase and three data types from Mouse Genomics Informatics (MGI). It is being used in the curation work flow at WormBase for automatic association of newly published papers with ten data types including RNAi, antibody, phenotype, gene regulation, mutant allele sequence, gene expression, gene product interaction, overexpression phenotype, gene interaction, and gene structure correction. Conclusions: Our methods are applicable to a variety of data types with training set containing several hundreds to a few thousand documents. It is completely automatic and, thus can be readily incorporated to different workflow at different literature-based databases. We believe that the work presented here can contribute greatly to the tremendous task of automating the important yet labor-intensive biocuration effort

    Methodologies for the Automatic Location of Academic and Educational Texts on the Internet

    Get PDF
    Traditionally online databases of web resources have been compiled by a human editor, or though the submissions of authors or interested parties. Considerable resources are needed to maintain a constant level of input and relevance in the face of increasing material quantity and quality, and much of what is in databases is of an ephemeral nature. These pressures dictate that many databases stagnate after an initial period of enthusiastic data entry. The solution to this problem would seem to be the automatic harvesting of resources, however, this process necessitates the automatic classification of resources as ‘appropriate’ to a given database, a problem only solved by complex text content analysis. This paper outlines the component methodologies necessary to construct such an automated harvesting system, including a number of novel approaches. In particular this paper looks at the specific problems of automatically identifying academic research work and Higher Education pedagogic materials. Where appropriate, experimental data is presented from searches in the field of Geography as well as the Earth and Environmental Sciences. In addition, appropriate software is reviewed where it exists, and future directions are outlined

    Methodologies for the Automatic Location of Academic and Educational Texts on the Internet

    Get PDF
    Traditionally online databases of web resources have been compiled by a human editor, or though the submissions of authors or interested parties. Considerable resources are needed to maintain a constant level of input and relevance in the face of increasing material quantity and quality, and much of what is in databases is of an ephemeral nature. These pressures dictate that many databases stagnate after an initial period of enthusiastic data entry. The solution to this problem would seem to be the automatic harvesting of resources, however, this process necessitates the automatic classification of resources as ‘appropriate’ to a given database, a problem only solved by complex text content analysis. This paper outlines the component methodologies necessary to construct such an automated harvesting system, including a number of novel approaches. In particular this paper looks at the specific problems of automatically identifying academic research work and Higher Education pedagogic materials. Where appropriate, experimental data is presented from searches in the field of Geography as well as the Earth and Environmental Sciences. In addition, appropriate software is reviewed where it exists, and future directions are outlined

    Extraction of Keyphrases from Text: Evaluation of Four Algorithms

    Get PDF
    This report presents an empirical evaluation of four algorithms for automatically extracting keywords and keyphrases from documents. The four algorithms are compared using five different collections of documents. For each document, we have a target set of keyphrases, which were generated by hand. The target keyphrases were generated for human readers; they were not tailored for any of the four keyphrase extraction algorithms. Each of the algorithms was evaluated by the degree to which the algorithmÂ’s keyphrases matched the manually generated keyphrases. The four algorithms were (1) the AutoSummarize feature in MicrosoftÂ’s Word 97, (2) an algorithm based on Eric BrillÂ’s part-of-speech tagger, (3) the Summarize feature in VerityÂ’s Search 97, and (4) NRCÂ’s Extractor algorithm. For all five document collections, NRCÂ’s Extractor yields the best match with the manually generated keyphrases
    • …
    corecore