468 research outputs found

    Segmenting Tables via Indexing of Value Cells by Table Headers

    Get PDF
    Correct segmentation of a web table into its component regions is the essential first step to understanding tabular data. Our algorithmic solution to the segmentation problem relies on the property that strings defining row and column header paths uniquely index each data cell in the table. We segment the table using only “logical layout analysis” without resorting to any appearance features or natural language understanding. We start with a CSV table that preserves the 2- dimensional structure and contents of the original source table (e.g., an HTML table) but not font size, font weight, and color. The indexing property of table headers implies a four-quadrant partitioning of the table about a minimum index point. The algorithm finds the index point through an efficient guided search. Experimental results on a 200-table benchmark demonstrate the generality of the algorithm in handling a variety of table styles and forms

    A Focused Crawler in order to Get Semantic Web Resources (CSR)

    Get PDF
    This paper presents a Focused Crawler in order to Get Semantic Web Resources (CSR). Structured data web are available in formats such as Extensible Markup Language (XML), Resource Description Framework (RDF) and Ontology Web Language (OWL) that can be used for processing. One of the main challenges for performing a manual search and download semantic web resources is that this task consumes a lot of time. Our research work propose a focused crawler which allow to download these resources automatically and store them on disk in order to have a collection that will be used for data processing. CRS consists of three layers: (a) The User Interface Layer, (b) The Focus Crawler Layer and (c) The Base Crawler Layer. CSR uses as a selection policie the Shark-Search method. CSR was conducted with two experiments. The first one starts on December 15 2012 at 7:11 am and ends on December 16 2012 at 4:01 were obtained 448,123,537 bytes of data. The CSR ends by itself after to analyze 80,4375 seeds with an unlimited depth. CSR got 16,576 semantic resources files where the 89 % was RDF, the 10 % was XML and the 1% was OWL. The second one was based on the Web Data Commons work of the Research Group Data and Web Science at the University of Mannheim and the Institute AIFB at the Karlsruhe Institute of Technology. This began at 4:46 am of June 2 2013 and 1:37 am June 9 2013. After 162.51 hours of execution the result was 285,279 semantic resources where predominated the XML resources with 99 % and OWL and RDF with 1 % each one

    PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents

    Full text link

    Multi-Hypothesis Parsing of Tabular Data in Comma-Separated Values (CSV) Files

    Get PDF
    Tabular data on the web comes in various formats and shapes. Preparing data for data analysis and integration requires manual steps which go beyond simple parsing of the data. The preparation includes steps like correct configuration of the parser, removing of meaningless rows, casting of data types and reshaping of the table structure. The goal of this thesis is the development of a robust and modular system which is able to automatically transform messy CSV data sources into a tidy tabular data structure. The highly diverse corpus of CSV files from the UK open data hub will serve as a basis for the evaluation of the system

    Handbook of Document Image Processing and Recognition

    Get PDF
    International audienceThe Handbook of Document Image Processing and Recognition provides a consistent, comprehensive resource on the available methods and techniques in document image processing and recognition. It includes unified comparison and contrast analysis of algorithms in standard table formats. Thus, it educates the reader in order to help them to make informed decisions on their particular problems. The handbook is divided into several parts. Each part starts with an introduction written by the two editors. These introductions set the general framework for the main topic of each part and introduces the contribution of each chapter within the framework. The introductions are followed by several chapters written by established experts of the field. Each chapter provides the reader with a clear overview of the topic and of the state of the art in techniques used (including elements of comparison between them). Each chapter is structured in the same way: It starts with an introductory text, concludes with a summary of the main points addressed in the chapter and ends with a comprehensive list of references. Whenever appropriate, the authors include specific sections describing and pointing to consolidated software and/or reference datasets. Numerous cross-references between the chapters ensure this is a truly integrated work, without unnecessary duplications and overlaps between chapters. This reference work is intended for the use by a wide audience of readers from around the world such as graduate students, researchers, librarians, lecturers, professionals, and many other people

    Variations on a theme: patterns of congruence and divergence among 18th century chemical affinity theories

    Get PDF
    The doctrine of affinity deserves to be recognised by historians of chemistry as the foundational basis of the discipline of chemistry as it was practiced in Britain during the 18th century. It attained this status through its crucial structural role in the pedagogy of the discipline. The importance of pedagogy and training in the practice of science is currently being reassessed by a number of historians, and my research contributes to this historiographical endeavour. My analysis of the variety of theories sheltered under the umbrella term ‘affinity theory’ has emphasised the role of pedagogy in influencing both the structure and the content of knowledge. I have shown that there were wide ranging discrepancies between many of the components of individual affinity theories. Nevertheless, the scope of divergence was limited. This underlying organisation resulted from the unifying hub of affinity theory, the logical common ground. This was the essence of the doctrine of affinity, encompassing the law of affinity and the conceptualisation of the table that brought together the relations described in the law. The doctrine of affinity thus provided a disciplinary common ground between chemists, providing a mediating level of understanding and communication for all those who subscribed to the doctrine of affinity, in spite of their detailed differences

    Segmenting Tables via Indexing of Value Cells by Table Headers

    Full text link

    DSG: An End-to-End Document Structure Generator

    Full text link
    Information in industry, research, and the public sector is widely stored as rendered documents (e.g., PDF files, scans). Hence, to enable downstream tasks, systems are needed that map rendered documents onto a structured hierarchical format. However, existing systems for this task are limited by heuristics and are not end-to-end trainable. In this work, we introduce the Document Structure Generator (DSG), a novel system for document parsing that is fully end-to-end trainable. DSG combines a deep neural network for parsing (i) entities in documents (e.g., figures, text blocks, headers, etc.) and (ii) relations that capture the sequence and nested structure between entities. Unlike existing systems that rely on heuristics, our DSG is trained end-to-end, making it effective and flexible for real-world applications. We further contribute a new, large-scale dataset called E-Periodica comprising real-world magazines with complex document structures for evaluation. Our results demonstrate that our DSG outperforms commercial OCR tools and, on top of that, achieves state-of-the-art performance. To the best of our knowledge, our DSG system is the first end-to-end trainable system for hierarchical document parsing.Comment: Accepted at ICDM 202
    corecore