468 research outputs found
Segmenting Tables via Indexing of Value Cells by Table Headers
Correct segmentation of a web table into its component regions is the essential first step to understanding tabular data. Our algorithmic solution to the segmentation problem relies on the property that strings defining row and column header paths uniquely index each data cell in the table. We segment the table using only “logical layout analysis” without resorting to any appearance features or natural language understanding. We start with a CSV table that preserves the 2- dimensional structure and contents of the original source table (e.g., an HTML table) but not font size, font weight, and color. The indexing property of table headers implies a four-quadrant partitioning of the table about a minimum index point. The algorithm finds the index point through an efficient guided search. Experimental results on a 200-table benchmark demonstrate the generality of the algorithm in handling a variety of table styles and forms
A Focused Crawler in order to Get Semantic Web Resources (CSR)
This paper presents a Focused Crawler in order to Get Semantic Web Resources (CSR). Structured data web are available in formats such as Extensible Markup Language (XML), Resource Description Framework (RDF) and Ontology Web Language (OWL) that can be used for processing. One of the main challenges for performing a manual search and download semantic web resources is that this task consumes a lot of time. Our research work propose a focused crawler which allow to download these resources automatically and store them on disk in order to have a collection that will be used for data processing. CRS consists of three layers: (a) The User Interface Layer, (b) The Focus Crawler Layer and (c) The Base Crawler Layer. CSR uses as a selection policie the Shark-Search method. CSR was conducted with two experiments. The first one starts on December 15 2012 at 7:11 am and ends on December 16 2012 at 4:01 were obtained 448,123,537 bytes of data. The CSR ends by itself after to analyze 80,4375 seeds with an unlimited depth. CSR got 16,576 semantic resources files where the 89 % was RDF, the 10 % was XML and the 1% was OWL. The second one was based on the Web Data Commons work of the Research Group Data and Web Science at the University of Mannheim and the Institute AIFB at the Karlsruhe Institute of Technology. This began at 4:46 am of June 2 2013 and 1:37 am June 9 2013. After 162.51 hours of execution the result was 285,279 semantic resources where predominated the XML resources with 99 % and OWL and RDF with 1 % each one
Multi-Hypothesis Parsing of Tabular Data in Comma-Separated Values (CSV) Files
Tabular data on the web comes in various formats and shapes. Preparing data for data
analysis and integration requires manual steps which go beyond simple parsing of the
data. The preparation includes steps like correct configuration of the parser, removing
of meaningless rows, casting of data types and reshaping of the table structure. The
goal of this thesis is the development of a robust and modular system which is able
to automatically transform messy CSV data sources into a tidy tabular data structure.
The highly diverse corpus of CSV files from the UK open data hub will serve as a basis
for the evaluation of the system
Handbook of Document Image Processing and Recognition
International audienceThe Handbook of Document Image Processing and Recognition provides a consistent, comprehensive resource on the available methods and techniques in document image processing and recognition. It includes unified comparison and contrast analysis of algorithms in standard table formats. Thus, it educates the reader in order to help them to make informed decisions on their particular problems. The handbook is divided into several parts. Each part starts with an introduction written by the two editors. These introductions set the general framework for the main topic of each part and introduces the contribution of each chapter within the framework. The introductions are followed by several chapters written by established experts of the field. Each chapter provides the reader with a clear overview of the topic and of the state of the art in techniques used (including elements of comparison between them). Each chapter is structured in the same way: It starts with an introductory text, concludes with a summary of the main points addressed in the chapter and ends with a comprehensive list of references. Whenever appropriate, the authors include specific sections describing and pointing to consolidated software and/or reference datasets. Numerous cross-references between the chapters ensure this is a truly integrated work, without unnecessary duplications and overlaps between chapters. This reference work is intended for the use by a wide audience of readers from around the world such as graduate students, researchers, librarians, lecturers, professionals, and many other people
Variations on a theme: patterns of congruence and divergence among 18th century chemical affinity theories
The doctrine of affinity deserves to be recognised by historians of chemistry as
the foundational basis of the discipline of chemistry as it was practiced in
Britain during the 18th century. It attained this status through its crucial
structural role in the pedagogy of the discipline. The importance of pedagogy
and training in the practice of science is currently being reassessed by a number
of historians, and my research contributes to this historiographical endeavour.
My analysis of the variety of theories sheltered under the umbrella term ‘affinity
theory’ has emphasised the role of pedagogy in influencing both the structure
and the content of knowledge. I have shown that there were wide ranging
discrepancies between many of the components of individual affinity theories.
Nevertheless, the scope of divergence was limited. This underlying organisation
resulted from the unifying hub of affinity theory, the logical common ground.
This was the essence of the doctrine of affinity, encompassing the law of affinity
and the conceptualisation of the table that brought together the relations
described in the law. The doctrine of affinity thus provided a disciplinary
common ground between chemists, providing a mediating level of
understanding and communication for all those who subscribed to the doctrine
of affinity, in spite of their detailed differences
DSG: An End-to-End Document Structure Generator
Information in industry, research, and the public sector is widely stored as
rendered documents (e.g., PDF files, scans). Hence, to enable downstream tasks,
systems are needed that map rendered documents onto a structured hierarchical
format. However, existing systems for this task are limited by heuristics and
are not end-to-end trainable. In this work, we introduce the Document Structure
Generator (DSG), a novel system for document parsing that is fully end-to-end
trainable. DSG combines a deep neural network for parsing (i) entities in
documents (e.g., figures, text blocks, headers, etc.) and (ii) relations that
capture the sequence and nested structure between entities. Unlike existing
systems that rely on heuristics, our DSG is trained end-to-end, making it
effective and flexible for real-world applications. We further contribute a
new, large-scale dataset called E-Periodica comprising real-world magazines
with complex document structures for evaluation. Our results demonstrate that
our DSG outperforms commercial OCR tools and, on top of that, achieves
state-of-the-art performance. To the best of our knowledge, our DSG system is
the first end-to-end trainable system for hierarchical document parsing.Comment: Accepted at ICDM 202
- …