6 research outputs found

    Segmenting Tables via Indexing of Value Cells by Table Headers

    Get PDF
    Correct segmentation of a web table into its component regions is the essential first step to understanding tabular data. Our algorithmic solution to the segmentation problem relies on the property that strings defining row and column header paths uniquely index each data cell in the table. We segment the table using only “logical layout analysis” without resorting to any appearance features or natural language understanding. We start with a CSV table that preserves the 2- dimensional structure and contents of the original source table (e.g., an HTML table) but not font size, font weight, and color. The indexing property of table headers implies a four-quadrant partitioning of the table about a minimum index point. The algorithm finds the index point through an efficient guided search. Experimental results on a 200-table benchmark demonstrate the generality of the algorithm in handling a variety of table styles and forms

    A Novel Approach to Data Extraction on Hyperlinked Webpages

    Get PDF
    The World Wide Web has an enormous amount of useful data presented as HTML tables. These tables are often linked to other web pages, providing further detailed information to certain attribute values. Extracting schema of such relational tables is a challenge due to the non-existence of a standard format and a lack of published algorithms. We downloaded 15,000 web pages using our in-house developed web-crawler, from various web sites. Tables from the HTML code were extracted and table rows were labeled with appropriate class labels. Conditional random fields (CRF) were used for the classification of table rows, and a nondeterministic finite automaton (NFA) algorithm was designed to identify simple, complex, hyperlinked, or non-linked tables. A simple schema for non-linked tables was extracted and for the linked-tables, relational schema in the form of primary and foreign-keys (PK and FK) were developed. Child tables were concatenated with the parent table’s attribute value (PK), serving as foreign keys (FKs). Resultantly, these tables could assist with performing better and stronger queries using the join operation. A manual checking of the linked web table results revealed a 99% precision and 68% recall values. Our 15,000-strong downloadable corpus and a novel algorithm will provide the basis for further research in this field.publishedVersio

    Segmenting Tables via Indexing of Value Cells by Table Headers

    Full text link

    Definição e avaliação de métodos para determinação de similaridade entre tabelas na web

    Get PDF
    Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Ciência da Computação, Florianópolis, 2015A Web e uma grande fonte de dados. Grandes quantidades de dados são inseridos diariamente e muitos desses dados estão na forma de tabelas HTML. Alguns trabalhos têm proposto formas de extrair e integrar o conteúdo dessas tabelas para torna-los mais acessíveis para o consumo humano. Porem, essa e uma tarefa complexa e um problema ainda em aberto visto que tabelas Web n~ao possuem um padrão de representação. Alem disso, o uso de sinônimos e abreviações torna difícil a comparação dos conteúdos dessas tabelas. Assim sendo, este trabalho propõe uma nova abordagem para determinar a similaridade entre tabelas Web capaz de lidar com suas diferentes estruturas e termos sinônimos. Trabalhos relacionados não lidam, ao mesmo tempo,com essas duas problemáticas. Experimentos realizados mostram que a abordagem e promissora.Abstract : The Web is a huge information source. Large amounts of data are publisheddaily and great part of them is available as HTML tables. Someworks have proposed approaches to extract and integrate Web tables'content in order to make it more accessible for human consumption.However, this is a complex task and still an open issue given that Webtables do not have a unique representation pattern. Besides, the useof synonyms and abbreviations become hard the comparison of tables'content. Given that, we propose a new approach to determine similaritybetween Web tables which is able to deal with distinct structuresand synonym terms. Related works do not deal, at the same time,with both problematics. Experimental evaluations had shown that theapproach is promising
    corecore