Search CORE

6 research outputs found

Segmenting Tables via Indexing of Value Cells by Table Headers

Author: Nagy George
Seth Sharad C.
Publication venue: DigitalCommons@University of Nebraska - Lincoln
Publication date: 01/01/2013
Field of study

Correct segmentation of a web table into its component regions is the essential first step to understanding tabular data. Our algorithmic solution to the segmentation problem relies on the property that strings defining row and column header paths uniquely index each data cell in the table. We segment the table using only “logical layout analysis” without resorting to any appearance features or natural language understanding. We start with a CSV table that preserves the 2- dimensional structure and contents of the original source table (e.g., an HTML table) but not font size, font weight, and color. The indexing property of table headers implies a four-quadrant partitioning of the table about a minimum index point. The algorithm finds the index point through an efficient guided search. Experimental results on a 200-table benchmark demonstrate the generality of the algorithm in handling a variety of table styles and forms

DigitalCommons@University of Nebraska

A Novel Approach to Data Extraction on Hyperlinked Webpages

Author: Khushi Matloob
Masood Nayyer
Shaukat Kamran
Publication venue: 'MDPI AG'
Publication date: 01/01/2019
Field of study

The World Wide Web has an enormous amount of useful data presented as HTML tables. These tables are often linked to other web pages, providing further detailed information to certain attribute values. Extracting schema of such relational tables is a challenge due to the non-existence of a standard format and a lack of published algorithms. We downloaded 15,000 web pages using our in-house developed web-crawler, from various web sites. Tables from the HTML code were extracted and table rows were labeled with appropriate class labels. Conditional random fields (CRF) were used for the classification of table rows, and a nondeterministic finite automaton (NFA) algorithm was designed to identify simple, complex, hyperlinked, or non-linked tables. A simple schema for non-linked tables was extracted and for the linked-tables, relational schema in the form of primary and foreign-keys (PK and FK) were developed. Child tables were concatenated with the parent table’s attribute value (PK), serving as foreign keys (FKs). Resultantly, these tables could assist with performing better and stronger queries using the join operation. A manual checking of the linked web table results revealed a 99% precision and 68% recall values. Our 15,000-strong downloadable corpus and a novel algorithm will provide the basis for further research in this field.publishedVersio

Multidisciplinary Digital Publishing Institute

NORA - Norwegian Open Research Archives

UiS Brage

Segmenting Tables via Indexing of Value Cells by Table Headers

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Clustering header categories extracted from web tables

Author: Abu-Tarif
Adelfio
Astrakhantsev
Ball
Bing
Chandran
Costa e Silva
Dalvi
Embley
Gatterbauer
Gonzalez
Handley
Hirayama
Hu
Hurst
Hurst
Itonori
Kieninger
Kieninger
Laurentini
Long
McQueen
Pinto
Pyreddy
Shamalian
Venetis
Wang
Zuyev
Publication venue: 'SPIE-Intl Soc Optical Eng'
Publication date
Field of study

Crossref

Definição e avaliação de métodos para determinação de similaridade entre tabelas na web

Author: Silva Filipe Roberto
Publication venue
Publication date: 01/01/2015
Field of study

Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Ciência da Computação, Florianópolis, 2015A Web e uma grande fonte de dados. Grandes quantidades de dados são inseridos diariamente e muitos desses dados estão na forma de tabelas HTML. Alguns trabalhos têm proposto formas de extrair e integrar o conteúdo dessas tabelas para torna-los mais acessíveis para o consumo humano. Porem, essa e uma tarefa complexa e um problema ainda em aberto visto que tabelas Web n~ao possuem um padrão de representação. Alem disso, o uso de sinônimos e abreviações torna difícil a comparação dos conteúdos dessas tabelas. Assim sendo, este trabalho propõe uma nova abordagem para determinar a similaridade entre tabelas Web capaz de lidar com suas diferentes estruturas e termos sinônimos. Trabalhos relacionados não lidam, ao mesmo tempo,com essas duas problemáticas. Experimentos realizados mostram que a abordagem e promissora.Abstract : The Web is a huge information source. Large amounts of data are publisheddaily and great part of them is available as HTML tables. Someworks have proposed approaches to extract and integrate Web tables'content in order to make it more accessible for human consumption.However, this is a complex task and still an open issue given that Webtables do not have a unique representation pattern. Besides, the useof synonyms and abbreviations become hard the comparison of tables'content. Given that, we propose a new approach to determine similaritybetween Web tables which is able to deal with distinct structuresand synonym terms. Related works do not deal, at the same time,with both problematics. Experimental evaluations had shown that theapproach is promising

Repositório Institucional da UFSC