Search CORE

11 research outputs found

HTML table wrapper based on table components

Author: Banowosari Lintang Yuniar
Harmanto Suryadi
Purnamasari Detty
Wicaksana I Wayan Simri
Publication venue: 'Gunadarma University'
Publication date: 01/01/2015
Field of study

Tables are a model for data representation in the internet. Many approaches to harvesting table data are executed by doing the copy-paste. However, this method will be a problem if there is a huge amount of tables and they come from many internet sources. This paper presents an approach to prepare the table area and to wrap or extract table components in cells and property from HTML tables. This paper discusses how the approach works by testing Algorithms 1, 2, and 3. Algorithm 1 is used to determine the actual number of columns and rows of the table, and Algorithm 2 is used to determine the boundary line of the property. At the end of the process of extraction, Algorithm 3 is implemented to get content of the table. Tests were conducted at 100 tabular HTML format. The result of F-measure for Algorithm 1 is 100.00%, for Algorithm 2 97.67% and for Algorithm 3 94.91%

Crossref

Gunadarma University Repository

Segmenting Tables via Indexing of Value Cells by Table Headers

Author: Nagy George
Seth Sharad C.
Publication venue: DigitalCommons@University of Nebraska - Lincoln
Publication date: 01/01/2013
Field of study

Correct segmentation of a web table into its component regions is the essential first step to understanding tabular data. Our algorithmic solution to the segmentation problem relies on the property that strings defining row and column header paths uniquely index each data cell in the table. We segment the table using only “logical layout analysis” without resorting to any appearance features or natural language understanding. We start with a CSV table that preserves the 2- dimensional structure and contents of the original source table (e.g., an HTML table) but not font size, font weight, and color. The indexing property of table headers implies a four-quadrant partitioning of the table about a minimum index point. The algorithm finds the index point through an efficient guided search. Experimental results on a 200-table benchmark demonstrate the generality of the algorithm in handling a variety of table styles and forms

DigitalCommons@University of Nebraska

Segmenting Tables via Indexing of Value Cells by Table Headers

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Web Data Extraction, Applications and Techniques: A Survey

Author: Abel
Amalfitano
Balduzzi
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Berger
Berthold
Bettencourt
Califf
Catanese
Chang
Chen
Chen
Chen
Collins
Conover
Crandall
Crescenzi
Crescenzi
Dalvi
Dalvi
De Meo
De Meo
Doan
Emilio Ferrara
Ferrara
Ferrara
Ferrara
Ferrara
Ferrara
Flesca
Freitag
Furche
Gatterbauer
Gatterbauer
Giacomo Fiumara
Gjoka
Gkotsis
Gottlob
Gottlob
Hammersley
Han
Hecht
Hsu
Irmak
Khare
Kim
Kinsella
Kleinberg
Kleinberg
Kohlschütter
Kokkoras
Kokkoras
Kokkoras
Krüpl
Kushmerick
Kwak
Laender
Liu
Manning
Masanès
Mathes
Meng
Mislove
Monge
Muslea
Oro
Pan
Pasquale De Meo
Perito
Phan
Plake
Rahm
Rahm
Reis
Robert Baumgartner
Sahuguet
Sarawagi
Schifanella
Selkow
Shi
Soderland
Szomszor
Turmo
Vosecky
Wang
Wang
Weikum
Wilson
Winograd
Yang
Ye
Zafarani
Zanasi
Zhai
Zhang
Zhang
Publication venue: 'Elsevier BV'
Publication date: 09/06/2014
Field of study

Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

arXiv.org e-Print Archive

Crossref

Using Visual Cues for Extraction of Tabular Data from Arbitrary HTML Documents

Author: Bernhard Krüpl
Marcus Herzog
Wolfgang Gatterbauer
Publication venue
Publication date: 01/01/2005
Field of study

We describe a method to extract tabular data from web pages. Rather than just analyzing the DOM tree, we also exploit visual cues in the rendered version of the document to extract data from tables which are not explicitly marked with an HTML table element. To detect tables, we rely on a variant of the well-known X-Y cut algorithm as used in the OCR community. We implemented the system by directly accessing Mozilla's box model that contains the positional data for all HTML elements of a given web page

CiteSeerX

Crossref

Herramientas visuales para la búsqueda, la recuperación de información y la búsqueda de contenido relevante aplicado a páginas web

Author: López Romero Sergio
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 28/11/2011
Field of study

La información accesible en internet crece de manera descontrolada y se actualiza más rápido que nunca. Debido al problema de encontrar la información actualizada y deseada aparecieron los motores de búsquedas siendo Google su máximo exponente. Estos buscadores son capaces de encontrar páginas concretas en la inmensidad de la red a partir de un criterio de búsqueda pero tras encontrarlo, no son capaces de mostrar de manera precisa al usuario dónde se encuentra lo que buscaba dentro de la página, obligando al usuario a buscarlo y desperdiciando así parte de su tiempo. Dicho problema se acentúa con la complejidad de información y apariencia de las páginas web convirtiendo la búsqueda en una tarea pesada. En esta tesina proponemos una técnica mediante una extensión instalable en Firefox que permite al usuario hacer más cómoda y rápida su búsqueda permitiéndole mostrar la información según la necesidad del momento. Tras obtener la información relevante con el criterio de búsqueda del usuario, observamos que parte de la información contenida en la página web, como menús o el pie de la página web, podría modificar la experiencia del usuario al mostrar más información de la que el usuario necesita realmente. Debido a dicho problema, se propone un algoritmo que se encarga de buscar el bloque de contenido principal relevante de una página web, ignorando u ocultando el resto de la página web irrelevante.López Romero, S. (2010). Herramientas visuales para la búsqueda, la recuperación de información y la búsqueda de contenido relevante aplicado a páginas web. http://hdl.handle.net/10251/13667Archivo delegad

RiuNet

A teachable semi-automatic web information extraction system based on evolved regular expression patterns

Author: Nor Zainah Siau (7169549)
Publication venue
Publication date: 01/01/2014
Field of study

This thesis explores Web Information Extraction (WIE) and how it has been used in decision making and to support businesses in their daily operations. The research focuses on a WIE system based on Genetic Programming (GP) with an extensible model to enhance the automatic extractor. This uses a human as a teacher to identify and extract relevant information from the semi-structured HTML webpages. Regular expressions, which have been chosen as the pattern matching tool, are automatically generated based on the training data to provide an improved grammar and lexicon. This particularly benefits the GP system which may need to extend its lexicon in the presence of new tokens in the web pages. These tokens allow the GP method to produce new extraction patterns for new requirements

Loughborough University Institutional Repository