11 research outputs found
HTML table wrapper based on table components
Tables are a model for data representation in the internet. Many approaches to
harvesting table data are executed by doing the copy-paste. However, this method will be a
problem if there is a huge amount of tables and they come from many internet sources. This
paper presents an approach to prepare the table area and to wrap or extract table components in
cells and property from HTML tables. This paper discusses how the approach works by testing
Algorithms 1, 2, and 3. Algorithm 1 is used to determine the actual number of columns and rows
of the table, and Algorithm 2 is used to determine the boundary line of the property. At the end of
the process of extraction, Algorithm 3 is implemented to get content of the table. Tests were
conducted at 100 tabular HTML format. The result of F-measure for Algorithm 1 is 100.00%,
for Algorithm 2 97.67% and for Algorithm 3 94.91%
Segmenting Tables via Indexing of Value Cells by Table Headers
Correct segmentation of a web table into its component regions is the essential first step to understanding tabular data. Our algorithmic solution to the segmentation problem relies on the property that strings defining row and column header paths uniquely index each data cell in the table. We segment the table using only “logical layout analysis” without resorting to any appearance features or natural language understanding. We start with a CSV table that preserves the 2- dimensional structure and contents of the original source table (e.g., an HTML table) but not font size, font weight, and color. The indexing property of table headers implies a four-quadrant partitioning of the table about a minimum index point. The algorithm finds the index point through an efficient guided search. Experimental results on a 200-table benchmark demonstrate the generality of the algorithm in handling a variety of table styles and forms
Web Data Extraction, Applications and Techniques: A Survey
Web Data Extraction is an important problem that has been studied by means of
different scientific tools and in a broad range of applications. Many
approaches to extracting data from the Web have been designed to solve specific
problems and operate in ad-hoc domains. Other approaches, instead, heavily
reuse techniques and algorithms developed in the field of Information
Extraction.
This survey aims at providing a structured and comprehensive overview of the
literature in the field of Web Data Extraction. We provided a simple
classification framework in which existing Web Data Extraction applications are
grouped into two main classes, namely applications at the Enterprise level and
at the Social Web level. At the Enterprise level, Web Data Extraction
techniques emerge as a key tool to perform data analysis in Business and
Competitive Intelligence systems as well as for business process
re-engineering. At the Social Web level, Web Data Extraction techniques allow
to gather a large amount of structured data continuously generated and
disseminated by Web 2.0, Social Media and Online Social Network users and this
offers unprecedented opportunities to analyze human behavior at a very large
scale. We discuss also the potential of cross-fertilization, i.e., on the
possibility of re-using Web Data Extraction techniques originally designed to
work in a given domain, in other domains.Comment: Knowledge-based System
Using Visual Cues for Extraction of Tabular Data from Arbitrary HTML Documents
We describe a method to extract tabular data from web pages. Rather than just analyzing the DOM tree, we also exploit visual cues in the rendered version of the document to extract data from tables which are not explicitly marked with an HTML table element. To detect tables, we rely on a variant of the well-known X-Y cut algorithm as used in the OCR community. We implemented the system by directly accessing Mozilla's box model that contains the positional data for all HTML elements of a given web page
Herramientas visuales para la búsqueda, la recuperación de información y la búsqueda de contenido relevante aplicado a páginas web
La información accesible en internet crece de manera descontrolada y se actualiza más
rápido que nunca. Debido al problema de encontrar la información actualizada y deseada
aparecieron los motores de búsquedas siendo Google su máximo exponente. Estos buscadores
son capaces de encontrar páginas concretas en la inmensidad de la red a partir de un criterio
de búsqueda pero tras encontrarlo, no son capaces de mostrar de manera precisa al usuario
dónde se encuentra lo que buscaba dentro de la página, obligando al usuario a buscarlo y
desperdiciando así parte de su tiempo. Dicho problema se acentúa con la complejidad de
información y apariencia de las páginas web convirtiendo la búsqueda en una tarea pesada. En
esta tesina proponemos una técnica mediante una extensión instalable en Firefox que permite
al usuario hacer más cómoda y rápida su búsqueda permitiéndole mostrar la información
según la necesidad del momento. Tras obtener la información relevante con el criterio de
búsqueda del usuario, observamos que parte de la información contenida en la página web,
como menús o el pie de la página web, podría modificar la experiencia del usuario al mostrar
más información de la que el usuario necesita realmente. Debido a dicho problema, se
propone un algoritmo que se encarga de buscar el bloque de contenido principal relevante de
una página web, ignorando u ocultando el resto de la página web irrelevante.López Romero, S. (2010). Herramientas visuales para la búsqueda, la recuperación de información y la búsqueda de contenido relevante aplicado a páginas web. http://hdl.handle.net/10251/13667Archivo delegad
A teachable semi-automatic web information extraction system based on evolved regular expression patterns
This thesis explores Web Information Extraction (WIE) and how it has been used in decision making and to support businesses in their daily operations. The research focuses on a WIE system based on Genetic Programming (GP) with an extensible model to enhance the automatic extractor. This uses a human as a teacher to identify and extract relevant information from the semi-structured HTML webpages.
Regular expressions, which have been chosen as the pattern matching tool, are automatically generated based on the training data to provide an improved grammar and lexicon. This particularly benefits the GP system which may need to extend its lexicon in the presence of new tokens in the web pages. These tokens allow the GP method to produce new extraction patterns for new requirements