11 research outputs found

    HTML table wrapper based on table components

    Get PDF
    Tables are a model for data representation in the internet. Many approaches to harvesting table data are executed by doing the copy-paste. However, this method will be a problem if there is a huge amount of tables and they come from many internet sources. This paper presents an approach to prepare the table area and to wrap or extract table components in cells and property from HTML tables. This paper discusses how the approach works by testing Algorithms 1, 2, and 3. Algorithm 1 is used to determine the actual number of columns and rows of the table, and Algorithm 2 is used to determine the boundary line of the property. At the end of the process of extraction, Algorithm 3 is implemented to get content of the table. Tests were conducted at 100 tabular HTML format. The result of F-measure for Algorithm 1 is 100.00%, for Algorithm 2 97.67% and for Algorithm 3 94.91%

    Segmenting Tables via Indexing of Value Cells by Table Headers

    Get PDF
    Correct segmentation of a web table into its component regions is the essential first step to understanding tabular data. Our algorithmic solution to the segmentation problem relies on the property that strings defining row and column header paths uniquely index each data cell in the table. We segment the table using only “logical layout analysis” without resorting to any appearance features or natural language understanding. We start with a CSV table that preserves the 2- dimensional structure and contents of the original source table (e.g., an HTML table) but not font size, font weight, and color. The indexing property of table headers implies a four-quadrant partitioning of the table about a minimum index point. The algorithm finds the index point through an efficient guided search. Experimental results on a 200-table benchmark demonstrate the generality of the algorithm in handling a variety of table styles and forms

    Segmenting Tables via Indexing of Value Cells by Table Headers

    Full text link

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    Using Visual Cues for Extraction of Tabular Data from Arbitrary HTML Documents

    No full text
    We describe a method to extract tabular data from web pages. Rather than just analyzing the DOM tree, we also exploit visual cues in the rendered version of the document to extract data from tables which are not explicitly marked with an HTML table element. To detect tables, we rely on a variant of the well-known X-Y cut algorithm as used in the OCR community. We implemented the system by directly accessing Mozilla's box model that contains the positional data for all HTML elements of a given web page

    Herramientas visuales para la búsqueda, la recuperación de información y la búsqueda de contenido relevante aplicado a páginas web

    Full text link
    La información accesible en internet crece de manera descontrolada y se actualiza más rápido que nunca. Debido al problema de encontrar la información actualizada y deseada aparecieron los motores de búsquedas siendo Google su máximo exponente. Estos buscadores son capaces de encontrar páginas concretas en la inmensidad de la red a partir de un criterio de búsqueda pero tras encontrarlo, no son capaces de mostrar de manera precisa al usuario dónde se encuentra lo que buscaba dentro de la página, obligando al usuario a buscarlo y desperdiciando así parte de su tiempo. Dicho problema se acentúa con la complejidad de información y apariencia de las páginas web convirtiendo la búsqueda en una tarea pesada. En esta tesina proponemos una técnica mediante una extensión instalable en Firefox que permite al usuario hacer más cómoda y rápida su búsqueda permitiéndole mostrar la información según la necesidad del momento. Tras obtener la información relevante con el criterio de búsqueda del usuario, observamos que parte de la información contenida en la página web, como menús o el pie de la página web, podría modificar la experiencia del usuario al mostrar más información de la que el usuario necesita realmente. Debido a dicho problema, se propone un algoritmo que se encarga de buscar el bloque de contenido principal relevante de una página web, ignorando u ocultando el resto de la página web irrelevante.López Romero, S. (2010). Herramientas visuales para la búsqueda, la recuperación de información y la búsqueda de contenido relevante aplicado a páginas web. http://hdl.handle.net/10251/13667Archivo delegad

    A teachable semi-automatic web information extraction system based on evolved regular expression patterns

    Get PDF
    This thesis explores Web Information Extraction (WIE) and how it has been used in decision making and to support businesses in their daily operations. The research focuses on a WIE system based on Genetic Programming (GP) with an extensible model to enhance the automatic extractor. This uses a human as a teacher to identify and extract relevant information from the semi-structured HTML webpages. Regular expressions, which have been chosen as the pattern matching tool, are automatically generated based on the training data to provide an improved grammar and lexicon. This particularly benefits the GP system which may need to extend its lexicon in the presence of new tokens in the web pages. These tokens allow the GP method to produce new extraction patterns for new requirements
    corecore