9 research outputs found

    Design of Automatically Adaptable Web Wrappers

    Get PDF
    Nowadays, the huge amount of information distributed through the Web motivates studying techniques to\ud be adopted in order to extract relevant data in an efficient and reliable way. Both academia and enterprises\ud developed several approaches of Web data extraction, for example using techniques of artificial intelligence or\ud machine learning. Some commonly adopted procedures, namely wrappers, ensure a high degree of precision\ud of information extracted from Web pages, and, at the same time, have to prove robustness in order not to\ud compromise quality and reliability of data themselves.\ud In this paper we focus on some experimental aspects related to the robustness of the data extraction process\ud and the possibility of automatically adapting wrappers. We discuss the implementation of algorithms for\ud finding similarities between two different version of a Web page, in order to handle modifications, avoiding\ud the failure of data extraction tasks and ensuring reliability of information extracted. Our purpose is to evaluate\ud performances, advantages and draw-backs of our novel system of automatic wrapper adaptation

    Intelligent Self-Repairable Web Wrappers

    Get PDF
    The amount of information available on the Web grows at an incredible high rate. Systems and procedures devised to extract these data from Web sources already exist, and different approaches and techniques have been investigated during the last years. On the one hand, reliable solutions should provide robust algorithms of Web data mining which could automatically face possible malfunctioning or failures. On the other, in literature there is a lack of solutions about the maintenance of these systems. Procedures that extract Web data may be strictly interconnected with the structure of the data source itself; thus, malfunctioning or acquisition of corrupted data could be caused, for example, by structural modifications of data sources brought by their owners. Nowadays, verification of data integrity and maintenance are mostly manually managed, in order to ensure that these systems work correctly and reliably. In this paper we propose a novel approach to create procedures able to extract data from Web sources -- the so called Web wrappers -- which can face possible malfunctioning caused by modifications of the structure of the data source, and can automatically repair themselves.\u

    Automatic Wrapper Adaptation by Tree Edit Distance Matching

    Get PDF
    Information distributed through the Web keeps growing faster day by day,\ud and for this reason, several techniques for extracting Web data have been suggested\ud during last years. Often, extraction tasks are performed through so called wrappers,\ud procedures extracting information from Web pages, e.g. implementing logic-based\ud techniques. Many fields of application today require a strong degree of robustness\ud of wrappers, in order not to compromise assets of information or reliability of data\ud extracted.\ud Unfortunately, wrappers may fail in the task of extracting data from a Web page, if\ud its structure changes, sometimes even slightly, thus requiring the exploiting of new\ud techniques to be automatically held so as to adapt the wrapper to the new structure\ud of the page, in case of failure. In this work we present a novel approach of automatic wrapper adaptation based on the measurement of similarity of trees through\ud improved tree edit distance matching techniques

    A Review on Multilevel wrApper Verification System with maintenance Model Enhancement

    Full text link
    The online data sources have prompted to an expanded utilization of wrappers for extract data from Web sources. We present a unique idea, to explain the expressed problems and formally demonstrate its accuracy. Conventional research techniques have concentrated on snappy and effective era of wrappers; the advancement of devices for wrapper support has gotten less consideration and no arrangement to self upkeep. This empowers us to learn wrappers in a totally unsupervised way from consequently and inexpensively preparing information, e.g., utilizing word references and standard expressions. This turns into a research issue since Web sources frequently change progressively in ways that keep the wrappers from removing data accurately. We will probably help programming engineers develop wrapping operators that translate questions written in abnormal state organized language. Work introduces a proficient idea for auxiliary data about information from positive cases alone. Framework utilizes this data for wrapper upkeep applications: utilizing wrapper check and enlistment component planning a support show. The wrapper verification framework identifies when a wrapper is not extricating right information, for the most part on the grounds that the Web source has changed its organization. Sites are constantly advancing, upgrading and basic changes happen with no cautioning, which for the most part results in wrappers working mistakenly. Tragically, wrappers may flop in the undertaking of separating information from a Web page, if its structure changes, once in a while even marginally, in this way requiring the abusing of new procedures to be naturally held to adjust the wrapper to the new structure of the page, in the event of disappointment

    Web data extraction systems versus research collaboration in sustainable planning for housing: Smart governance takes it all

    Get PDF
    To date, there are no clear insights in the spatial patterns and micro-dynamics of the housing market. The objective of this study is to collect real estate micro-data for the development of policy-support indicators on housing market dynamics at the local scale. These indicators can provide the requested insights in spatial patterns and micro-dynamics of the housing market. Because the required real estate data are not systematicly published as statistical data or open data, innovative forms of data collection are needed. This paper is based on a case study approach of the greater Leuven area (Belgium). The research question is what are suitable methods or strategies to collect data on micro-dynamics of the housing market. The methodology includes a technical approach for data collection, being Web data extraction, and a governance approach, being explorative interviews. A Web data extraction system collects and extracts unstructured or semi-structured data that are stored or published on Web sources. Most of the required data are publicly and readily available as Web data on real estate portal websites. Web data extraction at the scale of the case study succeeded in collecting the required micro-data, but a trial run at the regional scale encountered a number of practical and legal issues. Simultaneously with the Web data extraction, the dialogue with two real estate portal websites was initiated, using purposive sampling and explorative semi-structured interviews. The interviews were considered as the start of a transdisciplinary research collaboration process. Both companies indicated that the development of indicators about housing market dynamics was a good and relevant idea, yet a challenging task. The companies were familiar with Web data extraction systems, but considered it a suboptimal technique to collect real estate data for the development of housing dynamics indicators. They preferred an active collaboration instead of passive Web scraping. In the frame of a users’ agreement, we received one company’s dataset and calculated the indicators for the case study based on this dataset. The unique micro-data provided by the company proved to be the start of a collaborative planning approach between private partners, the academic world and the Flemish government. All three win from this collaboration on the long run. Smart governance can gain from smart technologies, but should not loose sight of active collaborations

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    Personalizing the web: A tool for empowering end-users to customize the web through browser-side modification

    Get PDF
    167 p.Web applications delegate to the browser the final rendering of their pages. Thispermits browser-based transcoding (a.k.a. Web Augmentation) that can be ultimately singularized for eachbrowser installation. This creates an opportunity for Web consumers to customize their Web experiences.This vision requires provisioning adequate tooling that makes Web Augmentation affordable to laymen.We consider this a special class of End-User Development, integrating Web Augmentation paradigms.The dominant paradigm in End-User Development is scripting languages through visual languages.This thesis advocates for a Google Chrome browser extension for Web Augmentation. This is carried outthrough WebMakeup, a visual DSL programming tool for end-users to customize their own websites.WebMakeup removes, moves and adds web nodes from different web pages in order to avoid tabswitching, scrolling, the number of clicks and cutting and pasting. Moreover, Web Augmentationextensions has difficulties in finding web elements after a website updating. As a consequence, browserextensions give up working and users might stop using these extensions. This is why two differentlocators have been implemented with the aim of improving web locator robustness

    Mejorando las técnicas de verificación de wrappers web mediante técnicas bioinspiradas y de clasificación

    Get PDF
    Muchas Aplicaciones Empresariales necesitan de los wrappers para poder tratar con información proveniente de la web profunda. Los wrappers son sistemas automáticos que permiten navegar, extraer, estructurar y verificar información relevante proveniente de la web. Uno de sus elementos, el extractor de información, está formado por una serie de reglas de extracción que suelen estar basadas en etiquetas HTML. Por tanto, si las fuentes cambian, el wrapper, en algunos casos, puede devolver información no deseada por la empresa y provocar, en el mejor de los casos, retrasos en sus tomas de decisión. Diversos sistemas de verificación de wrappers se han desarrollado con el objetivo de detectar automáticamente cuándo un wrapper está extrayendo datos incorrectos. Estos sistemas presentan una serie de carencias cuyo origen radica en asumir que los datos a verificar siguen una serie de características estadísticas preestablecidas. En esta disertación se analizan estos sistemas, se diseña un marco de trabajo para desarrollar verificadores y se aborda el problema de la verificación desde dos puntos de vista distintos. Inicialmente lo ubicaremos dentro de la rama de la optimización computacional y lo resolveremos aplicando metaheúristicas bioinspiradas como es la basada en colonias en hormigas, en concreto aplicaremos el algoritmo BWAS; con posterioridad, lo formularemos y resolveremos como si de un problema de clasificación no supervisada se tratara. Fruto de este segundo enfoque surge MAVE, un verificador multinivel cuya base principal son los clasificadores de una única clase.Many Enterprise Applications require wrappers to deal with information from the deep web. Wrappers are automated systems that allow you to navigate, extract, reveal structures and verify information from the web. One of its elements, the information extractor, is formed by extraction rules series that are usually based on HTML tags. Therefore, if you change sources, the wrapper, in some cases, may return unwanted information by the company and cause, at the best, delays in their decision-making process. Some wrappers verification systems have been developed to automatically detect when a wrapper is taking out incorrect data. These systems have a number of shortcomings whose origin lies in assuming that the data to verify follow a series of pre statistics. This dissertation analyzes these systems, a framework is designed to develop verifiers and the verification problem is approached from two different points of view. Initially, we place it within the branch of computational optimization and solve it applying bio-inspired metaheuristic as it is found in ant colonies, specifically we will apply the BWAS algorithm. Subsequently we will formulate and solve as if it were a unsupervised classification problem. The result of this second approach is MAVE, a multilevel verifier whose main base are the unique class classifiers
    corecore