30 research outputs found

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    Building Intelligent Web Applications Using Lightweight Wrappers

    Get PDF
    The Web so far has been incredibly successful at delivering information to human users. So successful actually, that there is now an urgent need to go beyond a browsing human. Unfortunately, the Web is not yet a well organized repository of nicely structured documents but rather a conglomerate of volatile HTML pages. To address this problem, we present the World Wide Web Wrapper Factory (W4F), a toolkit for the generation of wrappers for Web sources, that offers: (1) an expressive language to specify the extraction of complex structures from HTML pages; (2) a declarative mapping to various data formats like XML; (3) some visual tools to make the engineering of wrappers faster and easier

    SEMI-SUPERVISED INFORMATION EXTRACTION FROM VARIABLE-LENGTHWEB-PAGE LISTS

    Full text link

    Integrating financial data over the Internet

    Get PDF
    Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1999.Includes bibliographical references (leaves 65-66).This thesis examines the issues and value-added, from both the technical and economic perspective, of solving the information integration problem in the retail banking industry. In addition, we report on an implementation of a prototype for the Universal Banking Application using currently available technologies. We report on some of the issues we discovered and the suggested improvements for future work.by Howard W. Pan.M.Eng

    Entity Ranking in Wikipedia

    Get PDF
    The traditional entity extraction problem lies in the ability of extracting named entities from plain text using natural language processing techniques and intensive training from large document collections. Examples of named entities include organisations, people, locations, or dates. There are many research activities involving named entities; we are interested in entity ranking in the field of information retrieval. In this paper, we describe our approach to identifying and ranking entities from the INEX Wikipedia document collection. Wikipedia offers a number of interesting features for entity identification and ranking that we first introduce. We then describe the principles and the architecture of our entity ranking system, and introduce our methodology for evaluation. Our preliminary results show that the use of categories and the link structure of Wikipedia, together with entity examples, can significantly improve retrieval effectiveness.Comment: to appea

    DETC2002/CIE-34462 WEB-BASED INNOVATION ALERT SERVICES TO SUPPORT PRODUCT DESIGN EVOLUTION

    Get PDF
    ABSTRACT Technological innovations provide an opportunity to improve product performance and reduce cost. Therefore, design organizations are interested in monitoring technological innovations. A large number of innovations are announced every year. Monitoring them manually is very time consuming. We are developing web-based innovation-alert services that can be used to monitor and communicate information about innovations relevant to a particular product design. In this paper, we discuss the required infrastructure, relevant design issues, and our approach to developing web-based innovation alert services to support product design evolution. We also describe a prototype innovation monitoring service for computer components and an interactive tool to transform semi-structured web contents into semantic representations in XML

    Mujeres y universidad en El País (1977-2011): Una propuesta metodológica para para el uso de las TIC en el análisis histórico

    Get PDF
    The practice of historical research in recent years has been substantially affected by the emergence of the so-called digital humanities. New computer tools have been appearing, software systems capable of processing vast quantities of information in ways that until recently were inconceivable. Text mining and social network analysis techniques are sophisticated instruments that can help render a more enriching reading of the available data and draw useful conclusions. We reflect on this in the first part of this article, and then apply these tools to a practical case: quantifying and identifying the women who appear in university-related articles in the newspaper El País from its founding until 2011.La práctica de la investigación histórica, en los años recientes, ha sido sustancialmente afectada por la aparición de las llamadas humanidades digitales. Se han introducido nuevas herramientas informáticas, sistemas de software capaces de procesar vastas cantidades de información de formas que, hasta hace poco tiempo, eran inconcevibles. Las técnicas de minería de texto y de análisis de redes sociales constituyen instrumentos sofisticados que permiten obtener una lectura más enriquecedora de los datos disponibles y extraer conclusiones útiles. Hemos reflejado esto en la primera parte de este artículo, y a continuación hemos aplicado estas herramientas a un caso práctico: cuantificar e identificar a las mujeres que aparecen en artículos relacionados con la universidad, publicados en el periódico El País desde su fundación hasta el año 201

    ViDE: A Visual Data Extraction Environment for the Web

    Get PDF

    Managing semantic content for the Web

    Get PDF
    corecore