30 research outputs found
Web Data Extraction, Applications and Techniques: A Survey
Web Data Extraction is an important problem that has been studied by means of
different scientific tools and in a broad range of applications. Many
approaches to extracting data from the Web have been designed to solve specific
problems and operate in ad-hoc domains. Other approaches, instead, heavily
reuse techniques and algorithms developed in the field of Information
Extraction.
This survey aims at providing a structured and comprehensive overview of the
literature in the field of Web Data Extraction. We provided a simple
classification framework in which existing Web Data Extraction applications are
grouped into two main classes, namely applications at the Enterprise level and
at the Social Web level. At the Enterprise level, Web Data Extraction
techniques emerge as a key tool to perform data analysis in Business and
Competitive Intelligence systems as well as for business process
re-engineering. At the Social Web level, Web Data Extraction techniques allow
to gather a large amount of structured data continuously generated and
disseminated by Web 2.0, Social Media and Online Social Network users and this
offers unprecedented opportunities to analyze human behavior at a very large
scale. We discuss also the potential of cross-fertilization, i.e., on the
possibility of re-using Web Data Extraction techniques originally designed to
work in a given domain, in other domains.Comment: Knowledge-based System
Building Intelligent Web Applications Using Lightweight Wrappers
The Web so far has been incredibly successful at delivering information to human users. So successful actually, that there is now an urgent need to go beyond a browsing human. Unfortunately, the Web is not yet a well organized repository of nicely structured documents but rather a conglomerate of volatile HTML pages. To address this problem, we present the World Wide Web Wrapper Factory (W4F), a toolkit for the generation of wrappers for Web sources, that offers: (1) an expressive language to specify the extraction of complex structures from HTML pages; (2) a declarative mapping to various data formats like XML; (3) some visual tools to make the engineering of wrappers faster and easier
Integrating financial data over the Internet
Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1999.Includes bibliographical references (leaves 65-66).This thesis examines the issues and value-added, from both the technical and economic perspective, of solving the information integration problem in the retail banking industry. In addition, we report on an implementation of a prototype for the Universal Banking Application using currently available technologies. We report on some of the issues we discovered and the suggested improvements for future work.by Howard W. Pan.M.Eng
Entity Ranking in Wikipedia
The traditional entity extraction problem lies in the ability of extracting
named entities from plain text using natural language processing techniques and
intensive training from large document collections. Examples of named entities
include organisations, people, locations, or dates. There are many research
activities involving named entities; we are interested in entity ranking in the
field of information retrieval. In this paper, we describe our approach to
identifying and ranking entities from the INEX Wikipedia document collection.
Wikipedia offers a number of interesting features for entity identification and
ranking that we first introduce. We then describe the principles and the
architecture of our entity ranking system, and introduce our methodology for
evaluation. Our preliminary results show that the use of categories and the
link structure of Wikipedia, together with entity examples, can significantly
improve retrieval effectiveness.Comment: to appea
DETC2002/CIE-34462 WEB-BASED INNOVATION ALERT SERVICES TO SUPPORT PRODUCT DESIGN EVOLUTION
ABSTRACT Technological innovations provide an opportunity to improve product performance and reduce cost. Therefore, design organizations are interested in monitoring technological innovations. A large number of innovations are announced every year. Monitoring them manually is very time consuming. We are developing web-based innovation-alert services that can be used to monitor and communicate information about innovations relevant to a particular product design. In this paper, we discuss the required infrastructure, relevant design issues, and our approach to developing web-based innovation alert services to support product design evolution. We also describe a prototype innovation monitoring service for computer components and an interactive tool to transform semi-structured web contents into semantic representations in XML
Mujeres y universidad en El PaÃs (1977-2011): Una propuesta metodológica para para el uso de las TIC en el análisis histórico
The practice of historical research in recent years has been substantially affected by the emergence of the so-called digital humanities. New computer tools have been appearing, software systems capable of processing vast quantities of information in ways that until recently were inconceivable. Text mining and social network analysis techniques are sophisticated instruments that can help render a more enriching reading of the available data and draw useful conclusions. We reflect on this in the first part of this article, and then apply these tools to a practical case: quantifying and identifying the women who appear in university-related articles in the newspaper El PaÃs from its founding until 2011.La práctica de la investigación histórica, en los años recientes, ha sido sustancialmente afectada por la aparición de las llamadas humanidades digitales. Se han introducido nuevas herramientas informáticas, sistemas de software capaces de procesar vastas cantidades de información de formas que, hasta hace poco tiempo, eran inconcevibles. Las técnicas de minerÃa de texto y de análisis de redes sociales constituyen instrumentos sofisticados que permiten obtener una lectura más enriquecedora de los datos disponibles y extraer conclusiones útiles. Hemos reflejado esto en la primera parte de este artÃculo, y a continuación hemos aplicado estas herramientas a un caso práctico: cuantificar e identificar a las mujeres que aparecen en artÃculos relacionados con la universidad, publicados en el periódico El PaÃs desde su fundación hasta el año 201