Search CORE

30 research outputs found

Web Data Extraction, Applications and Techniques: A Survey

Author: Abel
Amalfitano
Balduzzi
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Berger
Berthold
Bettencourt
Califf
Catanese
Chang
Chen
Chen
Chen
Collins
Conover
Crandall
Crescenzi
Crescenzi
Dalvi
Dalvi
De Meo
De Meo
Doan
Emilio Ferrara
Ferrara
Ferrara
Ferrara
Ferrara
Ferrara
Flesca
Freitag
Furche
Gatterbauer
Gatterbauer
Giacomo Fiumara
Gjoka
Gkotsis
Gottlob
Gottlob
Hammersley
Han
Hecht
Hsu
Irmak
Khare
Kim
Kinsella
Kleinberg
Kleinberg
Kohlschütter
Kokkoras
Kokkoras
Kokkoras
Krüpl
Kushmerick
Kwak
Laender
Liu
Manning
Masanès
Mathes
Meng
Mislove
Monge
Muslea
Oro
Pan
Pasquale De Meo
Perito
Phan
Plake
Rahm
Rahm
Reis
Robert Baumgartner
Sahuguet
Sarawagi
Schifanella
Selkow
Shi
Soderland
Szomszor
Turmo
Vosecky
Wang
Wang
Weikum
Wilson
Winograd
Yang
Ye
Zafarani
Zanasi
Zhai
Zhang
Zhang
Publication venue: 'Elsevier BV'
Publication date: 09/06/2014
Field of study

Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

arXiv.org e-Print Archive

Crossref

Building Intelligent Web Applications Using Lightweight Wrappers

Author: Azavant Fabien
Sahuguet Arnaud
Publication venue: ScholarlyCommons
Publication date: 01/01/2000
Field of study

The Web so far has been incredibly successful at delivering information to human users. So successful actually, that there is now an urgent need to go beyond a browsing human. Unfortunately, the Web is not yet a well organized repository of nicely structured documents but rather a conglomerate of volatile HTML pages. To address this problem, we present the World Wide Web Wrapper Factory (W4F), a toolkit for the generation of wrappers for Web sources, that offers: (1) an expressive language to specify the extraction of complex structures from HTML pages; (2) a declarative mapping to various data formats like XML; (3) some visual tools to make the engineering of wrappers faster and easier

CiteSeerX

ScholarlyCommons@Penn

SEMI-SUPERVISED INFORMATION EXTRACTION FROM VARIABLE-LENGTHWEB-PAGE LISTS

Author
Publication venue: 'Scitepress'
Publication date: 01/01/2009
Field of study

Crossref

Integrating financial data over the Internet

Author: Pan Howard W. (Howard Weihao), 1973-
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/1999
Field of study

Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1999.Includes bibliographical references (leaves 65-66).This thesis examines the issues and value-added, from both the technical and economic perspective, of solving the information integration problem in the retail banking industry. In addition, we report on an implementation of a prototype for the Universal Banking Application using currently available technologies. We report on some of the issues we discovered and the suggested improvements for future work.by Howard W. Pan.M.Eng

DSpace@MIT

Entity Ranking in Wikipedia

Author: Pehcevski Jovan
Thom James A.
Vercoustre Anne-Marie
Publication venue
Publication date: 01/01/2007
Field of study

The traditional entity extraction problem lies in the ability of extracting named entities from plain text using natural language processing techniques and intensive training from large document collections. Examples of named entities include organisations, people, locations, or dates. There are many research activities involving named entities; we are interested in entity ranking in the field of information retrieval. In this paper, we describe our approach to identifying and ranking entities from the INEX Wikipedia document collection. Wikipedia offers a number of interesting features for entity identification and ranking that we first introduce. We then describe the principles and the architecture of our entity ranking system, and introduce our methodology for evaluation. Our preliminary results show that the use of categories and the link structure of Wikipedia, together with entity examples, can significantly improve retrieval effectiveness.Comment: to appea

arXiv.org e-Print Archive

CiteSeerX

INRIA a CCSD electronic archive server

RMIT Research Repository

DETC2002/CIE-34462 WEB-BASED INNOVATION ALERT SERVICES TO SUPPORT PRODUCT DESIGN EVOLUTION

Author: Alexander J Lo
Changxin Xu
Edward Lin
Satyandra K Gupta
Publication venue
Publication date: 24/04/2020
Field of study

ABSTRACT Technological innovations provide an opportunity to improve product performance and reduce cost. Therefore, design organizations are interested in monitoring technological innovations. A large number of innovations are announced every year. Monitoring them manually is very time consuming. We are developing web-based innovation-alert services that can be used to monitor and communicate information about innovations relevant to a particular product design. In this paper, we discuss the required infrastructure, relevant design issues, and our approach to developing web-based innovation alert services to support product design evolution. We also describe a prototype innovation monitoring service for computer components and an interactive tool to transform semi-structured web contents into semantic representations in XML

CiteSeerX

Mujeres y universidad en El País (1977-2011): Una propuesta metodológica para para el uso de las TIC en el análisis histórico

Author: Bingham
Kirschenbaum
Olston
Seco
Publication venue: 'Editorial CSIC'
Publication date: 30/06/2019
Field of study

The practice of historical research in recent years has been substantially affected by the emergence of the so-called digital humanities. New computer tools have been appearing, software systems capable of processing vast quantities of information in ways that until recently were inconceivable. Text mining and social network analysis techniques are sophisticated instruments that can help render a more enriching reading of the available data and draw useful conclusions. We reflect on this in the first part of this article, and then apply these tools to a practical case: quantifying and identifying the women who appear in university-related articles in the newspaper El País from its founding until 2011.La práctica de la investigación histórica, en los años recientes, ha sido sustancialmente afectada por la aparición de las llamadas humanidades digitales. Se han introducido nuevas herramientas informáticas, sistemas de software capaces de procesar vastas cantidades de información de formas que, hasta hace poco tiempo, eran inconcevibles. Las técnicas de minería de texto y de análisis de redes sociales constituyen instrumentos sofisticados que permiten obtener una lectura más enriquecedora de los datos disponibles y extraer conclusiones útiles. Hemos reflejado esto en la primera parte de este artículo, y a continuación hemos aplicado estas herramientas a un caso práctico: cuantificar e identificar a las mujeres que aparecen en artículos relacionados con la universidad, publicados en el periódico El País desde su fundación hasta el año 201

Crossref

Culture & History Digital Journal

ViDE: A Visual Data Extraction Environment for the Web

Author: LI Yi
LIM Ee Peng
NG Wee-Keong
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/09/2001
Field of study

Institutional Knowledge at Singapore Management University

Managing semantic content for the Web

Author: A. Sheth
B. Hammond
C. Bertram
D. Avant
K. Kochut
Y. Warke
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref