14,772 research outputs found
Web Data Extraction, Applications and Techniques: A Survey
Web Data Extraction is an important problem that has been studied by means of
different scientific tools and in a broad range of applications. Many
approaches to extracting data from the Web have been designed to solve specific
problems and operate in ad-hoc domains. Other approaches, instead, heavily
reuse techniques and algorithms developed in the field of Information
Extraction.
This survey aims at providing a structured and comprehensive overview of the
literature in the field of Web Data Extraction. We provided a simple
classification framework in which existing Web Data Extraction applications are
grouped into two main classes, namely applications at the Enterprise level and
at the Social Web level. At the Enterprise level, Web Data Extraction
techniques emerge as a key tool to perform data analysis in Business and
Competitive Intelligence systems as well as for business process
re-engineering. At the Social Web level, Web Data Extraction techniques allow
to gather a large amount of structured data continuously generated and
disseminated by Web 2.0, Social Media and Online Social Network users and this
offers unprecedented opportunities to analyze human behavior at a very large
scale. We discuss also the potential of cross-fertilization, i.e., on the
possibility of re-using Web Data Extraction techniques originally designed to
work in a given domain, in other domains.Comment: Knowledge-based System
Harvesting Entities from the Web Using Unique Identifiers -- IBEX
In this paper we study the prevalence of unique entity identifiers on the
Web. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs
(for documents), email addresses, and others. We show how these identifiers can
be harvested systematically from Web pages, and how they can be associated with
human-readable names for the entities at large scale.
Starting with a simple extraction of identifiers and names from Web pages, we
show how we can use the properties of unique identifiers to filter out noise
and clean up the extraction result on the entire corpus. The end result is a
database of millions of uniquely identified entities of different types, with
an accuracy of 73--96% and a very high coverage compared to existing knowledge
bases. We use this database to compute novel statistics on the presence of
products, people, and other entities on the Web.Comment: 30 pages, 5 figures, 9 tables. Complete technical report for A.
Talaika, J. A. Biega, A. Amarilli, and F. M. Suchanek. IBEX: Harvesting
Entities from the Web Using Unique Identifiers. WebDB workshop, 201
Semantic and Syntactic Matching of Heterogeneous e-Catalogues
In e-procurement, companies use e-catalogues to exchange product infor-mation with business partners. Matching e-catalogues with product requests helps the suppliers to identify the best business opportunities in B2B e-Marketplaces. But various ways to specify products and the large variety of e-catalogue formats used by different business actors makes it difficult.
This Ph.D. thesis aims to discover potential syntactic and semantic rela-tionships among product data in procurement documents and exploit it to find similar e-catalogues. Using a Concept-based Vector Space Model, product data and its semantic interpretation is used to find the correlation of product data. In order to identify important terms in procurement documents, standard e-catalogues and e-tenders are used as a resource to train a Product Named Entity Recognizer to find B2B product mentions in e-catalogues.
The proposed approach makes it possible to use the benefits of all availa-ble semantic resources and schemas but not to be dependent on any specific as-sumption. The solution can serve as a B2B product search system in e-Procurement platforms and e-Marketplaces
- …