23,717 research outputs found
Automatic Wrapper Adaptation by Tree Edit Distance Matching
Information distributed through the Web keeps growing faster day by day,\ud
and for this reason, several techniques for extracting Web data have been suggested\ud
during last years. Often, extraction tasks are performed through so called wrappers,\ud
procedures extracting information from Web pages, e.g. implementing logic-based\ud
techniques. Many fields of application today require a strong degree of robustness\ud
of wrappers, in order not to compromise assets of information or reliability of data\ud
extracted.\ud
Unfortunately, wrappers may fail in the task of extracting data from a Web page, if\ud
its structure changes, sometimes even slightly, thus requiring the exploiting of new\ud
techniques to be automatically held so as to adapt the wrapper to the new structure\ud
of the page, in case of failure. In this work we present a novel approach of automatic wrapper adaptation based on the measurement of similarity of trees through\ud
improved tree edit distance matching techniques
Web Data Extraction, Applications and Techniques: A Survey
Web Data Extraction is an important problem that has been studied by means of
different scientific tools and in a broad range of applications. Many
approaches to extracting data from the Web have been designed to solve specific
problems and operate in ad-hoc domains. Other approaches, instead, heavily
reuse techniques and algorithms developed in the field of Information
Extraction.
This survey aims at providing a structured and comprehensive overview of the
literature in the field of Web Data Extraction. We provided a simple
classification framework in which existing Web Data Extraction applications are
grouped into two main classes, namely applications at the Enterprise level and
at the Social Web level. At the Enterprise level, Web Data Extraction
techniques emerge as a key tool to perform data analysis in Business and
Competitive Intelligence systems as well as for business process
re-engineering. At the Social Web level, Web Data Extraction techniques allow
to gather a large amount of structured data continuously generated and
disseminated by Web 2.0, Social Media and Online Social Network users and this
offers unprecedented opportunities to analyze human behavior at a very large
scale. We discuss also the potential of cross-fertilization, i.e., on the
possibility of re-using Web Data Extraction techniques originally designed to
work in a given domain, in other domains.Comment: Knowledge-based System
Use of Wikipedia Categories in Entity Ranking
Wikipedia is a useful source of knowledge that has many applications in
language processing and knowledge representation. The Wikipedia category graph
can be compared with the class hierarchy in an ontology; it has some
characteristics in common as well as some differences. In this paper, we
present our approach for answering entity ranking queries from the Wikipedia.
In particular, we explore how to make use of Wikipedia categories to improve
entity ranking effectiveness. Our experiments show that using categories of
example entities works significantly better than using loosely defined target
categories
Sustaining Economic Exploitation of Complex Ecosystems in Computational Models of Coupled Human-Natural Networks
Understanding ecological complexity has stymied scientists for decades. Recent elucidation of the famously coined "devious strategies for stability in enduring natural systems" has opened up a new field of computational analyses of complex ecological networks where the nonlinear dynamics of many interacting species can be more realistically mod-eled and understood. Here, we describe the first extension of this field to include coupled human-natural systems. This extension elucidates new strategies for sustaining extraction of biomass (e.g., fish, forests, fiber) from ecosystems that account for ecological complexity and can pursue multiple goals such as maximizing economic profit, employment and carbon sequestration by ecosystems. Our more realistic modeling of ecosystems helps explain why simpler "maxi-mum sustainable yield" bioeconomic models underpinning much natural resource extraction policy leads to less profit, biomass, and biodiversity than predicted by those simple models. Current research directions of this integrated natu-ral and social science include applying artificial intelligence, cloud computing, and multiplayer online games
Text Extraction from Web Images Based on A Split-and-Merge Segmentation Method Using Color Perception
This paper describes a complete approach to the segmentation and extraction of text from Web images for subsequent recognition, to ultimately achieve both effective indexing and presentation by non-visual means (e.g., audio). The method described here (the first in the authors’ systematic approach to exploit human colour perception) enables the extraction of text in complex situations such as in the presence of varying colour (characters and background). More precisely, in addition to using structural features, the segmentation follows a split-and-merge strategy based on the Hue-Lightness- Saturation (HLS) representation of colour as a first approximation of an anthropocentric expression of the differences in chromaticity and lightness. Character-like components are then extracted as forming textlines in a number of orientations and along curves
CESI: Canonicalizing Open Knowledge Bases using Embeddings and Side Information
Open Information Extraction (OpenIE) methods extract (noun phrase, relation
phrase, noun phrase) triples from text, resulting in the construction of large
Open Knowledge Bases (Open KBs). The noun phrases (NPs) and relation phrases in
such Open KBs are not canonicalized, leading to the storage of redundant and
ambiguous facts. Recent research has posed canonicalization of Open KBs as
clustering over manuallydefined feature spaces. Manual feature engineering is
expensive and often sub-optimal. In order to overcome this challenge, we
propose Canonicalization using Embeddings and Side Information (CESI) - a novel
approach which performs canonicalization over learned embeddings of Open KBs.
CESI extends recent advances in KB embedding by incorporating relevant NP and
relation phrase side information in a principled manner. Through extensive
experiments on multiple real-world datasets, we demonstrate CESI's
effectiveness.Comment: Accepted at WWW 201
- …