305,859 research outputs found
Web Data Extraction, Applications and Techniques: A Survey
Web Data Extraction is an important problem that has been studied by means of
different scientific tools and in a broad range of applications. Many
approaches to extracting data from the Web have been designed to solve specific
problems and operate in ad-hoc domains. Other approaches, instead, heavily
reuse techniques and algorithms developed in the field of Information
Extraction.
This survey aims at providing a structured and comprehensive overview of the
literature in the field of Web Data Extraction. We provided a simple
classification framework in which existing Web Data Extraction applications are
grouped into two main classes, namely applications at the Enterprise level and
at the Social Web level. At the Enterprise level, Web Data Extraction
techniques emerge as a key tool to perform data analysis in Business and
Competitive Intelligence systems as well as for business process
re-engineering. At the Social Web level, Web Data Extraction techniques allow
to gather a large amount of structured data continuously generated and
disseminated by Web 2.0, Social Media and Online Social Network users and this
offers unprecedented opportunities to analyze human behavior at a very large
scale. We discuss also the potential of cross-fertilization, i.e., on the
possibility of re-using Web Data Extraction techniques originally designed to
work in a given domain, in other domains.Comment: Knowledge-based System
Ontology: A Linked Data Hub for Mathematics
In this paper, we present an ontology of mathematical knowledge concepts that
covers a wide range of the fields of mathematics and introduces a balanced
representation between comprehensive and sensible models. We demonstrate the
applications of this representation in information extraction, semantic search,
and education. We argue that the ontology can be a core of future integration
of math-aware data sets in the Web of Data and, therefore, provide mappings
onto relevant datasets, such as DBpedia and ScienceWISE.Comment: 15 pages, 6 images, 1 table, Knowledge Engineering and the Semantic
Web - 5th International Conferenc
Fusing Data with Correlations
Many applications rely on Web data and extraction systems to accomplish
knowledge-driven tasks. Web information is not curated, so many sources provide
inaccurate, or conflicting information. Moreover, extraction systems introduce
additional noise to the data. We wish to automatically distinguish correct data
and erroneous data for creating a cleaner set of integrated data. Previous work
has shown that a na\"ive voting strategy that trusts data provided by the
majority or at least a certain number of sources may not work well in the
presence of copying between the sources. However, correlation between sources
can be much broader than copying: sources may provide data from complementary
domains (\emph{negative correlation}), extractors may focus on different types
of information (\emph{negative correlation}), and extractors may apply common
rules in extraction (\emph{positive correlation, without copying}). In this
paper we present novel techniques modeling correlations between sources and
applying it in truth finding.Comment: Sigmod'201
Mining the Web for Lexical Knowledge to Improve Keyphrase Extraction: Learning from Labeled and Unlabeled Data.
A journal article is often accompanied by a list of keyphrases, composed of about five to fifteen important words and phrases that capture the articles main topics. Keyphrases are useful for a variety of purposes, including summarizing, indexing, labeling, categorizing, clustering, highlighting, browsing, and searching. The task of automatic keyphrase extraction is to select keyphrases from within the text of a given document. Automatic keyphrase extraction makes it feasible to generate keyphrases for the huge number of documents that do not have manually assigned keyphrases. Good performance on this task has been obtained by approaching it as a supervised learning problem. An input document is treated as a set of candidate phrases that must be classified as either keyphrases or non-keyphrases. To classify a candidate phrase as a keyphrase, the most important features (attributes) appear to be the frequency and location of the candidate phrase in the document. Recent work has demonstrated that it is also useful to know the frequency of the candidate phrase as a manually assigned keyphrase for other documents in the same domain as the given document (e.g., the domain of computer science). Unfortunately, this keyphrase-frequency feature is domain-specific (the learning process must be repeated for each new domain) and training-intensive (good performance requires a relatively large number of training documents in the given domain, with manually assigned keyphrases). The aim of the work described here is to remove these limitations. In this paper, I introduce new features that are conceptually related to keyphrase-frequency and I present experiments that show that the new features result in improved keyphrase extraction, although they are neither domain-specific nor training-intensive. The new features are generated by issuing queries to a Web search engine, based on the candidate phrases in the input document. The feature values are calculated from the number of hits for the queries (the number of matching Web pages). In essence, these new features are derived by mining lexical knowledge from a very large collection of unlabeled data, consisting of approximately 350 million Web pages without manually assigned keyphrases
User driven information extraction with LODIE
Information Extraction (IE) is the technique for transforming unstructured or semi-structured data into structured representation
that can be understood by machines. In this paper we use a user-driven
Information Extraction technique to wrap entity-centric Web pages. The
user can select concepts and properties of interest from available Linked
Data. Given a number of websites containing pages about the concepts of
interest, the method will exploit (i) recurrent structures in the Web pages
and (ii) available knowledge in Linked data to extract the information
of interest from the Web pages
- …