53,999 research outputs found
Web Data Extraction, Applications and Techniques: A Survey
Web Data Extraction is an important problem that has been studied by means of
different scientific tools and in a broad range of applications. Many
approaches to extracting data from the Web have been designed to solve specific
problems and operate in ad-hoc domains. Other approaches, instead, heavily
reuse techniques and algorithms developed in the field of Information
Extraction.
This survey aims at providing a structured and comprehensive overview of the
literature in the field of Web Data Extraction. We provided a simple
classification framework in which existing Web Data Extraction applications are
grouped into two main classes, namely applications at the Enterprise level and
at the Social Web level. At the Enterprise level, Web Data Extraction
techniques emerge as a key tool to perform data analysis in Business and
Competitive Intelligence systems as well as for business process
re-engineering. At the Social Web level, Web Data Extraction techniques allow
to gather a large amount of structured data continuously generated and
disseminated by Web 2.0, Social Media and Online Social Network users and this
offers unprecedented opportunities to analyze human behavior at a very large
scale. We discuss also the potential of cross-fertilization, i.e., on the
possibility of re-using Web Data Extraction techniques originally designed to
work in a given domain, in other domains.Comment: Knowledge-based System
Incorporating the knowledge management cycle in e-business
In e-business, knowledge can be extracted from the recorded information by intelligent data analysis and then utilised in the business transaction. E-knowledge is a foundation for e-business. E-business can be supported by an intelligent information system that provides intelligent business process support and advanced support of the e-knowledge management cycle. Knowledge is stored as knowledge models that can be updated in the e-knowledge management cycle. As illustrated in examples, the e-knowledge cycle aids in the business decision taking, production management, and costs management
Wrapper Maintenance: A Machine Learning Approach
The proliferation of online information sources has led to an increased use
of wrappers for extracting data from Web sources. While most of the previous
research has focused on quick and efficient generation of wrappers, the
development of tools for wrapper maintenance has received less attention. This
is an important research problem because Web sources often change in ways that
prevent the wrappers from extracting data correctly. We present an efficient
algorithm that learns structural information about data from positive examples
alone. We describe how this information can be used for two wrapper maintenance
applications: wrapper verification and reinduction. The wrapper verification
system detects when a wrapper is not extracting correct data, usually because
the Web source has changed its format. The reinduction algorithm automatically
recovers from changes in the Web source by identifying data on Web pages so
that a new wrapper may be generated for this source. To validate our approach,
we monitored 27 wrappers over a period of a year. The verification algorithm
correctly discovered 35 of the 37 wrapper changes, and made 16 mistakes,
resulting in precision of 0.73 and recall of 0.95. We validated the reinduction
algorithm on ten Web sources. We were able to successfully reinduce the
wrappers, obtaining precision and recall values of 0.90 and 0.80 on the data
extraction task
Automatically assembling a full census of an academic field
The composition of the scientific workforce shapes the direction of
scientific research, directly through the selection of questions to
investigate, and indirectly through its influence on the training of future
scientists. In most fields, however, complete census information is difficult
to obtain, complicating efforts to study workforce dynamics and the effects of
policy. This is particularly true in computer science, which lacks a single,
all-encompassing directory or professional organization. A full census of
computer science would serve many purposes, not the least of which is a better
understanding of the trends and causes of unequal representation in computing.
Previous academic census efforts have relied on narrow or biased samples, or on
professional society membership rolls. A full census can be constructed
directly from online departmental faculty directories, but doing so by hand is
prohibitively expensive and time-consuming. Here, we introduce a topical web
crawler for automating the collection of faculty information from web-based
department rosters, and demonstrate the resulting system on the 205
PhD-granting computer science departments in the U.S. and Canada. This method
constructs a complete census of the field within a few minutes, and achieves
over 99% precision and recall. We conclude by comparing the resulting 2017
census to a hand-curated 2011 census to quantify turnover and retention in
computer science, in general and for female faculty in particular,
demonstrating the types of analysis made possible by automated census
construction.Comment: 11 pages, 6 figures, 2 table
Web Mining Functions in an Academic Search Application
This paper deals with Web mining and the different categories of Web mining like content, structure and usage mining. The application of Web mining in an academic search application has been discussed. The paper concludes with open problems related to Web mining. The present work can be a useful input to Web users, Web Administrators in a university environment.Database, HITS, IR, NLP, Web mining
- …