391,099 research outputs found
Self-supervised automated wrapper generation for weblog data extraction
Data extraction from the web is notoriously hard. Of the types of resources available on the web, weblogs are becoming increasingly important due to the continued growth of the blogosphere, but remain poorly explored. Past approaches to data extraction from weblogs have often involved manual intervention and suffer from low scalability. This paper proposes a fully automated information extraction methodology based on the use of web feeds and processing of HTML. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a dataset of 2,393 posts and the results (92% accuracy) show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere for applications such as improved information retrieval and more robust web preservation initiatives
Information extraction from template-generated hidden web documents
The larger amount of information on the Web is stored in document databases and is not indexed by general-purpose
search engines (such as Google and Yahoo). Databases dynamically generate a list of documents in response to a user
query – which are referred to as Hidden Web databases. Such documents are typically presented to users as templategenerated
Web pages. This paper presents a new approach that identifies Web page templates in order to extract queryrelated
information from documents. We propose two forms of representation to analyse the content of a document –
Text with Immediate Adjacent Tag Segments (TIATS) and Text with Neighbouring Adjacent Tag Segments (TNATS).
Our techniques exploit tag structures that surround the textual contents of documents in order to detect Web page
templates thereby extracting query-related information. Experimental results demonstrate that TNATS detects Web page
templates most effectively and extracts information with high recall and precision
Web Data Extraction, Applications and Techniques: A Survey
Web Data Extraction is an important problem that has been studied by means of
different scientific tools and in a broad range of applications. Many
approaches to extracting data from the Web have been designed to solve specific
problems and operate in ad-hoc domains. Other approaches, instead, heavily
reuse techniques and algorithms developed in the field of Information
Extraction.
This survey aims at providing a structured and comprehensive overview of the
literature in the field of Web Data Extraction. We provided a simple
classification framework in which existing Web Data Extraction applications are
grouped into two main classes, namely applications at the Enterprise level and
at the Social Web level. At the Enterprise level, Web Data Extraction
techniques emerge as a key tool to perform data analysis in Business and
Competitive Intelligence systems as well as for business process
re-engineering. At the Social Web level, Web Data Extraction techniques allow
to gather a large amount of structured data continuously generated and
disseminated by Web 2.0, Social Media and Online Social Network users and this
offers unprecedented opportunities to analyze human behavior at a very large
scale. We discuss also the potential of cross-fertilization, i.e., on the
possibility of re-using Web Data Extraction techniques originally designed to
work in a given domain, in other domains.Comment: Knowledge-based System
Automatic supervised information extraction of structured web data
The overall purpose of this project is, in short words, to create a system able to extract vital
information from product web pages just like a human would. Information like the name of the
product, its description, price tag, company that produces it, and so on. At a first glimpse, this
may not seem extraordinary or technically difficult, since web scraping techniques exist from long
ago (like the python library Beautiful Soup for instance, an HTML parser1 released in 2004). But
let us think for a second on what it actually means being able to extract desired information from
any given web source: the way information is displayed can be extremely varied, not only visually,
but also semantically. For instance, some hotel booking web pages display at once all prices for
the different room types, while medium-sized consumer products in websites like Amazon offer the
main product in detail and then more small-sized product recommendations further down the page,
being the latter the preferred way of displaying assets by most retail companies. And each with its
own styling and search engines. With the above said, the task of mining valuable data from the
web now does not sound as easy as it first seemed. Hence the purpose of this project is to shine
some light on the Automatic Supervised Information Extraction of Structured Web Data problem.
It is important to think if developing such a solution is really valuable at all. Such an endeavour
both in time and computing resources should lead to a useful end result, at least on paper, to
justify it. The opinion of this author is that it does lead to a potentially valuable result. The
targeted extraction of information of publicly available consumer-oriented content at large scale in
an accurate, reliable and future proof manner could provide an incredibly useful and large amount
of data. This data, if kept updated, could create endless opportunities for Business Intelligence,
although exactly which ones is beyond the scope of this work. A simple metaphor explains the
potential value of this work: if an oil company were to be told where are all the oil reserves in the
planet, it still should need to invest in machinery, workers and time to successfully exploit them,
but half of the job would have already been done2.
As the reader will see in this work, the way the issue is tackled is by building a somehow complex
architecture that ends in an Artificial Neural Network3. A quick overview of such architecture is
as follows: first find the URLs that lead to the product pages that contain the desired data that
is going to be extracted inside a given site (like URLs that lead to ”action figure” products inside
the site ebay.com); second, per each URL passed, extract its HTML and make a screenshot of the
page, and store this data in a suitable and scalable fashion; third, label the data that will be fed to
the NN4; fourth, prepare the aforementioned data to be input in an NN; fifth, train the NN; and
sixth, deploy the NN to make [hopefully accurate] predictions
Sample-based XPath Ranking for Web Information Extraction
Web information extraction typically relies on a wrapper, i.e., program code or a configuration that specifies how to extract some information from web pages at a specific website. Manually creating and maintaining wrappers is a cumbersome and error-prone task. It may even be prohibitive as some applications require information extraction from previously unseen websites. This paper approaches the problem of automatic on-the-fly wrapper creation for websites that provide attribute data for objects in a ‘search – search result page – detail page’ setup. The approach is a wrapper induction approach which uses a small and easily obtainable set of sample data for ranking XPaths on their suitability for extracting the wanted attribute data. Experiments show that the automatically generated top-ranked XPaths indeed extract the wanted data. Moreover, it appears that 20 to 25 input samples suffice for finding a suitable XPath for an attribute
Web Mail Information Extraction
This project is conducted as to deliver the background of study, problem statements,
objective, scope, literature review, methodology of choice for the development
process, results and discussion, conclusion, recommendations and references used
throughout its completion. The objective of this project is to extract relevant and
useful information from Google Mail (GMail) by performing Information Extraction
(IE) using Java progranuning language. After several testing have take place, the
system developed is able to successfully extract relevant and useful information from
GMail account and the emails come from different folders such as All Mail, Inbox,
Drafts, Starred, Sent Mail, Spam and Trash. The focus is to extract email information
such as the sender, recipient, subject and content. Those extracted information are
presented in two mediums; as a text file or being stored inside database in order to
better suit different users who come from different backgrounds and needs
Design of Automatically Adaptable Web Wrappers
Nowadays, the huge amount of information distributed through the Web motivates studying techniques to\ud
be adopted in order to extract relevant data in an efficient and reliable way. Both academia and enterprises\ud
developed several approaches of Web data extraction, for example using techniques of artificial intelligence or\ud
machine learning. Some commonly adopted procedures, namely wrappers, ensure a high degree of precision\ud
of information extracted from Web pages, and, at the same time, have to prove robustness in order not to\ud
compromise quality and reliability of data themselves.\ud
In this paper we focus on some experimental aspects related to the robustness of the data extraction process\ud
and the possibility of automatically adapting wrappers. We discuss the implementation of algorithms for\ud
finding similarities between two different version of a Web page, in order to handle modifications, avoiding\ud
the failure of data extraction tasks and ensuring reliability of information extracted. Our purpose is to evaluate\ud
performances, advantages and draw-backs of our novel system of automatic wrapper adaptation
- …