813 research outputs found

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    Design of Automatically Adaptable Web Wrappers

    Get PDF
    Nowadays, the huge amount of information distributed through the Web motivates studying techniques to\ud be adopted in order to extract relevant data in an efïŹcient and reliable way. Both academia and enterprises\ud developed several approaches of Web data extraction, for example using techniques of artiïŹcial intelligence or\ud machine learning. Some commonly adopted procedures, namely wrappers, ensure a high degree of precision\ud of information extracted from Web pages, and, at the same time, have to prove robustness in order not to\ud compromise quality and reliability of data themselves.\ud In this paper we focus on some experimental aspects related to the robustness of the data extraction process\ud and the possibility of automatically adapting wrappers. We discuss the implementation of algorithms for\ud ïŹnding similarities between two different version of a Web page, in order to handle modiïŹcations, avoiding\ud the failure of data extraction tasks and ensuring reliability of information extracted. Our purpose is to evaluate\ud performances, advantages and draw-backs of our novel system of automatic wrapper adaptation

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    Self-supervised automated wrapper generation for weblog data extraction

    Get PDF
    Data extraction from the web is notoriously hard. Of the types of resources available on the web, weblogs are becoming increasingly important due to the continued growth of the blogosphere, but remain poorly explored. Past approaches to data extraction from weblogs have often involved manual intervention and suffer from low scalability. This paper proposes a fully automated information extraction methodology based on the use of web feeds and processing of HTML. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a dataset of 2,393 posts and the results (92% accuracy) show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere for applications such as improved information retrieval and more robust web preservation initiatives

    Web Data Extraction For Content Aggregation From E-Commerce Websites

    Get PDF
    Internetist on saanud piiramatu andmeallikas. LĂ€bi otsingumootorite\n\ron see andmehulk tehtud kĂ€ttesaadavaks igapĂ€evasele interneti kasutajale. Sellele vaatamata on seal ikka informatsiooni, mis pole lihtsasti kĂ€ttesaadav olemasolevateotsingumootoritega. See tekitab jĂ€tkuvalt vajadust ehitada aina uusi otsingumootoreid, mis esitavad informatsiooni uuel kujul, paremini kui seda on varem tehtud. Selleks, et esitada andmeid sellisel kujul, et neist tekiks lisavÀÀrtus tuleb nad kĂ”igepealt kokku koguda ning seejĂ€rel töödelda ja analĂŒĂŒsida. Antud magistritöö uurib andmete kogumise faasi selles protsessis.\n\rEsitletakse modernset andmete eraldamise sĂŒsteemi ZedBot, mis vĂ”imaldab veebilehtedel esinevad pooleldi struktureeritud andmed teisendada kĂ”rge tĂ€psusega struktureeritud kujule. Loodud sĂŒsteem tĂ€idab enamikku nĂ”udeid, mida peab tĂ€napĂ€evane andmeeraldussĂŒsteem tĂ€itma, milleks on: platvormist sĂ”ltumatus, vĂ”imas reeglite kirjelduse sĂŒsteem, automaatne reeglite genereerimise sĂŒsteem ja lihtsasti kasutatav kasutajaliides andmete annoteerimiseks. Eriliselt disainitud otsi-robot vĂ”imaldab andmete eraldamist kogu veebilehelt ilma inimese sekkumiseta. Töös nĂ€idatakse, et esitletud programm on sobilik andmete eraldamiseks vĂ€ga suure tĂ€psusega suurelt hulgalt veebilehtedelt ning tööriista poolt loodud andmestiku saab kasutada tooteinfo agregeerimiseks ning uue lisandvÀÀrtuse loomiseks.World Wide Web has become an unlimited source of data. Search engines have made this information available to every day Internet user. There is still information available that is not easily accessible through existing search engines, so there remains the need to create new search engines that would present information better than before. In order to present data in a way that gives extra value, it must be collected, analysed and transformed. This master thesis focuses on data collection part. Modern information extraction system ZedBot is presented, that allows extraction of highly structured data form semi structured web pages. It complies with majority of requirements set for modern data extraction system: it is platform independent, it has powerful semi automatic wrapper generation system and has easy to use user interface for annotating structured data. Specially designed web crawler allows to extraction to be performed on whole web site level without human interaction. \n\r We show that presented tool is suitable for extraction highly accurate data from large number of websites and can be used as a data source for product aggregation system to create new added value
    • 

    corecore