Self-supervised automated wrapper generation for weblog data extraction

A. Laender; B. Adelberg; C. Kohlschütter; I. Muslea; N. Kushmerick; P. Geibel; R. Baumgartner

research

Self-supervised automated wrapper generation for weblog data extraction

Authors: A. Laender
B. Adelberg
C. Kohlschütter
I. Muslea
N. Kushmerick
P. Geibel
R. Baumgartner
Publication date: 1 January 2013
Publisher: 'Springer Science and Business Media LLC'
Doi

Abstract

Data extraction from the web is notoriously hard. Of the types of resources available on the web, weblogs are becoming increasingly important due to the continued growth of the blogosphere, but remain poorly explored. Past approaches to data extraction from weblogs have often involved manual intervention and suffer from low scalability. This paper proposes a fully automated information extraction methodology based on the use of web feeds and processing of HTML. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a dataset of 2,393 posts and the results (92% accuracy) show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere for applications such as improved information retrieval and more robust web preservation initiatives

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Warwick Research Archives Portal Repository

oai:wrap.warwick.ac.uk:59173

Last time updated on 25/02/2014

Crossref

info:doi/10.1007%2F978-3-642-3...

Last time updated on 27/02/2019