Query-related data extraction of hidden web documents

Hedley, Y.; Younas, M.; James, A.; Sanderson, M.

research

oai:eprints.whiterose.ac.uk:4545

Query-related data extraction of hidden web documents

Authors: Y. Hedley
M. Younas
A. James
M. Sanderson
Publication date: 1 January 2004
Publisher: ACM
Doi

Abstract

The larger amount of information on the Web is stored in document databases and is not indexed by general-purpose search engines (i.e., Google and Yahoo). Such information is dynamically generated through querying databases — which are referred to as Hidden Web databases. Documents returned in response to a user query are typically presented using templategenerated Web pages. This paper proposes a novel approach that identifies Web page templates by analysing the textual contents and the adjacent tag structures of a document in order to extract query-related data. Preliminary results demonstrate that our approach effectively detects templates and retrieves data with high recall and precision

Similar works

Full text

Open in the Core reader

Download PDF

White Rose Research Online

oai:eprints.whiterose.ac.uk:45...

Last time updated on 28/06/2012

This paper was published in White Rose Research Online.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.