Skip to main content
Article thumbnail
Location of Repository

Toward a Structured Information Retrieval System on the Web: Automatic Structure Extraction of Web Pages

By Mathias Géry and Jean-pierre Chevallet

Abstract

Abstract: The World Wide Web is a distributed, heterogeneous and semi-structured information space. With the growth of available data, retrieving interesting information is becoming quite difficult and classical search engines give often very poor results. The Web is changing very quickly, and search engines mainly use old and well-known IR techniques. One of the main problems is the lack of explicit HTML page structure, and more generally the lack of explicit Web sites structure. We show in this paper that it is possible to extract such a structure, which can be explicit or implicit: hypertext links between pages, the implicit relations between pages, the HTML tags describing structure, etc. We present some preliminary results of a Web sample analysis extracting several levels of structure (a hierarchical tree structure, a graphlike structure)

Topics: Web Information Retrieval, Web Pages Analysis, Structure Extraction, Statistics
Year: 2001
OAI identifier: oai:CiteSeerX.psu:10.1.1.135.6058
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://citeseerx.ist.psu.edu/v... (external link)
  • http://www.iicm.tugraz.at/thes... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.