CSM-399 - Providing Robust Access to Data in Web Pages

Robinson, Jerome

research

CSM-399 - Providing Robust Access to Data in Web Pages

Authors: Jerome Robinson
Publication date: 1 January 2004
Publisher: CSM-399

Abstract

Much useful e-commerce information is available on web pages, especially those created by queries to web servers. The problem for programs to use that information is how to ‘screen-scrape’ the data off the web page into machineusable data structures. Wrappers for web data sources use knowledge of the page layout in order to extract data accurately. So they fail if page format changes. This paper describes a fast method for wrapper production and also a method to automatically detect page format change, before it causes data access to fail. The method works for pages that contain collections of items, such as lists, tables and hierarchical structures. It uses a representation of html documents, which makes repetitive features apparent. This provides fully automatic wrapper production for a class of web pages, and rapid interactive production for others

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

University of Essex Research Repository

oai:repository.essex.ac.uk:868...

Last time updated on 09/03/2014