Skip to main content
Article thumbnail
Location of Repository

The Itemizer: From Search Results to Spreadsheets

By 

Abstract

We propose an algorithm for extracting fields from search results. The output of the algorithm takes the form of a database table–a data structure that is easy to manipulate. Our algorithm effectively combines tree and string alignment algorithms to try to match semantically related data across individual search results. The applications of our approach are vast and include hidden web crawling, semantic tagging, and federated search. We explain how tree alignment works. We show how Support Vector Machines are used to learn tree alignment cost parameters from unlabelled data. We then explain how to use tree alignment to find fields. Finally, we show how string distance is exploited in several ways to refine the structure of the tables outputted.

Year: 2009
OAI identifier: oai:CiteSeerX.psu:10.1.1.134.4157
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://citeseerx.ist.psu.edu/v... (external link)
  • http://www.cs.ucsc.edu/~eads/p... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.