Skip to main content
Article thumbnail
Location of Repository

Automatically Learning Gazetteers from the Deep Web ∗

By Tim Furche, Giovanni Grasso, Giorgio Orsi, Christian Schallhart and Cheng Wang

Abstract

Wrapper induction faces a dilemma: To reach web scale, it requires automatically generated examples, but to produce accurate results, these examples must have the quality of human annotations. We resolve this conflict with AMBER, a system for fully automated data extraction from result pages. In contrast to previous approaches, AMBER employs domain specific gazetteers to discern basic domain attributes on a page, and leverages repeated occurrences of similar attributes to group related attributes into records rather than relying on the noisy structure of the DOM. With this approach AM-BER is able to identify records and their attributes with almost perfect accuracy (> 98%) on a large sample of websites. To make such an approach feasible at scale, AMBER automatically learns domain gazetteers from a small seed set. In this demonstration, we show how AMBER uses the repeated structure of records on deep web result pages to learn such gazetteers. This is only possible with a highly accurate extraction system. Depending on its parametrization, this learning process runs either fully automatically or with human interaction. We show how AMBER bootstraps a gazetteer for UK locations in 4 iterations: From a small seed sample we achieve 94.4 % accuracy in recognizing UK locations in the 4th iteration

Topics: Categories and Subject Descriptors H.3.5 [Information Storage and Retrieval, Data and Content Management—Web-based services General Terms Languages, Experimentation Keywords gazetteer learning
Year: 2013
OAI identifier: oai:CiteSeerX.psu:10.1.1.309.6674
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://citeseerx.ist.psu.edu/v... (external link)
  • http://www2012.wwwconference.o... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.