Skip to main content
Article thumbnail
Location of Repository

Automatic Wrappers for Large Scale Web Extraction

By Nilesh Dalvi, Ravi Kumar and Mohamed Soliman

Abstract

We present a generic framework to make wrapper induction algorithms tolerant to noise in the training data. This enables us to learn wrappers in a completely unsupervised manner from automatically and cheaply obtained noisy training data, e.g., using dictionaries and regular expressions. By removing the site-level supervision that wrapper-based techniques require, we are able to perform information extraction at web-scale, with accuracy unattained with existing unsupervised extraction techniques. Our system is used in production at Yahoo! and powers live applications.Comment: VLDB201

Topics: Computer Science - Databases
Year: 2011
OAI identifier: oai:arXiv.org:1103.2406
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://arxiv.org/abs/1103.2406 (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.