Skip to main content
Article thumbnail
Location of Repository

Scaling up the ALIAS Duplicate Elimination System: A Demonstration

By 

Abstract

Duplicate elimination is an important stage in integrating data from multiple sources. The challenges involved are finding a robust deduplication function that can identify when two records are duplicates and efficiently applying the function on very large lists of records. In ALIAS the task of designing a deduplication function is eased by learning the function from examples of duplicates and nonduplicates and by using active learning to spot such examples effectively [1]. Here we investigate the issues involved in efficiently applying the learnt deduplication system on large lists of records. We demonstrate the working of the ALIAS evaluation engine and highlight the optimizations it uses to significantly cut down the number of record pairs that need to be explicitly materialized.

Year: 2009
OAI identifier: oai:CiteSeerX.psu:10.1.1.134.5894
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://citeseerx.ist.psu.edu/v... (external link)
  • http://www.it.iitb.ac.in/~suni... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.