Active sampling for entity matching

Abstract

In entity matching, a fundamental issue while training a classifier to label pairs of entities as either duplicates or non-duplicates is the one of selecting informative examples. Al-though active learning presents an attractive solution to this problem, previous approaches minimize the misclassification rate (0-1 loss) of the classifier, which is an unsuitable metric for entity matching due to class imbalance (i.e., many more non-duplicate pairs than duplicate pairs). To address this, a recent work [1] proposes to maximize recall of the classi-fier under the constraint that its precision should be greater than a specified threshold. However, the proposed technique requires the labels of all n input pairs in the worst-case. Our main result is an active learning algorithm that ap-proximately maximizes recall of the classifier under precision constraint with provably sub-linear label complexity (under certain distributional assumptions). Our algorithm uses as a black-box any active learning approach that minimizes 0-1 loss. We show that label complexity of our algorithm is at most logn times the label complexity of the black-box, and also bound the difference in the recall of classifier learnt by our algorithm and the recall of the optimal classifier satisfy-ing the precision constraint. We provide an empirical evalu-ation of our algorithm on several real-world matching data sets that demonstrates the effectiveness of our approach. 1

Similar works

Full text

thumbnail-image

CiteSeerX

redirect
Last time updated on 29/10/2017

This paper was published in CiteSeerX.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.