In entity matching, a fundamental issue while training a classifier to label pairs of entities as either duplicates or non-duplicates is the one of selecting informative examples. Al-though active learning presents an attractive solution to this problem, previous approaches minimize the misclassification rate (0-1 loss) of the classifier, which is an unsuitable metric for entity matching due to class imbalance (i.e., many more non-duplicate pairs than duplicate pairs). To address this, a recent work [1] proposes to maximize recall of the classi-fier under the constraint that its precision should be greater than a specified threshold. However, the proposed technique requires the labels of all n input pairs in the worst-case. Our main result is an active learning algorithm that ap-proximately maximizes recall of the classifier under precision constraint with provably sub-linear label complexity (under certain distributional assumptions). Our algorithm uses as a black-box any active learning approach that minimizes 0-1 loss. We show that label complexity of our algorithm is at most logn times the label complexity of the black-box, and also bound the difference in the recall of classifier learnt by our algorithm and the recall of the optimal classifier satisfy-ing the precision constraint. We provide an empirical evalu-ation of our algorithm on several real-world matching data sets that demonstrates the effectiveness of our approach. 1
Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.