Learnable similarity functions and their applications to record linkage and clustering. Doctoral Dissertation Proposal

Abstract

Many machine learning tasks require similarity functions that estimate likeness between observations. Similarity computations are particularly important for clustering and record linkage algorithms that depend on accurate estimates of the distance between datapoints. However, standard measures such as string edit distance and Euclidean distance often fail to capture an appropriate notion of similarity for a particular domain or dataset. This problem can be alleviated by employing learnable similarity functions that adapt using training data. In this proposal, we introduce two adaptive string similarity measures: (1) Learnable Edit Distance with Affine Gaps, and (2) Learnable Vector-Space Similarity Based on Pairwise Classification. These similarity functions can be trained using a corpus of labeled pairs of equivalent and non-equivalent strings. We illustrate the accuracy improvements obtained with these measures using MARLIN, our system for record linkage in databases that learns to combine adaptive and static string similarity functions in a two-level learning framework. Obtaining useful training examples for learnable similarity functions can be problematic due to scarcity of informative similar and dissimilar object pairs. We propose two strategies, Static-Active Selection and Weakly-Labeled Selection, that facilitate efficient training data collection for record linkage

    Similar works

    Full text

    thumbnail-image

    Available Versions