4 research outputs found
Scaling Multiple-Source Entity Resolution using Statistically Efficient Transfer Learning
We consider a serious, previously-unexplored challenge facing almost all
approaches to scaling up entity resolution (ER) to multiple data sources: the
prohibitive cost of labeling training data for supervised learning of
similarity scores for each pair of sources. While there exists a rich
literature describing almost all aspects of pairwise ER, this new challenge is
arising now due to the unprecedented ability to acquire and store data from
online sources, features driven by ER such as enriched search verticals, and
the uniqueness of noisy and missing data characteristics for each source. We
show on real-world and synthetic data that for state-of-the-art techniques, the
reality of heterogeneous sources means that the number of labeled training data
must scale quadratically in the number of sources, just to maintain constant
precision/recall. We address this challenge with a brand new transfer learning
algorithm which requires far less training data (or equivalently, achieves
superior accuracy with the same data) and is trained using fast convex
optimization. The intuition behind our approach is to adaptively share
structure learned about one scoring problem with all other scoring problems
sharing a data source in common. We demonstrate that our theoretically
motivated approach incurs no runtime cost while it can maintain constant
precision/recall with the cost of labeling increasing only linearly with the
number of sources.Comment: Short version to appear in CIKM'2012; 10 pages, 7 figure
In Search of an Entity Resolution OASIS: Optimal Asymptotic Sequential Importance Sampling
Entity resolution (ER) presents unique challenges for evaluation methodology.
While crowdsourcing platforms acquire ground truth, sound approaches to
sampling must drive labelling efforts. In ER, extreme class imbalance between
matching and non-matching records can lead to enormous labelling requirements
when seeking statistically consistent estimates for rigorous evaluation. This
paper addresses this important challenge with the OASIS algorithm: a sampler
and F-measure estimator for ER evaluation. OASIS draws samples from a (biased)
instrumental distribution, chosen to ensure estimators with optimal asymptotic
variance. As new labels are collected OASIS updates this instrumental
distribution via a Bayesian latent variable model of the annotator oracle, to
quickly focus on unlabelled items providing more information. We prove that
resulting estimates of F-measure, precision, recall converge to the true
population values. Thorough comparisons of sampling methods on a variety of ER
datasets demonstrate significant labelling reductions of up to 83% without loss
to estimate accuracy.Comment: 13 pages, 5 figure
Principled Graph Matching Algorithms for Integrating Multiple Data Sources
This paper explores combinatorial optimization for problems of max-weight
graph matching on multi-partite graphs, which arise in integrating multiple
data sources. Entity resolution-the data integration problem of performing
noisy joins on structured data-typically proceeds by first hashing each record
into zero or more blocks, scoring pairs of records that are co-blocked for
similarity, and then matching pairs of sufficient similarity. In the most
common case of matching two sources, it is often desirable for the final
matching to be one-to-one (a record may be matched with at most one other);
members of the database and statistical record linkage communities accomplish
such matchings in the final stage by weighted bipartite graph matching on
similarity scores. Such matchings are intuitively appealing: they leverage a
natural global property of many real-world entity stores-that of being nearly
deduped-and are known to provide significant improvements to precision and
recall. Unfortunately unlike the bipartite case, exact max-weight matching on
multi-partite graphs is known to be NP-hard. Our two-fold algorithmic
contributions approximate multi-partite max-weight matching: our first
algorithm borrows optimization techniques common to Bayesian probabilistic
inference; our second is a greedy approximation algorithm. In addition to a
theoretical guarantee on the latter, we present comparisons on a real-world ER
problem from Bing significantly larger than typically found in the literature,
publication data, and on a series of synthetic problems. Our results quantify
significant improvements due to exploiting multiple sources, which are made
possible by global one-to-one constraints linking otherwise independent
matching sub-problems. We also discover that our algorithms are complementary:
one being much more robust under noise, and the other being simple to implement
and very fast to run.Comment: 14 pages, 11 figure
Reuse and Adaptation for Entity Resolution through Transfer Learning
Entity resolution (ER) is one of the fundamental problems in data
integration, where machine learning (ML) based classifiers often provide the
state-of-the-art results. Considerable human effort goes into feature
engineering and training data creation. In this paper, we investigate a new
problem: Given a dataset D_T for ER with limited or no training data, is it
possible to train a good ML classifier on D_T by reusing and adapting the
training data of dataset D_S from same or related domain? Our major
contributions include (1) a distributed representation based approach to encode
each tuple from diverse datasets into a standard feature space; (2)
identification of common scenarios where the reuse of training data can be
beneficial; and (3) five algorithms for handling each of the aforementioned
scenarios. We have performed comprehensive experiments on 12 datasets from 5
different domains (publications, movies, songs, restaurants, and books). Our
experiments show that our algorithms provide significant benefits such as
providing superior performance for a fixed training data size