52,984 research outputs found
Query-Driven Sampling for Collective Entity Resolution
Probabilistic databases play a preeminent role in the processing and
management of uncertain data. Recently, many database research efforts have
integrated probabilistic models into databases to support tasks such as
information extraction and labeling. Many of these efforts are based on batch
oriented inference which inhibits a realtime workflow. One important task is
entity resolution (ER). ER is the process of determining records (mentions) in
a database that correspond to the same real-world entity. Traditional pairwise
ER methods can lead to inconsistencies and low accuracy due to localized
decisions. Leading ER systems solve this problem by collectively resolving all
records using a probabilistic graphical model and Markov chain Monte Carlo
(MCMC) inference. However, for large datasets this is an extremely expensive
process. One key observation is that, such exhaustive ER process incurs a huge
up-front cost, which is wasteful in practice because most users are interested
in only a small subset of entities. In this paper, we advocate pay-as-you-go
entity resolution by developing a number of query-driven collective ER
techniques. We introduce two classes of SQL queries that involve ER operators
--- selection-driven ER and join-driven ER. We implement novel variations of
the MCMC Metropolis Hastings algorithm to generate biased samples and
selectivity-based scheduling algorithms to support the two classes of ER
queries. Finally, we show that query-driven ER algorithms can converge and
return results within minutes over a database populated with the extraction
from a newswire dataset containing 71 million mentions
Modeling Scalability of Distributed Machine Learning
Present day machine learning is computationally intensive and processes large
amounts of data. It is implemented in a distributed fashion in order to address
these scalability issues. The work is parallelized across a number of computing
nodes. It is usually hard to estimate in advance how many nodes to use for a
particular workload. We propose a simple framework for estimating the
scalability of distributed machine learning algorithms. We measure the
scalability by means of the speedup an algorithm achieves with more nodes. We
propose time complexity models for gradient descent and graphical model
inference. We validate our models with experiments on deep learning training
and belief propagation. This framework was used to study the scalability of
machine learning algorithms in Apache Spark.Comment: 6 pages, 4 figures, appears at ICDE 201
DOPE: Distributed Optimization for Pairwise Energies
We formulate an Alternating Direction Method of Mul-tipliers (ADMM) that
systematically distributes the computations of any technique for optimizing
pairwise functions, including non-submodular potentials. Such discrete
functions are very useful in segmentation and a breadth of other vision
problems. Our method decomposes the problem into a large set of small
sub-problems, each involving a sub-region of the image domain, which can be
solved in parallel. We achieve consistency between the sub-problems through a
novel constraint that can be used for a large class of pair-wise functions. We
give an iterative numerical solution that alternates between solving the
sub-problems and updating consistency variables, until convergence. We report
comprehensive experiments, which demonstrate the benefit of our general
distributed solution in the case of the popular serial algorithm of Boykov and
Kolmogorov (BK algorithm) and, also, in the context of non-submodular
functions.Comment: Accepted at CVPR 201
- …