47 research outputs found
Aggregate Estimation Over Dynamic Hidden Web Databases
Many databases on the web are "hidden" behind (i.e., accessible only through)
their restrictive, form-like, search interfaces. Recent studies have shown that
it is possible to estimate aggregate query answers over such hidden web
databases by issuing a small number of carefully designed search queries
through the restrictive web interface. A problem with these existing work,
however, is that they all assume the underlying database to be static, while
most real-world web databases (e.g., Amazon, eBay) are frequently updated. In
this paper, we study the novel problem of estimating/tracking aggregates over
dynamic hidden web databases while adhering to the stringent query-cost
limitation they enforce (e.g., at most 1,000 search queries per day).
Theoretical analysis and extensive real-world experiments demonstrate the
effectiveness of our proposed algorithms and their superiority over baseline
solutions (e.g., the repeated execution of algorithms designed for static web
databases)
Discovering the Skyline of Web Databases
Many web databases are "hidden" behind proprietary search interfaces that
enforce the top- output constraint, i.e., each query returns at most of
all matching tuples, preferentially selected and returned according to a
proprietary ranking function. In this paper, we initiate research into the
novel problem of skyline discovery over top- hidden web databases. Since
skyline tuples provide critical insights into the database and include the
top-ranked tuple for every possible ranking function following the monotonic
order of attribute values, skyline discovery from a hidden web database can
enable a wide variety of innovative third-party applications over one or
multiple web databases. Our research in the paper shows that the critical
factor affecting the cost of skyline discovery is the type of search interface
controls provided by the website. As such, we develop efficient algorithms for
three most popular types, i.e., one-ended range, free range and point
predicates, and then combine them to support web databases that feature a
mixture of these types. Rigorous theoretical analysis and extensive real-world
online and offline experiments demonstrate the effectiveness of our proposed
techniques and their superiority over baseline solutions
Data Curation with Deep Learning [Vision]
Data curation - the process of discovering, integrating, and cleaning data -
is one of the oldest, hardest, yet inevitable data management problems. Despite
decades of efforts from both researchers and practitioners, it is still one of
the most time consuming and least enjoyable work of data scientists. In most
organizations, data curation plays an important role so as to fully unlock the
value of big data. Unfortunately, the current solutions are not keeping up with
the ever-changing data ecosystem, because they often require substantially high
human cost. Meanwhile, deep learning is making strides in achieving remarkable
successes in multiple areas, such as image recognition, natural language
processing, and speech recognition. In this vision paper, we explore how some
of the fundamental innovations in deep learning could be leveraged to improve
existing data curation solutions and to help build new ones. In particular, we
provide a thorough overview of the current deep learning landscape, and
identify interesting research opportunities and dispel common myths. We hope
that the synthesis of these important domains will unleash a series of research
activities that will lead to significantly improved solutions for many data
curation tasks
Multi-Attribute Selectivity Estimation Using Deep Learning
Selectivity estimation - the problem of estimating the result size of queries
- is a fundamental problem in databases. Accurate estimation of query
selectivity involving multiple correlated attributes is especially challenging.
Poor cardinality estimates could result in the selection of bad plans by the
query optimizer. We investigate the feasibility of using deep learning based
approaches for both point and range queries and propose two complementary
approaches. Our first approach considers selectivity as an unsupervised deep
density estimation problem. We successfully introduce techniques from neural
density estimation for this purpose. The key idea is to decompose the joint
distribution into a set of tractable conditional probability distributions such
that they satisfy the autoregressive property. Our second approach formulates
selectivity estimation as a supervised deep learning problem that predicts the
selectivity of a given query. We also introduce and address a number of
practical challenges arising when adapting deep learning for relational data.
These include query/data featurization, incorporating query workload
information in a deep learning framework and the dynamic scenario where both
data and workload queries could be updated. Our extensive experiments with a
special emphasis on queries with a large number of predicates and/or small
result sizes demonstrates that our proposed techniques provide fast and
accurate selective estimates with minimal space overhead
Aggregate Estimations over Location Based Services
Location based services (LBS) have become very popular in recent years. They
range from map services (e.g., Google Maps) that store geographic locations of
points of interests, to online social networks (e.g., WeChat, Sina Weibo,
FourSquare) that leverage user geographic locations to enable various
recommendation functions. The public query interfaces of these services may be
abstractly modeled as a kNN interface over a database of two dimensional points
on a plane: given an arbitrary query point, the system returns the k points in
the database that are nearest to the query point. In this paper we consider the
problem of obtaining approximate estimates of SUM and COUNT aggregates by only
querying such databases via their restrictive public interfaces. We distinguish
between interfaces that return location information of the returned tuples
(e.g., Google Maps), and interfaces that do not return location information
(e.g., Sina Weibo). For both types of interfaces, we develop aggregate
estimation algorithms that are based on novel techniques for precisely
computing or approximately estimating the Voronoi cell of tuples. We discuss a
comprehensive set of real-world experiments for testing our algorithms,
including experiments on Google Maps, WeChat, and Sina Weibo
DeepER -- Deep Entity Resolution
Entity resolution (ER) is a key data integration problem. Despite the efforts
in 70+ years in all aspects of ER, there is still a high demand for
democratizing ER - humans are heavily involved in labeling data, performing
feature engineering, tuning parameters, and defining blocking functions. With
the recent advances in deep learning, in particular distributed representation
of words (a.k.a. word embeddings), we present a novel ER system, called DeepER,
that achieves good accuracy, high efficiency, as well as ease-of-use (i.e.,
much less human efforts). For accuracy, we use sophisticated composition
methods, namely uni- and bi-directional recurrent neural networks (RNNs) with
long short term memory (LSTM) hidden units, to convert each tuple to a
distributed representation (i.e., a vector), which can in turn be used to
effectively capture similarities between tuples. We consider both the case
where pre-trained word embeddings are available as well the case where they are
not; we present ways to learn and tune the distributed representations. For
efficiency, we propose a locality sensitive hashing (LSH) based blocking
approach that uses distributed representations of tuples; it takes all
attributes of a tuple into consideration and produces much smaller blocks,
compared with traditional methods that consider only a few attributes. For
ease-of-use, DeepER requires much less human labeled data and does not need
feature engineering, compared with traditional machine learning based
approaches which require handcrafted features, and similarity functions along
with their associated thresholds. We evaluate our algorithms on multiple
datasets (including benchmarks, biomedical data, as well as multi-lingual data)
and the extensive experimental results show that DeepER outperforms existing
solutions.Comment: Accepted to PVLDB 2018 as "Distributed Representations of Tuples for
Entity Resolution". This version corrects a minor issue in Example 4 pointed
out by Andrew Borthwick and Matthias Boeh
"The Whole Is Greater Than the Sum of Its Parts": Optimization in Collaborative Crowdsourcing
In this work, we initiate the investigation of optimization opportunities in
collaborative crowdsourcing. Many popular applications, such as collaborative
document editing, sentence translation, or citizen science resort to this
special form of human-based computing, where, crowd workers with appropriate
skills and expertise are required to form groups to solve complex tasks.
Central to any collaborative crowdsourcing process is the aspect of successful
collaboration among the workers, which, for the first time, is formalized and
then optimized in this work. Our formalism considers two main
collaboration-related human factors, affinity and upper critical mass,
appropriately adapted from organizational science and social theories. Our
contributions are (a) proposing a comprehensive model for collaborative
crowdsourcing optimization, (b) rigorous theoretical analyses to understand the
hardness of the proposed problems, (c) an array of efficient exact and
approximation algorithms with provable theoretical guarantees. Finally, we
present a detailed set of experimental results stemming from two real-world
collaborative crowdsourcing application us- ing Amazon Mechanical Turk, as well
as conduct synthetic data analyses on scalability and qualitative aspects of
our proposed algorithms. Our experimental results successfully demonstrate the
efficacy of our proposed solutions
Reuse and Adaptation for Entity Resolution through Transfer Learning
Entity resolution (ER) is one of the fundamental problems in data
integration, where machine learning (ML) based classifiers often provide the
state-of-the-art results. Considerable human effort goes into feature
engineering and training data creation. In this paper, we investigate a new
problem: Given a dataset D_T for ER with limited or no training data, is it
possible to train a good ML classifier on D_T by reusing and adapting the
training data of dataset D_S from same or related domain? Our major
contributions include (1) a distributed representation based approach to encode
each tuple from diverse datasets into a standard feature space; (2)
identification of common scenarios where the reuse of training data can be
beneficial; and (3) five algorithms for handling each of the aforementioned
scenarios. We have performed comprehensive experiments on 12 datasets from 5
different domains (publications, movies, songs, restaurants, and books). Our
experiments show that our algorithms provide significant benefits such as
providing superior performance for a fixed training data size
Walk, Not Wait: Faster Sampling Over Online Social Networks
In this paper, we introduce a novel, general purpose, technique for faster
sampling of nodes over an online social network. Specifically, unlike
traditional random walk which wait for the convergence of sampling distribution
to a predetermined target distribution - a waiting process that incurs a high
query cost - we develop WALK-ESTIMATE, which starts with a much shorter random
walk, and then proactively estimate the sampling probability for the node taken
before using acceptance-rejection sampling to adjust the sampling probability
to the predetermined target distribution. We present a novel backward random
walk technique which provides provably unbiased estimations for the sampling
probability, and demonstrate the superiority of WALK-ESTIMATE over traditional
random walks through theoretical analysis and extensive experiments over real
world online social networks
Are Outlier Detection Methods Resilient to Sampling?
Outlier detection is a fundamental task in data mining and has many
applications including detecting errors in databases. While there has been
extensive prior work on methods for outlier detection, modern datasets often
have sizes that are beyond the ability of commonly used methods to process the
data within a reasonable time. To overcome this issue, outlier detection
methods can be trained over samples of the full-sized dataset. However, it is
not clear how a model trained on a sample compares with one trained on the
entire dataset. In this paper, we introduce the notion of resilience to
sampling for outlier detection methods. Orthogonal to traditional performance
metrics such as precision/recall, resilience represents the extent to which the
outliers detected by a method applied to samples from a sampling scheme matches
those when applied to the whole dataset. We propose a novel approach for
estimating the resilience to sampling of both individual outlier methods and
their ensembles. We performed an extensive experimental study on synthetic and
real-world datasets where we study seven diverse and representative outlier
detection methods, compare results obtained from samples versus those obtained
from the whole datasets and evaluate the accuracy of our resilience estimates.
We observed that the methods are not equally resilient to a given sampling
scheme and it is often the case that careful joint selection of both the
sampling scheme and the outlier detection method is necessary. It is our hope
that the paper initiates research on designing outlier detection algorithms
that are resilient to sampling.Comment: 18 page