15 research outputs found
Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach
Finding joinable tables in data lakes is key procedure in many applications
such as data integration, data augmentation, data analysis, and data market.
Traditional approaches that find equi-joinable tables are unable to deal with
misspellings and different formats, nor do they capture any semantic joins. In
this paper, we propose PEXESO, a framework for joinable table discovery in data
lakes. We embed textual values as high-dimensional vectors and join columns
under similarity predicates on high-dimensional vectors, hence to address the
limitations of equi-join approaches and identify more meaningful results. To
efficiently find joinable tables with similarity, we propose a block-and-verify
method that utilizes pivot-based filtering. A partitioning technique is
developed to cope with the case when the data lake is large and the index
cannot fit in main memory. An experimental evaluation on real datasets shows
that our solution identifies substantially more tables than equi-joins and
outperforms other similarity-based options, and the join results are useful in
data enrichment for machine learning tasks. The experiments also demonstrate
the efficiency of the proposed method.Comment: Full version of paper in ICDE 202
FINEX: A Fast Index for Exact & Flexible Density-Based Clustering (Extended Version with Proofs)*
Density-based clustering aims to find groups of similar objects (i.e.,
clusters) in a given dataset. Applications include, e.g., process mining and
anomaly detection. It comes with two user parameters ({\epsilon}, MinPts) that
determine the clustering result, but are typically unknown in advance. Thus,
users need to interactively test various settings until satisfying clusterings
are found. However, existing solutions suffer from the following limitations:
(a) Ineffective pruning of expensive neighborhood computations. (b) Approximate
clustering, where objects are falsely labeled noise. (c) Restricted parameter
tuning that is limited to {\epsilon} whereas MinPts is constant, which reduces
the explorable clusterings. (d) Inflexibility in terms of applicable data types
and distance functions. We propose FINEX, a linear-space index that overcomes
these limitations. Our index provides exact clusterings and can be queried with
either of the two parameters. FINEX avoids neighborhood computations where
possible and reduces the complexities of the remaining computations by
leveraging fundamental properties of density-based clusters. Hence, our
solution is effcient and flexible regarding data types and distance functions.
Moreover, FINEX respects the original and straightforward notion of
density-based clustering. In our experiments on 12 large real-world datasets
from various domains, FINEX frequently outperforms state-of-the-art techniques
for exact clustering by orders of magnitude
GB-KMV: An Augmented KMV Sketch for Approximate Containment Similarity Search
In this paper, we study the problem of approximate containment similarity
search. Given two records Q and X, the containment similarity between Q and X
with respect to Q is |Q intersect X|/ |Q|. Given a query record Q and a set of
records S, the containment similarity search finds a set of records from S
whose containment similarity regarding Q are not less than the given threshold.
This problem has many important applications in commercial and scientific
fields such as record matching and domain search. Existing solution relies on
the asymmetric LSH method by transforming the containment similarity to
well-studied Jaccard similarity. In this paper, we use a different framework by
transforming the containment similarity to set intersection. We propose a novel
augmented KMV sketch technique, namely GB-KMV, which is data-dependent and can
achieve a good trade-off between the sketch size and the accuracy. We provide a
set of theoretical analysis to underpin the proposed augmented KMV sketch
technique, and show that it outperforms the state-of-the-art technique LSH-E in
terms of estimation accuracy under practical assumption. Our comprehensive
experiments on real-life datasets verify that GB-KMV is superior to LSH-E in
terms of the space-accuracy trade-off, time-accuracy trade-off, and the sketch
construction time. For instance, with similar estimation accuracy (F-1 score),
GB-KMV is over 100 times faster than LSH-E on some real-life dataset