22 research outputs found
Intrinsic dimension of a dataset: what properties does one expect?
We propose an axiomatic approach to the concept of an intrinsic dimension of
a dataset, based on a viewpoint of geometry of high-dimensional structures. Our
first axiom postulates that high values of dimension be indicative of the
presence of the curse of dimensionality (in a certain precise mathematical
sense). The second axiom requires the dimension to depend smoothly on a
distance between datasets (so that the dimension of a dataset and that of an
approximating principal manifold would be close to each other). The third axiom
is a normalization condition: the dimension of the Euclidean -sphere \s^n
is . We give an example of a dimension function satisfying our
axioms, even though it is in general computationally unfeasible, and discuss a
computationally cheap function satisfying most but not all of our axioms (the
``intrinsic dimensionality'' of Ch\'avez et al.)Comment: 6 pages, 6 figures, 1 table, latex with IEEE macros, final submission
to Proceedings of the 22nd IJCNN (Orlando, FL, August 12-17, 2007
Indexing Metric Spaces for Exact Similarity Search
With the continued digitalization of societal processes, we are seeing an
explosion in available data. This is referred to as big data. In a research
setting, three aspects of the data are often viewed as the main sources of
challenges when attempting to enable value creation from big data: volume,
velocity and variety. Many studies address volume or velocity, while much fewer
studies concern the variety. Metric space is ideal for addressing variety
because it can accommodate any type of data as long as its associated distance
notion satisfies the triangle inequality. To accelerate search in metric space,
a collection of indexing techniques for metric data have been proposed.
However, existing surveys each offers only a narrow coverage, and no
comprehensive empirical study of those techniques exists. We offer a survey of
all the existing metric indexes that can support exact similarity search, by i)
summarizing all the existing partitioning, pruning and validation techniques
used for metric indexes, ii) providing the time and storage complexity analysis
on the index construction, and iii) report on a comprehensive empirical
comparison of their similarity query processing performance. Here, empirical
comparisons are used to evaluate the index performance during search as it is
hard to see the complexity analysis differences on the similarity query
processing and the query performance depends on the pruning and validation
abilities related to the data distribution. This article aims at revealing
different strengths and weaknesses of different indexing techniques in order to
offer guidance on selecting an appropriate indexing technique for a given
setting, and directing the future research for metric indexes
HD-Index: Pushing the Scalability-Accuracy Boundary for Approximate kNN Search in High-Dimensional Spaces
Nearest neighbor searching of large databases in high-dimensional spaces is
inherently difficult due to the curse of dimensionality. A flavor of
approximation is, therefore, necessary to practically solve the problem of
nearest neighbor search. In this paper, we propose a novel yet simple indexing
scheme, HD-Index, to solve the problem of approximate k-nearest neighbor
queries in massive high-dimensional databases. HD-Index consists of a set of
novel hierarchical structures called RDB-trees built on Hilbert keys of
database objects. The leaves of the RDB-trees store distances of database
objects to reference objects, thereby allowing efficient pruning using distance
filters. In addition to triangular inequality, we also use Ptolemaic inequality
to produce better lower bounds. Experiments on massive (up to billion scale)
high-dimensional (up to 1000+) datasets show that HD-Index is effective,
efficient, and scalable.Comment: PVLDB 11(8):906-919, 201
QDR-Tree: An Efcient Index Scheme for Complex Spatial Keyword Query
With the popularity of mobile devices and the development of geo-positioning
technology, location-based services (LBS) attract much attention and top-k
spatial keyword queries become increasingly complex. It is common to see that
clients issue a query to find a restaurant serving pizza and steak, low in
price and noise level particularly. However, most of prior works focused only
on the spatial keyword while ignoring these independent numerical attributes.
In this paper we demonstrate, for the first time, the Attributes-Aware Spatial
Keyword Query (ASKQ), and devise a two-layer hybrid index structure called
Quad-cluster Dual-filtering R-Tree (QDR-Tree). In the keyword cluster layer, a
Quad-Cluster Tree (QC-Tree) is built based on the hierarchical clustering
algorithm using kernel k-means to classify keywords. In the spatial layer, for
each leaf node of the QC-Tree, we attach a Dual-Filtering R-Tree (DR-Tree) with
two filtering algorithms, namely, keyword bitmap-based and attributes
skyline-based filtering. Accordingly, efficient query processing algorithms are
proposed. Through theoretical analysis, we have verified the optimization both
in processing time and space consumption. Finally, massive experiments with
real-data demonstrate the efficiency and effectiveness of QDR-Tree
Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach
Finding joinable tables in data lakes is key procedure in many applications
such as data integration, data augmentation, data analysis, and data market.
Traditional approaches that find equi-joinable tables are unable to deal with
misspellings and different formats, nor do they capture any semantic joins. In
this paper, we propose PEXESO, a framework for joinable table discovery in data
lakes. We embed textual values as high-dimensional vectors and join columns
under similarity predicates on high-dimensional vectors, hence to address the
limitations of equi-join approaches and identify more meaningful results. To
efficiently find joinable tables with similarity, we propose a block-and-verify
method that utilizes pivot-based filtering. A partitioning technique is
developed to cope with the case when the data lake is large and the index
cannot fit in main memory. An experimental evaluation on real datasets shows
that our solution identifies substantially more tables than equi-joins and
outperforms other similarity-based options, and the join results are useful in
data enrichment for machine learning tasks. The experiments also demonstrate
the efficiency of the proposed method.Comment: Full version of paper in ICDE 202