42 research outputs found
KOIOS: Top-k Semantic Overlap Set Search
We study the top-k set similarity search problem using semantic overlap.
While vanilla overlap requires exact matches between set elements, semantic
overlap allows elements that are syntactically different but semantically
related to increase the overlap. The semantic overlap is the maximum matching
score of a bipartite graph, where an edge weight between two set elements is
defined by a user-defined similarity function, e.g., cosine similarity between
embeddings. Common techniques like token indexes fail for semantic search since
similar elements may be unrelated at the character level. Further, verifying
candidates is expensive (cubic versus linear for syntactic overlap), calling
for highly selective filters. We propose KOIOS, the first exact and efficient
algorithm for semantic overlap search. KOIOS leverages sophisticated filters to
minimize the number of required graph-matching calculations. Our experiments
show that for medium to large sets less than 5% of the candidate sets need
verification, and more than half of those sets are further pruned without
requiring the expensive graph matching. We show the efficiency of our algorithm
on four real datasets and demonstrate the improved result quality of semantic
over vanilla set similarity search
Speeding Up Reachability Queries in Public Transport Networks Using Graph Partitioning
Computing path queries such as the shortest path in public transport networks is challenging because the path costs between nodes change over time. A reachability query from a node at a given start time on such a network retrieves all points of interest (POIs) that are reachable within a given cost budget. Reachability queries are essential building blocks in many applications, for example, group recommendations, ranking spatial queries, or geomarketing. We propose an efficient solution for reachability queries in public transport networks. Currently, there are two options to solve reachability queries. (1) Execute a modified version of Dijkstra’s algorithm that supports time-dependent edge traversal costs; this solution is slow since it must expand edge by edge and does not use an index. (2) Issue a separate path query for each single POI, i.e., a single reachability query requires answering many path queries. None of these solutions scales to large networks with many POIs. We propose a novel and lightweight reachability index. The key idea is to partition the network into cells. Then, in contrast to other approaches, we expand the network cell by cell. Empirical evaluations on synthetic and real-world networks confirm the efficiency and the effectiveness of our index-based reachability query solution
FINEX: A Fast Index for Exact & Flexible Density-Based Clustering (Extended Version with Proofs)*
Density-based clustering aims to find groups of similar objects (i.e.,
clusters) in a given dataset. Applications include, e.g., process mining and
anomaly detection. It comes with two user parameters ({\epsilon}, MinPts) that
determine the clustering result, but are typically unknown in advance. Thus,
users need to interactively test various settings until satisfying clusterings
are found. However, existing solutions suffer from the following limitations:
(a) Ineffective pruning of expensive neighborhood computations. (b) Approximate
clustering, where objects are falsely labeled noise. (c) Restricted parameter
tuning that is limited to {\epsilon} whereas MinPts is constant, which reduces
the explorable clusterings. (d) Inflexibility in terms of applicable data types
and distance functions. We propose FINEX, a linear-space index that overcomes
these limitations. Our index provides exact clusterings and can be queried with
either of the two parameters. FINEX avoids neighborhood computations where
possible and reduces the complexities of the remaining computations by
leveraging fundamental properties of density-based clusters. Hence, our
solution is effcient and flexible regarding data types and distance functions.
Moreover, FINEX respects the original and straightforward notion of
density-based clustering. In our experiments on 12 large real-world datasets
from various domains, FINEX frequently outperforms state-of-the-art techniques
for exact clustering by orders of magnitude
Proceedings of the 21st International Conference on Extending Database Technology (EDBT) / A Roadmap towards Declarative Similarity Queries
Despite many research efforts, similarity queries are still poorly supported by current systems. We analyze the main stream research in processing similarity queries and argue that a general-purpose query processor for similarity queries is required. We identify three goals for the evaluation of similarity queries (declarative, efficient, combinable) and identify the main research challenges that must be solved to achieve these goals(VLID)441130
Preserving Contextual Information in Relational Matrix Operations
There exist large amounts of numerical data that are stored in databases and must be analyzed. Database tables come with a schema and include non-numerical attributes; this is crucial contextual information that is needed for interpreting the numerical values. We propose relational matrix operations that support the analysis of data stored in tables and that preserve contextual information. The result of our approach are precisely defined relational matrix operations and a system implementation in MonetDB that illustrates the seamless integration of relational matrix operations into a relational DBMS