Search CORE

42 research outputs found

Approximate Matching of Hierarchial Data

Author: Augsten Nikolaus
Publication venue: Aalborg Universitet
Publication date: 01/01/2008
Field of study

KOIOS: Top-k Semantic Overlap Set Search

Author: Augsten Nikolaus
Mundra Pranay
Nargesian Fatemeh
Zhang Jianhao
Publication venue
Publication date: 20/04/2023
Field of study

We study the top-k set similarity search problem using semantic overlap. While vanilla overlap requires exact matches between set elements, semantic overlap allows elements that are syntactically different but semantically related to increase the overlap. The semantic overlap is the maximum matching score of a bipartite graph, where an edge weight between two set elements is defined by a user-defined similarity function, e.g., cosine similarity between embeddings. Common techniques like token indexes fail for semantic search since similar elements may be unrelated at the character level. Further, verifying candidates is expensive (cubic versus linear for syntactic overlap), calling for highly selective filters. We propose KOIOS, the first exact and efficient algorithm for semantic overlap search. KOIOS leverages sophisticated filters to minimize the number of required graph-matching calculations. Our experiments show that for medium to large sets less than 5% of the candidate sets need verification, and more than half of those sets are further pruned without requiring the expensive graph matching. We show the efficiency of our algorithm on four real datasets and demonstrate the improved result quality of semantic over vanilla set similarity search

arXiv.org e-Print Archive

Speeding Up Reachability Queries in Public Transport Networks Using Graph Partitioning

Author: Augsten Nikolaus
Böhlen Michael H.
Jensen Christian S.
Pawlik Mateusz
Tesfaye Bezaye
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

Computing path queries such as the shortest path in public transport networks is challenging because the path costs between nodes change over time. A reachability query from a node at a given start time on such a network retrieves all points of interest (POIs) that are reachable within a given cost budget. Reachability queries are essential building blocks in many applications, for example, group recommendations, ranking spatial queries, or geomarketing. We propose an efficient solution for reachability queries in public transport networks. Currently, there are two options to solve reachability queries. (1) Execute a modified version of Dijkstra’s algorithm that supports time-dependent edge traversal costs; this solution is slow since it must expand edge by edge and does not use an index. (2) Issue a separate path query for each single POI, i.e., a single reachability query requires answering many path queries. None of these solutions scales to large networks with many POIs. We propose a novel and lightweight reachability index. The key idea is to partition the network into cells. Then, in contrast to other approaches, we expand the network cell by cell. Empirical evaluations on synthetic and real-world networks confirm the efficiency and the effectiveness of our index-based reachability query solution

PubMed Central

VBN

ZORA

FINEX: A Fast Index for Exact & Flexible Density-Based Clustering (Extended Version with Proofs)*

Author: Augsten Nikolaus
Hütter Thomas
Kocher Daniel
Mann Willi
Schmitt Daniel Ulrich
Thiel Konstantin Emil
Publication venue
Publication date: 10/04/2023
Field of study

Density-based clustering aims to find groups of similar objects (i.e., clusters) in a given dataset. Applications include, e.g., process mining and anomaly detection. It comes with two user parameters ({\epsilon}, MinPts) that determine the clustering result, but are typically unknown in advance. Thus, users need to interactively test various settings until satisfying clusterings are found. However, existing solutions suffer from the following limitations: (a) Ineffective pruning of expensive neighborhood computations. (b) Approximate clustering, where objects are falsely labeled noise. (c) Restricted parameter tuning that is limited to {\epsilon} whereas MinPts is constant, which reduces the explorable clusterings. (d) Inflexibility in terms of applicable data types and distance functions. We propose FINEX, a linear-space index that overcomes these limitations. Our index provides exact clusterings and can be queried with either of the two parameters. FINEX avoids neighborhood computations where possible and reduces the complexities of the remaining computations by leveraging fundamental properties of density-based clusters. Hence, our solution is effcient and flexible regarding data types and distance functions. Moreover, FINEX respects the original and straightforward notion of density-based clustering. In our experiments on 12 large real-world datasets from various domains, FINEX frequently outperforms state-of-the-art techniques for exact clustering by orders of magnitude

arXiv.org e-Print Archive

Proceedings of the 21st International Conference on Extending Database Technology (EDBT) / A Roadmap towards Declarative Similarity Queries

Author: Augsten Nikolaus
Publication venue
Publication date
Field of study

Despite many research efforts, similarity queries are still poorly supported by current systems. We analyze the main stream research in processing similarity queries and argue that a general-purpose query processor for similarity queries is required. We identify three goals for the evaluation of similarity queries (declarative, efficient, combinable) and identify the main research challenges that must be solved to achieve these goals(VLID)441130

eplus (Univ. of Salzburg)

Preserving Contextual Information in Relational Matrix Operations

Author: Augsten Nikolaus
Böhlen Michael Hanspeter
Dolmatova Oksana
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 24/04/2020
Field of study

There exist large amounts of numerical data that are stored in databases and must be analyzed. Database tables come with a schema and include non-numerical attributes; this is crucial contextual information that is needed for interpreting the numerical values. We propose relational matrix operations that support the analysis of data stored in tables and that preserve contextual information. The result of our approach are precisely defined relational matrix operations and a system implementation in MonetDB that illustrates the seamless integration of relational matrix operations into a relational DBMS

Crossref

ZORA