30 research outputs found

    FINEX: A Fast Index for Exact & Flexible Density-Based Clustering (Extended Version with Proofs)*

    Full text link
    Density-based clustering aims to find groups of similar objects (i.e., clusters) in a given dataset. Applications include, e.g., process mining and anomaly detection. It comes with two user parameters ({\epsilon}, MinPts) that determine the clustering result, but are typically unknown in advance. Thus, users need to interactively test various settings until satisfying clusterings are found. However, existing solutions suffer from the following limitations: (a) Ineffective pruning of expensive neighborhood computations. (b) Approximate clustering, where objects are falsely labeled noise. (c) Restricted parameter tuning that is limited to {\epsilon} whereas MinPts is constant, which reduces the explorable clusterings. (d) Inflexibility in terms of applicable data types and distance functions. We propose FINEX, a linear-space index that overcomes these limitations. Our index provides exact clusterings and can be queried with either of the two parameters. FINEX avoids neighborhood computations where possible and reduces the complexities of the remaining computations by leveraging fundamental properties of density-based clusters. Hence, our solution is effcient and flexible regarding data types and distance functions. Moreover, FINEX respects the original and straightforward notion of density-based clustering. In our experiments on 12 large real-world datasets from various domains, FINEX frequently outperforms state-of-the-art techniques for exact clustering by orders of magnitude

    Proceedings of the 21st International Conference on Extending Database Technology (EDBT) / A Roadmap towards Declarative Similarity Queries

    No full text
    Despite many research efforts, similarity queries are still poorly supported by current systems. We analyze the main stream research in processing similarity queries and argue that a general-purpose query processor for similarity queries is required. We identify three goals for the evaluation of similarity queries (declarative, efficient, combinable) and identify the main research challenges that must be solved to achieve these goals(VLID)441130

    Preserving Contextual Information in Relational Matrix Operations

    Full text link
    There exist large amounts of numerical data that are stored in databases and must be analyzed. Database tables come with a schema and include non-numerical attributes; this is crucial contextual information that is needed for interpreting the numerical values. We propose relational matrix operations that support the analysis of data stored in tables and that preserve contextual information. The result of our approach are precisely defined relational matrix operations and a system implementation in MonetDB that illustrates the seamless integration of relational matrix operations into a relational DBMS

    Efficient top-k approximate subtree matching in small memory

    Full text link
    We consider the Top-k Approximate Subtree Matching (TASM) problem: finding the k best matches of a small query tree within a large document tree using the canonical tree edit distance as a similarity measure between subtrees. Evaluating the tree edit distance for large XML trees is difficult: the best known algorithms have cubic runtime and quadratic space complexity, and, thus, do not scale. Our solution is TASM-postorder, a memory-efficient and scalable TASM algorithm. We prove an upper bound for the maximum subtree size for which the tree edit distance needs to be evaluated. The upper bound depends on the query and is independent of the document size and structure. A core problem is to efficiently prune subtrees that are above this size threshold. We develop an algorithm based on the prefix ring buffer that allows us to prune all subtrees above the threshold in a single postorder scan of the document. The size of the prefix ring buffer is linear in the threshold. As a result, the space complexity of TASM-postorder depends only on k and the query size, and the runtime of TASM-postorder is linear in the size of the document. Our experimental evaluation on large synthetic and real XML documents confirms our analytic results

    Grundlagen von Datenbanken 2016 / Reachability Queries in Public Transport Networks

    No full text
    Given a query point in a spatial network, the reachability query retrieves all points of interest that are reachable from the query point in a specific amount of time respecting net- work constraints. Reachability queries have a number of interesting applications, and for road networks efficient solutions have been proposed. Road networks are time-independent, i.e., the cost for traversing an edge is constant over time. Efficient algorithms for road networks heavily rely on pre- computing shortest paths. In public transport networks, however, the edge costs depend on schedules, which renders most solutions for road networks inefficient. In particular, shortest paths between node pairs cannot be easily precom- puted because they change over time. The goal of this work is to develop efficient solutions for reachability queries in public transport networks. The core idea is to partition the network into cells and compute time upper and lower bounds to traverse a cell. At query time, the reachable region is expanded cell by cell (instead of expanding edge by edge). All points of interest that are reachable using upper bound expansion are part of the result; all points that are not reachable in a lower bound expansion can safely be discarded; all other nodes are candidates and must be verified. This paper presents the expansion algorithm and and discusses interesting research directions regarding good network partitions, effective bounds, and efficient candidate verification.(VLID)441025

    ACM Transactions on Database Systems / Efficient Computation of the Tree Edit Distance

    No full text
    We consider the classical tree edit distance between ordered labelled trees, which is defined as the minimum-cost sequence of node edit operations that transform one tree into another. The state-of-the- art solutions for the tree edit distance are not satisfactory. The main competitors in the field either have optimal worst-case complexity, but the worst case happens frequently, or they are very efficient for some tree shapes, but degenerate for others. This leads to unpredictable and often infeasible runtimes. There is no obvious way to choose between the algorithms. In this paper we present RTED, a robust tree edit distance algorithm. The asymptotic complexity of our algorithm is smaller or equal to the complexity of the best competitors for any input instance, i.e., our algorithm is both efficient and worst-case optimal. This is achieved by computing a dynamic decomposition strategy that depends on the input trees. RTED is shown to be optimal among all algorithms that use LRH (Left-Right-Heavy) strategies, which include RTED and the fastest tree edit distance algorithms presented in literature. In our experiments on synthetic and real world data we empirically evaluate our solution and compare it to the state of the art.(VLID)448679

    Grundlagen von Datenbanken 2014 / PEL: Position-Enhanced Length Filter for Set Similarity Joins

    No full text
    Set similarity joins compute all pairs of similar sets from two collections of sets. Set similarity joins are typically implemented in a filter-verify framework: a filter generates candidate pairs, possibly including false positives, which must be verified to produce the final join result. Good filters produce a small number of false positives, while they reduce the time they spend on hopeless candidates. The best known algorithms generate candidates using the so-called prefix filter in conjunction with length- and position-based filters. In this paper we show that the potential of length and position have only partially been leveraged. We propose a new filter, the position-enhanced length filter, which exploits the matching position to incrementally tighten the length filter; our filter identifies hopeless candidates and avoids processing them. The filter is very efficient, requires no change in the data structures of most prefix filter algorithms, and is particularly effective for foreign joins, i.e., joins between two different collections of sets.(VLID)441022
    corecore