7,893 research outputs found

    Graph set data mining

    Get PDF
    Graphs are among the most versatile abstract data types in computer science. With the variety comes great adoption in various application fields, such as chemistry, biology, social analysis, logistics, and computer science itself. With the growing capacities of digital storage, the collection of large amounts of data has become the norm in many application fields. Data mining, i.e., the automated extraction of non-trivial patterns from data, is a key step to extract knowledge from these datasets and generate value. This thesis is dedicated to concurrent scalable data mining algorithms beyond traditional notions of efficiency for large-scale datasets of small labeled graphs; more precisely, structural clustering and representative subgraph pattern mining. It is motivated by, but not limited to, the need to analyze molecular libraries of ever-increasing size in the drug discovery process. Structural clustering makes use of graph theoretical concepts, such as (common) subgraph isomorphisms and frequent subgraphs, to model cluster commonalities directly in the application domain. It is considered computationally demanding for non-restricted graph classes and with very few exceptions prior algorithms are only suitable for very small datasets. This thesis discusses the first truly scalable structural clustering algorithm StruClus with linear worst-case complexity. At the same time, StruClus embraces the inherent values of structural clustering algorithms, i.e., interpretable, consistent, and high-quality results. A novel two-fold sampling strategy with stochastic error bounds for frequent subgraph mining is presented. It enables fast extraction of cluster commonalities in the form of common subgraph representative sets. StruClus is the first structural clustering algorithm with a directed selection of structural cluster-representative patterns regarding homogeneity and separation aspects in the high-dimensional subgraph pattern space. Furthermore, a novel concept of cluster homogeneity balancing using dynamically-sized representatives is discussed. The second part of this thesis discusses the representative subgraph pattern mining problem in more general terms. A novel objective function maximizes the number of represented graphs for a cardinality-constrained representative set. It is shown that the problem is a special case of the maximum coverage problem and is NP-hard. Based on the greedy approximation of Nemhauser, Wolsey, and Fisher for submodular set function maximization a novel sampling approach is presented. It mines candidate sets that contain an optimal greedy solution with a probabilistic maximum error. This leads to a constant-time algorithm to generate the candidate sets given a fixed-size sample of the dataset. In combination with a cheap single-pass streaming evaluation of the candidate sets, this enables scalability to datasets with billions of molecules on a single machine. Ultimately, the sampling approach leads to the first distributed subgraph pattern mining algorithm that distributes the pattern space and the dataset graphs at the same time

    Using behavioral context in process mining : exploration, preprocessing and analysis of event data

    Get PDF

    Managing and analyzing phylogenetic databases

    Get PDF
    The ever growing availability of phylogenomic data makes it increasingly possible to study and analyze phylogenetic relationships across a wide range of species. Indeed, current phylogenetic analyses are now producing enormous collections of trees that vary greatly in size. Our proposed research addresses the challenges posed by storing, querying, and analyzing such phylogenetic databases. Our first contribution is the further development of STBase, a phylogenetic tree database consisting of a billion trees whose leaf sets range from four to 20000. STBase applies techniques from different areas of computer science for efficient tree storage and retrieval. It also introduces new ideas that are specific to tree databases. STBase provides a unique opportunity to explore innovative ways to analyze the results from queries on large sets of phylogenetic trees. We propose new ways of extracting consensus information from a collection of phylogenetic trees. Specifically, this involves extending the maximum agreement subtree problem. We greatly improve upon an existing approach based on frequent subtrees and, propose two new approaches based on agreement subtrees and frequent subtrees respectively. The final part of our proposed work deals with the problem of simplifying multi-labeled trees and handling rogue taxa. We propose a novel technique to extract conflict-free information from multi-labeled trees as a much smaller single labeled tree. We show that the inherent problem in identifying rogue taxa is NP-hard and give fixed-parameter tractable and integer linear programming solutions

    Profiling relational data: a survey

    Get PDF
    Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases

    Refinement of Interval Approximations for Fully Commutative Quivers

    Full text link
    A fundamental challenge in multiparameter persistent homology is the absence of a complete and discrete invariant. To address this issue, we propose an enhanced framework that realizes a holistic understanding of a fully commutative quiver's representation via synthesizing interpretations obtained from intervals. Additionally, it provides a mechanism to tune the balance between approximation resolution and computational complexity. This framework is evaluated on commutative ladders of both finite-type and infinite-type. For the former, we discover an efficient method for the indecomposable decomposition leveraging solely one-parameter persistent homology. For the latter, we introduce a new invariant that reveals persistence in the second parameter by connecting two standard persistence diagrams using interval approximations. We subsequently present several models for constructing commutative ladder filtrations, offering fresh insights into random filtrations and demonstrating our toolkit's effectiveness in analyzing the topology of materials

    Efficient Data Structures for Partial Orders, Range Modes, and Graph Cuts

    Get PDF
    This thesis considers the study of data structures from the perspective of the theoretician, with a focus on simplicity and practicality. We consider both the time complexity as well as space usage of proposed solutions. Topics discussed fall in three main categories: partial order representation, range modes, and graph cuts. We consider two problems in partial order representation. The first is a data structure to represent a lattice. A lattice is a partial order where the set of elements larger than any two elements x and y are all larger than an element z, known as the join of x and y; a similar condition holds for elements smaller than any two elements. Our data structure is the first correct solution that can simultaneously compute joins and the inverse meet operation in sublinear time while also using subquadratic space. The second is a data structure to support queries on a dynamic set of one-dimensional ordered data; that is, essentially any operation computable on a binary search tree. We develop a data structure that is able to interpolate between binary search trees and efficient priority queues, offering more-efficient insertion times than the former when query distribution is non-uniform. We also consider static and dynamic exact and approximate range mode. Given one-dimensional data, the range mode problem is to compute the mode of a subinterval of the data. In the dynamic range mode problem, insertions and deletions are permitted. For the approximate problem, the element returned is to have frequency no less than a factor (1+epsilon) of the true mode, for some epsilon > 0. Our results include a linear-space dynamic exact range mode data structure that simultaneously improves on best previous operation complexity and an exact dynamic range mode data structure that breaks the Theta(n^(2/3)) time per operation barrier. For approximate range mode, we develop a static succinct data structure offering a logarithmic-factor space improvement and give the first dynamic approximate range mode data structure. We also consider approximate range selection. The final category discussed is graph and dynamic graph algorithms. We develop an optimal offline data structure for dynamic 2- and 3- edge and vertex connectivity. Here, the data structure is given the entire sequence of operations in advance, and the dynamic operations are edge insertion and removal. Finally, we give a simplification of Karger's near-linear time minimum cut algorithm, utilizing heavy-light decomposition and iteration in place of dynamic programming in the subroutine to find a minimum cut of a graph G that cuts at most two edges of a spanning tree T of G
    • …