5 research outputs found

    High Performance Frequent Subgraph Mining on Transactional Datasets

    Get PDF
    Graph data mining has been a crucial as well as inevitable area of research. Large amounts of graph data are produced in many areas, such as Bioinformatics, Cheminformatics, Social Networks, and Web etc. Scalable graph data mining methods are getting increasingly popular and necessary due to increased graph complexities. Frequent subgraph mining is one such area where the task is to find overly recurring patterns/subgraphs. To tackle this problem, many main memory-based methods were proposed, which proved to be inefficient as the data size grew exponentially over time. In the past few years several research groups have attempted to handle the frequent subgraph mining (FSM) problem in multiple ways. Many authors have tried to achieve better performance using Graphic Processing Units (GPUs) which has multi-fold improvement over in-memory while dealing with large datasets. Later, Google\u27s MapReduce model with the Hadoop framework proved to be a major breakthrough in high performance large batch processing. Although MapReduce came with many benefits, its disk I/O and non-iterative style model could not help much for FSM domain since subgraph mining process is an iterative approach. In recent years, Spark has emerged to be the De Facto industry standard with its distributed in-memory computing capability. This is a right fit solution for iterative style of programming as well. In this work, we cover how high-performance computing has helped in improving the performance tremendously in the transactional directed and undirected aspect of graphs and performance comparisons of various FSM techniques are done based on experimental results

    Parallele Graph-Mining-Techniken zur Auswertung von Hypergraph-Strukturen

    Get PDF
    Neben dem Feld Data-Mining nimmt auch das Graph-Mining eine immer zentralere Stellung in Forschung und Wirtschaft ein. Durch die Speicherung der Daten als Graph ergeben sich neue Möglichkeiten in der Datenanalyse. Diese Arbeit beschäftigt sich mit eben diesen Algorithmen zur Auswertung von Graph- und Hypergraph-Strukturen. Neben den verschiedenen Algorithmen des Graph-Mining, wird zusätzlich auch die Parallelisierbarkeit dieser untersucht. Mit PaSiGraM wird ein eigener Algorithmus vorgestellt, der es ermöglicht, die für das Frequent-Subgraph-Mining benötigten Berechnungen zu parallelisieren

    Graph set data mining

    Get PDF
    Graphs are among the most versatile abstract data types in computer science. With the variety comes great adoption in various application fields, such as chemistry, biology, social analysis, logistics, and computer science itself. With the growing capacities of digital storage, the collection of large amounts of data has become the norm in many application fields. Data mining, i.e., the automated extraction of non-trivial patterns from data, is a key step to extract knowledge from these datasets and generate value. This thesis is dedicated to concurrent scalable data mining algorithms beyond traditional notions of efficiency for large-scale datasets of small labeled graphs; more precisely, structural clustering and representative subgraph pattern mining. It is motivated by, but not limited to, the need to analyze molecular libraries of ever-increasing size in the drug discovery process. Structural clustering makes use of graph theoretical concepts, such as (common) subgraph isomorphisms and frequent subgraphs, to model cluster commonalities directly in the application domain. It is considered computationally demanding for non-restricted graph classes and with very few exceptions prior algorithms are only suitable for very small datasets. This thesis discusses the first truly scalable structural clustering algorithm StruClus with linear worst-case complexity. At the same time, StruClus embraces the inherent values of structural clustering algorithms, i.e., interpretable, consistent, and high-quality results. A novel two-fold sampling strategy with stochastic error bounds for frequent subgraph mining is presented. It enables fast extraction of cluster commonalities in the form of common subgraph representative sets. StruClus is the first structural clustering algorithm with a directed selection of structural cluster-representative patterns regarding homogeneity and separation aspects in the high-dimensional subgraph pattern space. Furthermore, a novel concept of cluster homogeneity balancing using dynamically-sized representatives is discussed. The second part of this thesis discusses the representative subgraph pattern mining problem in more general terms. A novel objective function maximizes the number of represented graphs for a cardinality-constrained representative set. It is shown that the problem is a special case of the maximum coverage problem and is NP-hard. Based on the greedy approximation of Nemhauser, Wolsey, and Fisher for submodular set function maximization a novel sampling approach is presented. It mines candidate sets that contain an optimal greedy solution with a probabilistic maximum error. This leads to a constant-time algorithm to generate the candidate sets given a fixed-size sample of the dataset. In combination with a cheap single-pass streaming evaluation of the candidate sets, this enables scalability to datasets with billions of molecules on a single machine. Ultimately, the sampling approach leads to the first distributed subgraph pattern mining algorithm that distributes the pattern space and the dataset graphs at the same time

    Efficient Frequent Subtree Mining Beyond Forests

    Get PDF
    A common paradigm in distance-based learning is to embed the instance space into some appropriately chosen feature space equipped with a metric and to define the dissimilarity between instances by the distance of their images in the feature space. If the instances are graphs, then frequent connected subgraphs are a well-suited pattern language to define such feature spaces. Identifying the set of frequent connected subgraphs and subsequently computing embeddings for graph instances, however, is computationally intractable. As a result, existing frequent subgraph mining algorithms either restrict the structural complexity of the instance graphs or require exponential delay between the output of subsequent patterns. Hence distance-based learners lack an efficient way to operate on arbitrary graph data. To resolve this problem, in this thesis we present a mining system that gives up the demand on the completeness of the pattern set to instead guarantee a polynomial delay between subsequent patterns. Complementing this, we devise efficient methods to compute the embedding of arbitrary graphs into the Hamming space spanned by our pattern set. As a result, we present a system that allows to efficiently apply distance-based learning methods to arbitrary graph databases. To overcome the computational intractability of the mining step, we consider only frequent subtrees for arbitrary graph databases. This restriction alone, however, does not suffice to make the problem tractable. We reduce the mining problem from arbitrary graphs to forests by replacing each graph by a polynomially sized forest obtained from a random sample of its spanning trees. This results in an incomplete mining algorithm. However, we prove that the probability of missing a frequent subtree pattern is low. We show empirically that this is true in practice even for very small sized forests. As a result, our algorithm is able to mine frequent subtrees in a range of graph databases where state-of-the-art exact frequent subgraph mining systems fail to produce patterns in reasonable time or even at all. Furthermore, the predictive performance of our patterns is comparable to that of exact frequent connected subgraphs, where available. The above method considers polynomially many spanning trees for the forest, while many graphs have exponentially many spanning trees. The number of patterns found by our mining algorithm can be negatively influenced by this exponential gap. We hence propose a method that can (implicitly) consider forests of exponential size, while remaining computationally tractable. This results in a higher recall for our incomplete mining algorithm. Furthermore, the methods extend the known positive results on the tractability of exact frequent subtree mining to a novel class of transaction graphs. We conjecture that the next natural extension of our results to a larger transaction graph class is at least as difficult as proving whether P = NP, or not. Regarding the graph embedding step, we apply a similar strategy as in the mining step. We represent a novel graph by a forest of its spanning trees and decide whether the frequent trees from the mining step are subgraph isomorphic to this forest. As a result, the embedding computation has one-sided error with respect to the exact subgraph isomorphism test but is computationally tractable. Furthermore, we show that we can leverage a partial order on the pattern set. This structure can be used to reduce the runtime of the embedding computation dramatically. For the special case of Jaccard-similarity between graph embeddings, a further substantial reduction of runtime can be achieved using min-hashing. The Jaccard-distance can be approximated using small sketch vectors that can be computed fast, again using the partial order on the tree patterns
    corecore