34 research outputs found

    A survey of frequent subgraph mining algorithms

    Get PDF
    AbstractGraph mining is an important research area within the domain of data mining. The field of study concentrates on the identification of frequent subgraphs within graph data sets. The research goals are directed at: (i) effective mechanisms for generating candidate subgraphs (without generating duplicates) and (ii) how best to process the generated candidate subgraphs so as to identify the desired frequent subgraphs in a way that is computationally efficient and procedurally effective. This paper presents a survey of current research in the field of frequent subgraph mining and proposes solutions to address the main research issues.</jats:p

    Is Frequent Pattern Mining useful in building predictive models?

    Get PDF
    Abstract. The recent studies of pattern mining have given more attention to discovering patterns that are interesting, significant, discriminative and so forth, than simply frequent. Does this imply that the frequent patterns are not useful anymore? In this paper we carry out a survey of frequent pattern mining and, using an empirical study , show how far the frequent pattern mining is useful in building predictive models

    Frequent subgraph mining algorithms on weighted graphs

    Get PDF
    This thesis describes research work undertaken in the field of graph-based knowledge discovery (or graph mining). The objective of the research is to investigate the benefits that the concept of weighted frequent subgraph mining can offer in the context of the graph model based classification. Weighted subgraphs are graphs where some of the vertexes/edges are considered to be more significant than others. How to discover frequent sub-structures with different strengths is the main issue to be resolved in this thesis. The main approach to addressing this issue is to integrate weight constraints into the frequent subgraph mining process. It is suggested that the utilization of weighted frequent subgraph mining generates more discriminate and significant subgraphs, which will have application in, for example, the classification and clustering of graph data

    A graph-based knowledge representation and pattern mining supporting the Digital Twin creation of existing manufacturing systems

    Full text link
    The creation of a Digital Twin for existing manufacturing systems, so-called brownfield systems, is a challenging task due to the needed expert knowledge about the structure of brownfield systems and the effort to realize the digital models. Several approaches and methods have already been proposed that at least partially digitalize the information about a brownfield manufacturing system. A Digital Twin requires linked information from multiple sources. This paper presents a graph-based approach to merge information from heterogeneous sources. Furthermore, the approach provides a way to automatically identify templates using graph structure analysis to facilitate further work with the resulting Digital Twin and its further enhancement.Comment: 4 pages, 3 figures. Accepted at IEEE ETFA 202

    Significant Subgraph Mining with Multiple Testing Correction

    Full text link
    The problem of finding itemsets that are statistically significantly enriched in a class of transactions is complicated by the need to correct for multiple hypothesis testing. Pruning untestable hypotheses was recently proposed as a strategy for this task of significant itemset mining. It was shown to lead to greater statistical power, the discovery of more truly significant itemsets, than the standard Bonferroni correction on real-world datasets. An open question, however, is whether this strategy of excluding untestable hypotheses also leads to greater statistical power in subgraph mining, in which the number of hypotheses is much larger than in itemset mining. Here we answer this question by an empirical investigation on eight popular graph benchmark datasets. We propose a new efficient search strategy, which always returns the same solution as the state-of-the-art approach and is approximately two orders of magnitude faster. Moreover, we exploit the dependence between subgraphs by considering the effective number of tests and thereby further increase the statistical power.Comment: 18 pages, 5 figure, accepted to the 2015 SIAM International Conference on Data Mining (SDM15

    Effiziente Prozessmodellanalyse mit Algorithmen der Subgraphisomorphie

    Get PDF
    In der Literatur existiert eine Vielzahl verschiedener Ansätze, um Prozessmodelle strukturell zu analysieren. Ein Unterproblem, das oft in vielen dieser Ansätze auftritt, ist die Identifikation von (häufig auftretenden) Subgraphen innerhalb der Modellgraphen. Um diese Problemstellung zu lösen, können graphentheoretische Algorithmen genutzt werden. Der vorliegende Artikel demonstriert, dass derartige Algorithmen in der Lage sind, große Mengen von Prozessmodellen innerhalb von (Milli-)Sekunden zu analysieren. Sie können folglich als Unterkomponente in bestehende Analyseansätze integriert werden, um (potenziell aufwändigere) Eigenentwicklungen zu ersetzen. Der Vorteil dieser Algorithmen liegt in ihrer breiten, nicht auf konkrete Modellierungssprachen oder Analysezwecke beschränkten Anwendbarkeit

    GraphMDL : sélection de motifs de graphes avec le principe MDL

    Get PDF
    International audienceMany graph pattern mining algorithms have been designed to identify recurring structures in graphs. The main drawback of these approaches is that they often extract too many patterns for human analysis. Recently, pattern mining methods using the Minimum Description Length (MDL) principle have been proposed to select a characteristic subset of patterns from transactional, sequential and relational data. In this paper, we propose a MDL-based approach for selecting a characteristic subset of patterns on labeled graphs. A key notion in this paper is the introduction of ports to encode connections between pattern occurrences without any loss of information. Experiments show that the number of patterns is drastically reduced, and the selected patterns can have complex shapes.Plusieurs algorithmes de fouille de motifs ont été proposés pour iden-tifier des structures récurrentes dans les graphes. Le principal défaut de ces ap-proches est qu'elles produisent généralement trop de motifs pour qu'une analyse humaine soit possible. Récemment, des méthodes de fouille de motifs ont traité ce problème sur des données transactionnelles, séquentielles et relationnelles en utilisant le principe MDL (Minimum Description Length). Dans ce papier, nous proposons une approche MDL pour sélectionner un sous-ensemble représentatif de motifs sur des graphes étiquetés. Une notion clé de notre approche est l'in-troduction de ports pour encoder les connections entre occurrences de motifs, sans perte d'information. Nos expériences montrent que le nombre de motifs est drastiquement réduit et que les motifs sélectionnés peuvent avoir des formes complexes

    FS^3: A Sampling based method for top-k Frequent Subgraph Mining

    Get PDF
    Mining labeled subgraph is a popular research task in data mining because of its potential application in many different scientific domains. All the existing methods for this task explicitly or implicitly solve the subgraph isomorphism task which is computationally expensive, so they suffer from the lack of scalability problem when the graphs in the input database are large. In this work, we propose FS^3, which is a sampling based method. It mines a small collection of subgraphs that are most frequent in the probabilistic sense. FS^3 performs a Markov Chain Monte Carlo (MCMC) sampling over the space of a fixed-size subgraphs such that the potentially frequent subgraphs are sampled more often. Besides, FS^3 is equipped with an innovative queue manager. It stores the sampled subgraph in a finite queue over the course of mining in such a manner that the top-k positions in the queue contain the most frequent subgraphs. Our experiments on database of large graphs show that FS^3 is efficient, and it obtains subgraphs that are the most frequent amongst the subgraphs of a given size

    Frequent Subgraph Mining via Sampling with Rigorous Guarantees

    Get PDF
    Frequent subgraph mining is a fundamental task in the analysis of collections of graphs that aims at finding all the subgraphs that appear with more than a user-specified frequency in the dataset. While several exact approaches have been proposed to solve the task, it remains computationally challenging on large graph datasets due to the complexity of the subgraph isomorphism problem inherent in the task and the huge number of candidate patterns even for fairly small subgraphs. In this thesis, we study two statistical learning measures of complexity, VC-dimension and Rademacher averages, for subgraphs, and derive efficiently computable bounds for both. We then show how such bounds can be applied to devise efficient sampling-based approaches for rigorously approximating the solutions of the frequent subgraph mining problem, providing sample sizes which are much tighter than what would be obtained by a straightforward application of Chernoff and union bounds. We also show that our bounds can be used for true frequent subgraph mining, which requires to identify subgraphs generated with probability above a given threshold using samples from an unknown generative process. Moreover, we carried out an extensive experimental evaluation of our methods on real datasets, which shows that our bounds lead to efficiently computable and high-quality approximations for both applications.Frequent subgraph mining is a fundamental task in the analysis of collections of graphs that aims at finding all the subgraphs that appear with more than a user-specified frequency in the dataset. While several exact approaches have been proposed to solve the task, it remains computationally challenging on large graph datasets due to the complexity of the subgraph isomorphism problem inherent in the task and the huge number of candidate patterns even for fairly small subgraphs. In this thesis, we study two statistical learning measures of complexity, VC-dimension and Rademacher averages, for subgraphs, and derive efficiently computable bounds for both. We then show how such bounds can be applied to devise efficient sampling-based approaches for rigorously approximating the solutions of the frequent subgraph mining problem, providing sample sizes which are much tighter than what would be obtained by a straightforward application of Chernoff and union bounds. We also show that our bounds can be used for true frequent subgraph mining, which requires to identify subgraphs generated with probability above a given threshold using samples from an unknown generative process. Moreover, we carried out an extensive experimental evaluation of our methods on real datasets, which shows that our bounds lead to efficiently computable and high-quality approximations for both applications
    corecore