34 research outputs found
A survey of frequent subgraph mining algorithms
AbstractGraph mining is an important research area within the domain of data mining. The field of study concentrates on the identification of frequent subgraphs within graph data sets. The research goals are directed at: (i) effective mechanisms for generating candidate subgraphs (without generating duplicates) and (ii) how best to process the generated candidate subgraphs so as to identify the desired frequent subgraphs in a way that is computationally efficient and procedurally effective. This paper presents a survey of current research in the field of frequent subgraph mining and proposes solutions to address the main research issues.</jats:p
Is Frequent Pattern Mining useful in building predictive models?
Abstract. The recent studies of pattern mining have given more attention to discovering patterns that are interesting, significant, discriminative and so forth, than simply frequent. Does this imply that the frequent patterns are not useful anymore? In this paper we carry out a survey of frequent pattern mining and, using an empirical study , show how far the frequent pattern mining is useful in building predictive models
Frequent subgraph mining algorithms on weighted graphs
This thesis describes research work undertaken in the field of graph-based knowledge
discovery (or graph mining). The objective of the research is to investigate the benefits
that the concept of weighted frequent subgraph mining can offer in the context of the
graph model based classification. Weighted subgraphs are graphs where some of the
vertexes/edges are considered to be more significant than others. How to discover
frequent sub-structures with different strengths is the main issue to be resolved in this
thesis. The main approach to addressing this issue is to integrate weight constraints into
the frequent subgraph mining process. It is suggested that the utilization of weighted
frequent subgraph mining generates more discriminate and significant subgraphs, which
will have application in, for example, the classification and clustering of graph data
A graph-based knowledge representation and pattern mining supporting the Digital Twin creation of existing manufacturing systems
The creation of a Digital Twin for existing manufacturing systems, so-called
brownfield systems, is a challenging task due to the needed expert knowledge
about the structure of brownfield systems and the effort to realize the digital
models. Several approaches and methods have already been proposed that at least
partially digitalize the information about a brownfield manufacturing system. A
Digital Twin requires linked information from multiple sources. This paper
presents a graph-based approach to merge information from heterogeneous
sources. Furthermore, the approach provides a way to automatically identify
templates using graph structure analysis to facilitate further work with the
resulting Digital Twin and its further enhancement.Comment: 4 pages, 3 figures. Accepted at IEEE ETFA 202
Significant Subgraph Mining with Multiple Testing Correction
The problem of finding itemsets that are statistically significantly enriched
in a class of transactions is complicated by the need to correct for multiple
hypothesis testing. Pruning untestable hypotheses was recently proposed as a
strategy for this task of significant itemset mining. It was shown to lead to
greater statistical power, the discovery of more truly significant itemsets,
than the standard Bonferroni correction on real-world datasets. An open
question, however, is whether this strategy of excluding untestable hypotheses
also leads to greater statistical power in subgraph mining, in which the number
of hypotheses is much larger than in itemset mining. Here we answer this
question by an empirical investigation on eight popular graph benchmark
datasets. We propose a new efficient search strategy, which always returns the
same solution as the state-of-the-art approach and is approximately two orders
of magnitude faster. Moreover, we exploit the dependence between subgraphs by
considering the effective number of tests and thereby further increase the
statistical power.Comment: 18 pages, 5 figure, accepted to the 2015 SIAM International
Conference on Data Mining (SDM15
Effiziente Prozessmodellanalyse mit Algorithmen der Subgraphisomorphie
In der Literatur existiert eine Vielzahl verschiedener Ansätze, um Prozessmodelle strukturell zu analysieren. Ein Unterproblem, das oft in vielen dieser Ansätze auftritt, ist die Identifikation von (häufig auftretenden) Subgraphen innerhalb der Modellgraphen. Um diese Problemstellung zu lösen, können graphentheoretische Algorithmen genutzt werden. Der vorliegende Artikel demonstriert, dass derartige Algorithmen in der Lage sind, große Mengen von Prozessmodellen innerhalb von (Milli-)Sekunden zu analysieren. Sie können folglich als Unterkomponente in bestehende Analyseansätze integriert werden, um (potenziell aufwändigere) Eigenentwicklungen zu ersetzen. Der Vorteil dieser Algorithmen liegt in ihrer breiten, nicht auf konkrete Modellierungssprachen oder Analysezwecke beschränkten Anwendbarkeit
GraphMDL : sélection de motifs de graphes avec le principe MDL
International audienceMany graph pattern mining algorithms have been designed to identify recurring structures in graphs. The main drawback of these approaches is that they often extract too many patterns for human analysis. Recently, pattern mining methods using the Minimum Description Length (MDL) principle have been proposed to select a characteristic subset of patterns from transactional, sequential and relational data. In this paper, we propose a MDL-based approach for selecting a characteristic subset of patterns on labeled graphs. A key notion in this paper is the introduction of ports to encode connections between pattern occurrences without any loss of information. Experiments show that the number of patterns is drastically reduced, and the selected patterns can have complex shapes.Plusieurs algorithmes de fouille de motifs ont été proposés pour iden-tifier des structures récurrentes dans les graphes. Le principal défaut de ces ap-proches est qu'elles produisent généralement trop de motifs pour qu'une analyse humaine soit possible. Récemment, des méthodes de fouille de motifs ont traité ce problème sur des données transactionnelles, séquentielles et relationnelles en utilisant le principe MDL (Minimum Description Length). Dans ce papier, nous proposons une approche MDL pour sélectionner un sous-ensemble représentatif de motifs sur des graphes étiquetés. Une notion clé de notre approche est l'in-troduction de ports pour encoder les connections entre occurrences de motifs, sans perte d'information. Nos expériences montrent que le nombre de motifs est drastiquement réduit et que les motifs sélectionnés peuvent avoir des formes complexes
FS^3: A Sampling based method for top-k Frequent Subgraph Mining
Mining labeled subgraph is a popular research task in data mining because of
its potential application in many different scientific domains. All the
existing methods for this task explicitly or implicitly solve the subgraph
isomorphism task which is computationally expensive, so they suffer from the
lack of scalability problem when the graphs in the input database are large. In
this work, we propose FS^3, which is a sampling based method. It mines a small
collection of subgraphs that are most frequent in the probabilistic sense. FS^3
performs a Markov Chain Monte Carlo (MCMC) sampling over the space of a
fixed-size subgraphs such that the potentially frequent subgraphs are sampled
more often. Besides, FS^3 is equipped with an innovative queue manager. It
stores the sampled subgraph in a finite queue over the course of mining in such
a manner that the top-k positions in the queue contain the most frequent
subgraphs. Our experiments on database of large graphs show that FS^3 is
efficient, and it obtains subgraphs that are the most frequent amongst the
subgraphs of a given size
Frequent Subgraph Mining via Sampling with Rigorous Guarantees
Frequent subgraph mining is a fundamental task in the analysis of collections of graphs that aims at finding all the subgraphs that appear with more than a user-specified frequency in the dataset. While several exact approaches have been proposed to solve the task, it remains computationally challenging on large graph datasets due to the complexity of the subgraph isomorphism problem inherent in the task and the huge number of candidate patterns even for fairly small subgraphs.
In this thesis, we study two statistical learning measures of complexity, VC-dimension and Rademacher averages, for subgraphs, and derive efficiently computable bounds for both. We then show how such bounds can be applied to devise efficient sampling-based approaches for rigorously approximating the solutions of the frequent subgraph mining problem, providing sample sizes which are much tighter than what would be obtained by a straightforward application of Chernoff and union bounds. We also show that our bounds can be used for true frequent subgraph mining, which requires to identify subgraphs generated with probability above a given threshold using samples from an unknown generative process.
Moreover, we carried out an extensive experimental evaluation of our methods on real datasets, which shows that our bounds lead to efficiently computable and high-quality approximations for both applications.Frequent subgraph mining is a fundamental task in the analysis of collections of graphs that aims at finding all the subgraphs that appear with more than a user-specified frequency in the dataset. While several exact approaches have been proposed to solve the task, it remains computationally challenging on large graph datasets due to the complexity of the subgraph isomorphism problem inherent in the task and the huge number of candidate patterns even for fairly small subgraphs.
In this thesis, we study two statistical learning measures of complexity, VC-dimension and Rademacher averages, for subgraphs, and derive efficiently computable bounds for both. We then show how such bounds can be applied to devise efficient sampling-based approaches for rigorously approximating the solutions of the frequent subgraph mining problem, providing sample sizes which are much tighter than what would be obtained by a straightforward application of Chernoff and union bounds. We also show that our bounds can be used for true frequent subgraph mining, which requires to identify subgraphs generated with probability above a given threshold using samples from an unknown generative process.
Moreover, we carried out an extensive experimental evaluation of our methods on real datasets, which shows that our bounds lead to efficiently computable and high-quality approximations for both applications