208 research outputs found
Mining Representative Unsubstituted Graph Patterns Using Prior Similarity Matrix
One of the most powerful techniques to study protein structures is to look
for recurrent fragments (also called substructures or spatial motifs), then use
them as patterns to characterize the proteins under study. An emergent trend
consists in parsing proteins three-dimensional (3D) structures into graphs of
amino acids. Hence, the search of recurrent spatial motifs is formulated as a
process of frequent subgraph discovery where each subgraph represents a spatial
motif. In this scope, several efficient approaches for frequent subgraph
discovery have been proposed in the literature. However, the set of discovered
frequent subgraphs is too large to be efficiently analyzed and explored in any
further process. In this paper, we propose a novel pattern selection approach
that shrinks the large number of discovered frequent subgraphs by selecting the
representative ones. Existing pattern selection approaches do not exploit the
domain knowledge. Yet, in our approach we incorporate the evolutionary
information of amino acids defined in the substitution matrices in order to
select the representative subgraphs. We show the effectiveness of our approach
on a number of real datasets. The results issued from our experiments show that
our approach is able to considerably decrease the number of motifs while
enhancing their interestingness
Inductive queries for a drug designing robot scientist
It is increasingly clear that machine learning algorithms need to be integrated in an iterative scientific discovery loop, in which data is queried repeatedly by means of inductive queries and where the computer provides guidance to the experiments that are being performed. In this chapter, we summarise several key challenges in achieving this integration of machine learning and data mining algorithms in methods for the discovery of Quantitative Structure Activity Relationships (QSARs). We introduce the concept of a robot scientist, in which all steps of the discovery process are automated; we discuss the representation of molecular data such that knowledge discovery tools can analyse it, and we discuss the adaptation of machine learning and data mining algorithms to guide QSAR experiments
Significant Subgraph Mining with Multiple Testing Correction
The problem of finding itemsets that are statistically significantly enriched
in a class of transactions is complicated by the need to correct for multiple
hypothesis testing. Pruning untestable hypotheses was recently proposed as a
strategy for this task of significant itemset mining. It was shown to lead to
greater statistical power, the discovery of more truly significant itemsets,
than the standard Bonferroni correction on real-world datasets. An open
question, however, is whether this strategy of excluding untestable hypotheses
also leads to greater statistical power in subgraph mining, in which the number
of hypotheses is much larger than in itemset mining. Here we answer this
question by an empirical investigation on eight popular graph benchmark
datasets. We propose a new efficient search strategy, which always returns the
same solution as the state-of-the-art approach and is approximately two orders
of magnitude faster. Moreover, we exploit the dependence between subgraphs by
considering the effective number of tests and thereby further increase the
statistical power.Comment: 18 pages, 5 figure, accepted to the 2015 SIAM International
Conference on Data Mining (SDM15
GTRACE-RS: Efficient Graph Sequence Mining using Reverse Search
The mining of frequent subgraphs from labeled graph data has been studied
extensively. Furthermore, much attention has recently been paid to frequent
pattern mining from graph sequences. A method, called GTRACE, has been proposed
to mine frequent patterns from graph sequences under the assumption that
changes in graphs are gradual. Although GTRACE mines the frequent patterns
efficiently, it still needs substantial computation time to mine the patterns
from graph sequences containing large graphs and long sequences. In this paper,
we propose a new version of GTRACE that enables efficient mining of frequent
patterns based on the principle of a reverse search. The underlying concept of
the reverse search is a general scheme for designing efficient algorithms for
hard enumeration problems. Our performance study shows that the proposed method
is efficient and scalable for mining both long and large graph sequence
patterns and is several orders of magnitude faster than the original GTRACE
Mining Brain Networks using Multiple Side Views for Neurological Disorder Identification
Mining discriminative subgraph patterns from graph data has attracted great
interest in recent years. It has a wide variety of applications in disease
diagnosis, neuroimaging, etc. Most research on subgraph mining focuses on the
graph representation alone. However, in many real-world applications, the side
information is available along with the graph data. For example, for
neurological disorder identification, in addition to the brain networks derived
from neuroimaging data, hundreds of clinical, immunologic, serologic and
cognitive measures may also be documented for each subject. These measures
compose multiple side views encoding a tremendous amount of supplemental
information for diagnostic purposes, yet are often ignored. In this paper, we
study the problem of discriminative subgraph selection using multiple side
views and propose a novel solution to find an optimal set of subgraph features
for graph classification by exploring a plurality of side views. We derive a
feature evaluation criterion, named gSide, to estimate the usefulness of
subgraph patterns based upon side views. Then we develop a branch-and-bound
algorithm, called gMSV, to efficiently search for optimal subgraph features by
integrating the subgraph mining process and the procedure of discriminative
feature selection. Empirical studies on graph classification tasks for
neurological disorders using brain networks demonstrate that subgraph patterns
selected by the multi-side-view guided subgraph selection approach can
effectively boost graph classification performances and are relevant to disease
diagnosis.Comment: in Proceedings of IEEE International Conference on Data Mining (ICDM)
201
FREQUENT PATTERN MINING ON UNCERTAIN GRAPHS
Weakness is typical for a wide extent of affirmed applications, which unavoidably applies to diagram information. Specialist flawed charts are seen in bio-informatics, social affiliations, and so forth This paper rouses the issue of ordinary sub framework mining on single sketchy outlines, and researches two exceptional - probabilistic and expected - semantics to the degree help definitions. Diagram information are poor upon shortcomings in different applications because of inadequacy and imprecision of information. Mining unsure diagram information is semantically not actually identical to and computationally more testing than mining precise chart information. This paper assesses the issue of mining reformist sub chart plans from defective outline information. The reformist sub chart arrangement mining issue is formalized by masterminding another measure called anticipated assistance. A normal mining tally is proposed to discover a concluded arrangement of standard sub diagram designs by permitting a blunder strength on the regular sponsorships of the found sub chart plans. The check utilizes a beneficial measure calculation to pick if a sub diagram model can be yield or not. The savvy and exploratory outcomes show that the assessment is useful, precise and adaptable for enormous crude chart information bases
Estudio comparativo de algoritmos de minerÃa de subgrafos frecuentes
Dentro las técnicas de minerÃa de grafos se encuentran las correspondientes a búsqueda de subgrafos frecuentes. Existen varios algoritmos orientados a reconocer subestructuras comunes entre un conjunto de grafos entre los que destacan: FSG, FFSM, gSpan y GASTON. El objetivo de esta investigación es analizar el comportamiento de estos algoritmos a través de distintos experimentos diseñados para identificar si existe un algoritmo superior al resto y, en caso de que no lo haya, poder definir en qué escenarios es más recomendable la elección de cada uno.XIII Workshop Bases de datos y MinerÃa de Datos (WBDMD).Red de Universidades con Carreras en Informática (RedUNCI
- …