2,022 research outputs found
Significant Subgraph Mining with Multiple Testing Correction
The problem of finding itemsets that are statistically significantly enriched
in a class of transactions is complicated by the need to correct for multiple
hypothesis testing. Pruning untestable hypotheses was recently proposed as a
strategy for this task of significant itemset mining. It was shown to lead to
greater statistical power, the discovery of more truly significant itemsets,
than the standard Bonferroni correction on real-world datasets. An open
question, however, is whether this strategy of excluding untestable hypotheses
also leads to greater statistical power in subgraph mining, in which the number
of hypotheses is much larger than in itemset mining. Here we answer this
question by an empirical investigation on eight popular graph benchmark
datasets. We propose a new efficient search strategy, which always returns the
same solution as the state-of-the-art approach and is approximately two orders
of magnitude faster. Moreover, we exploit the dependence between subgraphs by
considering the effective number of tests and thereby further increase the
statistical power.Comment: 18 pages, 5 figure, accepted to the 2015 SIAM International
Conference on Data Mining (SDM15
Mining Brain Networks using Multiple Side Views for Neurological Disorder Identification
Mining discriminative subgraph patterns from graph data has attracted great
interest in recent years. It has a wide variety of applications in disease
diagnosis, neuroimaging, etc. Most research on subgraph mining focuses on the
graph representation alone. However, in many real-world applications, the side
information is available along with the graph data. For example, for
neurological disorder identification, in addition to the brain networks derived
from neuroimaging data, hundreds of clinical, immunologic, serologic and
cognitive measures may also be documented for each subject. These measures
compose multiple side views encoding a tremendous amount of supplemental
information for diagnostic purposes, yet are often ignored. In this paper, we
study the problem of discriminative subgraph selection using multiple side
views and propose a novel solution to find an optimal set of subgraph features
for graph classification by exploring a plurality of side views. We derive a
feature evaluation criterion, named gSide, to estimate the usefulness of
subgraph patterns based upon side views. Then we develop a branch-and-bound
algorithm, called gMSV, to efficiently search for optimal subgraph features by
integrating the subgraph mining process and the procedure of discriminative
feature selection. Empirical studies on graph classification tasks for
neurological disorders using brain networks demonstrate that subgraph patterns
selected by the multi-side-view guided subgraph selection approach can
effectively boost graph classification performances and are relevant to disease
diagnosis.Comment: in Proceedings of IEEE International Conference on Data Mining (ICDM)
201
Top-k Differential Queries in Graph Databases
The sheer volume as well as the schema complexity of today’s graph databases impede the users in formulating queries against these databases and often cause queries to “fail” by delivering empty answers. To support users in such situations, the concept of differential queries can be used to bridge the gap between an unexpected result (e.g. an empty result set) and the query intention of users. These queries deliver missing parts of a query graph and, therefore, work with such scenarios that require users to specify a query graph. Based on the discovered information about a missing query subgraph, users may understand which vertices and edges are the reasons for queries that unexpectedly return empty answers, and thus can reformulate the queries if needed. A study showed that the result sets of differential queries are often too large to be manually introspected by users and thus a reduction of the number of results and their ranking is required. To address these issues, we extend the concept of differential queries and introduce top-k differential queries that calculate the ranking based on users’ preferences and therefore significantly support the users’ understanding of query database management systems. The idea consists of assigning relevance weights to vertices or edges of a query graph by users that steer the graph search and are used in the scoring function for top-k differential results. Along with the novel concept of the top-k differential queries, we further propose a strategy for propagating relevance weights and we model the search along the most relevant paths
- …