302 research outputs found
Subgraph covers -- An information theoretic approach to motif analysis in networks
Many real world networks contain a statistically surprising number of certain
subgraphs, called network motifs. In the prevalent approach to motif analysis,
network motifs are detected by comparing subgraph frequencies in the original
network with a statistical null model. In this paper we propose an alternative
approach to motif analysis where network motifs are defined to be connectivity
patterns that occur in a subgraph cover that represents the network using
minimal total information. A subgraph cover is defined to be a set of subgraphs
such that every edge of the graph is contained in at least one of the subgraphs
in the cover. Some recently introduced random graph models that can incorporate
significant densities of motifs have natural formulations in terms of subgraph
covers and the presented approach can be used to match networks with such
models. To prove the practical value of our approach we also present a
heuristic for the resulting NP-hard optimization problem and give results for
several real world networks.Comment: 10 pages, 7 tables, 1 Figur
Information content of colored motifs in complex networks
We study complex networks in which the nodes of the network are tagged with
different colors depending on the functionality of the nodes (colored graphs),
using information theory applied to the distribution of motifs in such
networks. We find that colored motifs can be viewed as the building blocks of
the networks (much more so than the uncolored structural motifs can be) and
that the relative frequency with which these motifs appear in the network can
be used to define the information content of the network. This information is
defined in such a way that a network with random coloration (but keeping the
relative number of nodes with different colors the same) has zero color
information content. Thus, colored motif information captures the
exceptionality of coloring in the motifs that is maintained via selection. We
study the motif information content of the C. elegans brain as well as the
evolution of colored motif information in networks that reflect the interaction
between instructions in genomes of digital life organisms. While we find that
colored motif information appears to capture essential functionality in the C.
elegans brain (where the color assignment of nodes is straightforward) it is
not obvious whether the colored motif information content always increases
during evolution, as would be expected from a measure that captures network
complexity. For a single choice of color assignment of instructions in the
digital life form Avida, we find rather that colored motif information content
increases or decreases during evolution, depending on how the genomes are
organized, and therefore could be an interesting tool to dissect genomic
rearrangements.Comment: 21 pages, 8 figures, to appear in Artificial Lif
The Graph Motif problem parameterized by the structure of the input graph
The Graph Motif problem was introduced in 2006 in the context of biological
networks. It consists of deciding whether or not a multiset of colors occurs in
a connected subgraph of a vertex-colored graph. Graph Motif has been mostly
analyzed from the standpoint of parameterized complexity. The main parameters
which came into consideration were the size of the multiset and the number of
colors. Though, in the many applications of Graph Motif, the input graph
originates from real-life and has structure. Motivated by this prosaic
observation, we systematically study its complexity relatively to graph
structural parameters. For a wide range of parameters, we give new or improved
FPT algorithms, or show that the problem remains intractable. For the FPT
cases, we also give some kernelization lower bounds as well as some ETH-based
lower bounds on the worst case running time. Interestingly, we establish that
Graph Motif is W[1]-hard (while in W[P]) for parameter max leaf number, which
is, to the best of our knowledge, the first problem to behave this way.Comment: 24 pages, accepted in DAM, conference version in IPEC 201
Coping with new Challenges in Clustering and Biomedical Imaging
The last years have seen a tremendous increase of data acquisition in different scientific fields such as molecular biology, bioinformatics or biomedicine. Therefore, novel methods are needed for automatic data processing and analysis of this large amount of data. Data mining is the process of applying methods like clustering or classification to large databases in order to uncover hidden patterns. Clustering is the task of partitioning points of a data set into distinct groups in order to minimize the intra cluster similarity and to maximize the inter cluster similarity. In contrast to unsupervised learning like clustering, the classification problem is known as supervised learning that aims at the prediction of group membership of data objects on the basis of rules learned from a training set where the group membership is known.
Specialized methods have been proposed for hierarchical and partitioning clustering. However, these methods suffer from several drawbacks. In the first part of this work, new clustering methods are proposed that cope with problems from conventional clustering algorithms. ITCH (Information-Theoretic Cluster Hierarchies) is a hierarchical clustering method that is based on a hierarchical variant of the Minimum Description Length (MDL) principle which finds hierarchies of clusters without requiring input parameters. As ITCH may converge only to a local optimum we propose GACH (Genetic Algorithm for Finding Cluster Hierarchies) that combines the benefits from genetic algorithms with information-theory. In this way the search space is explored more effectively.
Furthermore, we propose INTEGRATE a novel clustering method for data with mixed numerical and categorical attributes. Supported by the MDL principle our method integrates the information provided by heterogeneous numerical and categorical attributes and thus naturally balances the influence of both sources of information. A competitive evaluation illustrates that INTEGRATE is more effective than existing clustering methods for mixed type data. Besides clustering methods for single data objects we provide a solution for clustering different data sets that are represented by their skylines. The skyline operator is a well-established database primitive for finding database objects which minimize two or more attributes with an unknown weighting between these attributes. In this thesis, we define a similarity measure, called SkyDist, for comparing skylines of different data sets that can directly be integrated into different data mining tasks such as clustering or classification. The experiments show that SkyDist in combination with different clustering algorithms can give useful insights into many applications.
In the second part, we focus on the analysis of high resolution magnetic resonance images (MRI) that are clinically relevant and may allow for an early detection and diagnosis of several diseases. In particular, we propose a framework for the classification of Alzheimer's disease in MR images combining the data mining steps of feature selection, clustering and classification. As a result, a set of highly selective features discriminating patients with Alzheimer and healthy people has been identified. However, the analysis of the high dimensional MR images is extremely time-consuming. Therefore we developed JGrid, a scalable distributed computing solution designed to allow for a large scale analysis of MRI and thus an optimized prediction of diagnosis. In another study we apply efficient algorithms for motif discovery to task-fMRI scans in order to identify patterns in the brain that are characteristic for patients with somatoform pain disorder. We find groups of brain compartments that occur frequently within the brain networks and discriminate well among healthy and diseased people
Temporal Networks
A great variety of systems in nature, society and technology -- from the web
of sexual contacts to the Internet, from the nervous system to power grids --
can be modeled as graphs of vertices coupled by edges. The network structure,
describing how the graph is wired, helps us understand, predict and optimize
the behavior of dynamical systems. In many cases, however, the edges are not
continuously active. As an example, in networks of communication via email,
text messages, or phone calls, edges represent sequences of instantaneous or
practically instantaneous contacts. In some cases, edges are active for
non-negligible periods of time: e.g., the proximity patterns of inpatients at
hospitals can be represented by a graph where an edge between two individuals
is on throughout the time they are at the same ward. Like network topology, the
temporal structure of edge activations can affect dynamics of systems
interacting through the network, from disease contagion on the network of
patients to information diffusion over an e-mail network. In this review, we
present the emergent field of temporal networks, and discuss methods for
analyzing topological and temporal structure and models for elucidating their
relation to the behavior of dynamical systems. In the light of traditional
network theory, one can see this framework as moving the information of when
things happen from the dynamical system on the network, to the network itself.
Since fundamental properties, such as the transitivity of edges, do not
necessarily hold in temporal networks, many of these methods need to be quite
different from those for static networks
Balanced Connected Subgraph Problem in Geometric Intersection Graphs
We study the Balanced Connected Subgraph(shortly, BCS) problem on geometric
intersection graphs such as interval, circular-arc, permutation, unit-disk,
outer-string graphs, etc. Given a vertex-colored graph , where each
vertex in is colored with either ``red'' or ``blue'', the BCS problem seeks
a maximum cardinality induced connected subgraph of such that is
color-balanced, i.e., contains an equal number of red and blue vertices. We
study the computational complexity landscape of the BCS problem while
considering geometric intersection graphs. On one hand, we prove that the BCS
problem is NP-hard on the unit disk, outer-string, complete grid, and unit
square graphs. On the other hand, we design polynomial-time algorithms for the
BCS problem on interval, circular-arc and permutation graphs. In particular, we
give algorithm for the Steiner Tree problem on both the interval graphs and
circular arc graphs, that is used as a subroutine for solving BCS problem on
same graph classes. Finally, we present a FPT algorithm for the BCS problem on
general graphs.Comment: 17 pages, 3 figure
Towards comprehensive structural motif mining for better fold annotation in the "twilight zone" of sequence dissimilarity
Background: Automatic identification of structure fingerprints from a group of diverse protein structures is challenging, especially for proteins whose divergent amino acid sequences may fall into the “twilight-” or “midnight– ” zones where pair-wise sequence identities to known sequences fall below 25 % and sequence-based functional annotations often fail. Results: Here we report a novel graph database mining method and demonstrate its application to protein structure pattern identification and structure classification. The biologic motivation of our study is to recognize common structure patterns in “immunoevasins”, proteins mediating virus evasion of host immune defense. Our experimental study, using both viral and non-viral proteins, demonstrates the efficiency and efficacy of the proposed method. Conclusions: We present a theoretic framework, offer a practical software implementation for incorporating prior domain knowledge, such as substitution matrices as studied here, and devise an efficient algorithm to identify approximate matched frequent subgraphs. By doing so, we significantly expanded the analytical power of sophisticated data mining algorithms in dealing with large volume of complicated and noisy protein structure data. And without loss of generality, choice of appropriate compatibility matrices allows our method to be easily employed in domains where subgraph labels have some uncertainty
- …