302 research outputs found

    Subgraph covers -- An information theoretic approach to motif analysis in networks

    Get PDF
    Many real world networks contain a statistically surprising number of certain subgraphs, called network motifs. In the prevalent approach to motif analysis, network motifs are detected by comparing subgraph frequencies in the original network with a statistical null model. In this paper we propose an alternative approach to motif analysis where network motifs are defined to be connectivity patterns that occur in a subgraph cover that represents the network using minimal total information. A subgraph cover is defined to be a set of subgraphs such that every edge of the graph is contained in at least one of the subgraphs in the cover. Some recently introduced random graph models that can incorporate significant densities of motifs have natural formulations in terms of subgraph covers and the presented approach can be used to match networks with such models. To prove the practical value of our approach we also present a heuristic for the resulting NP-hard optimization problem and give results for several real world networks.Comment: 10 pages, 7 tables, 1 Figur

    Information content of colored motifs in complex networks

    Full text link
    We study complex networks in which the nodes of the network are tagged with different colors depending on the functionality of the nodes (colored graphs), using information theory applied to the distribution of motifs in such networks. We find that colored motifs can be viewed as the building blocks of the networks (much more so than the uncolored structural motifs can be) and that the relative frequency with which these motifs appear in the network can be used to define the information content of the network. This information is defined in such a way that a network with random coloration (but keeping the relative number of nodes with different colors the same) has zero color information content. Thus, colored motif information captures the exceptionality of coloring in the motifs that is maintained via selection. We study the motif information content of the C. elegans brain as well as the evolution of colored motif information in networks that reflect the interaction between instructions in genomes of digital life organisms. While we find that colored motif information appears to capture essential functionality in the C. elegans brain (where the color assignment of nodes is straightforward) it is not obvious whether the colored motif information content always increases during evolution, as would be expected from a measure that captures network complexity. For a single choice of color assignment of instructions in the digital life form Avida, we find rather that colored motif information content increases or decreases during evolution, depending on how the genomes are organized, and therefore could be an interesting tool to dissect genomic rearrangements.Comment: 21 pages, 8 figures, to appear in Artificial Lif

    The Graph Motif problem parameterized by the structure of the input graph

    Full text link
    The Graph Motif problem was introduced in 2006 in the context of biological networks. It consists of deciding whether or not a multiset of colors occurs in a connected subgraph of a vertex-colored graph. Graph Motif has been mostly analyzed from the standpoint of parameterized complexity. The main parameters which came into consideration were the size of the multiset and the number of colors. Though, in the many applications of Graph Motif, the input graph originates from real-life and has structure. Motivated by this prosaic observation, we systematically study its complexity relatively to graph structural parameters. For a wide range of parameters, we give new or improved FPT algorithms, or show that the problem remains intractable. For the FPT cases, we also give some kernelization lower bounds as well as some ETH-based lower bounds on the worst case running time. Interestingly, we establish that Graph Motif is W[1]-hard (while in W[P]) for parameter max leaf number, which is, to the best of our knowledge, the first problem to behave this way.Comment: 24 pages, accepted in DAM, conference version in IPEC 201

    Coping with new Challenges in Clustering and Biomedical Imaging

    Get PDF
    The last years have seen a tremendous increase of data acquisition in different scientific fields such as molecular biology, bioinformatics or biomedicine. Therefore, novel methods are needed for automatic data processing and analysis of this large amount of data. Data mining is the process of applying methods like clustering or classification to large databases in order to uncover hidden patterns. Clustering is the task of partitioning points of a data set into distinct groups in order to minimize the intra cluster similarity and to maximize the inter cluster similarity. In contrast to unsupervised learning like clustering, the classification problem is known as supervised learning that aims at the prediction of group membership of data objects on the basis of rules learned from a training set where the group membership is known. Specialized methods have been proposed for hierarchical and partitioning clustering. However, these methods suffer from several drawbacks. In the first part of this work, new clustering methods are proposed that cope with problems from conventional clustering algorithms. ITCH (Information-Theoretic Cluster Hierarchies) is a hierarchical clustering method that is based on a hierarchical variant of the Minimum Description Length (MDL) principle which finds hierarchies of clusters without requiring input parameters. As ITCH may converge only to a local optimum we propose GACH (Genetic Algorithm for Finding Cluster Hierarchies) that combines the benefits from genetic algorithms with information-theory. In this way the search space is explored more effectively. Furthermore, we propose INTEGRATE a novel clustering method for data with mixed numerical and categorical attributes. Supported by the MDL principle our method integrates the information provided by heterogeneous numerical and categorical attributes and thus naturally balances the influence of both sources of information. A competitive evaluation illustrates that INTEGRATE is more effective than existing clustering methods for mixed type data. Besides clustering methods for single data objects we provide a solution for clustering different data sets that are represented by their skylines. The skyline operator is a well-established database primitive for finding database objects which minimize two or more attributes with an unknown weighting between these attributes. In this thesis, we define a similarity measure, called SkyDist, for comparing skylines of different data sets that can directly be integrated into different data mining tasks such as clustering or classification. The experiments show that SkyDist in combination with different clustering algorithms can give useful insights into many applications. In the second part, we focus on the analysis of high resolution magnetic resonance images (MRI) that are clinically relevant and may allow for an early detection and diagnosis of several diseases. In particular, we propose a framework for the classification of Alzheimer's disease in MR images combining the data mining steps of feature selection, clustering and classification. As a result, a set of highly selective features discriminating patients with Alzheimer and healthy people has been identified. However, the analysis of the high dimensional MR images is extremely time-consuming. Therefore we developed JGrid, a scalable distributed computing solution designed to allow for a large scale analysis of MRI and thus an optimized prediction of diagnosis. In another study we apply efficient algorithms for motif discovery to task-fMRI scans in order to identify patterns in the brain that are characteristic for patients with somatoform pain disorder. We find groups of brain compartments that occur frequently within the brain networks and discriminate well among healthy and diseased people

    Temporal Networks

    Full text link
    A great variety of systems in nature, society and technology -- from the web of sexual contacts to the Internet, from the nervous system to power grids -- can be modeled as graphs of vertices coupled by edges. The network structure, describing how the graph is wired, helps us understand, predict and optimize the behavior of dynamical systems. In many cases, however, the edges are not continuously active. As an example, in networks of communication via email, text messages, or phone calls, edges represent sequences of instantaneous or practically instantaneous contacts. In some cases, edges are active for non-negligible periods of time: e.g., the proximity patterns of inpatients at hospitals can be represented by a graph where an edge between two individuals is on throughout the time they are at the same ward. Like network topology, the temporal structure of edge activations can affect dynamics of systems interacting through the network, from disease contagion on the network of patients to information diffusion over an e-mail network. In this review, we present the emergent field of temporal networks, and discuss methods for analyzing topological and temporal structure and models for elucidating their relation to the behavior of dynamical systems. In the light of traditional network theory, one can see this framework as moving the information of when things happen from the dynamical system on the network, to the network itself. Since fundamental properties, such as the transitivity of edges, do not necessarily hold in temporal networks, many of these methods need to be quite different from those for static networks

    Balanced Connected Subgraph Problem in Geometric Intersection Graphs

    Full text link
    We study the Balanced Connected Subgraph(shortly, BCS) problem on geometric intersection graphs such as interval, circular-arc, permutation, unit-disk, outer-string graphs, etc. Given a vertex-colored graph G=(V,E)G=(V,E), where each vertex in VV is colored with either ``red'' or ``blue'', the BCS problem seeks a maximum cardinality induced connected subgraph HH of GG such that HH is color-balanced, i.e., HH contains an equal number of red and blue vertices. We study the computational complexity landscape of the BCS problem while considering geometric intersection graphs. On one hand, we prove that the BCS problem is NP-hard on the unit disk, outer-string, complete grid, and unit square graphs. On the other hand, we design polynomial-time algorithms for the BCS problem on interval, circular-arc and permutation graphs. In particular, we give algorithm for the Steiner Tree problem on both the interval graphs and circular arc graphs, that is used as a subroutine for solving BCS problem on same graph classes. Finally, we present a FPT algorithm for the BCS problem on general graphs.Comment: 17 pages, 3 figure

    Mining subjectively interesting patterns in rich data

    Get PDF

    Towards comprehensive structural motif mining for better fold annotation in the "twilight zone" of sequence dissimilarity

    Get PDF
    Background: Automatic identification of structure fingerprints from a group of diverse protein structures is challenging, especially for proteins whose divergent amino acid sequences may fall into the “twilight-” or “midnight– ” zones where pair-wise sequence identities to known sequences fall below 25 % and sequence-based functional annotations often fail. Results: Here we report a novel graph database mining method and demonstrate its application to protein structure pattern identification and structure classification. The biologic motivation of our study is to recognize common structure patterns in “immunoevasins”, proteins mediating virus evasion of host immune defense. Our experimental study, using both viral and non-viral proteins, demonstrates the efficiency and efficacy of the proposed method. Conclusions: We present a theoretic framework, offer a practical software implementation for incorporating prior domain knowledge, such as substitution matrices as studied here, and devise an efficient algorithm to identify approximate matched frequent subgraphs. By doing so, we significantly expanded the analytical power of sophisticated data mining algorithms in dealing with large volume of complicated and noisy protein structure data. And without loss of generality, choice of appropriate compatibility matrices allows our method to be easily employed in domains where subgraph labels have some uncertainty
    • …
    corecore