54 research outputs found

    Counting small patterns in networks

    Get PDF
    Networks are an often employed tool that can help us visualize and analyze binary relationships by representing the entities as a set of nodes and the relations between them as edges in the network. One type of relations in the field of bioinformatics that is often modeled by networks are interactions between pairs of proteins. Recent studies have focused on analyzing the local structure of such networks by observing small connected patterns consisting of 4 or 5 nodes, which are also known as graphlets. The nodes of graphlets are further divided into orbits by their "roles" or symmetries. The number of times a node from the network participates in each orbit forms a signature of the node's local network topology. Working under the assumption that the node's local topology is correlated with its function in the network, researchers have successfully used graphlets to predict new protein functions. The bottleneck of graphlet-based approaches is usually in the time required to count them. This restriction is becoming even more pronounced with a growing amount of available data. This dissertation focuses on improving existing graphlet counting techniques that are based on simple exhaustive enumeration. We present the algorithm Orca that counts graphlets and their orbits instead of enumerating them. It exploits relations between orbit counts to construct a system of equations that can be set up efficiently. Orca achieves this by enumerating (k-1)-node graphlets to count k-node graphlets, effectively obtaining a speed-up by a factor proportional to the maximum degree of a node in the network. In practical terms, it counts graphlets in larger protein-protein interaction networks about 50-100 times faster. Orca was designed for counting graphlets with 4 and 5 nodes. However, we adapt the approach to counting edge-orbits in addition to the original node-orbits with the same gains in run time. We also show that this approach can be generalized to graphlets of arbitrary size by identifying the necessary conditions and proving that these conditions can be fulfilled even for larger graphlets. Finally, we consider the problem of generating random graphs with prescribed graph\-let distributions. This motivated the adaptation of Orca for dynamic or changing networks, where edges can be added or removed. These changes can be a consequence of the procedure for generating a random graph or can be inherent in the network and the process it models. The generated graphs closely match the desired graphlet counts and as a consequence approximate other structural measures as well. The developed algorithm is a valuable tool for graphlet-based network analysis and a significant stepping stone towards analyzing larger and denser networks. As the fastest graphlet counting method it also presents a basis for further development of efficient pattern counting methods in graphs. This doctoral dissertation is based on three published papers that together with a chapter containing some unpublished work form the core of the dissertation

    Graphettes: Constant-time determination of graphlet and orbit identity including (possibly disconnected) graphlets up to size 8.

    Get PDF
    Graphlets are small connected induced subgraphs of a larger graph G. Graphlets are now commonly used to quantify local and global topology of networks in the field. Methods exist to exhaustively enumerate all graphlets (and their orbits) in large networks as efficiently as possible using orbit counting equations. However, the number of graphlets in G is exponential in both the number of nodes and edges in G. Enumerating them all is already unacceptably expensive on existing large networks, and the problem will only get worse as networks continue to grow in size and density. Here we introduce an efficient method designed to aid statistical sampling of graphlets up to size k = 8 from a large network. We define graphettes as the generalization of graphlets allowing for disconnected graphlets. Given a particular (undirected) graphette g, we introduce the idea of the canonical graphette [Formula: see text] as a representative member of the isomorphism group Iso(g) of g. We compute the mapping [Formula: see text], in the form of a lookup table, from all 2k(k - 1)/2 undirected graphettes g of size k ≤ 8 to their canonical representatives [Formula: see text], as well as the permutation that transforms g to [Formula: see text]. We also compute all automorphism orbits for each canonical graphette. Thus, given any k ≤ 8 nodes in a graph G, we can in constant time infer which graphette it is, as well as which orbit each of the k nodes belongs to. Sampling a large number N of such k-sets of nodes provides an approximation of both the distribution of graphlets and orbits across G, and the orbit degree vector at each node

    Combinatorial algorithm for counting small induced graphs and orbits

    Full text link
    Graphlet analysis is an approach to network analysis that is particularly popular in bioinformatics. We show how to set up a system of linear equations that relate the orbit counts and can be used in an algorithm that is significantly faster than the existing approaches based on direct enumeration of graphlets. The algorithm requires existence of a vertex with certain properties; we show that such vertex exists for graphlets of arbitrary size, except for complete graphs and C4C_4, which are treated separately. Empirical analysis of running time agrees with the theoretical results

    A Fast Counting Method for 6-motifs with Low Connectivity

    Full text link
    A kk-motif (or graphlet) is a subgraph on kk nodes in a graph or network. Counting of motifs in complex networks has been a well-studied problem in network analysis of various real-word graphs arising from the study of social networks and bioinformatics. In particular, the triangle counting problem has received much attention due to its significance in understanding the behavior of social networks. Similarly, subgraphs with more than 3 nodes have received much attention recently. While there have been successful methods developed on this problem, most of the existing algorithms are not scalable to large networks with millions of nodes and edges. The main contribution of this paper is a preliminary study that genaralizes the exact counting algorithm provided by Pinar, Seshadhri and Vishal to a collection of 6-motifs. This method uses the counts of motifs with smaller size to obtain the counts of 6-motifs with low connecivity, that is, containing a cut-vertex or a cut-edge. Therefore, it circumvents the combinatorial explosion that naturally arises when counting subgraphs in large networks

    Beyond Triangles: A Distributed Framework for Estimating 3-profiles of Large Graphs

    Full text link
    We study the problem of approximating the 33-profile of a large graph. 33-profiles are generalizations of triangle counts that specify the number of times a small graph appears as an induced subgraph of a large graph. Our algorithm uses the novel concept of 33-profile sparsifiers: sparse graphs that can be used to approximate the full 33-profile counts for a given large graph. Further, we study the problem of estimating local and ego 33-profiles, two graph quantities that characterize the local neighborhood of each vertex of a graph. Our algorithm is distributed and operates as a vertex program over the GraphLab PowerGraph framework. We introduce the concept of edge pivoting which allows us to collect 22-hop information without maintaining an explicit 22-hop neighborhood list at each vertex. This enables the computation of all the local 33-profiles in parallel with minimal communication. We test out implementation in several experiments scaling up to 640640 cores on Amazon EC2. We find that our algorithm can estimate the 33-profile of a graph in approximately the same time as triangle counting. For the harder problem of ego 33-profiles, we introduce an algorithm that can estimate profiles of hundreds of thousands of vertices in parallel, in the timescale of minutes.Comment: To appear in part at KDD'1

    Distributed Estimation of Graph 4-Profiles

    Full text link
    We present a novel distributed algorithm for counting all four-node induced subgraphs in a big graph. These counts, called the 44-profile, describe a graph's connectivity properties and have found several uses ranging from bioinformatics to spam detection. We also study the more complicated problem of estimating the local 44-profiles centered at each vertex of the graph. The local 44-profile embeds every vertex in an 1111-dimensional space that characterizes the local geometry of its neighborhood: vertices that connect different clusters will have different local 44-profiles compared to those that are only part of one dense cluster. Our algorithm is a local, distributed message-passing scheme on the graph and computes all the local 44-profiles in parallel. We rely on two novel theoretical contributions: we show that local 44-profiles can be calculated using compressed two-hop information and also establish novel concentration results that show that graphs can be substantially sparsified and still retain good approximation quality for the global 44-profile. We empirically evaluate our algorithm using a distributed GraphLab implementation that we scaled up to 640640 cores. We show that our algorithm can compute global and local 44-profiles of graphs with millions of edges in a few minutes, significantly improving upon the previous state of the art.Comment: To appear in part at WWW'1

    Graphlet and Orbit Computation on Heterogeneous Graphs

    Full text link
    Many applications, ranging from natural to social sciences, rely on graphlet analysis for the intuitive and meaningful characterization of networks employing micro-level structures as building blocks. However, it has not been thoroughly explored in heterogeneous graphs, which comprise various types of nodes and edges. Finding graphlets and orbits for heterogeneous graphs is difficult because of the heterogeneity and abundance of semantic information. We consider heterogeneous graphs, which can be treated as colored graphs. By applying the canonical label technique, we determine the graph isomorphism problem with multiple states on nodes and edges. With minimal parameters, we build all non-isomorphic graphs and associated orbits. We provide a Python package that can be used to generate orbits for colored directed graphs and determine the frequency of orbit occurrence. Finally, we provide four examples to illustrate the use of the Python package.Comment: 13 pages, 7 figure

    Quasipolynomiality of the Smallest Missing Induced Subgraph

    Full text link
    We study the problem of finding the smallest graph that does not occur as an induced subgraph of a given graph. This missing induced subgraph has at most logarithmic size and can be found by a brute-force search, in an nn-vertex graph, in time nO(logn)n^{O(\log n)}. We show that under the Exponential Time Hypothesis this quasipolynomial time bound is optimal. We also consider variations of the problem in which either the missing subgraph or the given graph comes from a restricted graph family; for instance, we prove that the smallest missing planar induced subgraph of a given planar graph can be found in polynomial time.Comment: 10 pages, 1 figure. To appear in J. Graph Algorithms Appl. This version updates an author affiliatio

    An Introductory Guide to Aligning Networks Using SANA, the Simulated Annealing Network Aligner.

    Get PDF
    Sequence alignment has had an enormous impact on our understanding of biology, evolution, and disease. The alignment of biological networks holds similar promise. Biological networks generally model interactions between biomolecules such as proteins, genes, metabolites, or mRNAs. There is strong evidence that the network topology-the "structure" of the network-is correlated with the functions performed, so that network topology can be used to help predict or understand function. However, unlike sequence comparison and alignment-which is an essentially solved problem-network comparison and alignment is an NP-complete problem for which heuristic algorithms must be used.Here we introduce SANA, the Simulated Annealing Network Aligner. SANA is one of many algorithms proposed for the arena of biological network alignment. In the context of global network alignment, SANA stands out for its speed, memory efficiency, ease-of-use, and flexibility in the arena of producing alignments between two or more networks. SANA produces better alignments in minutes on a laptop than most other algorithms can produce in hours or days of CPU time on large server-class machines. We walk the user through how to use SANA for several types of biomolecular networks
    corecore