54 research outputs found
Counting small patterns in networks
Networks are an often employed tool that can help us visualize and analyze binary relationships by representing the entities as a set of nodes and the relations between them as edges in the network. One type of relations in the field of bioinformatics that is often modeled by networks are interactions between pairs of proteins. Recent studies have focused on analyzing the local structure of such networks by observing small connected patterns consisting of 4 or 5 nodes, which are also known as graphlets. The nodes of graphlets are further divided into orbits by their "roles" or symmetries. The number of times a node from the network participates in each orbit forms a signature of the node's local network topology. Working under the assumption that the node's local topology is correlated with its function in the network, researchers have successfully used graphlets to predict new protein functions.
The bottleneck of graphlet-based approaches is usually in the time required to count them. This restriction is becoming even more pronounced with a growing amount of available data. This dissertation focuses on improving existing graphlet counting techniques that are based on simple exhaustive enumeration.
We present the algorithm Orca that counts graphlets and their orbits instead of enumerating them. It exploits relations between orbit counts to construct a system of equations that can be set up efficiently. Orca achieves this by enumerating (k-1)-node graphlets to count k-node graphlets, effectively obtaining a speed-up by a factor proportional to the maximum degree of a node in the network. In practical terms, it counts graphlets in larger protein-protein interaction networks about 50-100 times faster.
Orca was designed for counting graphlets with 4 and 5 nodes. However, we adapt the approach to counting edge-orbits in addition to the original node-orbits with the same gains in run time. We also show that this approach can be generalized to graphlets of arbitrary size by identifying the necessary conditions and proving that these conditions can be fulfilled even for larger graphlets.
Finally, we consider the problem of generating random graphs with prescribed graph\-let distributions. This motivated the adaptation of Orca for dynamic or changing networks, where edges can be added or removed. These changes can be a consequence of the procedure for generating a random graph or can be inherent in the network and the process it models. The generated graphs closely match the desired graphlet counts and as a consequence approximate other structural measures as well.
The developed algorithm is a valuable tool for graphlet-based network analysis and a significant stepping stone towards analyzing larger and denser networks. As the fastest graphlet counting method it also presents a basis for further development of efficient pattern counting methods in graphs.
This doctoral dissertation is based on three published papers that together with a chapter containing some unpublished work form the core of the dissertation
Graphettes: Constant-time determination of graphlet and orbit identity including (possibly disconnected) graphlets up to size 8.
Graphlets are small connected induced subgraphs of a larger graph G. Graphlets are now commonly used to quantify local and global topology of networks in the field. Methods exist to exhaustively enumerate all graphlets (and their orbits) in large networks as efficiently as possible using orbit counting equations. However, the number of graphlets in G is exponential in both the number of nodes and edges in G. Enumerating them all is already unacceptably expensive on existing large networks, and the problem will only get worse as networks continue to grow in size and density. Here we introduce an efficient method designed to aid statistical sampling of graphlets up to size k = 8 from a large network. We define graphettes as the generalization of graphlets allowing for disconnected graphlets. Given a particular (undirected) graphette g, we introduce the idea of the canonical graphette [Formula: see text] as a representative member of the isomorphism group Iso(g) of g. We compute the mapping [Formula: see text], in the form of a lookup table, from all 2k(k - 1)/2 undirected graphettes g of size k ≤ 8 to their canonical representatives [Formula: see text], as well as the permutation that transforms g to [Formula: see text]. We also compute all automorphism orbits for each canonical graphette. Thus, given any k ≤ 8 nodes in a graph G, we can in constant time infer which graphette it is, as well as which orbit each of the k nodes belongs to. Sampling a large number N of such k-sets of nodes provides an approximation of both the distribution of graphlets and orbits across G, and the orbit degree vector at each node
Combinatorial algorithm for counting small induced graphs and orbits
Graphlet analysis is an approach to network analysis that is particularly
popular in bioinformatics. We show how to set up a system of linear equations
that relate the orbit counts and can be used in an algorithm that is
significantly faster than the existing approaches based on direct enumeration
of graphlets. The algorithm requires existence of a vertex with certain
properties; we show that such vertex exists for graphlets of arbitrary size,
except for complete graphs and , which are treated separately. Empirical
analysis of running time agrees with the theoretical results
A Fast Counting Method for 6-motifs with Low Connectivity
A -motif (or graphlet) is a subgraph on nodes in a graph or network.
Counting of motifs in complex networks has been a well-studied problem in
network analysis of various real-word graphs arising from the study of social
networks and bioinformatics. In particular, the triangle counting problem has
received much attention due to its significance in understanding the behavior
of social networks. Similarly, subgraphs with more than 3 nodes have received
much attention recently. While there have been successful methods developed on
this problem, most of the existing algorithms are not scalable to large
networks with millions of nodes and edges.
The main contribution of this paper is a preliminary study that genaralizes
the exact counting algorithm provided by Pinar, Seshadhri and Vishal to a
collection of 6-motifs. This method uses the counts of motifs with smaller size
to obtain the counts of 6-motifs with low connecivity, that is, containing a
cut-vertex or a cut-edge. Therefore, it circumvents the combinatorial explosion
that naturally arises when counting subgraphs in large networks
Beyond Triangles: A Distributed Framework for Estimating 3-profiles of Large Graphs
We study the problem of approximating the -profile of a large graph.
-profiles are generalizations of triangle counts that specify the number of
times a small graph appears as an induced subgraph of a large graph. Our
algorithm uses the novel concept of -profile sparsifiers: sparse graphs that
can be used to approximate the full -profile counts for a given large graph.
Further, we study the problem of estimating local and ego -profiles, two
graph quantities that characterize the local neighborhood of each vertex of a
graph.
Our algorithm is distributed and operates as a vertex program over the
GraphLab PowerGraph framework. We introduce the concept of edge pivoting which
allows us to collect -hop information without maintaining an explicit
-hop neighborhood list at each vertex. This enables the computation of all
the local -profiles in parallel with minimal communication.
We test out implementation in several experiments scaling up to cores
on Amazon EC2. We find that our algorithm can estimate the -profile of a
graph in approximately the same time as triangle counting. For the harder
problem of ego -profiles, we introduce an algorithm that can estimate
profiles of hundreds of thousands of vertices in parallel, in the timescale of
minutes.Comment: To appear in part at KDD'1
Distributed Estimation of Graph 4-Profiles
We present a novel distributed algorithm for counting all four-node induced
subgraphs in a big graph. These counts, called the -profile, describe a
graph's connectivity properties and have found several uses ranging from
bioinformatics to spam detection. We also study the more complicated problem of
estimating the local -profiles centered at each vertex of the graph. The
local -profile embeds every vertex in an -dimensional space that
characterizes the local geometry of its neighborhood: vertices that connect
different clusters will have different local -profiles compared to those
that are only part of one dense cluster.
Our algorithm is a local, distributed message-passing scheme on the graph and
computes all the local -profiles in parallel. We rely on two novel
theoretical contributions: we show that local -profiles can be calculated
using compressed two-hop information and also establish novel concentration
results that show that graphs can be substantially sparsified and still retain
good approximation quality for the global -profile.
We empirically evaluate our algorithm using a distributed GraphLab
implementation that we scaled up to cores. We show that our algorithm can
compute global and local -profiles of graphs with millions of edges in a few
minutes, significantly improving upon the previous state of the art.Comment: To appear in part at WWW'1
Graphlet and Orbit Computation on Heterogeneous Graphs
Many applications, ranging from natural to social sciences, rely on graphlet
analysis for the intuitive and meaningful characterization of networks
employing micro-level structures as building blocks. However, it has not been
thoroughly explored in heterogeneous graphs, which comprise various types of
nodes and edges. Finding graphlets and orbits for heterogeneous graphs is
difficult because of the heterogeneity and abundance of semantic information.
We consider heterogeneous graphs, which can be treated as colored graphs. By
applying the canonical label technique, we determine the graph isomorphism
problem with multiple states on nodes and edges. With minimal parameters, we
build all non-isomorphic graphs and associated orbits. We provide a Python
package that can be used to generate orbits for colored directed graphs and
determine the frequency of orbit occurrence. Finally, we provide four examples
to illustrate the use of the Python package.Comment: 13 pages, 7 figure
Quasipolynomiality of the Smallest Missing Induced Subgraph
We study the problem of finding the smallest graph that does not occur as an
induced subgraph of a given graph. This missing induced subgraph has at most
logarithmic size and can be found by a brute-force search, in an -vertex
graph, in time . We show that under the Exponential Time
Hypothesis this quasipolynomial time bound is optimal. We also consider
variations of the problem in which either the missing subgraph or the given
graph comes from a restricted graph family; for instance, we prove that the
smallest missing planar induced subgraph of a given planar graph can be found
in polynomial time.Comment: 10 pages, 1 figure. To appear in J. Graph Algorithms Appl. This
version updates an author affiliatio
An Introductory Guide to Aligning Networks Using SANA, the Simulated Annealing Network Aligner.
Sequence alignment has had an enormous impact on our understanding of biology, evolution, and disease. The alignment of biological networks holds similar promise. Biological networks generally model interactions between biomolecules such as proteins, genes, metabolites, or mRNAs. There is strong evidence that the network topology-the "structure" of the network-is correlated with the functions performed, so that network topology can be used to help predict or understand function. However, unlike sequence comparison and alignment-which is an essentially solved problem-network comparison and alignment is an NP-complete problem for which heuristic algorithms must be used.Here we introduce SANA, the Simulated Annealing Network Aligner. SANA is one of many algorithms proposed for the arena of biological network alignment. In the context of global network alignment, SANA stands out for its speed, memory efficiency, ease-of-use, and flexibility in the arena of producing alignments between two or more networks. SANA produces better alignments in minutes on a laptop than most other algorithms can produce in hours or days of CPU time on large server-class machines. We walk the user through how to use SANA for several types of biomolecular networks
- …