16,229 research outputs found
Diffusion Component Analysis: Unraveling Functional Topology in Biological Networks
Complex biological systems have been successfully modeled by biochemical and
genetic interaction networks, typically gathered from high-throughput (HTP)
data. These networks can be used to infer functional relationships between
genes or proteins. Using the intuition that the topological role of a gene in a
network relates to its biological function, local or diffusion based
"guilt-by-association" and graph-theoretic methods have had success in
inferring gene functions. Here we seek to improve function prediction by
integrating diffusion-based methods with a novel dimensionality reduction
technique to overcome the incomplete and noisy nature of network data. In this
paper, we introduce diffusion component analysis (DCA), a framework that plugs
in a diffusion model and learns a low-dimensional vector representation of each
node to encode the topological properties of a network. As a proof of concept,
we demonstrate DCA's substantial improvement over state-of-the-art
diffusion-based approaches in predicting protein function from molecular
interaction networks. Moreover, our DCA framework can integrate multiple
networks from heterogeneous sources, consisting of genomic information,
biochemical experiments and other resources, to even further improve function
prediction. Yet another layer of performance gain is achieved by integrating
the DCA framework with support vector machines that take our node vector
representations as features. Overall, our DCA framework provides a novel
representation of nodes in a network that can be used as a plug-in architecture
to other machine learning algorithms to decipher topological properties of and
obtain novel insights into interactomes.Comment: RECOMB 201
Coding limits on the number of transcription factors
Transcription factor proteins bind specific DNA sequences to control the
expression of genes. They contain DNA binding domains which belong to several
super-families, each with a specific mechanism of DNA binding. The total number
of transcription factors encoded in a genome increases with the number of genes
in the genome. Here, we examined the number of transcription factors from each
super-family in diverse organisms.
We find that the number of transcription factors from most super-families
appears to be bounded. For example, the number of winged helix factors does not
generally exceed 300, even in very large genomes. The magnitude of the maximal
number of transcription factors from each super-family seems to correlate with
the number of DNA bases effectively recognized by the binding mechanism of that
super-family. Coding theory predicts that such upper bounds on the number of
transcription factors should exist, in order to minimize cross-binding errors
between transcription factors. This theory further predicts that factors with
similar binding sequences should tend to have similar biological effect, so
that errors based on mis-recognition are minimal. We present evidence that
transcription factors with similar binding sequences tend to regulate genes
with similar biological functions, supporting this prediction.
The present study suggests limits on the transcription factor repertoire of
cells, and suggests coding constraints that might apply more generally to the
mapping between binding sites and biological function.Comment: http://www.weizmann.ac.il/complex/tlusty/papers/BMCGenomics2006.pdf
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1590034/
http://www.biomedcentral.com/1471-2164/7/23
Biological network comparison using graphlet degree distribution
Analogous to biological sequence comparison, comparing cellular networks is
an important problem that could provide insight into biological understanding
and therapeutics. For technical reasons, comparing large networks is
computationally infeasible, and thus heuristics such as the degree distribution
have been sought. It is easy to demonstrate that two networks are different by
simply showing a short list of properties in which they differ. It is much
harder to show that two networks are similar, as it requires demonstrating
their similarity in all of their exponentially many properties. Clearly, it is
computationally prohibitive to analyze all network properties, but the larger
the number of constraints we impose in determining network similarity, the more
likely it is that the networks will truly be similar.
We introduce a new systematic measure of a network's local structure that
imposes a large number of similarity constraints on networks being compared. In
particular, we generalize the degree distribution, which measures the number of
nodes 'touching' k edges, into distributions measuring the number of nodes
'touching' k graphlets, where graphlets are small connected non-isomorphic
subgraphs of a large network. Our new measure of network local structure
consists of 73 graphlet degree distributions (GDDs) of graphlets with 2-5
nodes, but it is easily extendible to a greater number of constraints (i.e.
graphlets). Furthermore, we show a way to combine the 73 GDDs into a network
'agreement' measure. Based on this new network agreement measure, we show that
almost all of the 14 eukaryotic PPI networks, including human, are better
modeled by geometric random graphs than by Erdos-Reny, random scale-free, or
Barabasi-Albert scale-free networks.Comment: Proceedings of the 2006 European Conference on Computational Biology,
ECCB'06, Eilat, Israel, January 21-24, 200
Global alignment of protein-protein interaction networks by graph matching methods
Aligning protein-protein interaction (PPI) networks of different species has
drawn a considerable interest recently. This problem is important to
investigate evolutionary conserved pathways or protein complexes across
species, and to help in the identification of functional orthologs through the
detection of conserved interactions. It is however a difficult combinatorial
problem, for which only heuristic methods have been proposed so far. We
reformulate the PPI alignment as a graph matching problem, and investigate how
state-of-the-art graph matching algorithms can be used for that purpose. We
differentiate between two alignment problems, depending on whether strict
constraints on protein matches are given, based on sequence similarity, or
whether the goal is instead to find an optimal compromise between sequence
similarity and interaction conservation in the alignment. We propose new
methods for both cases, and assess their performance on the alignment of the
yeast and fly PPI networks. The new methods consistently outperform
state-of-the-art algorithms, retrieving in particular 78% more conserved
interactions than IsoRank for a given level of sequence similarity.
Availability:http://cbio.ensmp.fr/proj/graphm\_ppi/, additional data and codes
are available upon request. Contact: [email protected]: Preprint versio
- …