5,438 research outputs found
Fast and Accurate Multivariate Gaussian Modeling of Protein Families: Predicting Residue Contacts and Protein-Interaction Partners
In the course of evolution, proteins show a remarkable conservation of their three-dimensional structure and their biological function, leading to strong evolutionary constraints on the sequence variability between homologous proteins. Our method aims at extracting such constraints from rapidly accumulating sequence data, and thereby at inferring protein structure and function from sequence information alone. Recently, global statistical inference methods (e.g. direct-coupling analysis, sparse inverse covariance estimation) have achieved a breakthrough towards this aim, and their predictions have been successfully implemented into tertiary and quaternary protein structure prediction methods. However, due to the discrete nature of the underlying variable (amino-acids), exact inference requires exponential time in the protein length, and efficient approximations are needed for practical applicability. Here we propose a very efficient multivariate Gaussian modeling approach as a variant of direct-coupling analysis: the discrete amino-acid variables are replaced by continuous Gaussian random variables. The resulting statistical inference problem is efficiently and exactly solvable. We show that the quality of inference is comparable or superior to the one achieved by mean-field approximations to inference with discrete variables, as done by direct-coupling analysis. This is true for (i) the prediction of residue-residue contacts in proteins, and (ii) the identification of protein-protein interaction partner in bacterial signal transduction. An implementation of our multivariate Gaussian approach is available at the website http://areeweb.polito.it/ricerca/cmp/cod
SPRINT: Ultrafast protein-protein interaction prediction of the entire human interactome
Proteins perform their functions usually by interacting with other proteins.
Predicting which proteins interact is a fundamental problem. Experimental
methods are slow, expensive, and have a high rate of error. Many computational
methods have been proposed among which sequence-based ones are very promising.
However, so far no such method is able to predict effectively the entire human
interactome: they require too much time or memory. We present SPRINT (Scoring
PRotein INTeractions), a new sequence-based algorithm and tool for predicting
protein-protein interactions. We comprehensively compare SPRINT with
state-of-the-art programs on seven most reliable human PPI datasets and show
that it is more accurate while running orders of magnitude faster and using
very little memory. SPRINT is the only program that can predict the entire
human interactome. Our goal is to transform the very challenging problem of
predicting the entire human interactome into a routine task. The source code of
SPRINT is freely available from github.com/lucian-ilie/SPRINT/ and the datasets
and predicted PPIs from www.csd.uwo.ca/faculty/ilie/SPRINT/
On Weight Matrix and Free Energy Models for Sequence Motif Detection
The problem of motif detection can be formulated as the construction of a
discriminant function to separate sequences of a specific pattern from
background. In computational biology, motif detection is used to predict DNA
binding sites of a transcription factor (TF), mostly based on the weight matrix
(WM) model or the Gibbs free energy (FE) model. However, despite the wide
applications, theoretical analysis of these two models and their predictions is
still lacking. We derive asymptotic error rates of prediction procedures based
on these models under different data generation assumptions. This allows a
theoretical comparison between the WM-based and the FE-based predictions in
terms of asymptotic efficiency. Applications of the theoretical results are
demonstrated with empirical studies on ChIP-seq data and protein binding
microarray data. We find that, irrespective of underlying data generation
mechanisms, the FE approach shows higher or comparable predictive power
relative to the WM approach when the number of observed binding sites used for
constructing a discriminant decision is not too small.Comment: 23 pages, 1 figure and 4 table
Inverse Statistical Physics of Protein Sequences: A Key Issues Review
In the course of evolution, proteins undergo important changes in their amino
acid sequences, while their three-dimensional folded structure and their
biological function remain remarkably conserved. Thanks to modern sequencing
techniques, sequence data accumulate at unprecedented pace. This provides large
sets of so-called homologous, i.e.~evolutionarily related protein sequences, to
which methods of inverse statistical physics can be applied. Using sequence
data as the basis for the inference of Boltzmann distributions from samples of
microscopic configurations or observables, it is possible to extract
information about evolutionary constraints and thus protein function and
structure. Here we give an overview over some biologically important questions,
and how statistical-mechanics inspired modeling approaches can help to answer
them. Finally, we discuss some open questions, which we expect to be addressed
over the next years.Comment: 18 pages, 7 figure
Identification of functionally related enzymes by learning-to-rank methods
Enzyme sequences and structures are routinely used in the biological sciences
as queries to search for functionally related enzymes in online databases. To
this end, one usually departs from some notion of similarity, comparing two
enzymes by looking for correspondences in their sequences, structures or
surfaces. For a given query, the search operation results in a ranking of the
enzymes in the database, from very similar to dissimilar enzymes, while
information about the biological function of annotated database enzymes is
ignored.
In this work we show that rankings of that kind can be substantially improved
by applying kernel-based learning algorithms. This approach enables the
detection of statistical dependencies between similarities of the active cleft
and the biological function of annotated enzymes. This is in contrast to
search-based approaches, which do not take annotated training data into
account. Similarity measures based on the active cleft are known to outperform
sequence-based or structure-based measures under certain conditions. We
consider the Enzyme Commission (EC) classification hierarchy for obtaining
annotated enzymes during the training phase. The results of a set of sizeable
experiments indicate a consistent and significant improvement for a set of
similarity measures that exploit information about small cavities in the
surface of enzymes
- âŠ