61 research outputs found
The Statistical Mechanics Approach to Protein Sequence Data: Beyond Contact Prediction
The recent application of models from inverse statistical mechanics to protein sequence data in has been a large success. In my thesis, I will build upon these models but also use them beyond their original aim of residue contact prediction. This includes the improvement of contact prediction itself by extending the models, the application of the methods in the wider scope of protein interaction networks and the prediction of further biological characteristics from the extracted information
Inter-protein sequence co-evolution predicts known physical interactions in bacterial ribosomes and the trp operon
Interaction between proteins is a fundamental mechanism that underlies
virtually all biological processes. Many important interactions are conserved
across a large variety of species. The need to maintain interaction leads to a
high degree of co-evolution between residues in the interface between partner
proteins. The inference of protein-protein interaction networks from the
rapidly growing sequence databases is one of the most formidable tasks in
systems biology today. We propose here a novel approach based on the
Direct-Coupling Analysis of the co-evolution between inter-protein residue
pairs. We use ribosomal and trp operon proteins as test cases: For the small
resp. large ribosomal subunit our approach predicts protein-interaction
partners at a true-positive rate of 70% resp. 90% within the first 10
predictions, with areas of 0.69 resp. 0.81 under the ROC curves for all
predictions. In the trp operon, it assigns the two largest interaction scores
to the only two interactions experimentally known. On the level of residue
interactions we show that for both the small and the large ribosomal subunit
our approach predicts interacting residues in the system with a true positive
rate of 60% and 85% in the first 20 predictions. We use artificial data to show
that the performance of our approach depends crucially on the size of the joint
multiple sequence alignments and analyze how many sequences would be necessary
for a perfect prediction if the sequences were sampled from the same model that
we use for prediction. Given the performance of our approach on the test data
we speculate that it can be used to detect new interactions, especially in the
light of the rapid growth of available sequence data
Inverse Statistical Physics of Protein Sequences: A Key Issues Review
In the course of evolution, proteins undergo important changes in their amino
acid sequences, while their three-dimensional folded structure and their
biological function remain remarkably conserved. Thanks to modern sequencing
techniques, sequence data accumulate at unprecedented pace. This provides large
sets of so-called homologous, i.e.~evolutionarily related protein sequences, to
which methods of inverse statistical physics can be applied. Using sequence
data as the basis for the inference of Boltzmann distributions from samples of
microscopic configurations or observables, it is possible to extract
information about evolutionary constraints and thus protein function and
structure. Here we give an overview over some biologically important questions,
and how statistical-mechanics inspired modeling approaches can help to answer
them. Finally, we discuss some open questions, which we expect to be addressed
over the next years.Comment: 18 pages, 7 figure
Fast and Accurate Multivariate Gaussian Modeling of Protein Families: Predicting Residue Contacts and Protein-Interaction Partners
In the course of evolution, proteins show a remarkable conservation of their three-dimensional structure and their biological function, leading to strong evolutionary constraints on the sequence variability between homologous proteins. Our method aims at extracting such constraints from rapidly accumulating sequence data, and thereby at inferring protein structure and function from sequence information alone. Recently, global statistical inference methods (e.g. direct-coupling analysis, sparse inverse covariance estimation) have achieved a breakthrough towards this aim, and their predictions have been successfully implemented into tertiary and quaternary protein structure prediction methods. However, due to the discrete nature of the underlying variable (amino-acids), exact inference requires exponential time in the protein length, and efficient approximations are needed for practical applicability. Here we propose a very efficient multivariate Gaussian modeling approach as a variant of direct-coupling analysis: the discrete amino-acid variables are replaced by continuous Gaussian random variables. The resulting statistical inference problem is efficiently and exactly solvable. We show that the quality of inference is comparable or superior to the one achieved by mean-field approximations to inference with discrete variables, as done by direct-coupling analysis. This is true for (i) the prediction of residue-residue contacts in proteins, and (ii) the identification of protein-protein interaction partner in bacterial signal transduction. An implementation of our multivariate Gaussian approach is available at the website http://areeweb.polito.it/ricerca/cmp/cod
Entropic gradient descent algorithms and wide flat minima
The properties of flat minima in the empirical risk landscape of neural
networks have been debated for some time. Increasing evidence suggests they
possess better generalization capabilities with respect to sharp ones. First,
we discuss Gaussian mixture classification models and show analytically that
there exist Bayes optimal pointwise estimators which correspond to minimizers
belonging to wide flat regions. These estimators can be found by applying
maximum flatness algorithms either directly on the classifier (which is norm
independent) or on the differentiable loss function used in learning. Next, we
extend the analysis to the deep learning scenario by extensive numerical
validations. Using two algorithms, Entropy-SGD and Replicated-SGD, that
explicitly include in the optimization objective a non-local flatness measure
known as local entropy, we consistently improve the generalization error for
common architectures (e.g. ResNet, EfficientNet). An easy to compute flatness
measure shows a clear correlation with test accuracy.Comment: updated version focusing on numerical experiment
Reconstruction of pairwise interactions using Energy-Based Models
No abstract availabl
Interpretable pairwise distillations for generative protein sequence models
Many different types of generative models for protein sequences have been proposed in literature. Their uses include the prediction of mutational effects, protein design and the prediction of structural properties. Neural network (NN) architectures have shown great performances, commonly attributed to the capacity to extract non-trivial higher-order interactions from the data. In this work, we analyze two different NN models and assess how close they are to simple pairwise distributions, which have been used in the past for similar problems. We present an approach for extracting pairwise models from more complex ones using an energy-based modeling framework. We show that for the tested models the extracted pairwise models can replicate the energies of the original models and are also close in performance in tasks like mutational effect prediction. In addition, we show that even simpler, factorized models often come close in performance to the original models
Improving contact prediction along three dimensions
Correlation patterns in multiple sequence alignments of homologous proteins can be exploited to infer information on the three-dimensional structure of their members. The typical pipeline to address this task, which we in this paper refer to as the three dimensions of contact prediction, is to (i) filter and align the raw sequence data representing the evolutionarily related proteins; (ii) choose a predictive model to describe a sequence alignment; (iii) infer the model parameters and interpret them in terms of structural properties, such as an accurate contact map. We show here that all three dimensions are important for overall prediction success. In particular, we show that it is possible to improve significantly along the second dimension by going beyond the pair-wise Potts models from statistical physics, which have hitherto been the focus of the field. These (simple) extensions are motivated by multiple sequence alignments often containing long stretches of gaps which, as a data feature, would be rather untypical for independent samples drawn from a Potts model. Using a large test set of proteins we show that the combined improvements along the three dimensions are as large as any reported to dat
Improving Contact Prediction along Three Dimensions
Correlation patterns in multiple sequence alignments of homologous proteins can be exploited to infer information on the three-dimensional structure of their members. The typical pipeline to address this task, which we in this paper refer to as the three dimensions of contact prediction, is to (i) filter and align the raw sequence data representing the evolutionarily related proteins; (ii) choose a predictive model to describe a sequence alignment; (iii) infer the model parameters and interpret them in terms of structural properties, such as an accurate contact map. We show here that all three dimensions are important for overall prediction success. In particular, we show that it is possible to improve significantly along the second dimension by going beyond the pair-wise Potts models from statistical physics, which have hitherto been the focus of the field. These (simple) extensions are motivated by multiple sequence alignments often containing long stretches of gaps which, as a data feature, would be ratheruntypical for independent samples drawn from a Potts model. Using a large test set of proteins we show that the combined improvements along the three dimensions are as large as any reported to date.Peer reviewe
- âŠ