Search CORE

48 research outputs found

Improved contact prediction in proteins: Using pseudolikelihoods to infer Potts models

Author: Aurell Erik
Ekeberg Magnus
Lan Yueheng
Lövkvist Cecilia
Weigt Martin
Publication venue: 'American Physical Society (APS)'
Publication date: 01/01/2013
Field of study

Spatially proximate amino acids in a protein tend to coevolve. A protein's three-dimensional (3D) structure hence leaves an echo of correlations in the evolutionary record. Reverse engineering 3D structures from such correlations is an open problem in structural biology, pursued with increasing vigor as more and more protein sequences continue to fill the data banks. Within this task lies a statistical inference problem, rooted in the following: correlation between two sites in a protein sequence can arise from firsthand interaction but can also be network-propagated via intermediate sites; observed correlation is not enough to guarantee proximity. To separate direct from indirect interactions is an instance of the general problem of inverse statistical mechanics, where the task is to learn model parameters (fields, couplings) from observables (magnetizations, correlations, samples) in large systems. In the context of protein sequences, the approach has been referred to as direct-coupling analysis. Here we show that the pseudolikelihood method, applied to 21-state Potts models describing the statistical properties of families of evolutionarily related proteins, significantly outperforms existing approaches to the direct-coupling analysis, the latter being based on standard mean-field techniques. This improved performance also relies on a modified score for the coupling strength. The results are verified using known crystal structures of specific sequence instances of various protein families. Code implementing the new method can be found at http://plmdca.csc.kth.se/.Comment: 19 pages, 16 figures, published versio

arXiv.org e-Print Archive

Crossref

HAL Descartes

Hal-Diderot

Multimodal Data Fusion and Machine Learning for Deciphering Protein-Protein Interactions

Author: Talukder Arghamitra
Publication venue
Publication date: 15/11/2023
Field of study

Protein-protein interactions (PPIs) often underlie important biological processes. Due to the vast quantity of potential PPIs in living organisms, it can be an expensive if not daunting task to identify each PPI experimentally, thus computational methods have been developed in parallel to facilitate the task. Despite various experimental or computational methods to determine or predict PPIs, a knowledge gap is there to understand the 3-dimensional interactions in atomic level details. This research project aims to leverage the existing protein data and emerging tools of machine learning to both predict and explain protein-protein interactions. Speciﬁcally, using various modalities of protein data including 1D sequences and 2D structures, several hierarchical recurrent neural network (HRNN) and joint attention based models have been developed. These models predict whether two proteins interact (the probability of PPI) and, if they do, how they interact (the probabilities of their residue-residue contacts (RRC)). The prediction of PPI from model I (uses only 1D sequences) has Area under the Precision-Recall Curve (AUPRC) output of 0.738. In the comparative analysis of model I with state–of-the-art PPI-detect [1], the precision, sensitivity and accuracy increased 7.8%, 9.5% and 6.2% respectively. To predict inter RRC map, a gradual improvement has been observed from model I, model II(uses sequence pre-training and Inter RRC maps to ﬁne-tune) and model III(uses both sequences and intra RRC maps). As a result, the best AUPRC reached 2.69e-4 (model III), from 2.49e-4 (model II) and 1.02e-4 (model I) for the validation set. Thus, model III shows 163% AUPRC improvement than model I and 8.03% than model II; additionally model II shows 144% improvement than model I. The performance evaluations of these models show that the advantage of big data for 1D modality is not good enough to predict inter RRC maps; rather a slight combination of structure information with sequence as done in model II gives a much better inter RRC predictions. The full combination of sequences and intra RRC maps show the best result

Texas A&M Repository

Recommended from our members

Evolutionary Covariant Positions within Calmodulin EF-hand Sequences Promote Ligand Binding

Author: Vaidyanathan Uma
Publication venue
Publication date: 01/01/2018
Field of study

Intracellular calcium signaling is an essential regulatory mechanism through calcium-mediated signal transduction pathways involved in many cell processes, such as exocytosis, motility, apoptosis, excitability, transcription, and muscle contraction. The calcium-binding, ubiquitous, and highly conserved protein calmodulin (CaM) is an important regulator of hundreds of target proteins involved in cellular calcium signaling. CaM comprises of two pairs of EF-hand calcium-binding domains and these structural regions of the protein are highly conserved. Studying the molecular mechanisms underlying the binding of calcium to the EF-hands of CaM is critical in understanding the calcium-mediated cellular processes and how improper binding of calcium can lead to various human pathologies. Previous site-specific binding measurements indicate that each of the four EF-hands of CaM have distinct affinities for calcium. In this study, we have utilized covariance patterns and site-specific mutagenesis to analyze calcium affinity in the two EF-hands of the N-lobe of CaM in order to determine the specific amino acids that are evolutionarily conserved to coordinate calcium. The specific amino acids in CaM that we studied are theorized to coevolve, which means that in their protein coding genes, when a mutation occurs, a compensatory mutation is likely to follow to conserve structure and function of CaM. Since CaM is a highly conserved protein with a known structure, covariance analyses will help in understanding which amino acid contacts are most important for the coordination of calcium in the EF-hands of CaM and to determine which amino acids are under evolutionary constraint. Covariance algorithms, multiple sequence analyses and accompanied protein structure analyses were used to identify the two high scoring amino acid pairs in the N-lobe EF-hands: positions 22 and 24 in EF-hand site 1 and positions 58 and 60 in EF-hand site 2. The amino acids in these locations were mutated and accompanied calcium binding was measured to better understand the effects of the mutations on calcium binding. We have found that both the D24N mutation in site 1 and the D58N mutation in site 2 disrupt binding likely due to the removal of a necessary aspartate in the binding site. However, the combined D58N and N60D mutations restore binding in site 2 by providing the necessary aspartate in the covariant location. The N60D mutation by itself has little impact on calcium binding in site 2. Therefore, it is evident that evolution conserves at least one aspartate in the covariant positions of the binding site and the presence of two aspartates in the covariant positions of the binding site has little affect on calcium binding. We are currently studying the covariant positions in site 1 and future work includes structurally analyzing the covariant positions in the C-lobe of CaM and studying covariance patterns of other calcium-binding proteins with EF-hand binding domains.Biochemistr

Texas ScholarWorks

Molecular Recognition between Cadherins Studied by a Coarse-Grained Model Interacting with a Coevolutionary Potential

Author: G. Tiana
S. Terzoli
Publication venue: 'American Chemical Society (ACS)'
Publication date: 21/05/2020
Field of study

Studying the conformations involved in the dimerization of cadherins is highly relevant to understand the development of tissues and its failure, which is associated with tumors and metastases. Experimental techniques, like X-ray crystallography, can usually report only the most stable conformations, missing minority states that could nonetheless be important for the recognition mechanism. Computer simulations could be a valid complement to the experimental approach. However, standard all-atom protein models in explicit solvent are computationally too demanding to search thoroughly the conformational space of multiple chains composed of several hundreds of amino acids. To reach this goal, we resorted to a coarse-grained model in implicit solvent. The standard problem with this kind of model is to find a realistic potential to describe its interactions. We used coevolutionary information from cadherin alignments, corrected by a statistical potential, to build an interaction potential, which is agnostic about the experimental conformations of the protein. Using this model, we explored the conformational space of multichain systems and validated the results comparing with experimental data. We identified dimeric conformations that are sequence specific and that can be useful to rationalize the mechanism of recognition between cadherins

AIR Universita degli studi di Milano

Covariance models for RNA structure prediction

Author: Cuturello Francesca
Publication venue: place:Trieste
Publication date: 14/10/2019
Field of study

Many non-coding RNAs are known to play a role in the cell directly linked to their structure. Structure prediction based on the sole sequence is however a challenging task. On the other hand, thanks to the low cost of sequencing technologies, a very large number of homologous sequences are becoming available for many RNA families. In the protein community, it has emerged in the last decade the idea of exploiting the covariance of mutations within a family to predict the protein structure using the direct- coupling-analysis (DCA) method. The application of DCA to RNA systems has been limited so far. We here perform an assessment of the DCA method on 17 riboswitch families, comparing it with the commonly used mutual information analysis. We also compare different flavors of DCA, including mean-field, pseudo-likelihood, and a proposed stochastic procedure (Boltzmann learning) for solving exactly the DCA inverse problem. Boltzmann learning outperforms the other methods in predicting contacts observed in high resolution crystal structures. In order to enhance the prediction of both RNA secondary and tertiary contacts, we discuss the possibility to include of a number of informed priors in the estimation of the couplings for the DCA statistical model. We observe a systematic improvement of the DCA performance by embedding in the prior distribution the pairing probability matrices calculated using secondary-structure prediction algorithms

Sissa Digital Library

Interacting networks of resistance, virulence and core machinery genes identified by genome-wide epistasis analysis.

Author: Aurell Erik
Bentley Stephen D
Beres Stephen B
Chewapreecha Claire
Corander Jukka
Croucher Nicholas J
Harris Simon R
Musser James M
Parkhill Julian
Pesonen Maiju
Puranen Santeri
Skwark Marcin J
Turner Paul
Xu Ying Ying
Publication venue: PLoS Genet
Publication date: 25/08/2016
Field of study

Recent advances in the scale and diversity of population genomic datasets for bacteria now provide the potential for genome-wide patterns of co-evolution to be studied at the resolution of individual bases. Here we describe a new statistical method, genomeDCA, which uses recent advances in computational structural biology to identify the polymorphic loci under the strongest co-evolutionary pressures. We apply genomeDCA to two large population data sets representing the major human pathogens Streptococcus pneumoniae (pneumococcus) and Streptococcus pyogenes (group A Streptococcus). For pneumococcus we identified 5,199 putative epistatic interactions between 1,936 sites. Over three-quarters of the links were between sites within the pbp2x, pbp1a and pbp2b genes, the sequences of which are critical in determining non-susceptibility to beta-lactam antibiotics. A network-based analysis found these genes were also coupled to that encoding dihydrofolate reductase, changes to which underlie trimethoprim resistance. Distinct from these antibiotic resistance genes, a large network component of 384 protein coding sequences encompassed many genes critical in basic cellular functions, while another distinct component included genes associated with virulence. The group A Streptococcus (GAS) data set population represents a clonal population with relatively little genetic variation and a high level of linkage disequilibrium across the genome. Despite this, we were able to pinpoint two RNA pseudouridine synthases, which were each strongly linked to a separate set of loci across the chromosome, representing biologically plausible targets of co-selection. The population genomic analysis method applied here identifies statistically significantly co-evolving locus pairs, potentially arising from fitness selection interdependence reflecting underlying protein-protein interactions, or genes whose product activities contribute to the same phenotype. This discovery approach greatly enhances the future potential of epistasis analysis for systems biology, and can complement genome-wide association studies as a means of formulating hypotheses for targeted experimental work

ZENODO

Directory of Open Access Journals

Electronic Archiving System

Spiral - Imperial College Digital Repository

Helsingin yliopiston digitaalinen arkisto

FigShare

Crossref

Dryad Digital Repository (Duke University)

PubMed Central

Aaltodoc Publication Archive

Oxford University Research Archive

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Apollo (Cambridge)

Accurate contact predictions using covariation techniques and machine learning.

Author: Jones DT
Kosciolek T
Publication venue
Publication date: 01/09/2016
Field of study

Here we present the results of residue-residue contact predictions achieved in CASP11 by the CONSIP2 server, which is based around our MetaPSICOV contact prediction method. On a set of 40 target domains with a median family size of around 40 effective sequences, our server achieved an average top-L/5 long-range contact precision of 27%. MetaPSICOV method bases on a combination of classical contact prediction features, enhanced with three distinct covariation methods embedded in a two-stage neural network predictor. Some unique features of our approach are (1) the tuning between the classical and covariation features depending on the depth of the input alignment and (2) a hybrid approach to generate deepest possible multiple-sequence alignments by combining jackHMMer and HHblits. We discuss the CONSIP2 pipeline, our results and show that where the method underperformed, the major factor was relying on a fixed set of parameters for the initial sequence alignments and not attempting to perform domain splitting as a preprocessing step. Proteins 2015. © 2015 The Authors. Proteins: Structure, Function, and Bioinformatics Published by Wiley Periodicals, Inc

UCL Discovery

PubMed Central

Inter-protein sequence co-evolution predicts known physical interactions in bacterial ribosomes and the trp operon

Author: Feinauer Christoph
Pagnani Andrea
Szurmant Hendrik
Weigt Martin
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2016
Field of study

Interaction between proteins is a fundamental mechanism that underlies virtually all biological processes. Many important interactions are conserved across a large variety of species. The need to maintain interaction leads to a high degree of co-evolution between residues in the interface between partner proteins. The inference of protein-protein interaction networks from the rapidly growing sequence databases is one of the most formidable tasks in systems biology today. We propose here a novel approach based on the Direct-Coupling Analysis of the co-evolution between inter-protein residue pairs. We use ribosomal and trp operon proteins as test cases: For the small resp. large ribosomal subunit our approach predicts protein-interaction partners at a true-positive rate of 70% resp. 90% within the first 10 predictions, with areas of 0.69 resp. 0.81 under the ROC curves for all predictions. In the trp operon, it assigns the two largest interaction scores to the only two interactions experimentally known. On the level of residue interactions we show that for both the small and the large ribosomal subunit our approach predicts interacting residues in the system with a true positive rate of 60% and 85% in the first 20 predictions. We use artificial data to show that the performance of our approach depends crucially on the size of the joint multiple sequence alignments and analyze how many sequences would be necessary for a perfect prediction if the sequences were sampled from the same model that we use for prediction. Given the performance of our approach on the test data we speculate that it can be used to detect new interactions, especially in the light of the rapid growth of available sequence data

arXiv.org e-Print Archive

Archivio istituzionale della Ricerca - Bocconi

Crossref

Directory of Open Access Journals

PubMed Central

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

FigShare

PORTO Publications Open Repository TOrino