256 research outputs found

    Simultaneous identification of specifically interacting paralogs and inter-protein contacts by Direct-Coupling Analysis

    Full text link
    Understanding protein-protein interactions is central to our understanding of almost all complex biological processes. Computational tools exploiting rapidly growing genomic databases to characterize protein-protein interactions are urgently needed. Such methods should connect multiple scales from evolutionary conserved interactions between families of homologous proteins, over the identification of specifically interacting proteins in the case of multiple paralogs inside a species, down to the prediction of residues being in physical contact across interaction interfaces. Statistical inference methods detecting residue-residue coevolution have recently triggered considerable progress in using sequence data for quaternary protein structure prediction; they require, however, large joint alignments of homologous protein pairs known to interact. The generation of such alignments is a complex computational task on its own; application of coevolutionary modeling has in turn been restricted to proteins without paralogs, or to bacterial systems with the corresponding coding genes being co-localized in operons. Here we show that the Direct-Coupling Analysis of residue coevolution can be extended to connect the different scales, and simultaneously to match interacting paralogs, to identify inter-protein residue-residue contacts and to discriminate interacting from noninteracting families in a multiprotein system. Our results extend the potential applications of coevolutionary analysis far beyond cases treatable so far.Comment: Main Text 19 pages Supp. Inf. 16 page

    From principal component to direct coupling analysis of coevolution in proteins: Low-eigenvalue modes are needed for structure prediction

    Get PDF
    Various approaches have explored the covariation of residues in multiple-sequence alignments of homologous proteins to extract functional and structural information. Among those are principal component analysis (PCA), which identifies the most correlated groups of residues, and direct coupling analysis (DCA), a global inference method based on the maximum entropy principle, which aims at predicting residue-residue contacts. In this paper, inspired by the statistical physics of disordered systems, we introduce the Hopfield-Potts model to naturally interpolate between these two approaches. The Hopfield-Potts model allows us to identify relevant 'patterns' of residues from the knowledge of the eigenmodes and eigenvalues of the residue-residue correlation matrix. We show how the computation of such statistical patterns makes it possible to accurately predict residue-residue contacts with a much smaller number of parameters than DCA. This dimensional reduction allows us to avoid overfitting and to extract contact information from multiple-sequence alignments of reduced size. In addition, we show that low-eigenvalue correlation modes, discarded by PCA, are important to recover structural information: the corresponding patterns are highly localized, that is, they are concentrated in few sites, which we find to be in close contact in the three-dimensional protein fold.Comment: Supporting information can be downloaded from: http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.100317

    Self-Organizing Genetic Algorithm for Multiple Sequence Alignment

    Get PDF
    Genetic algorithm (GA) used to solve the optimization problem is self-organized and applied to Multiple Sequence Alignment (MSA), an essential process in molecular sequence analysis. This paper presents the first attempt in applying Self-Organizing Genetic Algorithm for MSA. Self-organizing genetic algorithm (SOGA) can be developed with the complete knowledge about the problem and its parameters. In SOGA, values of various parameters are decided based on the problem and fitness value obtained in each generation. The proposed algorithm undergoes a self-organizing crossover operation by selecting an appropriate rate or a point and a self-organizing cyclic mutation for the required number of generations. The advantages of the proposed algorithm are (i) reduce the time requirement for optimizing the parameter values (ii) prevent execution with default values (iii) avoid premature convergence by the cyclic mutation operation. To validate the efficiency, SOGA is applied to MSA, and the resulting alignment is evaluated using the column score (CS). The comparison result shows that the alignment produced by SOGA is better than the widely used tools like Dialign and Multalin. It is also evident that the proposed algorithm can produce optimal or closer-to-optimal alignment compared to tools like ClustalW, Mafft, Dialign and Multalin

    Dissecting the Specificity of Protein-Protein Interaction in Bacterial Two-Component Signaling: Orphans and Crosstalks

    Get PDF
    Predictive understanding of the myriads of signal transduction pathways in a cell is an outstanding challenge of systems biology. Such pathways are primarily mediated by specific but transient protein-protein interactions, which are difficult to study experimentally. In this study, we dissect the specificity of protein-protein interactions governing two-component signaling (TCS) systems ubiquitously used in bacteria. Exploiting the large number of sequenced bacterial genomes and an operon structure which packages many pairs of interacting TCS proteins together, we developed a computational approach to extract a molecular interaction code capturing the preferences of a small but critical number of directly interacting residue pairs. This code is found to reflect physical interaction mechanisms, with the strongest signal coming from charged amino acids. It is used to predict the specificity of TCS interaction: Our results compare favorably to most available experimental results, including the prediction of 7 (out of 8 known) interaction partners of orphan signaling proteins in Caulobacter crescentus. Surveying among the available bacterial genomes, our results suggest 15~25% of the TCS proteins could participate in out-of-operon "crosstalks". Additionally, we predict clusters of crosstalking candidates, expanding from the anecdotally known examples in model organisms. The tools and results presented here can be used to guide experimental studies towards a system-level understanding of two-component signaling.Comment: Supplementary information available on http://www.plosone.org/article/info:doi/10.1371/journal.pone.001972

    The Statistical Mechanics Approach to Protein Sequence Data: Beyond Contact Prediction

    Get PDF
    The recent application of models from inverse statistical mechanics to protein sequence data in has been a large success. In my thesis, I will build upon these models but also use them beyond their original aim of residue contact prediction. This includes the improvement of contact prediction itself by extending the models, the application of the methods in the wider scope of protein interaction networks and the prediction of further biological characteristics from the extracted information

    Multiple Biolgical Sequence Alignment: Scoring Functions, Algorithms, and Evaluations

    Get PDF
    Aligning multiple biological sequences such as protein sequences or DNA/RNA sequences is a fundamental task in bioinformatics and sequence analysis. These alignments may contain invaluable information that scientists need to predict the sequences\u27 structures, determine the evolutionary relationships between them, or discover drug-like compounds that can bind to the sequences. Unfortunately, multiple sequence alignment (MSA) is NP-Complete. In addition, the lack of a reliable scoring method makes it very hard to align the sequences reliably and to evaluate the alignment outcomes. In this dissertation, we have designed a new scoring method for use in multiple sequence alignment. Our scoring method encapsulates stereo-chemical properties of sequence residues and their substitution probabilities into a tree-structure scoring scheme. This new technique provides a reliable scoring scheme with low computational complexity. In addition to the new scoring scheme, we have designed an overlapping sequence clustering algorithm to use in our new three multiple sequence alignment algorithms. One of our alignment algorithms uses a dynamic weighted guidance tree to perform multiple sequence alignment in progressive fashion. The use of dynamic weighted tree allows errors in the early alignment stages to be corrected in the subsequence stages. Other two algorithms utilize sequence knowledge-bases and sequence consistency to produce biological meaningful sequence alignments. To improve the speed of the multiple sequence alignment, we have developed a parallel algorithm that can be deployed on reconfigurable computer models. Analytically, our parallel algorithm is the fastest progressive multiple sequence alignment algorithm

    Covariance models for RNA structure prediction

    Get PDF
    Many non-coding RNAs are known to play a role in the cell directly linked to their structure. Structure prediction based on the sole sequence is however a challenging task. On the other hand, thanks to the low cost of sequencing technologies, a very large number of homologous sequences are becoming available for many RNA families. In the protein community, it has emerged in the last decade the idea of exploiting the covariance of mutations within a family to predict the protein structure using the direct- coupling-analysis (DCA) method. The application of DCA to RNA systems has been limited so far. We here perform an assessment of the DCA method on 17 riboswitch families, comparing it with the commonly used mutual information analysis. We also compare different flavors of DCA, including mean-field, pseudo-likelihood, and a proposed stochastic procedure (Boltzmann learning) for solving exactly the DCA inverse problem. Boltzmann learning outperforms the other methods in predicting contacts observed in high resolution crystal structures. In order to enhance the prediction of both RNA secondary and tertiary contacts, we discuss the possibility to include of a number of informed priors in the estimation of the couplings for the DCA statistical model. We observe a systematic improvement of the DCA performance by embedding in the prior distribution the pairing probability matrices calculated using secondary-structure prediction algorithms
    • …
    corecore