187 research outputs found

    Disorder drives cooperative folding in a multidomain protein

    Get PDF
    Many human proteins contain intrinsically disordered regions, and disorder in these proteins can be fundamental to their function - for example, facilitating transient but specific binding, promoting allostery, or allowing efficient posttranslational modification. SasG, a multidomain protein implicated in host colonization and biofilm formation in Staphylococcus aureus, provides another example of how disorder can play an important role. Approximately one-half of the domains in the extracellular repetitive region of SasG are intrinsically unfolded in isolation, but these E domains fold in the context of their neighboring folded G5 domains. We have previously shown that the intrinsic disorder of the E domains mediates long-range cooperativity between nonneighboring G5 domains, allowing SasG to form a long, rod-like, mechanically strong structure. Here, we show that the disorder of the E domains coupled with the remarkable stability of the interdomain interface result in cooperative folding kinetics across long distances. Formation of a small structural nucleus at one end of the molecule results in rapid structure formation over a distance of 10 nm, which is likely to be important for the maintenance of the structural integrity of SasG. Moreover, if this normal folding nucleus is disrupted by mutation, the interdomain interface is sufficiently stable to drive the folding of adjacent E and G5 domains along a parallel folding pathway, thus maintaining cooperative folding

    Selective Constraints on Amino Acids Estimated by a Mechanistic Codon Substitution Model with Multiple Nucleotide Changes

    Get PDF
    Empirical substitution matrices represent the average tendencies of substitutions over various protein families by sacrificing gene-level resolution. We develop a codon-based model, in which mutational tendencies of codon, a genetic code, and the strength of selective constraints against amino acid replacements can be tailored to a given gene. First, selective constraints averaged over proteins are estimated by maximizing the likelihood of each 1-PAM matrix of empirical amino acid (JTT, WAG, and LG) and codon (KHG) substitution matrices. Then, selective constraints specific to given proteins are approximated as a linear function of those estimated from the empirical substitution matrices. Akaike information criterion (AIC) values indicate that a model allowing multiple nucleotide changes fits the empirical substitution matrices significantly better. Also, the ML estimates of transition-transversion bias obtained from these empirical matrices are not so large as previously estimated. The selective constraints are characteristic of proteins rather than species. However, their relative strengths among amino acid pairs can be approximated not to depend very much on protein families but amino acid pairs, because the present model, in which selective constraints are approximated to be a linear function of those estimated from the JTT/WAG/LG/KHG matrices, can provide a good fit to other empirical substitution matrices including cpREV for chloroplast proteins and mtREV for vertebrate mitochondrial proteins. The present codon-based model with the ML estimates of selective constraints and with adjustable mutation rates of nucleotide would be useful as a simple substitution model in ML and Bayesian inferences of molecular phylogenetic trees, and enables us to obtain biologically meaningful information at both nucleotide and amino acid levels from codon and protein sequences.Comment: Table 9 in this article includes corrections for errata in the Table 9 published in 10.1371/journal.pone.0017244. Supporting information is attached at the end of the article, and a computer-readable dataset of the ML estimates of selective constraints is available from 10.1371/journal.pone.001724

    Non-Negative Matrix Factorization for Learning Alignment-Specific Models of Protein Evolution

    Get PDF
    Models of protein evolution currently come in two flavors: generalist and specialist. Generalist models (e.g. PAM, JTT, WAG) adopt a one-size-fits-all approach, where a single model is estimated from a number of different protein alignments. Specialist models (e.g. mtREV, rtREV, HIVbetween) can be estimated when a large quantity of data are available for a single organism or gene, and are intended for use on that organism or gene only. Unsurprisingly, specialist models outperform generalist models, but in most instances there simply are not enough data available to estimate them. We propose a method for estimating alignment-specific models of protein evolution in which the complexity of the model is adapted to suit the richness of the data. Our method uses non-negative matrix factorization (NNMF) to learn a set of basis matrices from a general dataset containing a large number of alignments of different proteins, thus capturing the dimensions of important variation. It then learns a set of weights that are specific to the organism or gene of interest and for which only a smaller dataset is available. Thus the alignment-specific model is obtained as a weighted sum of the basis matrices. Having been constrained to vary along only as many dimensions as the data justify, the model has far fewer parameters than would be required to estimate a specialist model. We show that our NNMF procedure produces models that outperform existing methods on all but one of 50 test alignments. The basis matrices we obtain confirm the expectation that amino acid properties tend to be conserved, and allow us to quantify, on specific alignments, how the strength of conservation varies across different properties. We also apply our new models to phylogeny inference and show that the resulting phylogenies are different from, and have improved likelihood over, those inferred under standard models

    PhyloSim - Monte Carlo simulation of sequence evolution in the R statistical computing environment

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The Monte Carlo simulation of sequence evolution is routinely used to assess the performance of phylogenetic inference methods and sequence alignment algorithms. Progress in the field of molecular evolution fuels the need for more realistic and hence more complex simulations, adapted to particular situations, yet current software makes unreasonable assumptions such as homogeneous substitution dynamics or a uniform distribution of indels across the simulated sequences. This calls for an extensible simulation framework written in a high-level functional language, offering new functionality and making it easy to incorporate further complexity.</p> <p>Results</p> <p><monospace>PhyloSim</monospace> is an extensible framework for the Monte Carlo simulation of sequence evolution, written in R, using the Gillespie algorithm to integrate the actions of many concurrent processes such as substitutions, insertions and deletions. Uniquely among sequence simulation tools, <monospace>PhyloSim</monospace> can simulate arbitrarily complex patterns of rate variation and multiple indel processes, and allows for the incorporation of selective constraints on indel events. User-defined complex patterns of mutation and selection can be easily integrated into simulations, allowing <monospace>PhyloSim</monospace> to be adapted to specific needs.</p> <p>Conclusions</p> <p>Close integration with <monospace>R</monospace> and the wide range of features implemented offer unmatched flexibility, making it possible to simulate sequence evolution under a wide range of realistic settings. We believe that <monospace>PhyloSim</monospace> will be useful to future studies involving simulated alignments.</p

    Advantages of a Mechanistic Codon Substitution Model for Evolutionary Analysis of Protein-Coding Sequences

    Get PDF
    A mechanistic codon substitution model, in which each codon substitution rate is proportional to the product of a codon mutation rate and the average fixation probability depending on the type of amino acid replacement, has advantages over nucleotide, amino acid, and empirical codon substitution models in evolutionary analysis of protein-coding sequences. It can approximate a wide range of codon substitution processes. If no selection pressure on amino acids is taken into account, it will become equivalent to a nucleotide substitution model. If mutation rates are assumed not to depend on the codon type, then it will become essentially equivalent to an amino acid substitution model. Mutation at the nucleotide level and selection at the amino acid level can be separately evaluated.The present scheme for single nucleotide mutations is equivalent to the general time-reversible model, but multiple nucleotide changes in infinitesimal time are allowed. Selective constraints on the respective types of amino acid replacements are tailored to each gene in a linear function of a given estimate of selective constraints. Their good estimates are those calculated by maximizing the respective likelihoods of empirical amino acid or codon substitution frequency matrices. Akaike and Bayesian information criteria indicate that the present model performs far better than the other substitution models for all five phylogenetic trees of highly-divergent to highly-homologous sequences of chloroplast, mitochondrial, and nuclear genes. It is also shown that multiple nucleotide changes in infinitesimal time are significant in long branches, although they may be caused by compensatory substitutions or other mechanisms. The variation of selective constraint over sites fits the datasets significantly better than variable mutation rates, except for 10 slow-evolving nuclear genes of 10 mammals. An critical finding for phylogenetic analysis is that assuming variable mutation rates over sites lead to the overestimation of branch lengths

    Comparative genomics of the class 4 histone deacetylase family indicates a complex evolutionary history

    Get PDF
    BACKGROUND: Histone deacetylases are enzymes that modify core histones and play key roles in transcriptional regulation, chromatin assembly, DNA repair, and recombination in eukaryotes. Three types of related histone deacetylases (classes 1, 2, and 4) are widely found in eukaryotes, and structurally related proteins have also been found in some prokaryotes. Here we focus on the evolutionary history of the class 4 histone deacetylase family. RESULTS: Through sequence similarity searches against sequenced genomes and expressed sequence tag data, we identified members of the class 4 histone deacetylase family in 45 eukaryotic and 37 eubacterial species representative of very distant evolutionary lineages. Multiple phylogenetic analyses indicate that the phylogeny of these proteins is, in many respects, at odds with the phylogeny of the species in which they are found. In addition, the eukaryotic members of the class 4 histone deacetylase family clearly display an anomalous phyletic distribution. CONCLUSION: The unexpected phylogenetic relationships within the class 4 histone deacetylase family and the anomalous phyletic distribution of these proteins within eukaryotes might be explained by two mechanisms: ancient gene duplication followed by differential gene losses and/or horizontal gene transfer. We discuss both possibilities in this report, and suggest that the evolutionary history of the class 4 histone deacetylase family may have been shaped by horizontal gene transfers

    A model-independent approach to infer hierarchical codon substitution dynamics

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Codon substitution constitutes a fundamental process in molecular biology that has been studied extensively. However, prior studies rely on various assumptions, e.g. regarding the relevance of specific biochemical properties, or on conservation criteria for defining substitution groups. Ideally, one would instead like to analyze the substitution process in terms of raw dynamics, independently of underlying system specifics. In this paper we propose a method for doing this by identifying groups of codons and amino acids such that these groups imply closed dynamics. The approach relies on recently developed spectral and agglomerative techniques for identifying hierarchical organization in dynamical systems.</p> <p>Results</p> <p>We have applied the techniques on an empirically derived Markov model of the codon substitution process that is provided in the literature. Without system specific knowledge of the substitution process, the techniques manage to "blindly" identify multiple levels of dynamics; from amino acid substitutions (via the standard genetic code) to higher order dynamics on the level of amino acid groups. We hypothesize that the acquired groups reflect earlier versions of the genetic code.</p> <p>Conclusions</p> <p>The results demonstrate the applicability of the techniques. Due to their generality, we believe that they can be used to coarse grain and identify hierarchical organization in a broad range of other biological systems and processes, such as protein interaction networks, genetic regulatory networks and food webs.</p

    Protein expression profiles indicative for drug resistance of non-small cell lung cancer

    Get PDF
    Data obtained from multiple sources indicate that no single mechanism can explain the resistance to chemotherapy exhibited by non-small cell lung carcinomas. The multi-factorial nature of drug resistance implies that the analysis of comprising expression profiles may predict drug resistance with higher accuracy than single gene or protein expression studies. Forty cellular parameters (drug resistance proteins, proliferative, apoptotic, and angiogenic factors, products of proto-oncogenes, and suppressor genes) were evaluated mainly by immunohistochemistry in specimens of primary non-small cell lung carcinoma of 94 patients and compared with the response of the tumours to doxorubicin in vitro. The protein expression profile of non-small cell lung carcinoma was determined by hierarchical cluster analysis and clustered image mapping. The cluster analysis revealed three different resistance profiles. The frequency of each profile was different (77, 14 and 9%, respectively). In the most frequent drug resistance profile, the resistance proteins P-glycoprotein/MDR1 (MDR1, ABCB1), thymidylate-synthetase, glutathione-S-transferase-π, metallothionein, O6-methylguanine-DNA-methyltransferase and major vault protein/lung resistance-related protein were up-regulated. Microvessel density, the angiogenic factor vascular endothelial growth factor and its receptor FLT1, and ECGF1 as well were down-regulated. In addition, the proliferative factors proliferating cell nuclear antigen and cyclin A were reduced compared to the sensitive non-small cell lung carcinoma. In this resistance profile, FOS was up-regulated and NM23 down-regulated. In the second profile, only three resistance proteins were increased (glutathione-S-transferase-π, O6-methylguanine-DNA-methyltransferase, major vault protein/lung resistance-related protein). The angiogenic factors were reduced. In the third profile, only five of the resistance factors were increased (MDR1, thymidylate-synthetase, glutathione-S-transferase-π, O6-methylguanine-DNA-methyltransferase, major vault protein/lung resistance-related protein)

    AST: An Automated Sequence-Sampling Method for Improving the Taxonomic Diversity of Gene Phylogenetic Trees

    Get PDF
    A challenge in phylogenetic inference of gene trees is how to properly sample a large pool of homologous sequences to derive a good representative subset of sequences. Such a need arises in various applications, e.g. when (1) accuracy-oriented phylogenetic reconstruction methods may not be able to deal with a large pool of sequences due to their high demand in computing resources; (2) applications analyzing a collection of gene trees may prefer to use trees with fewer operational taxonomic units (OTUs), for instance for the detection of horizontal gene transfer events by identifying phylogenetic conflicts; and (3) the pool of available sequences is biased towards extensively studied species. In the past, the creation of subsamples often relied on manual selection. Here we present an Automated sequence-Sampling method for improving the Taxonomic diversity of gene phylogenetic trees, AST, to obtain representative sequences that maximize the taxonomic diversity of the sampled sequences. To demonstrate the effectiveness of AST, we have tested it to solve four problems, namely, inference of the evolutionary histories of the small ribosomal subunit protein S5 of E. coli, 16 S ribosomal RNAs and glycosyl-transferase gene family 8, and a study of ancient horizontal gene transfers from bacteria to plants. Our results show that the resolution of our computational results is almost as good as that of manual inference by domain experts, hence making the tool generally useful to phylogenetic studies by non-phylogeny specialists. The program is available at http://csbl.bmb.uga.edu/~zhouchan/AST.php
    corecore