31 research outputs found
Selective Constraints on Amino Acids Estimated by a Mechanistic Codon Substitution Model with Multiple Nucleotide Changes
Empirical substitution matrices represent the average tendencies of
substitutions over various protein families by sacrificing gene-level
resolution. We develop a codon-based model, in which mutational tendencies of
codon, a genetic code, and the strength of selective constraints against amino
acid replacements can be tailored to a given gene. First, selective constraints
averaged over proteins are estimated by maximizing the likelihood of each 1-PAM
matrix of empirical amino acid (JTT, WAG, and LG) and codon (KHG) substitution
matrices. Then, selective constraints specific to given proteins are
approximated as a linear function of those estimated from the empirical
substitution matrices.
Akaike information criterion (AIC) values indicate that a model allowing
multiple nucleotide changes fits the empirical substitution matrices
significantly better. Also, the ML estimates of transition-transversion bias
obtained from these empirical matrices are not so large as previously
estimated. The selective constraints are characteristic of proteins rather than
species. However, their relative strengths among amino acid pairs can be
approximated not to depend very much on protein families but amino acid pairs,
because the present model, in which selective constraints are approximated to
be a linear function of those estimated from the JTT/WAG/LG/KHG matrices, can
provide a good fit to other empirical substitution matrices including cpREV for
chloroplast proteins and mtREV for vertebrate mitochondrial proteins.
The present codon-based model with the ML estimates of selective constraints
and with adjustable mutation rates of nucleotide would be useful as a simple
substitution model in ML and Bayesian inferences of molecular phylogenetic
trees, and enables us to obtain biologically meaningful information at both
nucleotide and amino acid levels from codon and protein sequences.Comment: Table 9 in this article includes corrections for errata in the Table
9 published in 10.1371/journal.pone.0017244. Supporting information is
attached at the end of the article, and a computer-readable dataset of the ML
estimates of selective constraints is available from
10.1371/journal.pone.001724
The extraordinary evolutionary history of the reticuloendotheliosis viruses
The reticuloendotheliosis viruses (REVs) comprise several closely related amphotropic retroviruses isolated from birds. These viruses exhibit several highly unusual characteristics that have not so far been adequately explained, including their extremely close relationship to mammalian retroviruses, and their presence as endogenous sequences within the genomes of certain large DNA viruses. We present evidence for an iatrogenic origin of REVs that accounts for these phenomena. Firstly, we identify endogenous retroviral fossils in mammalian genomes that share a unique recombinant structure with REVsβunequivocally demonstrating that REVs derive directly from mammalian retroviruses. Secondly, through sequencing of archived REV isolates, we confirm that contaminated Plasmodium lophurae stocks have been the source of multiple REV outbreaks in experimentally infected birds. Finally, we show that both phylogenetic and historical evidence support a scenario wherein REVs originated as mammalian retroviruses that were accidentally introduced into avian hosts in the late 1930s, during experimental studies of P. lophurae, and subsequently integrated into the fowlpox virus (FWPV) and gallid herpesvirus type 2 (GHV-2) genomes, generating recombinant DNA viruses that now circulate in wild birds and poultry. Our findings provide a novel perspective on the origin and evolution of REV, and indicate that horizontal gene transfer between virus families can expand the impact of iatrogenic transmission events
HIV-Specific Probabilistic Models of Protein Evolution
Comparative sequence analyses, including such fundamental bioinformatics techniques as similarity searching, sequence alignment and phylogenetic inference, have become a mainstay for researchers studying type 1 Human Immunodeficiency Virus (HIV-1) genome structure and evolution. Implicit in comparative analyses is an underlying model of evolution, and the chosen model can significantly affect the results. In general, evolutionary models describe the probabilities of replacing one amino acid character with another over a period of time. Most widely used evolutionary models for protein sequences have been derived from curated alignments of hundreds of proteins, usually based on mammalian genomes. It is unclear to what extent these empirical models are generalizable to a very different organism, such as HIV-1βthe most extensively sequenced organism in existence. We developed a maximum likelihood model fitting procedure to a collection of HIV-1 alignments sampled from different viral genes, and inferred two empirical substitution models, suitable for describing between-and within-host evolution. Our procedure pools the information from multiple sequence alignments, and provided software implementation can be run efficiently in parallel on a computer cluster. We describe how the inferred substitution models can be used to generate scoring matrices suitable for alignment and similarity searches. Our models had a consistently superior fit relative to the best existing models and to parameter-rich data-driven models when benchmarked on independent HIV-1 alignments, demonstrating evolutionary biases in amino-acid substitution that are unique to HIV, and that are not captured by the existing models. The scoring matrices derived from the models showed a marked difference from common amino-acid scoring matrices. The use of an appropriate evolutionary model recovered a known viral transmission history, whereas a poorly chosen model introduced phylogenetic error. We argue that our model derivation procedure is immediately applicable to other organisms with extensive sequence data available, such as Hepatitis C and Influenza A viruses
An Endogenous Foamy-like Viral Element in the Coelacanth Genome
Little is known about the origin and long-term evolutionary mode of retroviruses. Retroviruses can integrate into their hosts' genomes, providing a molecular fossil record for studying their deep history. Here we report the discovery of an endogenous foamy virus-like element, which we designate βcoelacanth endogenous foamy-like virusβ (CoeEFV), within the genome of the coelacanth (Latimeria chalumnae). Phylogenetic analyses place CoeEFV basal to all known foamy viruses, strongly suggesting an ancient ocean origin of this major retroviral lineage, which had previously been known to infect only land mammals. The discovery of CoeEFV reveals the presence of foamy-like viruses in species outside the Mammalia. We show that foamy-like viruses have likely codiverged with their vertebrate hosts for more than 407 million years and underwent an evolutionary transition from water to land with their vertebrate hosts. These findings suggest an ancient marine origin of retroviruses and have important implications in understanding foamy virus biology
Non-Negative Matrix Factorization for Learning Alignment-Specific Models of Protein Evolution
Models of protein evolution currently come in two flavors: generalist and specialist. Generalist models (e.g. PAM, JTT, WAG) adopt a one-size-fits-all approach, where a single model is estimated from a number of different protein alignments. Specialist models (e.g. mtREV, rtREV, HIVbetween) can be estimated when a large quantity of data are available for a single organism or gene, and are intended for use on that organism or gene only. Unsurprisingly, specialist models outperform generalist models, but in most instances there simply are not enough data available to estimate them. We propose a method for estimating alignment-specific models of protein evolution in which the complexity of the model is adapted to suit the richness of the data. Our method uses non-negative matrix factorization (NNMF) to learn a set of basis matrices from a general dataset containing a large number of alignments of different proteins, thus capturing the dimensions of important variation. It then learns a set of weights that are specific to the organism or gene of interest and for which only a smaller dataset is available. Thus the alignment-specific model is obtained as a weighted sum of the basis matrices. Having been constrained to vary along only as many dimensions as the data justify, the model has far fewer parameters than would be required to estimate a specialist model. We show that our NNMF procedure produces models that outperform existing methods on all but one of 50 test alignments. The basis matrices we obtain confirm the expectation that amino acid properties tend to be conserved, and allow us to quantify, on specific alignments, how the strength of conservation varies across different properties. We also apply our new models to phylogeny inference and show that the resulting phylogenies are different from, and have improved likelihood over, those inferred under standard models
Changing Selective Pressure during Antigenic Changes in Human Influenza H3
The rapid evolution of influenza viruses presents difficulties in maintaining the optimal efficiency of vaccines. Amino acid substitutions result in antigenic drift, a process whereby antisera raised in response to one virus have reduced effectiveness against future viruses. Interestingly, while amino acid substitutions occur at a relatively constant rate, the antigenic properties of H3 move in a discontinuous, step-wise manner. It is not clear why this punctuated evolution occurs, whether this represents simply the fact that some substitutions affect these properties more than others, or if this is indicative of a changing relationship between the virus and the host. In addition, the role of changing glycosylation of the haemagglutinin in these shifts in antigenic properties is unknown. We analysed the antigenic drift of HA1 from human influenza H3 using a model of sequence change that allows for variation in selective pressure at different locations in the sequence, as well as at different parts of the phylogenetic tree. We detect significant changes in selective pressure that occur preferentially during major changes in antigenic properties. Despite the large increase in glycosylation during the past 40 years, changes in glycosylation did not correlate either with changes in antigenic properties or with significantly more rapid changes in selective pressure. The locations that undergo changes in selective pressure are largely in places undergoing adaptive evolution, in antigenic locations, and in locations or near locations undergoing substitutions that characterise the change in antigenicity of the virus. Our results suggest that the relationship of the virus to the host changes with time, with the shifts in antigenic properties representing changes in this relationship. This suggests that the virus and host immune system are evolving different methods to counter each other. While we are able to characterise the rapid increase in glycosylation of the haemagglutinin during time in human influenza H3, an increase not present in influenza in birds, this increase seems unrelated to the observed changes in antigenic properties
Parallel Germline Infiltration of a Lentivirus in Two Malagasy Lemurs
Retroviruses normally infect the somatic cells of their host and are transmitted horizontally, i.e., in an exogenous way. Occasionally, however, some retroviruses can also infect and integrate into the genome of germ cells, which may allow for their vertical inheritance and fixation in a given species; a process known as endogenization. Lentiviruses, a group of mammalian retroviruses that includes HIV, are known to infect primates, ruminants, horses, and cats. Unlike many other retroviruses, these viruses have not been demonstrably successful at germline infiltration. Here, we report on the discovery of endogenous lentiviral insertions in seven species of Malagasy lemurs from two different generaβCheirogaleus and Microcebus. Combining molecular clock analyses and cross-species screening of orthologous insertions, we show that the presence of this endogenous lentivirus in six species of Microcebus is the result of one endogenization event that occurred about 4.2 million years ago. In addition, we demonstrate that this lentivirus independently infiltrated the germline of Cheirogaleus and that the two endogenization events occurred quasi-simultaneously. Using multiple proviral copies, we derive and characterize an apparently full length and intact consensus for this lentivirus. These results provide evidence that lentiviruses have repeatedly infiltrated the germline of prosimian species and that primates have been exposed to lentiviruses for a much longer time than what can be inferred based on sequence comparison of circulating lentiviruses. The study sets the stage for an unprecedented opportunity to reconstruct an ancestral primate lentivirus and thereby advance our knowledge of hostβvirus interactions
Fast and Robust Characterization of Time-Heterogeneous Sequence Evolutionary Processes Using Substitution Mapping
Genes and genomes do not evolve similarly in all branches of the tree of life. Detecting and characterizing the heterogeneity in time, and between lineages, of the nucleotide (or amino acid) substitution process is an important goal of current molecular evolutionary research. This task is typically achieved through the use of non-homogeneous models of sequence evolution, which being highly parametrized and computationally-demanding are not appropriate for large-scale analyses. Here we investigate an alternative methodological option based on probabilistic substitution mapping. The idea is to first reconstruct the substitutional history of each site of an alignment under a homogeneous model of sequence evolution, then to characterize variations in the substitution process across lineages based on substitution counts. Using simulated and published datasets, we demonstrate that probabilistic substitution mapping is robust in that it typically provides accurate reconstruction of sequence ancestry even when the true process is heterogeneous, but a homogeneous model is adopted. Consequently, we show that the new approach is essentially as efficient as and extremely faster than (up to 25 000 times) existing methods, thus paving the way for a systematic survey of substitution process heterogeneity across genes and lineages
Advantages of a Mechanistic Codon Substitution Model for Evolutionary Analysis of Protein-Coding Sequences
A mechanistic codon substitution model, in which each codon substitution rate is proportional to the product of a codon mutation rate and the average fixation probability depending on the type of amino acid replacement, has advantages over nucleotide, amino acid, and empirical codon substitution models in evolutionary analysis of protein-coding sequences. It can approximate a wide range of codon substitution processes. If no selection pressure on amino acids is taken into account, it will become equivalent to a nucleotide substitution model. If mutation rates are assumed not to depend on the codon type, then it will become essentially equivalent to an amino acid substitution model. Mutation at the nucleotide level and selection at the amino acid level can be separately evaluated.The present scheme for single nucleotide mutations is equivalent to the general time-reversible model, but multiple nucleotide changes in infinitesimal time are allowed. Selective constraints on the respective types of amino acid replacements are tailored to each gene in a linear function of a given estimate of selective constraints. Their good estimates are those calculated by maximizing the respective likelihoods of empirical amino acid or codon substitution frequency matrices. Akaike and Bayesian information criteria indicate that the present model performs far better than the other substitution models for all five phylogenetic trees of highly-divergent to highly-homologous sequences of chloroplast, mitochondrial, and nuclear genes. It is also shown that multiple nucleotide changes in infinitesimal time are significant in long branches, although they may be caused by compensatory substitutions or other mechanisms. The variation of selective constraint over sites fits the datasets significantly better than variable mutation rates, except for 10 slow-evolving nuclear genes of 10 mammals. An critical finding for phylogenetic analysis is that assuming variable mutation rates over sites lead to the overestimation of branch lengths
Site-specific time heterogeneity of the substitution process and its impact on phylogenetic inference
<p>Abstract</p> <p>Background</p> <p>Model violations constitute the major limitation in inferring accurate phylogenies. Characterizing properties of the data that are not being correctly handled by current models is therefore of prime importance. One of the properties of protein evolution is the variation of the relative rate of substitutions across sites and over time, the latter is the phenomenon called heterotachy. Its effect on phylogenetic inference has recently obtained considerable attention, which led to the development of new models of sequence evolution. However, thus far focus has been on the quantitative heterogeneity of the evolutionary process, thereby overlooking more qualitative variations.</p> <p>Results</p> <p>We studied the importance of variation of the site-specific amino-acid substitution process over time and its possible impact on phylogenetic inference. We used the CAT model to define an infinite mixture of substitution processes characterized by equilibrium frequencies over the twenty amino acids, a useful proxy for qualitatively estimating the evolutionary process. Using two large datasets, we show that qualitative changes in site-specific substitution properties over time occurred significantly. To test whether this unaccounted qualitative variation can lead to an erroneous phylogenetic tree, we analyzed a concatenation of mitochondrial proteins in which Cnidaria and Porifera were erroneously grouped. The progressive removal of the sites with the most heterogeneous CAT profiles across clades led to the recovery of the monophyly of Eumetazoa (Cnidaria+Bilateria), suggesting that this heterogeneity can negatively influence phylogenetic inference.</p> <p>Conclusion</p> <p>The time-heterogeneity of the amino-acid replacement process is therefore an important evolutionary aspect that should be incorporated in future models of sequence change.</p