5,432 research outputs found
Ambiguity Coding Allows Accurate Inference of Evolutionary Parameters from Alignments in an Aggregated State-Space
How can we best learn the history of a protein’s evolution? Ideally, a model of sequence evolution should capture both the process that generates genetic variation and the functional constraints determining which changes are fixed. However, in practical terms the most suitable approach may simply be the one that combines the convenience of easily available input data with the ability to return useful parameter estimates. For example, we might be interested in a measure of the strength of selection (typically obtained using a codon model) or an ancestral structure (obtained using structural modelling based on inferred amino acid sequence and side chain configuration).
But what if data in the relevant state-space are not readily available? We show that it is possible to obtain accurate estimates of the outputs of interest using an established method for handling missing data. Encoding observed characters in an alignment as ambiguous representations of characters in a larger state-space allows the application of models with the desired features to data that lack the resolution that is normally required. This strategy is viable because the evolutionary path taken through the observed space contains information about states that were likely visited in the “unseen” state-space. To illustrate this, we consider two examples with amino acid sequences as input.
We show that ω, a parameter describing the relative strength of selection on non-synonymous and synonymous changes, can be estimated in an unbiased manner using an adapted version of a standard 61-state codon model. Using simulated and empirical data, we find that ancestral amino acid side chain configuration can be inferred by applying a 55-state empirical model to 20-state amino acid data. Where feasible, combining inputs from both ambiguity-coded and fully resolved data improves accuracy. Adding structural information to as few as 12.5% of the sequences in an amino acid alignment results in remarkable ancestral reconstruction performance compared to a benchmark that considers the full rotamer state information. These examples show that our methods permit the recovery of evolutionary information from sequences where it has previously been inaccessible
Phylogenetic influence of complex, evolutionary models: a Bayesian approach
Molecular evolution recovers the history of living species by comparing genetic information, exploring genome structure and function from an evolutionary perspective. Here we infer substitution rates and ancestral reconstructions, to better understand mutation responses to some known biochemical phenomena. Mutation processes are commonly inferred using parsimony, maximum likelihood and Bayesian. Parsimony is not explicitly model-based, and is statistically biased due to unrealistic assumptions. The model-based maximum likelihood approaches become computationally inefficient while analyzing large or high-dimensional datasets, leaving little opportunities to incorporate complex evolutionary models. We implemented a posterior probability (Bayesian) approach that evaluates evolutionary models, applying it to primate mitochondrial genomes. The species nucleotide sequence data were augmented with ancestral states at the internal nodes of the phylogeny. We simplified probability calculations for substitution events along the branches by assuming that only up to one or two substitution events occurred per branch per site. These conditional pathway calculations introduce very little bias into the inferred reconstructions, while increasing the feasibility of incorporating complex evolutionary models with higher dimensions. Compositional bias tests, including functional predictions of ancestral tRNAs, show that ancestral sequences from the Bayesian approach are more biologically realistic than those reconstructed by maximum likelihood. To explore other model complexity, we allowed substitution rates to vary among sites by having a different model at each site. With a strand-symmetric model as the base model, asymmetric substitution probabilities for specific substitution types were varied among sites. This model would not be feasible with standard matrix exponentiation methods, particularly maximum likelihood. We observed for A--\u3eG and C--\u3eT substitutions almost linear, respectively, almost asymptotic responses (with some regional deviations). Note that the HMM models had no a priori response built in them. Observed responses fitted predictions from earlier gene by gene likelihood analyses. For A--\u3eG substitutions, deviations from the expected linear response correlated positively with the loop-forming propensity of the corresponding site in the mRNA secondary structure. In the COI region, C--\u3eT substitutions have a prominent dip, suggesting protection against mutations. The C--\u3eT substitution responses differed significantly between primate sub-groups defined based on their single genome A--\u3eG responses
Recommended from our members
The Influence of Structural Constraints on Protein Evolution
Few mathematical models of sequence evolution incorporate parameters describingprotein structure, despite its high conservation, essential functional role and the increasingavailability of structural data. The primary goal of my PhD project was to create astructurally aware amino acid substitution model in which proteins are represented usingan expanded alphabet that relays both amino acid identity and structural information.Each character in this alphabet specifies an amino acid as well as information aboutthe rotamer configuration of its side chain: the discrete geometric pattern of permittedside chain atomic positions, as defined by the dihedral angles between covalently linkedatoms. I generated a 55-state “Dayhoff-like” substitution model (RAM55) by assigningrotamer states in 79,558 structures (∼50%of all PDBe entries) and identifying substitu-tions between closely related sequences. RAM55’s rotamer state exchange patterns clearlyshow that the evolutionary properties of amino acids depend strongly upon side chain ge-ometry. Exploiting knowledge of these patterns assists in phylogenetic analyses: I showthat RAM55 performs as well as or better than traditional 20-state models on simulatedand empirical data for divergence time estimation, tree inference, side chain configurationprediction and ancestral sequence reconstruction.Further, encoding observed characters in an alignment as ambiguous representations ofcharacters in a larger state-space allows the application of RAM55 to 20-state amino aciddata for which structures are not known. Adding structural information to as few as12.5%of the sequences in an amino acid alignment results in excellent ancestral reconstructionperformance compared to a benchmark that considers the full rotamer state information.This strategy significantly expands the applicability of RAM55 to real-world scenarioswhere structure might only be available for some of the sequences of interest.Thus, not only is rotamer configuration a valuable source of information for phylo-genetic studies, but modelling the concomitant evolution of sequence and structure mayhave important implications for understanding protein folding and function
The inference of gene trees with species trees
Molecular phylogeny has focused mainly on improving models for the
reconstruction of gene trees based on sequence alignments. Yet, most
phylogeneticists seek to reveal the history of species. Although the histories
of genes and species are tightly linked, they are seldom identical, because
genes duplicate, are lost or horizontally transferred, and because alleles can
co-exist in populations for periods that may span several speciation events.
Building models describing the relationship between gene and species trees can
thus improve the reconstruction of gene trees when a species tree is known, and
vice-versa. Several approaches have been proposed to solve the problem in one
direction or the other, but in general neither gene trees nor species trees are
known. Only a few studies have attempted to jointly infer gene trees and
species trees. In this article we review the various models that have been used
to describe the relationship between gene trees and species trees. These models
account for gene duplication and loss, transfer or incomplete lineage sorting.
Some of them consider several types of events together, but none exists
currently that considers the full repertoire of processes that generate gene
trees along the species tree. Simulations as well as empirical studies on
genomic data show that combining gene tree-species tree models with models of
sequence evolution improves gene tree reconstruction. In turn, these better
gene trees provide a better basis for studying genome evolution or
reconstructing ancestral chromosomes and ancestral gene sequences. We predict
that gene tree-species tree methods that can deal with genomic data sets will
be instrumental to advancing our understanding of genomic evolution.Comment: Review article in relation to the "Mathematical and Computational
Evolutionary Biology" conference, Montpellier, 201
Combining genomics and epidemiology to track mumps virus transmission in the United States.
Unusually large outbreaks of mumps across the United States in 2016 and 2017 raised questions about the extent of mumps circulation and the relationship between these and prior outbreaks. We paired epidemiological data from public health investigations with analysis of mumps virus whole genome sequences from 201 infected individuals, focusing on Massachusetts university communities. Our analysis suggests continuous, undetected circulation of mumps locally and nationally, including multiple independent introductions into Massachusetts and into individual communities. Despite the presence of these multiple mumps virus lineages, the genomic data show that one lineage has dominated in the US since at least 2006. Widespread transmission was surprising given high vaccination rates, but we found no genetic evidence that variants arising during this outbreak contributed to vaccine escape. Viral genomic data allowed us to reconstruct mumps transmission links not evident from epidemiological data or standard single-gene surveillance efforts and also revealed connections between apparently unrelated mumps outbreaks
Evolution of substrate specificity in a recipient's enzyme following horizontal gene transfer
Despite the prominent role of horizontal gene transfer (HGT) in shaping bacterial metabolism, little is known about the impact of HGT on the evolution of enzyme function. Specifically, what is the influence of a recently acquired gene on the function of an existing gene? For example, certain members of the genus Corynebacterium have horizontally acquired a whole L-tryptophan biosynthetic operon, whereas in certain closely related actinobacteria, for example, Mycobacterium, the trpF gene is missing. In Mycobacterium, the function of the trpF gene is performed by a dual-substrate (βα)8 phosphoribosyl isomerase (priA gene) also involved in L-histidine (hisA gene) biosynthesis. We investigated the effect of a HGT-acquired TrpF enzyme upon PriA’s substrate specificity in Corynebacterium through comparative genomics and phylogenetic reconstructions. After comprehensive in vivo and enzyme kinetic analyses of selected PriA homologs, a novel (βα)8 isomerase subfamily with a specialized function in L-histidine biosynthesis, termed subHisA, was confirmed. X-ray crystallography was used to reveal active-site mutations in subHisA important for narrowing of substrate specificity, which when mutated to the naturally occurring amino acid in PriA led to gain of function. Moreover, in silico molecular dynamic analyses demonstrated that the narrowing of substrate specificity of subHisA is concomitant with loss of ancestral protein conformational states. Our results show the importance of HGT in shaping enzyme evolution and metabolism
Phylogenetic systematics, biogeography, and evolutionary ecology of the true crocodiles (Eusuchia: Crocodylidae: Crocodylus)
Modern crocodylian systematics has been dominated by investigations of higher-level relationships aimed at resolving the disparity between morphological and molecular data, especially regarding the true gharial (Gavialis). Consequently, no studies to date have provided adequate resolution of the interspecific relationships within the most broadly distributed, ecologically diverse, and species-rich crocodylian genus, Crocodylus. In this study, Bayesian and ML partitioned phylogenetic analyses were performed on a DNA sequence dataset of 7,282 base pairs representing four mitochondrial regions, nine nuclear loci, and all 23 crocodylian species. The analyses were performed on a suite of partitioning strategies to investigate the modeling effects of partition choice in phylogenetic analyses. Bayesian lognormal relaxed-clock dating analyses also were performed on the dataset, calibrated from the rich crocodylian fossil record. A robust interspecific phylogeny of Crocodylus is reconstructed, and subsequently used in ML and Bayesian ancestral character-state reconstructions to test hypotheses about the biogeographic history and evolutionary ecology of the genus. The results demonstrate that the genus originated from an ancestor in the tropics of the Late Miocene Indo-Pacific, and rapidly radiated and dispersed around the globe during a period marked by mass extinctions of fellow crocodylians. The results also prove paraphyly of Crocodylus, and reveal more diversity within the genus than recognized by current taxonomy. This study also establishes a baseline for assessing the utility of various model selection criteria for objectively selecting the optimal partitioning strategy within ML and Bayesian frameworks. The results indicate that gene identity is a poor method of partition choice. Furthermore, the results of the ancestral character-state reconstructions suggest ML and Bayesian methods produce more realistic and reliable results than parsimony
A Comparison of Phylogenetic Network Methods Using Computer Simulation
Background: We present a series of simulation studies that explore the relative performance of several phylogenetic network approaches (statistical parsimony, split decomposition, union of maximum parsimony trees, neighbor-net, simulated history recombination upper bound, median-joining, reduced median joining and minimum spanning network) compared to standard tree approaches, (neighbor-joining and maximum parsimony) in the presence and absence of recombination. Principal Findings: In the absence of recombination, all methods recovered the correct topology and branch lengths nearly all of the time when the substitution rate was low, except for minimum spanning networks, which did considerably worse. At a higher substitution rate, maximum parsimony and union of maximum parsimony trees were the most accurate. With recombination, the ability to infer the correct topology was halved for all methods and no method could accurately estimate branch lengths. Conclusions: Our results highlight the need for more accurate phylogenetic network methods and the importance of detecting and accounting for recombination in phylogenetic studies. Furthermore, we provide useful information for choosing a network algorithm and a framework in which to evaluate improvements to existing methods and nove
A Methodological Framework for the Reconstruction of Contiguous Regions of Ancestral Genomes and Its Application to Mammalian Genomes
The reconstruction of ancestral genome architectures and gene orders from homologies between extant species is a long-standing problem, considered by both cytogeneticists and bioinformaticians. A comparison of the two approaches was recently investigated and discussed in a series of papers, sometimes with diverging points of view regarding the performance of these two approaches. We describe a general methodological framework for reconstructing ancestral genome segments from conserved syntenies in extant genomes. We show that this problem, from a computational point of view, is naturally related to physical mapping of chromosomes and benefits from using combinatorial tools developed in this scope. We develop this framework into a new reconstruction method considering conserved gene clusters with similar gene content, mimicking principles used in most cytogenetic studies, although on a different kind of data. We implement and apply it to datasets of mammalian genomes. We perform intensive theoretical and experimental comparisons with other bioinformatics methods for ancestral genome segments reconstruction. We show that the method that we propose is stable and reliable: it gives convergent results using several kinds of data at different levels of resolution, and all predicted ancestral regions are well supported. The results come eventually very close to cytogenetics studies. It suggests that the comparison of methods for ancestral genome reconstruction should include the algorithmic aspects of the methods as well as the disciplinary differences in data aquisition
- …