Search CORE

Benchmarking multi-rate codon models

Author: Delport Wayne
Gravenor Mike B.
Muse Spencer V.
Pond Sergei Kosakovsky
Scheffler Konrad
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 21/07/2010
Field of study

CITATION: Delport, W. et al. 2010. Benchmarking multi-rate codon models. PLoS ONE, 5(7): e11587, doi:10.1371/journal.pone.0011587.The original publication is available at http://journals.plos.org/plosoneThe single rate codon model of non-synonymous substitution is ubiquitous in phylogenetic modeling. Indeed, the use of a non-synonymous to synonymous substitution rate ratio parameter has facilitated the interpretation of selection pressure on genomes. Although the single rate model has achieved wide acceptance, we argue that the assumption of a single rate of non-synonymous substitution is biologically unreasonable, given observed differences in substitution rates evident from empirical amino acid models. Some have attempted to incorporate amino acid substitution biases into models of codon evolution and have shown improved model performance versus the single rate model. Here, we show that the single rate model of non-synonymous substitution is easily outperformed by a model with multiple non-synonymous rate classes, yet in which amino acid substitution pairs are assigned randomly to these classes. We argue that, since the single rate model is so easy to improve upon, new codon models should not be validated entirely on the basis of improved model fit over this model. Rather, we should strive to both improve on the single rate model and to approximate the general time-reversible model of codon substitution, with as few parameters as possible, so as to reduce model over-fitting. We hint at how this can be achieved with a Genetic Algorithm approach in which rate classes are assigned on the basis of sequence information content. © 2010 Delport et al.http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0011587Publisher's versio

Bern Open Repository and Information System (BORIS)

Somatic genome architecture and molecular evolution are decoupled in "young" linage-specific gene families in ciliates.

Author: Cote-L'Heureux Auden
Katz Laura A
Kosakovsky Pond Sergei L
Maurer-Alcalá Xyrus X
Publication venue: Public Library of Science
Publication date: 01/01/2024
Field of study

The evolution of lineage-specific gene families remains poorly studied across the eukaryotic tree of life, with most analyses focusing on the recent evolution of de novo genes in model species. Here we explore the origins of lineage-specific genes in ciliates, a ~1 billion year old clade of microeukaryotes that are defined by their division of somatic and germline functions into distinct nuclei. Previous analyses on conserved gene families have shown the effect of ciliates' unusual genome architecture on gene family evolution: extensive genome processing-the generation of thousands of gene-sized somatic chromosomes from canonical germline chromosomes-is associated with larger and more diverse gene families. To further study the relationship between ciliate genome architecture and gene family evolution, we analyzed lineage specific gene families from a set of 46 transcriptomes and 12 genomes representing x species from eight ciliate classes. We assess how the evolution lineage-specific gene families occurs among four groups of ciliates: extensive fragmenters with gene-size somatic chromosomes, non-extensive fragmenters with "large'' multi-gene somatic chromosomes, Heterotrichea with highly polyploid somatic genomes and Karyorelictea with 'paradiploid' somatic genomes. Our analyses demonstrate that: 1) most lineage-specific gene families are found at shallow taxonomic scales; 2) extensive genome processing (i.e., gene unscrambling) during development likely influences the size and number of young lineage-specific gene families; and 3) the influence of somatic genome architecture on molecular evolution is increasingly apparent in older gene families. Altogether, these data highlight the influences of genome architecture on the evolution of lineage-specific gene families in eukaryotes

A First Look at ARFome: Dual-Coding Genes in Mammalian Genomes

Author: Anton Nekrutenko
Radek Szklarczyk
Samir Wadhawan
Sergei Kosakovsky Pond
Wen-Hsiung Li
Wen-Yu Chung
Publication venue: Public Library of Science
Publication date: 01/05/2007
Field of study

Coding of multiple proteins by overlapping reading frames is not a feature one would associate with eukaryotic genes. Indeed, codependency between codons of overlapping protein-coding regions imposes a unique set of evolutionary constraints, making it a costly arrangement. Yet in cases of tightly coexpressed interacting proteins, dual coding may be advantageous. Here we show that although dual coding is nearly impossible by chance, a number of human transcripts contain overlapping coding regions. Using newly developed statistical techniques, we identified 40 candidate genes with evolutionarily conserved overlapping coding regions. Because our approach is conservative, we expect mammals to possess more dual-coding genes. Our results emphasize that the skepticism surrounding eukaryotic dual coding is unwarranted: rather than being artifacts, overlapping reading frames are often hallmarks of fascinating biology

Assigning and visualizing germline genes in antibody repertoires.

Author: Frost Simon DW
Hossain AS Md Mukarram
Murrell Ben
Pond Sergei L Kosakovsky
Silverman Gregg J
Publication venue: Philos Trans R Soc Lond B Biol Sci
Publication date: 20/07/2015
Field of study

Identifying the germline genes involved in immunoglobulin rearrangements is an essential first step in the analysis of antibody repertoires. Based on our prior work in analysing diverse recombinant viruses, we present IgSCUEAL (Immunoglobulin Subtype Classification Using Evolutionary ALgorithms), a phylogenetic approach to assign V and J regions of immunoglobulin sequences to their corresponding germline alleles, with D regions assigned using a simple pairwise alignment algorithm. We also develop an interactive web application for viewing the results, allowing the user to explore the frequency distribution of sequence assignments and CDR3 region length statistics, which is useful for summarizing repertoires, as well as a detailed viewer of rearrangements and region alignments for individual query sequences. We demonstrate the accuracy and utility of our method compared with sequence similarity-based approaches and other non-phylogenetic model-based approaches, using both simulated data and a set of evaluation datasets of human immunoglobulin heavy chain sequences. IgSCUEAL demonstrates the highest accuracy of V and J assignment amongst existing approaches, even when the reassorted sequence is highly mutated, and can successfully cluster sequences on the basis of shared V/J germline alleles.S.K.L.P. and B.M. were supported in part by the U.S. National Institutes of Health (AI110181, AI90970, AI100665, DA34978, GM93939, HL108460, GM110749, LM7092, MH97520, MH83552), the UCSD Center for AIDS Research (Developmental Grant, AI36214, Bioinformatics and Information Technologies Core), the International AIDS Vaccine Initiative (through AI90970), the UC Laboratory Fees Research Program (grant no. 12-LR-236617). G.J.S. was supported in part the U.S. National Institute of Health (AI90118, AI68063, AI40305, and NIAID HHS N272201400019C), and a grant from the Lupus Research Institute. A.S.M.M.H. was supported by an Islamic Development Bank Scholarship, and S.D.W.F. was supported in part by the UK MRC Methodology Research Programme (grant no. MR/J013862/1).This is the final published version. It first appeared at http://rstb.royalsocietypublishing.org/content/370/1676/20140240

Apollo (Cambridge)

Evolutionary Interactions between N-Linked Glycosylation Sites in the HIV-1 Envelope

Author: Art F. Y Poon
Fraser I Lewis
Sebastion Bonhoeffer
Sergei L. Kosakovsky Pond
Simon D. W Frost
Publication venue: Public Library of Science
Publication date: 01/01/2007
Field of study

The addition of asparagine (N)-linked polysaccharide chains (i.e., glycans) to the gp120 and gp41 glycoproteins of human immunodeficiency virus type 1 (HIV-1) envelope is not only required for correct protein folding, but also may provide protection against neutralizing antibodies as a “glycan shield.” As a result, strong host-specific selection is frequently associated with codon positions where nonsynonymous substitutions can create or disrupt potential N-linked glycosylation sites (PNGSs). Moreover, empirical data suggest that the individual contribution of PNGSs to the neutralization sensitivity or infectivity of HIV-1 may be critically dependent on the presence or absence of other PNGSs in the envelope sequence. Here we evaluate how glycan–glycan interactions have shaped the evolution of HIV-1 envelope sequences by analyzing the distribution of PNGSs in a large-sequence alignment. Using a “covarion”-type phylogenetic model, we find that the rates at which individual PNGSs are gained or lost vary significantly over time, suggesting that the selective advantage of having a PNGS may depend on the presence or absence of other PNGSs in the sequence. Consequently, we identify specific interactions between PNGSs in the alignment using a new paired-character phylogenetic model of evolution, and a Bayesian graphical model. Despite the fundamental differences between these two methods, several interactions are jointly identified by both. Mapping these interactions onto a structural model of HIV-1 gp120 reveals that negative (exclusive) interactions occur significantly more often between colocalized glycans, while positive (inclusive) interactions are restricted to more distant glycans. Our results imply that the adaptive repertoire of alternative configurations in the HIV-1 glycan shield is limited by functional interactions between the N-linked glycans. This represents a potential vulnerability of rapidly evolving HIV-1 populations that may provide useful glycan-based targets for neutralizing antibodies

Modeling HIV-1 Drug Resistance as Episodic Directional Selection

The evolution of substitutions conferring drug resistance to HIV-1 is both episodic, occurring when patients are on antiretroviral therapy, and strongly directional, with site-specific resistant residues increasing in frequency over time. While methods exist to detect episodic diversifying selection and continuous directional selection, no evolutionary model combining these two properties has been proposed. We present two models of episodic directional selection (MEDS and EDEPS) which allow the a priori specification of lineages expected to have undergone directional selection. The models infer the sites and target residues that were likely subject to directional selection, using either codon or protein sequences. Compared to its null model of episodic diversifying selection, MEDS provides a superior fit to most sites known to be involved in drug resistance, and neither one test for episodic diversifying selection nor another for constant directional selection are able to detect as many true positives as MEDS and EDEPS while maintaining acceptable levels of false positives. This suggests that episodic directional selection is a better description of the process driving the evolution of drug resistance

FigShare

CodonTest: Modeling Amino Acid Substitution Preferences in Coding Sequences

Codon models of evolution have facilitated the interpretation of selective forces operating on genomes. These models, however, assume a single rate of non-synonymous substitution irrespective of the nature of amino acids being exchanged. Recent developments have shown that models which allow for amino acid pairs to have independent rates of substitution offer improved fit over single rate models. However, these approaches have been limited by the necessity for large alignments in their estimation. An alternative approach is to assume that substitution rates between amino acid pairs can be subdivided into rate classes, dependent on the information content of the alignment. However, given the combinatorially large number of such models, an efficient model search strategy is needed. Here we develop a Genetic Algorithm (GA) method for the estimation of such models. A GA is used to assign amino acid substitution pairs to a series of rate classes, where is estimated from the alignment. Other parameters of the phylogenetic Markov model, including substitution rates, character frequencies and branch lengths are estimated using standard maximum likelihood optimization procedures. We apply the GA to empirical alignments and show improved model fit over existing models of codon evolution. Our results suggest that current models are poor approximations of protein evolution and thus gene and organism specific multi-rate models that incorporate amino acid substitution biases are preferred. We further anticipate that the clustering of amino acid substitution rates into classes will be biologically informative, such that genes with similar functions exhibit similar clustering, and hence this clustering will be useful for the evolutionary fingerprinting of genes

Cronfa at Swansea University

An Evolutionary-Network Model Reveals Stratified Interactions in the V3 Loop of the HIV-1 Envelope

Author: Art F. Y Poon
Eugene I Shakhnovich
Fraser I Lewis
Sergei L. Kosakovsky Pond
Simon D. W Frost
Publication venue: Public Library of Science
Publication date: 01/01/2007
Field of study

The third variable loop (V3) of the human immunodeficiency virus type 1 (HIV-1) envelope is a principal determinant of antibody neutralization and progression to AIDS. Although it is undoubtedly an important target for vaccine research, extensive genetic variation in V3 remains an obstacle to the development of an effective vaccine. Comparative methods that exploit the abundance of sequence data can detect interactions between residues of rapidly evolving proteins such as the HIV-1 envelope, revealing biological constraints on their variability. However, previous studies have relied implicitly on two biologically unrealistic assumptions: (1) that founder effects in the evolutionary history of the sequences can be ignored, and; (2) that statistical associations between residues occur exclusively in pairs. We show that comparative methods that neglect the evolutionary history of extant sequences are susceptible to a high rate of false positives (20%–40%). Therefore, we propose a new method to detect interactions that relaxes both of these assumptions. First, we reconstruct the evolutionary history of extant sequences by maximum likelihood, shifting focus from extant sequence variation to the underlying substitution events. Second, we analyze the joint distribution of substitution events among positions in the sequence as a Bayesian graphical model, in which each branch in the phylogeny is a unit of observation. We perform extensive validation of our models using both simulations and a control case of known interactions in HIV-1 protease, and apply this method to detect interactions within V3 from a sample of 1,154 HIV-1 envelope sequences. Our method greatly reduces the number of false positives due to founder effects, while capturing several higher-order interactions among V3 residues. By mapping these interactions to a structural model of the V3 loop, we find that the loop is stratified into distinct evolutionary clusters. We extend our model to detect interactions between the V3 and C4 domains of the HIV-1 envelope, and account for the uncertainty in mapping substitutions to the tree with a parametric bootstrap

CiteSeerX

HIV-Specific Probabilistic Models of Protein Evolution

Author: D Jones
D Posada
David C. Nickle
DC Nickle
DF Feng
DG George
H Tang
J Adachi
J Felsenstein
James I. Mullins
JT Herbeck
K Tamura
KP Burnham
L Stanfel
Laura Heath
LM Mansky
LY Yampolsky
M Hasegawa
Mark A. Jensen
MO Dayhoff
MW Dimmic
MW Nachman
N Goldman
N Saitou
N Sugiura
Oliver Pybus
Peter B. Gilbert
R Shankarappa
S Henikoff
S Karlin
S Whelan
Sergei L. Kosakovsky Pond
SL Kosakovsky Pond
SL Kosakovsky Pond
SL Kosakovsky Pond
SL Kosakovsky Pond
SL Kosakovsky Pond
SLK Pond
SV Muse
T Leitner
TC Friedrich
TM Allen
Y Liu
Z Yang
Publication venue: Public Library of Science
Publication date: 01/06/2007
Field of study

Comparative sequence analyses, including such fundamental bioinformatics techniques as similarity searching, sequence alignment and phylogenetic inference, have become a mainstay for researchers studying type 1 Human Immunodeficiency Virus (HIV-1) genome structure and evolution. Implicit in comparative analyses is an underlying model of evolution, and the chosen model can significantly affect the results. In general, evolutionary models describe the probabilities of replacing one amino acid character with another over a period of time. Most widely used evolutionary models for protein sequences have been derived from curated alignments of hundreds of proteins, usually based on mammalian genomes. It is unclear to what extent these empirical models are generalizable to a very different organism, such as HIV-1–the most extensively sequenced organism in existence. We developed a maximum likelihood model fitting procedure to a collection of HIV-1 alignments sampled from different viral genes, and inferred two empirical substitution models, suitable for describing between-and within-host evolution. Our procedure pools the information from multiple sequence alignments, and provided software implementation can be run efficiently in parallel on a computer cluster. We describe how the inferred substitution models can be used to generate scoring matrices suitable for alignment and similarity searches. Our models had a consistently superior fit relative to the best existing models and to parameter-rich data-driven models when benchmarked on independent HIV-1 alignments, demonstrating evolutionary biases in amino-acid substitution that are unique to HIV, and that are not captured by the existing models. The scoring matrices derived from the models showed a marked difference from common amino-acid scoring matrices. The use of an appropriate evolutionary model recovered a known viral transmission history, whereas a poorly chosen model introduced phylogenetic error. We argue that our model derivation procedure is immediately applicable to other organisms with extensive sequence data available, such as Hepatitis C and Influenza A viruses