56 research outputs found
Simultaneous Reconstruction of Duplication Episodes and Gene-Species Mappings
We present a novel problem, called MetaEC, which aims to infer gene-species assignments in a collection of gene trees with missing labels by minimizing the size of duplication episode clustering (EC). This problem is particularly relevant in metagenomics, where incomplete data often poses a challenge in the accurate reconstruction of gene histories. To solve MetaEC, we propose a polynomial time dynamic programming (DP) formulation that verifies the existence of a set of duplication episodes from a predefined set of episode candidates. We then demonstrate how to use DP to design an algorithm that solves MetaEC. Although the algorithm is exponential in the worst case, we introduce a heuristic modification of the algorithm that provides a solution with the knowledge that it is exact. To evaluate our method, we perform two computational experiments on simulated and empirical data containing whole genome duplication events, showing that our algorithm is able to accurately infer the corresponding events
Consensus properties for the deep coalescence problem and their application for scalable tree search
Evolutionary systems biology of virus-host interactions
The evolution of virus-host interactions occurs at multiple levels of biological complexity, such as organismal, genetic, and molecular levels. In the first part of this study, the evolution of associations between herpesviruses (HVs) and theirhosts are examined across more than 400 million years. Recent studies have been demonstrating that cospeciations are not always the main event driving HV evolution, asinterhost speciations and host switches also play important roles. The present study shows that more than topological incongruences, mismatches on divergence times are the main source of disagreements between host and viral phylogenies, which reveals host switches, intrahost speciations and viral losses along the evolution of HVs. Herpesviruses have large genomes encoding dozens of proteins. Apart from amino acid substitutions, these viruses also evolve by acquiring, duplicating and losing protein domains. Although the domain repertoires of HVs differ across species, a core set of domains is shared among all of them. This second part of this study reveals that 28 out 41 core domains encoded by HV ancestors are still found in present-day repertoires, which over time were expanded by domain gains and duplications. Distinct evolutionary strategies led HVs to developed very specific domain repertoires, which may explain their host range and tissue tropism, and provide hints on the origins of herpesviruses.
Despite the fact that most mutations in proteins are deleterious, few of them end up improving viral fitness and defining how viruses interact with their hosts. By using an integrative approach, the third part of this study investigates the evolution of protein-protein interactions (PPIs) involving the membrane proteins Nectins, and the herpesviral envelope glycoproteins D/G. By means of ancestral sequence reconstruction and homology modelling, ancestral structures of these protein complexes were generated, and analysis of their interaction energies revealed important differences of binding affinity along their evolution.Open Acces
MĂ©thodes et algorithmes pour lâamĂ©lioration de lâinfĂ©rence de lâhistoire Ă©volutive des gĂ©nomes
Les phylogĂ©nies de gĂšnes offrent un cadre idĂ©al pour lâĂ©tude comparative des gĂ©nomes.
Non seulement elles incorporent lâĂ©volution des espĂšces par spĂ©ciation, mais permettent aussi
de capturer lâexpansion et la contraction des familles de gĂšnes par gains et pertes de gĂšnes.
La dĂ©termination de lâordre et de la nature de ces Ă©vĂ©nements Ă©quivaut Ă infĂ©rer lâhistoire
évolutive des familles de gÚnes, et constitue un prérequis à plusieurs analyses en génomique
comparative. En effet, elle est requise pour dĂ©terminer efficacement les relations dâorthologies
entre gÚnes, importantes pour la prédiction des structures et fonctions de protéines et les
analyses phylogénétiques, pour ne citer que ces applications.
Les mĂ©thodes dâinfĂ©rence dâhistoires Ă©volutives de familles de gĂšnes supposent que les
phylogĂ©nies considĂ©rĂ©es sont dĂ©nuĂ©es dâerreurs. Ces phylogĂ©nies de gĂšnes, souvent recons-
truites Ă partir des sĂ©quences dâacides aminĂ©s ou de nuclĂ©otides, ne reprĂ©sentent cependant
quâune estimation du vrai arbre de gĂšnes et sont sujettes Ă des erreurs provenant de sources
variĂ©es, mais bien documentĂ©es. Pour garantir lâexactitude des histoires infĂ©rĂ©es, il faut donc
sâassurer de lâabsence dâerreurs au sein des arbres de gĂšnes. Dans cette thĂšse, nous Ă©tudions
cette problématique sous deux aspects.
Le premier volet de cette thĂšse concerne lâidentification des dĂ©viations du code gĂ©nĂ©tique,
lâune des causes dâerreurs dâannotations se propageant ensuite dans les phylogĂ©nies. Nous
dĂ©veloppons Ă cet effet, une mĂ©thodologie pour lâinfĂ©rence de dĂ©viations du code gĂ©nĂ©tique
standard par lâanalyse des sĂ©quences codantes et des ARNt. Cette mĂ©thodologie est cen-
trĂ©e autour dâun algorithme de prĂ©diction de rĂ©affectations de codons, appelĂ© CoreTracker.
Nous montrons tout dâabord lâefficacitĂ© de notre mĂ©thode, puis lâutilisons pour dĂ©montrer
lâĂ©volution du code gĂ©nĂ©tique dans les gĂ©nomes mitochondriaux des algues vertes.
Le second volet de la thÚse concerne le développement de méthodes efficaces pour la
correction et la construction dâarbres phylogĂ©nĂ©tiques de gĂšnes. Nous prĂ©sentons deux
mĂ©thodes exploitant lâinformation sur lâĂ©volution des espĂšces. La premiĂšre, ProfileNJ ,
est déterministe et trÚs rapide. Elle corrige les arbres de gÚnes en ciblant exclusivement
les sous-arbres prĂ©sentant un support statistique faible. Son application sur les familles de gĂšnes dâEnsembl Compara montre une amĂ©lioration nette de la qualitĂ© des arbres,
par comparaison à ceux proposés par la base de données. La seconde, GATC, utilise un
algorithme gĂ©nĂ©tique et traite le problĂšme comme celui de lâoptimisation multi-objectif de
la topologie des arbres de gĂšnes, Ă©tant donnĂ©es des contraintes relatives Ă lâĂ©volution des
familles de gÚnes par mutation de séquences et par gain/perte de gÚnes. Nous montrons
quâune telle approche est non seulement efficace, mais appropriĂ©e pour la construction
dâensemble dâarbres de rĂ©fĂ©rence.Gene trees offer a proper framework for comparative genomics. Not only do they provide
information about species evolution through speciation events, but they also capture gene
family expansion and contraction by gene gains and losses. They are thus used to infer
the evolutionary history of gene families and accurately predict the orthologous relationship
between genes, on which several biological analyses rely.
Methods for inferring gene family evolution explicitly assume that gene trees are known
without errors. However, standard phylogenetic methods for tree construction based on se-
quence data are well documented as error-prone. Gene trees constructed using these methods
will usually introduce biases during the inference of gene family histories. In this thesis, we
present new methods aiming to improve the quality of phylogenetic gene trees and thereby
the accuracy of underlying evolutionary histories of their corresponding gene families.
We start by providing a framework to study genetic code deviations, one possible reason
of annotation errors that could then spread to the phylogeny reconstruction. Our framework
is based on analysing coding sequences and tRNAs to predict codon reassignments. We first
show its efficiency, then apply it to green plant mitochondrial genomes.
The second part of this thesis focuses on the development of efficient species tree aware
methods for gene tree construction. We present ProfileNJ , a fast and deterministic correction
method that targets weakly supported branches of a gene tree. When applied to the gene
families of the Ensembl Compara database, ProfileNJ produces an arguably better set of
gene trees compared to the ones available in Ensembl Compara. We later use a different
strategy, based on a genetic algorithm, allowing both construction and correction of gene
trees. This second method called GATC, treats the problem as a multi-objective optimisation
problem in which we are looking for the set of gene trees optimal for both sequence data
and information of gene family evolution through gene gain and loss. We show that this
approach yields accurate trees and is suitable for the construction of reference datasets to benchmark other methods
Evolutionary genomics : statistical and computational methods
This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward
Evolutionary Genomics
This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward
Graph-based modeling and evolutionary analysis of microbial metabolism
Microbial organisms are responsible for most of the metabolic innovations on Earth. Understanding microbial metabolism helps shed the light on questions that are central to biology, biomedicine, energy and the environment. Graph-based modeling is a powerful tool that has been used extensively for elucidating the organising principles of microbial metabolism and the underlying evolutionary forces that act upon it. Nevertheless, various graph-theoretic representations and techniques have been applied to metabolic networks, rendering the modeling aspect ad hoc and highlighting the conflicting conclusions based on the different representations.
The contribution of this dissertation is two-fold. In the first half, I revisit the modeling aspect of metabolic networks, and present novel techniques for their representation and analysis. In particular, I explore the limitations of standard graphs representations, and the utility of the more appropriate model---hypergraphs---for capturing metabolic network properties. Further, I address the task of metabolic pathway inference and the necessity to account for chemical symmetries and alternative tracings in this crucial task.
In the second part of the dissertation, I focus on two evolutionary questions. First, I investigate the evolutionary underpinnings of the formation of communities in metabolic networks---a phenomenon that has been reported in the literature and implicated in an organism's adaptation to its environment. I find that the metabolome size better explains the observed community structures. Second, I correlate evolution at the genome level with emergent properties at the metabolic network level. In particular, I quantify the various evolutionary events (e.g., gene duplication, loss, transfer, fusion, and fission) in a group of proteobacteria, and analyze their role in shaping the metabolic networks and determining the organismal fitness.
As metabolism gains an increasingly prominent role in biomedical, energy, and environmental research, understanding how to model this process and how it came about during evolution become more crucial. My dissertation provides important insights in both directions
Evaluating, Accelerating and Extending the Multispecies Coalescent Model of Evolution
So much research builds on evolutionary histories of species and
genes. They are used in genomics to infer synteny, in ecology to
describe and predict biodiversity, and in molecular biology to
transfer knowledge acquired in model organisms to humans and
crops. Beyond downstream applications, expanding our knowledge of
life on Earth is important in its own right. From Naturalis
Historia to On the Origin of Species, the acquisition of this
knowledge has been a part of human development.
Evolutionary histories are commonly represented as trees, where a
common ancestor progressively splits into descendant species or
alleles. Time trees add more information by using height to
represent genetic distance or elapsed time. Species and gene
trees can be inferred from molecular sequences using methods
which are explicitly model-based, or implicitly assume or are
statistically consistent with a particular model of evolution.
One such model, the multispecies coalescent (MSC), is the topic
of my thesis. Under this model, separate trees are inferred for
the species history and for each geneâs history. Gene trees are
embedded within the species tree according to a coalescent
process.
Researchers often avoid the MSC when reconstructing time trees
because of claims that available implementations are too
computationally demanding. Instead, the species history is
inferred using a single tree by concatenating the sequences from
each gene. I began my thesis research by evaluating the effect of
this approximation. In a realistic simulation based on parameters
inferred from empirical data, concatenation was grossly
inaccurate, especially when estimating recent species divergence
times. In a later simulation study I demonstrated that when using
concatenation, credible intervals often excluded the true
values.
To address reluctance towards using the MSC, I developed a faster
implementation of the model. StarBEAST2 is a Markov chain Monte
Carlo (MCMC) method, meaning it characterizes the probability
distribution over trees by randomly walking the parameter space.
I improved computational performance by developing more efficient
proposals used to traverse the space, and reducing the number of
parameters in the model through analytical integration of
population sizes.
Despite its sophistication, the MSC has theoretical limitations.
One is that the substitution rate is assumed to stay constant, or
uncorrelated between lineages of different genes. However
substitution rates do vary and are associated with species traits
like body size. I addressed this assumption in StarBEAST2 by
extending the MSC to estimate substitution rates for each
species. Another assumption is that genetic material cannot be
transferred horizontally, but a more general model called the
multispecies network coalescent (MSNC) permits introgression of
alleles across species boundaries. My collaborators and I have
developed and evaluated an MCMC implementation of the the MSNC.
My final thesis project was to combine the MSC with the
fossilized birth-death (FBD) process, which models how species
are fossilized and sampled through time. To demonstrate the
utility of the FBD-MSC model, I used it to reconstruct the
evolutionary history of Caninae (dogs and foxes) using fossil
data and molecular sequences
Integrating phylogenomics, biogeography and systematics to explore the taxonomy and the rise of the ratsnakes
Understanding the evolutionary processes that create the spectacular diversity of organisms, both in species numbers and form, is a primary goal for biologists. Global ratsnakes are a species-rich assemblage with high morphological and ecological diversity and a distribution that encompasses both the Old World (OW) and the New World (NW). To explore the mechanism leading to the divergence of the ratsnakes, I tested the hypotheses regarding the area of origin and global dispersal, and examined the patterns of diversification and trait evolution. Given adaptive radiation via ecological opportunity, a diversity-dependent diversification pattern and an early burst trait evolutionary pattern are expected with rapid divergence triggered by the appearance of new resources, extinction of competitors, colonization of new areas or the appearance of key innovations. Thus, I tested if the radiation of ratsnakes follows diversity-dependent diversification with an early burst in speciation and trait divergence and whether the variation in diversification is associated with OW-NW dispersal or changes in traits. Further, trait convergence between OW and NW lineages was investigated to determine, if given similar environmental conditions, rapid speciation via ecological opportunity is repeatable. To answer the questions mentioned above, a robust phylogenetic tree is fundamental. Due to potential gene tree/species discordance, hundreds of loci sampled across the entire genome were generated using the anchored hybrid enrichment approach and the multi-species coalescent methods were used to build the species phylogeny. Then, given this phylogenetic context, taxonomic changes were made to reflect named monophyletic groups and divergence time and ancestral areas were estimated to 1) infer the processes leading to the current ratsnake global distribution, 2) assess the best fitting diversification and trait evolution models, and 3) determine if ecomorphological convergence occurs with adaptive regimes of traits on the phylogeny. Among all of the inferred species trees, by comparing the extent of tree discordance and the gene tree errors, the species trees generated in the program MPEST with summary statistics of posterior probability gene trees was used for further analysis. First, it was determined that the traditional ratsnake genera Gonyosoma and Coelegnathus are excluded from the monophyletic ratsnake group, with the remaining monophyletic group defined as Coronellini. The reconstructed ancestral areas supported that ratsnakes originating in the OW Eastern Palearctic and with a single dispersal to the NW via Bergingia. Two subclades each defined by a single genus, Lampropeltis and Elaphe, were found to have exclusively elevated species diversification and trait evolutionary rates. As the rate accelerations were only in the recent divergent lineages, colonization to the NW and rapid speciation of the NW lineages were decoupled. A general diversity-dependent radiation pattern in both OW and NW lineages was supported with a recent sharp diversification elevation about 6.5 Ma mainly within the genera Lampropeltis and Elaphe. Three morphological convergence events were detected among OW and NW lineages, corresponding to the previously defined morphological taxonomies (i.e., Elaphe and Pantherophis), indicating without a robust molecular phylogeny, morphological convergence positively misleads taxonomy. This research demonstrates the advantages and challenges of phylogenetic inference using genome scale dataset, highlights the importance of incorporating the biogeographic history and trait evolution in studies of diversification and indicates that oversimplified models are insufficient to describe the complexity of processes shaping the diversity in a species-rich assemblage
- âŠ