8 research outputs found
Using ESTs for phylogenomics: Can one accurately infer a phylogenetic tree from a gappy alignment?
<p>Abstract</p> <p>Background</p> <p>While full genome sequences are still only available for a handful of taxa, large collections of partial gene sequences are available for many more. The alignment of partial gene sequences results in a multiple sequence alignment containing large gaps that are arranged in a staggered pattern. The consequences of this pattern of missing data on the accuracy of phylogenetic analysis are not well understood. We conducted a simulation study to determine the accuracy of phylogenetic trees obtained from gappy alignments using three commonly used phylogenetic reconstruction methods (Neighbor Joining, Maximum Parsimony, and Maximum Likelihood) and studied ways to improve the accuracy of trees obtained from such datasets.</p> <p>Results</p> <p>We found that the pattern of gappiness in multiple sequence alignments derived from partial gene sequences substantially compromised phylogenetic accuracy even in the absence of alignment error. The decline in accuracy was beyond what would be expected based on the amount of missing data. The decline was particularly dramatic for Neighbor Joining and Maximum Parsimony, where the majority of gappy alignments contained 25% to 40% incorrect quartets. To improve the accuracy of the trees obtained from a gappy multiple sequence alignment, we examined two approaches. In the first approach, alignment masking, potentially problematic columns and input sequences are excluded from from the dataset. Even in the absence of alignment error, masking improved phylogenetic accuracy up to 100-fold. However, masking retained, on average, only 83% of the input sequences. In the second approach, alignment subdivision, the missing data is statistically modelled in order to retain as many sequences as possible in the phylogenetic analysis. Subdivision resulted in more modest improvements to alignment accuracy, but succeeded in including almost all of the input sequences.</p> <p>Conclusion</p> <p>These results demonstrate that partial gene sequences and gappy multiple sequence alignments can pose a major problem for phylogenetic analysis. The concern will be greatest for high-throughput phylogenomic analyses, in which Neighbor Joining is often the preferred method due to its computational efficiency. Both approaches can be used to increase the accuracy of phylogenetic inference from a gappy alignment. The choice between the two approaches will depend upon how robust the application is to the loss of sequences from the input set, with alignment masking generally giving a much greater improvement in accuracy but at the cost of discarding a larger number of the input sequences.</p
Using Trees: Myrmecocystus Phylogeny and Character Evolution and New Methods for Investigating Trait Evolution and Species Delimitation (PhD Dissertation)
1) Rates of phenotypic evolution have changed throughout the history of life, producing variation in levels of morphological, functional, and ecological diversity among groups. Testing for the presence of these rate shifts is a key component of evaluating hypotheses about what causes them. General predictions regarding changes in phenotypic diversity as a function of evolutionary history and rates are developed, and tests are derived to evaluate rate changes. Simulations show that these tests are more powerful than existing tests using standardized contrasts. 
2) Species delimitation and species tree inference are difficult problems in the case of recent divergences, especially when different loci have different histories. I quantify the difficulty of the problem and introduce a non-parametric method for simultaneously dividing anonymous samples into different species and inferring a species tree, using individual gene trees as input. This heuristic method seeks to both minimize gene tree – species tree discordance and excess population structure within a species. Analyses suggest that the method may provide useful insights for systematists working at the species level with molecular data.
3) The phylogeny of Myrmecocystus ants is estimated using nine loci, finding that none of the three subgenera are monophyletic, implying repeated evolution of foraging times and particular morphologies. A new partitioned likelihood program, MrFisher, is created from MrBayes to aid analysis of multilocus datasets without assuming priors. Simulations show that using a partitioned likelihood approach in the presence of rate heterogeneity and missing data, as is common in supermatrix analyses, can recover correct branch lengths where non-partitioned likelihood gives predictably biased estimates of branch lengths but the correct topology.
4) Evolution of foraging time and coevolution of behavior and morphology in Myrmecocystus ants is examined. New models for reconstructing discrete states along branches of a tree and for examining continuous trait evolution and coevolution with discrete traits are developed and implemented. Foraging transitions between diurnal and nocturnal foraging evidently go through crepuscular intermediates. There is some evidence for increased rates of morphological character evolution associated with changes in foraging regime, but little evidence for particular optimum values for morphological traits associated with foraging
Parallelization of the maximum likelihood approach to phylogenetic inference
Phylogenetic inference refers to the reconstruction of evolutionary relationships among various species, usually presented in the form of a tree. DNA sequences are most often used to determine these relationships. The results of phylogenetic inference have many important applications, including protein function determination, drug discovery, disease tracking and forensics. There are several popular computational methods used for phylogenetic inference, among them distance-based (i.e. neighbor joining), maximum parsimony, maximum likelihood, and Bayesian methods. This thesis focuses on the maximum likelihood method, which is regarded as one of the most accurate methods, with its computational demand being the main hindrance to its widespread use. Maximum likelihood is generally considered to be a heuristic method providing a statistical evaluation of the results, where potential tree topologies are judged by how well they predict the observed sequences. While there have been several previous efforts to parallelize the maximum likelihood method, sequential implementations are more widely used in the biological research community. This is due to a lack of confidence in the results produced by the more recent, parallel programs. However, because phylogenetic inference can be extremely computationally intensive, with the number of possible tree topologies growing exponentially with the number of species, parallelization is necessary to reduce the computation time to a reasonable amount. A parallel program was developed for phylogenetic inference based on the trusted algorithms of fastDNAml, a sequential program for phylogenetic inference utilizing the maximum likelihood approach. Parallelization is achieved using the popular master/workers scheme, where workers evaluate potential tree topologies in parallel. Three innovative optimizations are employed to alleviate the associated communication bottleneck encountered when using the master/workers technique with large-scale systems and problems. First, message packing reduces the number of messages sent out by the master, along with the associated overheads. Secondly, allowing workers to keep the best trees evaluated reduces the number of messages received by the master, as low-scoring results are discarded by the workers. Finally, multiple masters are utilized to parallelize the responsibilities of what is traditionally a single master process. These last two optimizations led to a dramatic improvement in performance over the unoptimized parallelization under the conditions tested. Message packing, however, demonstrated a slight reduction in performance. Although testing with large-scale systems and problems was not possible, results for all three optimizations suggested likely performance enhancement under such conditions, potentially leading to relief of the bottleneck
A new method for identifying site-specific evolutionary rates and its applications.
In this thesis, I discuss each stage in the development of a new method for identifying
site specific evolutionary rates, from conception of the idea, through the
implementation to its application to data. TIGER, or tree independent generation of
evolutionary rates, is based largely around the works of LeQuesne (1989), Wilkinson
(1998) and Pisani (2004) and the premise that sites in a multi-state character matrix
could be scored based on the level of agreement it displays with the other sites. In
these earlier studies, however, agreement was measured in binary manner: sites were
either compatible with each other or they are not. TIGER allows various degrees of
agreement to occur between two sites, allowing it to pick up more subtle signals in the
data.
After implementing the method into a software program, it could be applied to data.
Using a combination of simulated and empirical datasets, TIGER was shown to
produce desirable results. In particular, removal of sites identified by TIGER was
shown to improve phylogenetic reconstruction of deeply diverging lineages and of
taxa displaying compositional attraction. Additionally, TIGER was applied to a gene
content matrix in order to identify HGT signals and integrated into the analysis of a
current phylogenetic problem, the origin of the mitochondria.
Although it is widely accepted that eukaryotes have a chimeric genome, the specific
âparentâ of the mitochondria is, as of yet, unclear. Previous studies have failed to
reach agreement regarding this issue for a number of reasons. Exploration of the
signals using TIGER and heterogeneous modelling reveal that multiple signals and
compositional heterogeneity are among the biggest problems with datasets containing
both mitochondrial and a-proteobacterial sequences
Estimation des longueurs de branche et artefact sur la datation moléculaire
La phylogĂ©nie molĂ©culaire fournit un outil complĂ©mentaire aux Ă©tudes palĂ©ontologiques et gĂ©ologiques en permettant la construction des relations phylogĂ©nĂ©tiques entre espĂšces ainsi que lâestimation du temps de leur divergence. Cependant lorsquâun arbre phylogĂ©nĂ©tique est infĂ©rĂ©, les chercheurs se focalisent surtout sur la topologie, c'est-Ă -dire lâordre de branchement relatif des diffĂ©rents nĆuds. Les longueurs des branches de cette phylogĂ©nie sont souvent considĂ©rĂ©es comme des sous-produits, des paramĂštres de nuisances apportant peu dâinformation. Elles constituent cependant lâinformation primaire pour rĂ©aliser des datations molĂ©culaires. Or la saturation, la prĂ©sence de substitutions multiples Ă une mĂȘme position, est un artefact qui conduit Ă une sous-estimation systĂ©matique des longueurs de branche. Nous avons dĂ©cidĂ© dâestimer lâinfluence de la saturation et son impact sur lâestimation de lâĂąge de divergence.
Nous avons choisi dâĂ©tudier le gĂ©nome mitochondrial des mammifĂšres qui est supposĂ© avoir un niveau Ă©levĂ© de saturation et qui est disponible pour de nombreuses espĂšces. De plus, les relations phylogĂ©nĂ©tiques des mammifĂšres sont connues, ce qui nous a permis de fixer la topologie, contrĂŽlant ainsi un des paramĂštres influant la longueur des branches. Nous avons utilisĂ© principalement deux mĂ©thodes pour amĂ©liorer la dĂ©tection des substitutions multiples : (i) lâaugmentation du nombre dâespĂšces afin de briser les plus longues branches de lâarbre et (ii) des modĂšles dâĂ©volution des sĂ©quences plus ou moins rĂ©alistes.
Les rĂ©sultats montrĂšrent que la sous-estimation des longueurs de branche Ă©tait trĂšs importante (jusqu'Ă un facteur de 3) et que lâutilisation d'un grand nombre dâespĂšces est un facteur qui influence beaucoup plus la dĂ©tection de substitutions multiples que lâamĂ©lioration des modĂšles dâĂ©volutions de sĂ©quences. Cela suggĂšre que mĂȘme les modĂšles dâĂ©volution les plus complexes disponibles actuellement, (exemple: modĂšle CAT+Covarion, qui prend en compte lâhĂ©tĂ©rogĂ©nĂ©itĂ© des processus de substitution entre positions et des vitesses dâĂ©volution au cours du temps) sont encore loin de capter toute la complexitĂ© des processus biologiques.
MalgrĂ© lâimportance de la sous-estimation des longueurs de branche, lâimpact sur les datations est apparu ĂȘtre relativement faible, car la sous-estimation est plus ou moins homothĂ©tique. Cela est particuliĂšrement vrai pour les modĂšles dâĂ©volution. Cependant, comme les substitutions multiples sont le plus efficacement dĂ©tectĂ©es en brisant les branches en fragments les plus courts possibles via lâajout dâespĂšces, se pose le problĂšme du biais dans lâĂ©chantillonnage taxonomique, biais dĂ» Ă lâextinction pendant lâhistoire de la vie sur terre. Comme ce biais entraine une sous-estimation non-homothĂ©tique, nous considĂ©rons quâil est indispensable dâamĂ©liorer les modĂšles dâĂ©volution des sĂ©quences et proposons que le protocole Ă©laborĂ© dans ce travail permettra dâĂ©valuer leur efficacitĂ© vis-Ă -vis de la saturation.Molecular phylogeny provides an additional tool complementary to paleontological and geological studies, allowing the reconstruction of phylogenetic relationships between species and the estimate of their divergence time. Researchers are mainly focusing on the topology of a phylogenetic tree; i.e. the relative connection between different nodes. Whereas, the branch lengths of this phylogeny are often considered as secondary, i.e. as additional parameters containing little information. However, the branch lengths are the primary information for molecular dating. Importantly, saturation, the presence of multiple substitutions at the same position, is an artifact that leads to an underestimation of the branch length. We are therefore interested in estimating the magnitude of this phenomenon and its impact on divergence time.
We chose to study the mammalian mitochondrial genome, which is available for many species and displays a high level of saturation. Furthermore, the phylogenetic relationships of mammalians are known, thus allowing us to fix the topology, thus eliminating one of the parameters influencing the branch lengths. We used two main approaches to improve the detection of multiple substitutions: (i) an increase in the number of species breaks the longest branches of the tree, (ii) more realistic models of sequence evolution. The results demonstrate that there is a very pronounced underestimation of branch lengths (up to a factor of 3). Furthermore, the use of a large number of species is the factor that influences most the detection of multiple substitutions, not the improvement of the model of sequence evolution. This suggests that even the most complex evolutionary models currently available, like the CAT+ Covarion model, which takes into account the heterogeneity of the substitution process between sites and the rates of evolution over time, are still far from taking the entire complexity of biological processes into account.
Despite the important underestimation of branch lengths, the impact on dating appeared to be relatively limited, because the underestimation is more or less homothetic. This is obviously true for the complex evolutionary models. Since multiple substitutions are most effectively detected when breaking the long internal branches via the addition of species. This raises the problem of bias in the taxonomic sampling, due to the impact of extinction on the history of life on earth. Because this kind of bias leads to a non-homothetic underestimation, we consider it essential to improve models of sequence evolution and suggest that the protocol developed in this work will allow to evaluate their effectiveness towards saturation
An efficient program for phylogenetic inference using simulated annealing
Inference of phylogenetic trees comprising thousands of organisms based on the maximum likelihood method is computationally expensive. A new program RAxML-SA (Randomized Axelerated Maximum Likelihood with Simulated Annealing) is presented that combines simulated annealing and hill-climbing techniques to improve the quality of final trees. In addition, to the ability to perform backward steps and potentially escape local maxima provided by simulated annealing, a large number of âgood â alternative topologies is generated which can be used to build a consensus tree on the fly. Though, slower than some of the fastest hill-climbing programs such as RAxML-III and PHYML, RAxML-SA finds better trees for large real data alignments containing more than 250 sequences. Furthermore, the performance on 40 simulated 500-taxon alignments is reasonable in comparison to PHYML. Finally, a straight-forward and efficient OpenMP parallelization of RAxML is presented. 1
Computational statistics in molecular phylogenetics
Simulation remains a very important approach to testing the robustness and accuracy of phylogenetic
inference methods. However, current simulation programs are limited, especially concerning realistic
models for simulating insertions and deletions (indels). In this thesis I implement a new, portable and
flexible application, named INDELible, which can be used to generate nucleotide, amino acid and
codon sequence data by simulating indels (under several models of indel length distribution) as well
as substitutions (under a rich repertoire of substitution models).
In particular, I introduce a simulation study that makes use of one of INDELibleâs many unique
features to simulate data with indels under codon models that allow the nonsynonymous/synonymous
substitution rate ratio to vary among sites and branches. This data is used to quantify, for the first
time, the precise effects of indels and alignment errors on the false-positive rate and power of the
widely used branch-site test of positive selection. Several alignment programs are used and assessed
in this context. Through the simulation experiment, I show that insertions and deletions do not cause
the test to generate excessive false positives if the alignment is correct, but alignment errors can lead
to unacceptably high false positives. Previous selection studies that use inferior alignment programs
are revisited to demonstrate the applicability of my results in real world situations.
Further work uses simulated data from INDELible to examine the effects of tree-shape and branch
length on the alignment accuracy of several alignment programs, and the impact of alignment errors
on different methods of phylogeny reconstruction. In particular, analysis is performed to explore
which programs avoid generating the kind of alignment errors that are most detrimental to the process
of phylogeny reconstruction
Transmission dynamics of Avian Influenza A virus
Influenza A virus (AIV) has an extremely high rate of mutation. Frequent
exchanges of gene segments between different AIV (reassortment) have been
responsible for major pandemics in recent human history. The presence of a wild bird
reservoir maintains the threat of incursion of AIV into domestic birds, humans and
other animals. In this thesis, I addressed unanswered questions of how diverse AIV
subtypes (classified according to antigenicity of the two surface proteins,
haemagglutinin and neuraminidase) evolve and interact among different bird
populations in different parts of the world, using Bayesian phylogenetic methods
with large datasets of full genome sequences.
Firstly, I explored the reassortment patterns of AIV internal segments among
different subtypes by quantifying evolutionary parameters including reassortment
rate, evolutionary rate and selective constraint in time-resolved Bayesian tree
phylogenies. A major conclusion was that reassortment rate is negatively associated
with selective constraint and that infection of wild rather than domestic birds was
associated with a higher reassortment rate. Secondly, I described the spatial
transmission pattern of AIV in China. Clustering of related viruses in particular
geographic areas and economic zones was identified from the viral phylogeographic
diffusion networks. The results indicated that Central China and the Pearl River
Delta are two main sources of viral out flow; while the East Coast, especially the
Yangtze River delta, is the major recipient area. Simultaneously, by applying a
general linear model, the predictors that have the strongest impact on viral spatial
diffusion were identified, including economic (agricultural) activity, climate, and
ecology. Thirdly, I determined the genetic and phylogeographic origin of a recent
H7N3 highly pathogenic avian influenza outbreak in Mexico. Location, subtype,
avian host species and pathogenicity were modelled as discrete traits and jointly
analysed using all eight viral gene segments. The results indicated that the outbreak
AIV is a novel reassortant carried by wild waterfowl from different migration
flyways in North America during the time period studied. Importantly, I concluded
that Mexico, and Central America in general, might be a potential hotspot for AIV
reassortment events, a possibility which to date has not attracted widespread
attention.
Overall, the work carried out in this thesis described the evolutionary dynamics of
AIV from which important conclusions regarding its epidemiological impact in both
Eurasia and North America can be drawn