8 research outputs found

    Using ESTs for phylogenomics: Can one accurately infer a phylogenetic tree from a gappy alignment?

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>While full genome sequences are still only available for a handful of taxa, large collections of partial gene sequences are available for many more. The alignment of partial gene sequences results in a multiple sequence alignment containing large gaps that are arranged in a staggered pattern. The consequences of this pattern of missing data on the accuracy of phylogenetic analysis are not well understood. We conducted a simulation study to determine the accuracy of phylogenetic trees obtained from gappy alignments using three commonly used phylogenetic reconstruction methods (Neighbor Joining, Maximum Parsimony, and Maximum Likelihood) and studied ways to improve the accuracy of trees obtained from such datasets.</p> <p>Results</p> <p>We found that the pattern of gappiness in multiple sequence alignments derived from partial gene sequences substantially compromised phylogenetic accuracy even in the absence of alignment error. The decline in accuracy was beyond what would be expected based on the amount of missing data. The decline was particularly dramatic for Neighbor Joining and Maximum Parsimony, where the majority of gappy alignments contained 25% to 40% incorrect quartets. To improve the accuracy of the trees obtained from a gappy multiple sequence alignment, we examined two approaches. In the first approach, alignment masking, potentially problematic columns and input sequences are excluded from from the dataset. Even in the absence of alignment error, masking improved phylogenetic accuracy up to 100-fold. However, masking retained, on average, only 83% of the input sequences. In the second approach, alignment subdivision, the missing data is statistically modelled in order to retain as many sequences as possible in the phylogenetic analysis. Subdivision resulted in more modest improvements to alignment accuracy, but succeeded in including almost all of the input sequences.</p> <p>Conclusion</p> <p>These results demonstrate that partial gene sequences and gappy multiple sequence alignments can pose a major problem for phylogenetic analysis. The concern will be greatest for high-throughput phylogenomic analyses, in which Neighbor Joining is often the preferred method due to its computational efficiency. Both approaches can be used to increase the accuracy of phylogenetic inference from a gappy alignment. The choice between the two approaches will depend upon how robust the application is to the loss of sequences from the input set, with alignment masking generally giving a much greater improvement in accuracy but at the cost of discarding a larger number of the input sequences.</p

    Using Trees: Myrmecocystus Phylogeny and Character Evolution and New Methods for Investigating Trait Evolution and Species Delimitation (PhD Dissertation)

    Get PDF
    1) Rates of phenotypic evolution have changed throughout the history of life, producing variation in levels of morphological, functional, and ecological diversity among groups. Testing for the presence of these rate shifts is a key component of evaluating hypotheses about what causes them. General predictions regarding changes in phenotypic diversity as a function of evolutionary history and rates are developed, and tests are derived to evaluate rate changes. Simulations show that these tests are more powerful than existing tests using standardized contrasts. &#xd;&#xa;2) Species delimitation and species tree inference are difficult problems in the case of recent divergences, especially when different loci have different histories. I quantify the difficulty of the problem and introduce a non-parametric method for simultaneously dividing anonymous samples into different species and inferring a species tree, using individual gene trees as input. This heuristic method seeks to both minimize gene tree &#x2013; species tree discordance and excess population structure within a species. Analyses suggest that the method may provide useful insights for systematists working at the species level with molecular data.&#xd;&#xa;3) The phylogeny of Myrmecocystus ants is estimated using nine loci, finding that none of the three subgenera are monophyletic, implying repeated evolution of foraging times and particular morphologies. A new partitioned likelihood program, MrFisher, is created from MrBayes to aid analysis of multilocus datasets without assuming priors. Simulations show that using a partitioned likelihood approach in the presence of rate heterogeneity and missing data, as is common in supermatrix analyses, can recover correct branch lengths where non-partitioned likelihood gives predictably biased estimates of branch lengths but the correct topology.&#xd;&#xa;4) Evolution of foraging time and coevolution of behavior and morphology in Myrmecocystus ants is examined. New models for reconstructing discrete states along branches of a tree and for examining continuous trait evolution and coevolution with discrete traits are developed and implemented. Foraging transitions between diurnal and nocturnal foraging evidently go through crepuscular intermediates. There is some evidence for increased rates of morphological character evolution associated with changes in foraging regime, but little evidence for particular optimum values for morphological traits associated with foraging

    Parallelization of the maximum likelihood approach to phylogenetic inference

    Get PDF
    Phylogenetic inference refers to the reconstruction of evolutionary relationships among various species, usually presented in the form of a tree. DNA sequences are most often used to determine these relationships. The results of phylogenetic inference have many important applications, including protein function determination, drug discovery, disease tracking and forensics. There are several popular computational methods used for phylogenetic inference, among them distance-based (i.e. neighbor joining), maximum parsimony, maximum likelihood, and Bayesian methods. This thesis focuses on the maximum likelihood method, which is regarded as one of the most accurate methods, with its computational demand being the main hindrance to its widespread use. Maximum likelihood is generally considered to be a heuristic method providing a statistical evaluation of the results, where potential tree topologies are judged by how well they predict the observed sequences. While there have been several previous efforts to parallelize the maximum likelihood method, sequential implementations are more widely used in the biological research community. This is due to a lack of confidence in the results produced by the more recent, parallel programs. However, because phylogenetic inference can be extremely computationally intensive, with the number of possible tree topologies growing exponentially with the number of species, parallelization is necessary to reduce the computation time to a reasonable amount. A parallel program was developed for phylogenetic inference based on the trusted algorithms of fastDNAml, a sequential program for phylogenetic inference utilizing the maximum likelihood approach. Parallelization is achieved using the popular master/workers scheme, where workers evaluate potential tree topologies in parallel. Three innovative optimizations are employed to alleviate the associated communication bottleneck encountered when using the master/workers technique with large-scale systems and problems. First, message packing reduces the number of messages sent out by the master, along with the associated overheads. Secondly, allowing workers to keep the best trees evaluated reduces the number of messages received by the master, as low-scoring results are discarded by the workers. Finally, multiple masters are utilized to parallelize the responsibilities of what is traditionally a single master process. These last two optimizations led to a dramatic improvement in performance over the unoptimized parallelization under the conditions tested. Message packing, however, demonstrated a slight reduction in performance. Although testing with large-scale systems and problems was not possible, results for all three optimizations suggested likely performance enhancement under such conditions, potentially leading to relief of the bottleneck

    A new method for identifying site-specific evolutionary rates and its applications.

    Get PDF
    In this thesis, I discuss each stage in the development of a new method for identifying site specific evolutionary rates, from conception of the idea, through the implementation to its application to data. TIGER, or tree independent generation of evolutionary rates, is based largely around the works of LeQuesne (1989), Wilkinson (1998) and Pisani (2004) and the premise that sites in a multi-state character matrix could be scored based on the level of agreement it displays with the other sites. In these earlier studies, however, agreement was measured in binary manner: sites were either compatible with each other or they are not. TIGER allows various degrees of agreement to occur between two sites, allowing it to pick up more subtle signals in the data. After implementing the method into a software program, it could be applied to data. Using a combination of simulated and empirical datasets, TIGER was shown to produce desirable results. In particular, removal of sites identified by TIGER was shown to improve phylogenetic reconstruction of deeply diverging lineages and of taxa displaying compositional attraction. Additionally, TIGER was applied to a gene content matrix in order to identify HGT signals and integrated into the analysis of a current phylogenetic problem, the origin of the mitochondria. Although it is widely accepted that eukaryotes have a chimeric genome, the specific “parent” of the mitochondria is, as of yet, unclear. Previous studies have failed to reach agreement regarding this issue for a number of reasons. Exploration of the signals using TIGER and heterogeneous modelling reveal that multiple signals and compositional heterogeneity are among the biggest problems with datasets containing both mitochondrial and a-proteobacterial sequences

    Estimation des longueurs de branche et artefact sur la datation moléculaire

    Get PDF
    La phylogĂ©nie molĂ©culaire fournit un outil complĂ©mentaire aux Ă©tudes palĂ©ontologiques et gĂ©ologiques en permettant la construction des relations phylogĂ©nĂ©tiques entre espĂšces ainsi que l’estimation du temps de leur divergence. Cependant lorsqu’un arbre phylogĂ©nĂ©tique est infĂ©rĂ©, les chercheurs se focalisent surtout sur la topologie, c'est-Ă -dire l’ordre de branchement relatif des diffĂ©rents nƓuds. Les longueurs des branches de cette phylogĂ©nie sont souvent considĂ©rĂ©es comme des sous-produits, des paramĂštres de nuisances apportant peu d’information. Elles constituent cependant l’information primaire pour rĂ©aliser des datations molĂ©culaires. Or la saturation, la prĂ©sence de substitutions multiples Ă  une mĂȘme position, est un artefact qui conduit Ă  une sous-estimation systĂ©matique des longueurs de branche. Nous avons dĂ©cidĂ© d’estimer l‘influence de la saturation et son impact sur l’estimation de l’ñge de divergence. Nous avons choisi d’étudier le gĂ©nome mitochondrial des mammifĂšres qui est supposĂ© avoir un niveau Ă©levĂ© de saturation et qui est disponible pour de nombreuses espĂšces. De plus, les relations phylogĂ©nĂ©tiques des mammifĂšres sont connues, ce qui nous a permis de fixer la topologie, contrĂŽlant ainsi un des paramĂštres influant la longueur des branches. Nous avons utilisĂ© principalement deux mĂ©thodes pour amĂ©liorer la dĂ©tection des substitutions multiples : (i) l’augmentation du nombre d’espĂšces afin de briser les plus longues branches de l’arbre et (ii) des modĂšles d’évolution des sĂ©quences plus ou moins rĂ©alistes. Les rĂ©sultats montrĂšrent que la sous-estimation des longueurs de branche Ă©tait trĂšs importante (jusqu'Ă  un facteur de 3) et que l’utilisation d'un grand nombre d’espĂšces est un facteur qui influence beaucoup plus la dĂ©tection de substitutions multiples que l’amĂ©lioration des modĂšles d’évolutions de sĂ©quences. Cela suggĂšre que mĂȘme les modĂšles d’évolution les plus complexes disponibles actuellement, (exemple: modĂšle CAT+Covarion, qui prend en compte l’hĂ©tĂ©rogĂ©nĂ©itĂ© des processus de substitution entre positions et des vitesses d’évolution au cours du temps) sont encore loin de capter toute la complexitĂ© des processus biologiques. MalgrĂ© l’importance de la sous-estimation des longueurs de branche, l’impact sur les datations est apparu ĂȘtre relativement faible, car la sous-estimation est plus ou moins homothĂ©tique. Cela est particuliĂšrement vrai pour les modĂšles d’évolution. Cependant, comme les substitutions multiples sont le plus efficacement dĂ©tectĂ©es en brisant les branches en fragments les plus courts possibles via l’ajout d’espĂšces, se pose le problĂšme du biais dans l’échantillonnage taxonomique, biais dĂ» Ă  l‘extinction pendant l’histoire de la vie sur terre. Comme ce biais entraine une sous-estimation non-homothĂ©tique, nous considĂ©rons qu’il est indispensable d’amĂ©liorer les modĂšles d’évolution des sĂ©quences et proposons que le protocole Ă©laborĂ© dans ce travail permettra d’évaluer leur efficacitĂ© vis-Ă -vis de la saturation.Molecular phylogeny provides an additional tool complementary to paleontological and geological studies, allowing the reconstruction of phylogenetic relationships between species and the estimate of their divergence time. Researchers are mainly focusing on the topology of a phylogenetic tree; i.e. the relative connection between different nodes. Whereas, the branch lengths of this phylogeny are often considered as secondary, i.e. as additional parameters containing little information. However, the branch lengths are the primary information for molecular dating. Importantly, saturation, the presence of multiple substitutions at the same position, is an artifact that leads to an underestimation of the branch length. We are therefore interested in estimating the magnitude of this phenomenon and its impact on divergence time. We chose to study the mammalian mitochondrial genome, which is available for many species and displays a high level of saturation. Furthermore, the phylogenetic relationships of mammalians are known, thus allowing us to fix the topology, thus eliminating one of the parameters influencing the branch lengths. We used two main approaches to improve the detection of multiple substitutions: (i) an increase in the number of species breaks the longest branches of the tree, (ii) more realistic models of sequence evolution. The results demonstrate that there is a very pronounced underestimation of branch lengths (up to a factor of 3). Furthermore, the use of a large number of species is the factor that influences most the detection of multiple substitutions, not the improvement of the model of sequence evolution. This suggests that even the most complex evolutionary models currently available, like the CAT+ Covarion model, which takes into account the heterogeneity of the substitution process between sites and the rates of evolution over time, are still far from taking the entire complexity of biological processes into account. Despite the important underestimation of branch lengths, the impact on dating appeared to be relatively limited, because the underestimation is more or less homothetic. This is obviously true for the complex evolutionary models. Since multiple substitutions are most effectively detected when breaking the long internal branches via the addition of species. This raises the problem of bias in the taxonomic sampling, due to the impact of extinction on the history of life on earth. Because this kind of bias leads to a non-homothetic underestimation, we consider it essential to improve models of sequence evolution and suggest that the protocol developed in this work will allow to evaluate their effectiveness towards saturation

    An efficient program for phylogenetic inference using simulated annealing

    No full text
    Inference of phylogenetic trees comprising thousands of organisms based on the maximum likelihood method is computationally expensive. A new program RAxML-SA (Randomized Axelerated Maximum Likelihood with Simulated Annealing) is presented that combines simulated annealing and hill-climbing techniques to improve the quality of final trees. In addition, to the ability to perform backward steps and potentially escape local maxima provided by simulated annealing, a large number of “good ” alternative topologies is generated which can be used to build a consensus tree on the fly. Though, slower than some of the fastest hill-climbing programs such as RAxML-III and PHYML, RAxML-SA finds better trees for large real data alignments containing more than 250 sequences. Furthermore, the performance on 40 simulated 500-taxon alignments is reasonable in comparison to PHYML. Finally, a straight-forward and efficient OpenMP parallelization of RAxML is presented. 1

    Computational statistics in molecular phylogenetics

    Get PDF
    Simulation remains a very important approach to testing the robustness and accuracy of phylogenetic inference methods. However, current simulation programs are limited, especially concerning realistic models for simulating insertions and deletions (indels). In this thesis I implement a new, portable and flexible application, named INDELible, which can be used to generate nucleotide, amino acid and codon sequence data by simulating indels (under several models of indel length distribution) as well as substitutions (under a rich repertoire of substitution models). In particular, I introduce a simulation study that makes use of one of INDELible’s many unique features to simulate data with indels under codon models that allow the nonsynonymous/synonymous substitution rate ratio to vary among sites and branches. This data is used to quantify, for the first time, the precise effects of indels and alignment errors on the false-positive rate and power of the widely used branch-site test of positive selection. Several alignment programs are used and assessed in this context. Through the simulation experiment, I show that insertions and deletions do not cause the test to generate excessive false positives if the alignment is correct, but alignment errors can lead to unacceptably high false positives. Previous selection studies that use inferior alignment programs are revisited to demonstrate the applicability of my results in real world situations. Further work uses simulated data from INDELible to examine the effects of tree-shape and branch length on the alignment accuracy of several alignment programs, and the impact of alignment errors on different methods of phylogeny reconstruction. In particular, analysis is performed to explore which programs avoid generating the kind of alignment errors that are most detrimental to the process of phylogeny reconstruction

    Transmission dynamics of Avian Influenza A virus

    Get PDF
    Influenza A virus (AIV) has an extremely high rate of mutation. Frequent exchanges of gene segments between different AIV (reassortment) have been responsible for major pandemics in recent human history. The presence of a wild bird reservoir maintains the threat of incursion of AIV into domestic birds, humans and other animals. In this thesis, I addressed unanswered questions of how diverse AIV subtypes (classified according to antigenicity of the two surface proteins, haemagglutinin and neuraminidase) evolve and interact among different bird populations in different parts of the world, using Bayesian phylogenetic methods with large datasets of full genome sequences. Firstly, I explored the reassortment patterns of AIV internal segments among different subtypes by quantifying evolutionary parameters including reassortment rate, evolutionary rate and selective constraint in time-resolved Bayesian tree phylogenies. A major conclusion was that reassortment rate is negatively associated with selective constraint and that infection of wild rather than domestic birds was associated with a higher reassortment rate. Secondly, I described the spatial transmission pattern of AIV in China. Clustering of related viruses in particular geographic areas and economic zones was identified from the viral phylogeographic diffusion networks. The results indicated that Central China and the Pearl River Delta are two main sources of viral out flow; while the East Coast, especially the Yangtze River delta, is the major recipient area. Simultaneously, by applying a general linear model, the predictors that have the strongest impact on viral spatial diffusion were identified, including economic (agricultural) activity, climate, and ecology. Thirdly, I determined the genetic and phylogeographic origin of a recent H7N3 highly pathogenic avian influenza outbreak in Mexico. Location, subtype, avian host species and pathogenicity were modelled as discrete traits and jointly analysed using all eight viral gene segments. The results indicated that the outbreak AIV is a novel reassortant carried by wild waterfowl from different migration flyways in North America during the time period studied. Importantly, I concluded that Mexico, and Central America in general, might be a potential hotspot for AIV reassortment events, a possibility which to date has not attracted widespread attention. Overall, the work carried out in this thesis described the evolutionary dynamics of AIV from which important conclusions regarding its epidemiological impact in both Eurasia and North America can be drawn
    corecore