Search CORE

8 research outputs found

Using ESTs for phylogenomics: Can one accurately infer a phylogenetic tree from a gappy alignment?

Author: Hartmann Stefanie
Vision Todd J
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background While full genome sequences are still only available for a handful of taxa, large collections of partial gene sequences are available for many more. The alignment of partial gene sequences results in a multiple sequence alignment containing large gaps that are arranged in a staggered pattern. The consequences of this pattern of missing data on the accuracy of phylogenetic analysis are not well understood. We conducted a simulation study to determine the accuracy of phylogenetic trees obtained from gappy alignments using three commonly used phylogenetic reconstruction methods (Neighbor Joining, Maximum Parsimony, and Maximum Likelihood) and studied ways to improve the accuracy of trees obtained from such datasets. Results We found that the pattern of gappiness in multiple sequence alignments derived from partial gene sequences substantially compromised phylogenetic accuracy even in the absence of alignment error. The decline in accuracy was beyond what would be expected based on the amount of missing data. The decline was particularly dramatic for Neighbor Joining and Maximum Parsimony, where the majority of gappy alignments contained 25% to 40% incorrect quartets. To improve the accuracy of the trees obtained from a gappy multiple sequence alignment, we examined two approaches. In the first approach, alignment masking, potentially problematic columns and input sequences are excluded from from the dataset. Even in the absence of alignment error, masking improved phylogenetic accuracy up to 100-fold. However, masking retained, on average, only 83% of the input sequences. In the second approach, alignment subdivision, the missing data is statistically modelled in order to retain as many sequences as possible in the phylogenetic analysis. Subdivision resulted in more modest improvements to alignment accuracy, but succeeded in including almost all of the input sequences. Conclusion These results demonstrate that partial gene sequences and gappy multiple sequence alignments can pose a major problem for phylogenetic analysis. The concern will be greatest for high-throughput phylogenomic analyses, in which Neighbor Joining is often the preferred method due to its computational efficiency. Both approaches can be used to increase the accuracy of phylogenetic inference from a gappy alignment. The choice between the two approaches will depend upon how robust the application is to the loss of sequences from the input set, with alignment masking generally giving a much greater improvement in accuracy but at the cost of discarding a larger number of the input sequences.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Carolina Digital Repository

MPG.PuRe

Using Trees: Myrmecocystus Phylogeny and Character Evolution and New Methods for Investigating Trait Evolution and Species Delimitation (PhD Dissertation)

Author: Brian C. O&#x27
Publication venue
Publication date: 05/09/2008
Field of study

1) Rates of phenotypic evolution have changed throughout the history of life, producing variation in levels of morphological, functional, and ecological diversity among groups. Testing for the presence of these rate shifts is a key component of evaluating hypotheses about what causes them. General predictions regarding changes in phenotypic diversity as a function of evolutionary history and rates are developed, and tests are derived to evaluate rate changes. Simulations show that these tests are more powerful than existing tests using standardized contrasts. 
2) Species delimitation and species tree inference are difficult problems in the case of recent divergences, especially when different loci have different histories. I quantify the difficulty of the problem and introduce a non-parametric method for simultaneously dividing anonymous samples into different species and inferring a species tree, using individual gene trees as input. This heuristic method seeks to both minimize gene tree – species tree discordance and excess population structure within a species. Analyses suggest that the method may provide useful insights for systematists working at the species level with molecular data.
3) The phylogeny of Myrmecocystus ants is estimated using nine loci, finding that none of the three subgenera are monophyletic, implying repeated evolution of foraging times and particular morphologies. A new partitioned likelihood program, MrFisher, is created from MrBayes to aid analysis of multilocus datasets without assuming priors. Simulations show that using a partitioned likelihood approach in the presence of rate heterogeneity and missing data, as is common in supermatrix analyses, can recover correct branch lengths where non-partitioned likelihood gives predictably biased estimates of branch lengths but the correct topology.
4) Evolution of foraging time and coevolution of behavior and morphology in Myrmecocystus ants is examined. New models for reconstructing discrete states along branches of a tree and for examining continuous trait evolution and coevolution with discrete traits are developed and implemented. Foraging transitions between diurnal and nocturnal foraging evidently go through crepuscular intermediates. There is some evidence for increased rates of morphological character evolution associated with changes in foraging regime, but little evidence for particular optimum values for morphological traits associated with foraging

Crossref

Nature Precedings

Parallelization of the maximum likelihood approach to phylogenetic inference

Author: Garnham Janine
Publication venue: RIT Scholar Works
Publication date: 01/08/2007
Field of study

Phylogenetic inference refers to the reconstruction of evolutionary relationships among various species, usually presented in the form of a tree. DNA sequences are most often used to determine these relationships. The results of phylogenetic inference have many important applications, including protein function determination, drug discovery, disease tracking and forensics. There are several popular computational methods used for phylogenetic inference, among them distance-based (i.e. neighbor joining), maximum parsimony, maximum likelihood, and Bayesian methods. This thesis focuses on the maximum likelihood method, which is regarded as one of the most accurate methods, with its computational demand being the main hindrance to its widespread use. Maximum likelihood is generally considered to be a heuristic method providing a statistical evaluation of the results, where potential tree topologies are judged by how well they predict the observed sequences. While there have been several previous efforts to parallelize the maximum likelihood method, sequential implementations are more widely used in the biological research community. This is due to a lack of confidence in the results produced by the more recent, parallel programs. However, because phylogenetic inference can be extremely computationally intensive, with the number of possible tree topologies growing exponentially with the number of species, parallelization is necessary to reduce the computation time to a reasonable amount. A parallel program was developed for phylogenetic inference based on the trusted algorithms of fastDNAml, a sequential program for phylogenetic inference utilizing the maximum likelihood approach. Parallelization is achieved using the popular master/workers scheme, where workers evaluate potential tree topologies in parallel. Three innovative optimizations are employed to alleviate the associated communication bottleneck encountered when using the master/workers technique with large-scale systems and problems. First, message packing reduces the number of messages sent out by the master, along with the associated overheads. Secondly, allowing workers to keep the best trees evaluated reduces the number of messages received by the master, as low-scoring results are discarded by the workers. Finally, multiple masters are utilized to parallelize the responsibilities of what is traditionally a single master process. These last two optimizations led to a dramatic improvement in performance over the unoptimized parallelization under the conditions tested. Message packing, however, demonstrated a slight reduction in performance. Although testing with large-scale systems and problems was not possible, results for all three optimizations suggested likely performance enhancement under such conditions, potentially leading to relief of the bottleneck

RIT Scholar Works

A new method for identifying site-specific evolutionary rates and its applications.

Author: Cummins Carla A.
Publication venue
Publication date: 01/10/2011
Field of study

In this thesis, I discuss each stage in the development of a new method for identifying site specific evolutionary rates, from conception of the idea, through the implementation to its application to data. TIGER, or tree independent generation of evolutionary rates, is based largely around the works of LeQuesne (1989), Wilkinson (1998) and Pisani (2004) and the premise that sites in a multi-state character matrix could be scored based on the level of agreement it displays with the other sites. In these earlier studies, however, agreement was measured in binary manner: sites were either compatible with each other or they are not. TIGER allows various degrees of agreement to occur between two sites, allowing it to pick up more subtle signals in the data. After implementing the method into a software program, it could be applied to data. Using a combination of simulated and empirical datasets, TIGER was shown to produce desirable results. In particular, removal of sites identified by TIGER was shown to improve phylogenetic reconstruction of deeply diverging lineages and of taxa displaying compositional attraction. Additionally, TIGER was applied to a gene content matrix in order to identify HGT signals and integrated into the analysis of a current phylogenetic problem, the origin of the mitochondria. Although it is widely accepted that eukaryotes have a chimeric genome, the specific “parent” of the mitochondria is, as of yet, unclear. Previous studies have failed to reach agreement regarding this issue for a number of reasons. Exploration of the signals using TIGER and heterogeneous modelling reveal that multiple signals and compositional heterogeneity are among the biggest problems with datasets containing both mitochondrial and a-proteobacterial sequences

MURAL - Maynooth University Research Archive Library

Estimation des longueurs de branche et artefact sur la datation moléculaire

Author: El Alaoui Wafae
Publication venue
Publication date: 01/08/2008
Field of study

La phylogénie moléculaire fournit un outil complémentaire aux études paléontologiques et géologiques en permettant la construction des relations phylogénétiques entre espèces ainsi que l’estimation du temps de leur divergence. Cependant lorsqu’un arbre phylogénétique est inféré, les chercheurs se focalisent surtout sur la topologie, c'est-à-dire l’ordre de branchement relatif des différents nœuds. Les longueurs des branches de cette phylogénie sont souvent considérées comme des sous-produits, des paramètres de nuisances apportant peu d’information. Elles constituent cependant l’information primaire pour réaliser des datations moléculaires. Or la saturation, la présence de substitutions multiples à une même position, est un artefact qui conduit à une sous-estimation systématique des longueurs de branche. Nous avons décidé d’estimer l‘influence de la saturation et son impact sur l’estimation de l’âge de divergence. Nous avons choisi d’étudier le génome mitochondrial des mammifères qui est supposé avoir un niveau élevé de saturation et qui est disponible pour de nombreuses espèces. De plus, les relations phylogénétiques des mammifères sont connues, ce qui nous a permis de fixer la topologie, contrôlant ainsi un des paramètres influant la longueur des branches. Nous avons utilisé principalement deux méthodes pour améliorer la détection des substitutions multiples : (i) l’augmentation du nombre d’espèces afin de briser les plus longues branches de l’arbre et (ii) des modèles d’évolution des séquences plus ou moins réalistes. Les résultats montrèrent que la sous-estimation des longueurs de branche était très importante (jusqu'à un facteur de 3) et que l’utilisation d'un grand nombre d’espèces est un facteur qui influence beaucoup plus la détection de substitutions multiples que l’amélioration des modèles d’évolutions de séquences. Cela suggère que même les modèles d’évolution les plus complexes disponibles actuellement, (exemple: modèle CAT+Covarion, qui prend en compte l’hétérogénéité des processus de substitution entre positions et des vitesses d’évolution au cours du temps) sont encore loin de capter toute la complexité des processus biologiques. Malgré l’importance de la sous-estimation des longueurs de branche, l’impact sur les datations est apparu être relativement faible, car la sous-estimation est plus ou moins homothétique. Cela est particulièrement vrai pour les modèles d’évolution. Cependant, comme les substitutions multiples sont le plus efficacement détectées en brisant les branches en fragments les plus courts possibles via l’ajout d’espèces, se pose le problème du biais dans l’échantillonnage taxonomique, biais dû à l‘extinction pendant l’histoire de la vie sur terre. Comme ce biais entraine une sous-estimation non-homothétique, nous considérons qu’il est indispensable d’améliorer les modèles d’évolution des séquences et proposons que le protocole élaboré dans ce travail permettra d’évaluer leur efficacité vis-à-vis de la saturation.Molecular phylogeny provides an additional tool complementary to paleontological and geological studies, allowing the reconstruction of phylogenetic relationships between species and the estimate of their divergence time. Researchers are mainly focusing on the topology of a phylogenetic tree; i.e. the relative connection between different nodes. Whereas, the branch lengths of this phylogeny are often considered as secondary, i.e. as additional parameters containing little information. However, the branch lengths are the primary information for molecular dating. Importantly, saturation, the presence of multiple substitutions at the same position, is an artifact that leads to an underestimation of the branch length. We are therefore interested in estimating the magnitude of this phenomenon and its impact on divergence time. We chose to study the mammalian mitochondrial genome, which is available for many species and displays a high level of saturation. Furthermore, the phylogenetic relationships of mammalians are known, thus allowing us to fix the topology, thus eliminating one of the parameters influencing the branch lengths. We used two main approaches to improve the detection of multiple substitutions: (i) an increase in the number of species breaks the longest branches of the tree, (ii) more realistic models of sequence evolution. The results demonstrate that there is a very pronounced underestimation of branch lengths (up to a factor of 3). Furthermore, the use of a large number of species is the factor that influences most the detection of multiple substitutions, not the improvement of the model of sequence evolution. This suggests that even the most complex evolutionary models currently available, like the CAT+ Covarion model, which takes into account the heterogeneity of the substitution process between sites and the rates of evolution over time, are still far from taking the entire complexity of biological processes into account. Despite the important underestimation of branch lengths, the impact on dating appeared to be relatively limited, because the underestimation is more or less homothetic. This is obviously true for the complex evolutionary models. Since multiple substitutions are most effectively detected when breaking the long internal branches via the addition of species. This raises the problem of bias in the taxonomic sampling, due to the impact of extinction on the history of life on earth. Because this kind of bias leads to a non-homothetic underestimation, we consider it essential to improve models of sequence evolution and suggest that the protocol developed in this work will allow to evaluate their effectiveness towards saturation

Dépôt Institutionnel Numérique

An efficient program for phylogenetic inference using simulated annealing

Author: Alexandros Stamatakis
Publication venue
Publication date
Field of study

Inference of phylogenetic trees comprising thousands of organisms based on the maximum likelihood method is computationally expensive. A new program RAxML-SA (Randomized Axelerated Maximum Likelihood with Simulated Annealing) is presented that combines simulated annealing and hill-climbing techniques to improve the quality of final trees. In addition, to the ability to perform backward steps and potentially escape local maxima provided by simulated annealing, a large number of “good ” alternative topologies is generated which can be used to build a consensus tree on the fly. Though, slower than some of the fastest hill-climbing programs such as RAxML-III and PHYML, RAxML-SA finds better trees for large real data alignments containing more than 250 sequences. Furthermore, the performance on 40 simulated 500-taxon alignments is reasonable in comparison to PHYML. Finally, a straight-forward and efficient OpenMP parallelization of RAxML is presented. 1

CiteSeerX

Computational statistics in molecular phylogenetics

Author: Fletcher W.A.J.
Publication venue: UCL (University College London)
Publication date: 28/03/2011
Field of study

Simulation remains a very important approach to testing the robustness and accuracy of phylogenetic inference methods. However, current simulation programs are limited, especially concerning realistic models for simulating insertions and deletions (indels). In this thesis I implement a new, portable and flexible application, named INDELible, which can be used to generate nucleotide, amino acid and codon sequence data by simulating indels (under several models of indel length distribution) as well as substitutions (under a rich repertoire of substitution models). In particular, I introduce a simulation study that makes use of one of INDELible’s many unique features to simulate data with indels under codon models that allow the nonsynonymous/synonymous substitution rate ratio to vary among sites and branches. This data is used to quantify, for the first time, the precise effects of indels and alignment errors on the false-positive rate and power of the widely used branch-site test of positive selection. Several alignment programs are used and assessed in this context. Through the simulation experiment, I show that insertions and deletions do not cause the test to generate excessive false positives if the alignment is correct, but alignment errors can lead to unacceptably high false positives. Previous selection studies that use inferior alignment programs are revisited to demonstrate the applicability of my results in real world situations. Further work uses simulated data from INDELible to examine the effects of tree-shape and branch length on the alignment accuracy of several alignment programs, and the impact of alignment errors on different methods of phylogeny reconstruction. In particular, analysis is performed to explore which programs avoid generating the kind of alignment errors that are most detrimental to the process of phylogeny reconstruction

UCL Discovery

Transmission dynamics of Avian Influenza A virus

Author: Leigh Brown Andrew J.
Lu Lu
Lycett Samantha J.
Publication venue: The University of Edinburgh
Publication date: 01/01/2014
Field of study

Influenza A virus (AIV) has an extremely high rate of mutation. Frequent exchanges of gene segments between different AIV (reassortment) have been responsible for major pandemics in recent human history. The presence of a wild bird reservoir maintains the threat of incursion of AIV into domestic birds, humans and other animals. In this thesis, I addressed unanswered questions of how diverse AIV subtypes (classified according to antigenicity of the two surface proteins, haemagglutinin and neuraminidase) evolve and interact among different bird populations in different parts of the world, using Bayesian phylogenetic methods with large datasets of full genome sequences. Firstly, I explored the reassortment patterns of AIV internal segments among different subtypes by quantifying evolutionary parameters including reassortment rate, evolutionary rate and selective constraint in time-resolved Bayesian tree phylogenies. A major conclusion was that reassortment rate is negatively associated with selective constraint and that infection of wild rather than domestic birds was associated with a higher reassortment rate. Secondly, I described the spatial transmission pattern of AIV in China. Clustering of related viruses in particular geographic areas and economic zones was identified from the viral phylogeographic diffusion networks. The results indicated that Central China and the Pearl River Delta are two main sources of viral out flow; while the East Coast, especially the Yangtze River delta, is the major recipient area. Simultaneously, by applying a general linear model, the predictors that have the strongest impact on viral spatial diffusion were identified, including economic (agricultural) activity, climate, and ecology. Thirdly, I determined the genetic and phylogeographic origin of a recent H7N3 highly pathogenic avian influenza outbreak in Mexico. Location, subtype, avian host species and pathogenicity were modelled as discrete traits and jointly analysed using all eight viral gene segments. The results indicated that the outbreak AIV is a novel reassortant carried by wild waterfowl from different migration flyways in North America during the time period studied. Importantly, I concluded that Mexico, and Central America in general, might be a potential hotspot for AIV reassortment events, a possibility which to date has not attracted widespread attention. Overall, the work carried out in this thesis described the evolutionary dynamics of AIV from which important conclusions regarding its epidemiological impact in both Eurasia and North America can be drawn

ZENODO

Directory of Open Access Journals

Edinburgh Research Archive

Edinburgh Research Explorer

Electronic Archiving System

Enlighten

FigShare

Crossref

Springer - Publisher Connector

Dryad Digital Repository (Duke University)

PubMed Central

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY