18 research outputs found

    APRIMORAMENTOS DA SOLUÇÃO PARALELA BASEADA EM OPERAÇÕES COLETIVAS PARA O BOOTSTRAP DA RECONSTRUÇÃO DE ÁRVORES FILOGENÉTICAS NO PHYML 3.0

    Get PDF
    Phylogenetics determines the evolutionary relationships between groups of species, through a phylogenetic tree. PhyML is among the main programs for the reconstruction of phylogenetic trees. Bootstrap is a statistical method used to measure the confidence of a given data set, which is usually applied in the analysis of inferred phylogenetic trees. In PhyML this method has two MPI parallel implementations: with point-to-point operations and collective operations. The second version is more efficient than the first, however it has a limitation on the number of bootstrap to be used due to the increase in memory consumption. In order to solve this problem, three proposals were developed. The objectives of this work were to carry out the validation of these versions together with performance tests. The validation showed that the proposed solutions present results equivalent to the point-to-point version. In the performance simulations, two solutions were shown to be superior to the point-to-point version, with the best one achieving gains of 28.46% and 39.64% for 32 and 64 processes, respectively. Therefore, the enhancements allow alternatives to the point-to-point version without limiting memory.A filogenética é um ramo de estudos que objetiva determinar as relações evolutivas entre grupos de espécies. Como produto, é obtida uma hipótese que relaciona a coleção de espécies analisadas, que normalmente é representada através de uma árvore filogenética. PhyML é um dos principais programas que realizam a Reconstrução de Árvores Filogenéticas. O bootstrap é um método estatístico utilizado para medir a confiança de um determinado conjunto de dados, que é usualmente aplicado na análise de árvores filogenéticas inferidas. No PhyML, esse método possui duas implementações paralelas MPI: com operações ponto-a-ponto e operações coletivas. A segunda versão é mais eficiente que a primeira, porém apresenta uma limitação no número de bootstrap a ser utilizado devido ao aumento no consumo de memória. Para solucionar esse problema foram desenvolvidas três soluções. O objetivo deste trabalho foi realizar a validação destas versões juntamente com testes de desempenho. A validação mostrou que as soluções propostas apresentam resultados equivalentes à versão ponto-a-ponto. Já nas simulações de desempenho, duas soluções se mostraram superiores à versão ponto-a-ponto, sendo que a melhor conseguiu ganhos de 28,46% e 39,64% para 32 e 64 processos, respectivamente. Portanto, os aprimoramentos permitem alternativas à versão ponto-a-ponto sem limitação de memória

    Supertree-like methods for genome-scale species tree estimation

    Get PDF
    A critical step in many biological studies is the estimation of evolutionary trees (phylogenies) from genomic data. Of particular interest is the species tree, which illustrates how a set of species evolved from a common ancestor. While species trees were previously estimated from a few regions of the genome (genes), it is now widely recognized that biological processes can cause the evolutionary histories of individual genes to differ from each other and from the species tree. This heterogeneity across the genome is phylogenetic signal that can be leveraged to estimate species evolution with greater accuracy. Hence, species tree estimation is expected to be greatly aided by current large-scale sequencing efforts, including the 5000 Insect Genomes Project, the 10000 Plant Genomes Project, the (~60000) Vertebrate Genomes Project, and the Earth BioGenome Project, which aims to assemble genomes (or at least genome-scale data) for 1.5 million eukaryotic species in the next ten years. To analyze these forthcoming datasets, species tree estimation methods must scale to thousands of species and tens of thousands of genes; however, many of the current leading methods, which are heuristics for NP-hard optimization problems, can be prohibitively expensive on datasets of this size. In this dissertation, we argue that new methods are needed to enable scalable and statistically rigorous species tree estimation pipelines; we then seek to address this challenge through the introduction of three supertree-like methods: NJMerge, TreeMerge, and FastMulRFS. For these methods, we present theoretical results (worst-case running time analyses and proofs of statistical consistency) as well as empirical results on simulated datasets (and a fungal dataset for FastMulRFS). Overall, these methods enable statistically consistent species tree estimation pipelines that achieve comparable accuracy to the dominant optimization-based approaches while dramatically reducing running time

    Algorithmic Advancements and Massive Parallelism for Large-Scale Datasets in Phylogenetic Bayesian Markov Chain Monte Carlo

    Get PDF
    Datasets used for the inference of the "tree of life" grow at unprecedented rates, thus inducing a high computational burden for analytic methods. First, we introduce a scalable software package that allows us to conduct state of the art Bayesian analyses on datasets of almost arbitrary size. Second, we derive a proposal mechanism for MCMC that is substantially more efficient than traditional branch length proposals. Third, we present an efficient algorithm for solving the rogue taxon problem

    Annotation of marine eukaryotic genomes

    Get PDF

    Statistical Methods for Identifying Demographic Structure in DNA Sequence Alignments

    Get PDF
    All life on Earth, from viruses and bacteria, trees and flowers, to birds and human beings, can be traced back to a single common ancestor. However, the evolutionary history that led to this diversity of life is a complicated story that we do not yet fully understand. Since the discovery of the structure of deoxyribonucleic acid (DNA) in 1953, and the development of DNA sequencing technology, researchers have been using similarities and differences in the genomes of organisms to better understand the relationships between species. However, due to the complexity of the evolutionary history of life, simplifying assumptions must be made to make mathematical models tractable. It must then be of paramount importance for researchers to be able to identify when the simplifying assumptions of a specific model are unreasonable. In this thesis we present two projects, and although they are different in implementation, both attempt to investigate simplifying assumptions in the closely related fields of population genetics and phylogenetics. However, we also present applications of our projects where the results of our work are not used in assessing assumptions for further analyses, but are of standalone interest to researchers. Our first project is concerned with the development of a method for constructing coordinate representations for single-copy DNA, such as mitochondrial DNA (mtDNA) or Y-chromosomal DNA, analogous to the use of PCA for nuclear DNA. We construct a coordinate system such that, given p informative sites in an alignment of n individuals, returns p-dimensional coordinates for each n individuals. We order the dimensions by the proportion of variability each dimension captures in the overall genetic diversity. From these coordinates in \genetic space" researchers may perform a number of down stream analyses. It is possible to optimally visualise high-dimensional sequence data in two or three dimensions. One may use our method to identify closely related individuals, identify sites in the alignment that are closely linked, or to use the same coordinate space to nd sites that are closely linked with groups of individuals. Finally, one may choose to test for significant relationships between the structure of the coordinates in genetic space, and metadata recorded on sequenced individuals, indicating demographic variables that are highly related to the evolutionary history of an alignment. This final application of our method, where one may test for demographic structure in sequence data, is of key importance to the theme of discovering when simplifying assumptions of analyses are not reasonable. Through the comparison of coordinates in gene space, and any demographic variables of interest, researchers may explore whether or not the individuals in the alignment indicate population substructure. For example, one may investigate if there appears to be a phylogeographic structure to the individuals forming distinct subpopulations, and if migration appears to occur between subpopulations. Through empirical data, we show that our method can readily recover tree-like structure, identify strong genetic groupings based on qualitative traits and show that we are able to recover phylogeographic signal given provenanced sampling information. We show that our method can even be used to suggest routes of migration based on mtDNA. Finally we apply our method to modern Aboriginal Australian mtDNA to show strong evidence for discrete geographic populations of Aboriginal Australian peoples that display permanence on the Australian landscape dating back to the original colonisation of Australia 50 thousand years before present (kya). Our second project is concerned with identifying departures from a tree-like evolutionary history at the species level. It is not uncommon for closely related species (Species A and C say) to still be capable of interbreeding, and producing viable \hybrid" offγspring (Species B say). Under these conditions, a phylogenetic tree cannot describe the evolutionary history of the hybrid species, and instead an admixture graph may be a better description. We begin by considering the evolutionary history of three species: a hybrid organism that has undergone some independent evolution (Species B), and two \parent" organisms, Species A and C. Relatively long, contiguous regions of the genome of Species B will have undergone no recombination since the admixture event. These regions will have been contributed by either Species A (and hence will be more closely related to Species A), or Species C. We aim to estimate the proportion of the genome contributed by Species A, and denote this by considering the proportion of informative site patterns that indicate evidence for the two possible ancestries. The mixing proportion is the parameter of interest in our analyses. However, due to the classical problem of the non-identifiability of mixing parameters in multinomial distributions, we describe two Bayesian methods for estimating γ. Our first method places prior distributions on the parameters of the model, and uses Approximate Bayesian Computation (ABC) to estimate the marginal posterior distribution of γ. Our second, closely related method, instead estimates the marginal posterior distribution of via numerical integration. We show via a simulation study that our methods can accurately estimate the true value of γ, and perform well under biologically reasonable scenarios. However, we also find that our methods suffer from a relatively small positive bias for small values of γ, i.e., when one species of the parent species contributes very little to the genome of the hybrid species. We compare the performance of our method to the popular method of the ratio of f4 statistics. We do this by estimating the proportion of Neanderthal ancestry in pre-ice age European human samples and comparing our results to the finding of Fu et al. [18]. We show that our method recovers extremely similar estimates of Neanderthal ancestry with no apparent systematic bias when compared to the results of Fu et al.. Finally we apply our method to the genomes of Late Pleistocene European bison (Bison bonasus) and Steppe Bison (Bison priscus) to understand the evolutionary history of bovid megafauna in Europe over the last seventy thousand years. It was thought that before 10 kya the only bovid present in Europe was the Steppe bison. However, from bone samples found dating from the present day, and back to approximately 70 kya, mtDNA indicated a second bison species was also roaming Europe before 10 kya, more closely related to modern cattle than the Steppe bison. After nuclear DNA was sequenced, we were able to show that this new species of bovid was actually a hybrid offspring of Aurochs (the ancestor of modern cattle) and Steppe bison, an event that occurred approximately 120 kya. We used our method, in concert with the ratio of f4 statistics, to show that the hybrid species contained approximately 10% Aurochs and 90% Steppe bison ancestry.Thesis (Ph.D.) -- University of Adelaide, School of Mathematical Sciences, 201

    Exploring the integration of traditional and molecular epidemiological methods for infectious disease outbreaks

    Get PDF
    BACKGROUND: Understanding the transmission dynamics of infectious pathogens is critical to developing effective public health strategies. Traditionally, time consuming epidemiological methods were used, often limited by incomplete or inaccurate datasets. Novel phylogenetic techniques can determine transmission events, but have rarely been used in real-time outbreak settings to inform interventions and limit the impact of outbreaks. METHODS: I undertook a series of novel studies to explore the utility of combining phylogenetics with traditional epidemiological analysis to enhance the understanding of transmission dynamics. I investigated HIV in an endemic South African setting and Ebola in an acute outbreak in Sierra Leone. The strengths and limitations of this combined approach are explored, ethical issues investigated and recommendations made regarding the implications of this work for public health. RESULTS: Phylogenetics provides an exciting and synergistic tool to epidemiological analysis in outbreak investigation and control. These combined methods enable a more detailed understanding than is possible through either discipline alone. My key findings include: • Identification of infection source: Phylogenetics gives new insight into the role of external introductions (e.g. migrators) in driving and sustaining the high incidence of HIV. • Earlier identification of new emerging clusters: I identified a new cluster of HIV from around a mining community. This is one of the first examples of molecular methods detecting a previously unknown outbreak. • Identification of novel mechanisms of transmission: This work suggests that children may have been infected by playing in puddles contaminated with Ebola, a previously unrecognised route of transmission. CONCLUSION: The integration of these two methods facilitate sophisticated real-time techniques to maximise understanding of transmission dynamics, allowing faster and more effectively targeted interventions. Moving forwards, sequence data should be incorporated into standard outbreak investigation. This is critical at a time when infectious disease outbreaks have led to the some of the most significant global health threats of the recent past

    Phylogenetics in the Genomic Era

    Get PDF
    Molecular phylogenetics was born in the middle of the 20th century, when the advent of protein and DNA sequencing offered a novel way to study the evolutionary relationships between living organisms. The first 50 years of the discipline can be seen as a long quest for resolving power. The goal – reconstructing the tree of life – seemed to be unreachable, the methods were heavily debated, and the data limiting. Maybe for these reasons, even the relevance of the whole approach was repeatedly questioned, as part of the so-called molecules versus morphology debate. Controversies often crystalized around long-standing conundrums, such as the origin of land plants, the diversification of placental mammals, or the prokaryote/eukaryote divide. Some of these questions were resolved as gene and species samples increased in size. Over the years, molecular phylogenetics has gradually evolved from a brilliant, revolutionary idea to a mature research field centred on the problem of reliably building trees. This logical progression was abruptly interrupted in the late 2000s. High-throughput sequencing arose and the field suddenly moved into something entirely different. Access to genome-scale data profoundly reshaped the methodological challenges, while opening an amazing range of new application perspectives. Phylogenetics left the realm of systematics to occupy a central place in one of the most exciting research fields of this century – genomics. This is what this book is about: how we do trees, and what we do with trees, in the current phylogenomic era. One obvious, practical consequence of the transition to genome-scale data is that the most widely used tree-building methods, which are based on probabilistic models of sequence evolution, require intensive algorithmic optimization to be applicable to current datasets. This problem is considered in Part 1 of the book, which includes a general introduction to Markov models (Chapter 1.1) and a detailed description of how to optimally design and implement Maximum Likelihood (Chapter 1.2) and Bayesian (Chapter 1.4) phylogenetic inference methods. The importance of the computational aspects of modern phylogenomics is such that efficient software development is a major activity of numerous research groups in the field. We acknowledge this and have included seven "How to" chapters presenting recent updates of major phylogenomic tools – RAxML (Chapter 1.3), PhyloBayes (Chapter 1.5), MACSE (Chapter 2.3), Bgee (Chapter 4.3), RevBayes (Chapter 5.2), Beagle (Chapter 5.4), and BPP (Chapter 5.6). Genome-scale data sets are so large that statistical power, which had been the main limiting factor of phylogenetic inference during previous decades, is no longer a major issue. Massive data sets instead tend to amplify the signal they deliver – be it biological or artefactual – so that bias and inconsistency, instead of sampling variance, are the main problems with phylogenetic inference in the genomic era. Part 2 covers the issues of data quality and model adequacy in phylogenomics. Chapter 2.1 provides an overview of current practice and makes recommendations on how to avoid the more common biases. Two chapters review the challenges and limitations of two key steps of phylogenomic analysis pipelines, sequence alignment (Chapter 2.2) and orthology prediction (Chapter 2.4), which largely determine the reliability of downstream inferences. The performance of tree building methods is also the subject of Chapter 2.5, in which a new approach is introduced to assess the quality of gene trees based on their ability to correctly predict ancestral gene order. Analyses of multiple genes typically recover multiple, distinct trees. Maybe the biggest conceptual advance induced by the phylogenetic to phylogenomic transition is the suggestion that one should not simply aim to reconstruct “the” species tree, but rather to be prepared to make sense of forests of gene trees. Chapter 3.1 reviews the numerous reasons why gene trees can differ from each other and from the species tree, and what the implications are for phylogenetic inference. Chapter 3.2 focuses on gene trees/species trees reconciliation methods that account for gene duplication/loss and horizontal gene transfer among lineages. Incomplete lineage sorting is another major source of phylogenetic incongruence among loci, which recently gained attention and is covered by Chapter 3.3. Chapter 3.4 concludes this part by taking a user’s perspective and examining the pros and cons of concatenation versus separate analysis of gene sequence alignments. Modern genomics is comparative and phylogenetic methods are key to a wide range of questions and analyses relevant to the study of molecular evolution. This is covered by Part 4. We argue that genome annotation, either structural or functional, can only be properly achieved in a phylogenetic context. Chapters 4.1 and 4.2 review the power of these approaches and their connections with the study of gene function. Molecular substitution rates play a key role in our understanding of the prevalence of nearly neutral versus adaptive molecular evolution, and the influence of species traits on genome dynamics (Chapter 4.4). The analysis of substitution rates, and particularly the detection of positive selection, requires sophisticated methods and models of coding sequence evolution (Chapter 4.5). Phylogenomics also offers a unique opportunity to explore evolutionary convergence at a molecular level, thus addressing the long-standing question of predictability versus contingency in evolution (Chapter 4.6). The development of phylogenomics, as reviewed in Parts 1 through 4, has resulted in a powerful conceptual and methodological corpus, which is often reused for addressing problems of interest to biologists from other fields. Part 5 illustrates this application potential via three selected examples. Chapter 5.1 addresses the link between phylogenomics and palaeontology; i.e., how to optimally combine molecular and fossil data for estimating divergence times. Chapter 5.3 emphasizes the importance of the phylogenomic approach in virology and its potential to trace the origin and spread of infectious diseases in space and time. Finally, Chapter 5.5 recalls why phylogenomic methods and the multi-species coalescent model are key in addressing the problem of species delimitation – one of the major goals of taxonomy. It is hard to predict where phylogenomics as a discipline will stand in even 10 years. Maybe a novel technological revolution will bring it to yet another level? We strongly believe, however, that tree thinking will remain pivotal in the treatment and interpretation of the deluge of genomic data to come. Perhaps a prefiguration of the future of our field is provided by the daily monitoring of the current Covid-19 outbreak via the phylogenetic analysis of coronavirus genomic data in quasi real time – a topic of major societal importance, contemporary to the publication of this book, in which phylogenomics is instrumental in helping to fight disease

    Evolutionary genomics : statistical and computational methods

    Get PDF
    This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward

    Evolutionary Genomics

    Get PDF
    This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward

    Statistical Population Genomics

    Get PDF
    This open access volume presents state-of-the-art inference methods in population genomics, focusing on data analysis based on rigorous statistical techniques. After introducing general concepts related to the biology of genomes and their evolution, the book covers state-of-the-art methods for the analysis of genomes in populations, including demography inference, population structure analysis and detection of selection, using both model-based inference and simulation procedures. Last but not least, it offers an overview of the current knowledge acquired by applying such methods to a large variety of eukaryotic organisms. Written in the highly successful Methods in Molecular Biology series format, chapters include introductions to their respective topics, pointers to the relevant literature, step-by-step, readily reproducible laboratory protocols, and tips on troubleshooting and avoiding known pitfalls. Authoritative and cutting-edge, Statistical Population Genomics aims to promote and ensure successful applications of population genomic methods to an increasing number of model systems and biological questions
    corecore