66 research outputs found
Algorithmic Advancements and Massive Parallelism for Large-Scale Datasets in Phylogenetic Bayesian Markov Chain Monte Carlo
Datasets used for the inference of the "tree of life" grow at unprecedented rates, thus inducing a high computational burden for analytic methods. First, we introduce a scalable software package that allows us to conduct state of the art Bayesian analyses on datasets of almost arbitrary size. Second, we derive a proposal mechanism for MCMC that is substantially more efficient than traditional branch length proposals. Third, we present an efficient algorithm for solving the rogue taxon problem
Models, Optimizations, and Tools for Large-Scale Phylogenetic Inference, Handling Sequence Uncertainty, and Taxonomic Validation
Das Konzept der Evolution ist in der modernen Biologie von zentraler Bedeutung.
Deswegen liefert die Phylogenetik, die Lehre ĂĽber die Verwandschaften und Abstam-
mung von Organismen bzw. Spezies, entscheidende Hinweise zur EntschlĂĽsselung
einer Vielzahl biologischer Prozesse. Phylogenetische Stammbäume sind einerseits
fĂĽr die Grundlagenforschung wichtig, da sie in Studien ĂĽber die Diversifizierung und
Umweltanpassung einzelner Organismengruppen (z.B. Insekten oder Vögel) bis hin
zu der groĂźen Herausforderung, die Entstehung und Entwicklung aller Lebensfor-
men in einem umfassenden evolutionären Baum darzustellen (der sog. Tree of Life)
Anwendung finden. Andererseits werden phylogenetische Methoden auch in prax-
isnahen Anwendungen eingesetzt, um beispielsweise die Verbreitungsdynamik von
HIV-Infektionen oder, die Heterogenität der Krebszellen eines Tumors, zu verstehen.
Den aktuellen Stand der Technik in der Stammbaumrekonstruktion stellen Meth-
oden Maximum Likelihood (ML) und Bayes’sche Inferenz (BI) dar, welche auf der
Analyse molekularer Sequenzendaten (DNA und Proteine) anhand probabilistis-
cher Evolutionsmodelle basieren. Diese Methoden weisen eine hohe Laufzeitkom-
plexität auf (N P -schwer), welche die Entwicklung effizienter Heuristiken unabding-
bar macht. Hinzu kommt, dass die Berechnung der Zielfunktion (sog. Phylogenetic
Likelihood Function, PLF) neben einem hohen Speicherverbrauch auch eine Vielzahl
an Gleitkommaarithmetik-Operationen erfordert und somit extrem rechenaufwendig
ist.
Die neuesten Entwicklungen im Bereich der DNA-Sequenzierung (Next Gener-
ation Sequencing, NGS) steigern kontinuierlich den Durchsatz und senken zugleich
die Sequenzierungskosten um ein Vielfaches. FĂĽr die Phylogenetik hat dies zur
Folge, dass die Dimensionen der zu analysierenden Datensätze alle 2–3 Jahre, um
eine Grössenordnung zunhemen. War es bisher üblich, einige Dutzend bis Hun-
derte Spezies anhand einzelner bzw. weniger Gene zu analysieren (Sequenzlänge:
1–10 Kilobasen), stellen derzeit Studien mit Tausenden Sequenzen oder Genen keine
Seltenheit mehr dar. In den nächsten 1–2 Jahren ist zu erwarten, dass die Anal-
ysen Tausender bis Zehntausender vollständiger Genome bzw. Transkriptome (Se-
quenzlänge: 1–100 Megabasen und mehr) anstehen. Um diesen Aufgaben gewachsen
zu sein, mĂĽssen die bestehenden Methoden weiterentwickelt und optimiert werden,
um vor allem Höchstleistungsrechner sowie neue Hardware-Architekturen optimal
nutzen zu können.
Außerdem führt die sich beschleunigende Speicherung von Sequenzen in öffentli-
chen Datenbanken wie NCBI GenBank (und ihren Derivaten) dazu, dass eine hohe
Qualität der Sequenzannotierungen (z. B. Organismus- bzw. Speziesname, tax-
onomische Klassifikation, Name eines Gens usw.) nicht zwangsläufig gewährleistet
ist. Das hängt unter anderem auch damit zusammen, dass eine zeitnahe Korrektur
durch entsprechende Experten nicht mehr möglich ist, solange ihnen keine adäquaten
Software-Tools zur VerfĂĽgung stehen.
In dieser Doktroarbeit leisten wir mehrere Beiträge zur Bewältigung der oben
genannten Herausforderungen.
Erstens haben wir ExaML, eine dedizierte Software zur ML-basierten Stamm-
baumrekonstruktion für Höchstleistungsrechner, auf den Intel Xeon Phi Hardware-
beschleuniger portiert. Der Xeon Phi bietet im Vergleich zu klassischen x86 CPUs
eine höhere Rechenleistung, die allerdings nur anhand architekturspezifischer Op-
timierungen vollständig genutzt werden kann. Aus diesem Grund haben wir zum
einen die PLF-Berechnung fĂĽr die 512-bit-Vektoreinheit des Xeon Phi umstrukturi-
ert und optimiert. Zum anderen haben wir die in ExaML bereits vorhandene reine
MPI-Parallelisierung durch eine hybride MPI/OpenMP-Lösung ersetzt. Diese hy-
bride Lösung weist eine wesentlich bessere Skalierbarkeit für eine hohe Zahl von
Kernen bzw. Threads innerhalb eines Rechenknotens auf (>100 HW-Threads fĂĽr
Xeon Phi).
Des Weiteren haben wir eine neue Software zur ML-Baumrekonstruktion na-
mens RAxML-NG entwickelt. Diese implementiert, bis auf kleinere Anpassungen, zwar
denselben Suchalgorithmus wie das weit verbreitete Programm RAxML, bietet aber
gegenüber RAxML mehrere Vorteile: (a) dank den sorgfältigen Optimierungen der
PLF-Berechnung ist es gelungen, die Laufzeiten um den Faktor 2 bis 3 zu reduzieren
(b) die Skalierbarkeit auf extrem großen Eingabedatensätzen wurde verbessert, in-
dem ineffiziente topologische Operationen eliminiert bzw. optimiert wurden, (c) die
bisher nur in ExaML verfügbaren, für große Datensätze relevanten Funktionen wie
Checkpointing sowie ein dedizierter Datenverteilungsalgorithmus wurden nachimple-
mentiert (d) dem Benutzer steht eine größere Auswahl an statistischen DNA-Evo-
lutionsmodellen zur VerfĂĽgung, die zudem flexibler kombiniert und parametrisiert
werden können (e) die Weiterentwicklung der Software wird aufgrund der modularen
Architektur wesentlich erleichtert (die Funktionen zur PLF-Berechnung wurden in
eine gesonderte Bibliothek ausgeglidert).
Als nächstes haben wir untersucht, wie sich Sequenzierungsfehler auf die Genau-
igkeit phylogenetischr Stammbaumrekonstruktionen auswirken. Wir modifizieren
den RAxML bzw. RAxML-NG Code dahingehend, dass sowohl die explizite Angabe von
Fehlerwahrscheinlichkeiten als auch die automatische Schätzung von Fehlerraten
mittels der ML-Methode möglich ist. Unsere Simulationen zeigen: (a) Wenn die
Fehler gleichverteilt sind, kann die Fehlerrate direkt aus den Sequenzdaten geschätzt
werden. (b) Ab einer Fehlerrate von ca. 1% liefert die Baumrekonstruktion unter
BerĂĽcksichtigung des Fehlermodells genauere Ergebnisse als die klassische Methode,
welche die Eingabe als fehlerfrei annimmt.
Ein weiterer Beitrag im Rahmen dieser Arbeit ist die Software-Pipeline SATIVA
zur rechnergestĂĽtzten Identifizierung und Korrektur fehlerhafter taxonomischer An-
notierungen in groĂźen Sequenzendatenbanken. Der Algorithmus funktioniert wie
folgt: für jede Sequenz wird die Platzierung im Stammbaum mit dem höchst-
möglichen Likelihood-Wert ermittelt und anschließend geprüft, ob diese mit der
vorgegeben taxonomischen Klassifikation ĂĽbereinstimmt. Ist dies nicht der Fall,
wird also eine Sequenz beispielsweise innerhalb einer anderen Gattung platziert,
wird die Sequenz als falsch annotiert gemeldet, und es wird eine entsprechende
Umklassifizierung vorgeschlagen. Auf simulierten Datensätzen mit zufällig eingefüg-
ten Fehlern, erreichte unsere Pipeline eine hohe Identifikationsquote (>90%) sowie
Genauigkeit (>95%). Zur Evaluierung anhand empirischer Daten, haben wir vier
öffentliche rRNA Datenbanken untersucht, welche zur Klassifizierung von Bakterien
häufig als Referenz benutzt werden. Dabei haben wir je nach Datenbank 0.2% bis
2.5% aller Sequenzen als potenzielle Fehlannotierungen identifiziert
Modern considerations for the use of naive Bayes in the supervised classification of genetic sequence data
2021 Spring.Includes bibliographical references.Genetic sequence classification is the task of assigning a known genetic label to an unknown genetic sequence. Often, this is the first step in genetic sequence analysis and is critical to understanding data produced by molecular techniques like high throughput sequencing. Here, we explore an algorithm called naive Bayes that was historically successful in classifying 16S ribosomal gene sequences for microbiome analysis. We extend the naive Bayes classifier to perform the task of general sequence classification by leveraging advancements in computational parallelism and the statistical distributions that underlie naive Bayes. In Chapter 2, we show that our implementation of naive Bayes, called WarpNL, performs within a margin of error of modern classifiers like Kraken2 and local alignment. We discuss five crucial aspects of genetic sequence classification and show how these areas affect classifier performance: the query data, the reference sequence database, the feature encoding method, the classification algorithm, and access to computational resources. In Chapter 3, we cover the critical computational advancements introduced in WarpNL that make it efficient in a modern computing framework. This includes efficient feature encoding, introduction of a log-odds ratio for comparison of naive Bayes posterior estimates, description of schema for parallel and distributed naive Bayes architectures, and use of machine learning classifiers to perform outgroup sequence classification. Finally in Chapter 4, we explore a variant of the Dirichlet multinomial distribution that underlies the naive Bayes likelihood, called the beta-Liouville multinomial. We show that the beta-Liouville multinomial can be used to enhance classifier performance, and we provide mathematical proofs regarding its convergence during maximum likelihood estimation. Overall, this work explores the naive Bayes algorithm in a modern context and shows that it is competitive for genetic sequence classification
The prospects of quantum computing in computational molecular biology
Quantum computers can in principle solve certain problems exponentially more
quickly than their classical counterparts. We have not yet reached the advent
of useful quantum computation, but when we do, it will affect nearly all
scientific disciplines. In this review, we examine how current quantum
algorithms could revolutionize computational biology and bioinformatics. There
are potential benefits across the entire field, from the ability to process
vast amounts of information and run machine learning algorithms far more
efficiently, to algorithms for quantum simulation that are poised to improve
computational calculations in drug discovery, to quantum algorithms for
optimization that may advance fields from protein structure prediction to
network analysis. However, these exciting prospects are susceptible to "hype",
and it is also important to recognize the caveats and challenges in this new
technology. Our aim is to introduce the promise and limitations of emerging
quantum computing technologies in the areas of computational molecular biology
and bioinformatics.Comment: 23 pages, 3 figure
Phylogenetics in the Genomic Era
Molecular phylogenetics was born in the middle of the 20th century, when the advent of protein and DNA sequencing offered a novel way to study the evolutionary relationships between living organisms. The first 50 years of the discipline can be seen as a long quest for resolving power. The goal – reconstructing the tree of life – seemed to be unreachable, the methods were heavily debated, and the data limiting. Maybe for these reasons, even the relevance of the whole approach was repeatedly questioned, as part of the so-called molecules versus morphology debate. Controversies often crystalized around long-standing conundrums, such as the origin of land plants, the diversification of placental mammals, or the prokaryote/eukaryote divide. Some of these questions were resolved as gene and species samples increased in size. Over the years, molecular phylogenetics has gradually evolved from a brilliant, revolutionary idea to a mature research field centred on the problem of reliably building trees.
This logical progression was abruptly interrupted in the late 2000s. High-throughput sequencing arose and the field suddenly moved into something entirely different. Access to genome-scale data profoundly reshaped the methodological challenges, while opening an amazing range of new application perspectives. Phylogenetics left the realm of systematics to occupy a central place in one of the most exciting research fields of this century – genomics. This is what this book is about: how we do trees, and what we do with trees, in the current phylogenomic era.
One obvious, practical consequence of the transition to genome-scale data is that the most widely used tree-building methods, which are based on probabilistic models of sequence evolution, require intensive algorithmic optimization to be applicable to current datasets. This problem is considered in Part 1 of the book, which includes a general introduction to Markov models (Chapter 1.1) and a detailed description of how to optimally design and implement Maximum Likelihood (Chapter 1.2) and Bayesian (Chapter 1.4) phylogenetic inference methods. The importance of the computational aspects of modern phylogenomics is such that efficient software development is a major activity of numerous research groups in the field. We acknowledge this and have included seven "How to" chapters presenting recent updates of major phylogenomic tools – RAxML (Chapter 1.3), PhyloBayes (Chapter 1.5), MACSE (Chapter 2.3), Bgee (Chapter 4.3), RevBayes (Chapter 5.2), Beagle (Chapter 5.4), and BPP (Chapter 5.6).
Genome-scale data sets are so large that statistical power, which had been the main limiting factor of phylogenetic inference during previous decades, is no longer a major issue. Massive data sets instead tend to amplify the signal they deliver – be it biological or artefactual – so that bias and inconsistency, instead of sampling variance, are the main problems with phylogenetic inference in the genomic era. Part 2 covers the issues of data quality and model adequacy in phylogenomics. Chapter 2.1 provides an overview of current practice and makes recommendations on how to avoid the more common biases. Two chapters review the challenges and limitations of two key steps of phylogenomic analysis pipelines, sequence alignment (Chapter 2.2) and orthology prediction (Chapter 2.4), which largely determine the reliability of downstream inferences. The performance of tree building methods is also the subject of Chapter 2.5, in which a new approach is introduced to assess the quality of gene trees based on their ability to correctly predict ancestral gene order.
Analyses of multiple genes typically recover multiple, distinct trees. Maybe the biggest conceptual advance induced by the phylogenetic to phylogenomic transition is the suggestion that one should not simply aim to reconstruct “the” species tree, but rather to be prepared to make sense of forests of gene trees. Chapter 3.1 reviews the numerous reasons why gene trees can differ from each other and from the species tree, and what the implications are for phylogenetic inference. Chapter 3.2 focuses on gene trees/species trees reconciliation methods that account for gene duplication/loss and horizontal gene transfer among lineages. Incomplete lineage sorting is another major source of phylogenetic incongruence among loci, which recently gained attention and is covered by Chapter 3.3. Chapter 3.4 concludes this part by taking a user’s perspective and examining the pros and cons of concatenation versus separate analysis of gene sequence alignments.
Modern genomics is comparative and phylogenetic methods are key to a wide range of questions and analyses relevant to the study of molecular evolution. This is covered by Part 4. We argue that genome annotation, either structural or functional, can only be properly achieved in a phylogenetic context. Chapters 4.1 and 4.2 review the power of these approaches and their connections with the study of gene function. Molecular substitution rates play a key role in our understanding of the prevalence of nearly neutral versus adaptive molecular evolution, and the influence of species traits on genome dynamics (Chapter 4.4). The analysis of substitution rates, and particularly the detection of positive selection, requires sophisticated methods and models of coding sequence evolution (Chapter 4.5). Phylogenomics also offers a unique opportunity to explore evolutionary convergence at a molecular level, thus addressing the long-standing question of predictability versus contingency in evolution (Chapter 4.6).
The development of phylogenomics, as reviewed in Parts 1 through 4, has resulted in a powerful conceptual and methodological corpus, which is often reused for addressing problems of interest to biologists from other fields. Part 5 illustrates this application potential via three selected examples. Chapter 5.1 addresses the link between phylogenomics and palaeontology; i.e., how to optimally combine molecular and fossil data for estimating divergence times. Chapter 5.3 emphasizes the importance of the phylogenomic approach in virology and its potential to trace the origin and spread of infectious diseases in space and time. Finally, Chapter 5.5 recalls why phylogenomic methods and the multi-species coalescent model are key in addressing the problem of species delimitation – one of the major goals of taxonomy.
It is hard to predict where phylogenomics as a discipline will stand in even 10 years. Maybe a novel technological revolution will bring it to yet another level? We strongly believe, however, that tree thinking will remain pivotal in the treatment and interpretation of the deluge of genomic data to come. Perhaps a prefiguration of the future of our field is provided by the daily monitoring of the current Covid-19 outbreak via the phylogenetic analysis of coronavirus genomic data in quasi real time – a topic of major societal importance, contemporary to the publication of this book, in which phylogenomics is instrumental in helping to fight disease
Evolutionary genomics : statistical and computational methods
This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward
Evolutionary Genomics
This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward
Fundamentals
Volume 1 establishes the foundations of this new field. It goes through all the steps from data collection, their summary and clustering, to different aspects of resource-aware learning, i.e., hardware, memory, energy, and communication awareness. Machine learning methods are inspected with respect to resource requirements and how to enhance scalability on diverse computing architectures ranging from embedded systems to large computing clusters
Fundamentals
Volume 1 establishes the foundations of this new field. It goes through all the steps from data collection, their summary and clustering, to different aspects of resource-aware learning, i.e., hardware, memory, energy, and communication awareness. Machine learning methods are inspected with respect to resource requirements and how to enhance scalability on diverse computing architectures ranging from embedded systems to large computing clusters
White Paper 11: Artificial intelligence, robotics & data science
198 p. : 17 cmSIC white paper on Artificial Intelligence, Robotics and Data Science sketches a preliminary roadmap for addressing current R&D challenges associated with automated and autonomous machines. More than 50 research challenges investigated all over Spain by more than 150 experts within CSIC are presented in eight chapters. Chapter One introduces key concepts and tackles the issue of the integration of knowledge (representation), reasoning and learning in the design of artificial entities. Chapter Two analyses challenges associated with the development of theories –and supporting technologies– for modelling the behaviour of autonomous agents. Specifically, it pays attention to the interplay between elements at micro level (individual autonomous agent interactions) with the macro world (the properties we seek in large and complex societies). While Chapter Three discusses the variety of data science applications currently used in all fields of science, paying particular attention to Machine Learning (ML) techniques, Chapter Four presents current development in various areas of robotics. Chapter Five explores the challenges associated with computational cognitive models. Chapter Six pays attention to the ethical, legal, economic and social challenges coming alongside the development of smart systems. Chapter Seven engages with the problem of the environmental sustainability of deploying intelligent systems at large scale. Finally, Chapter Eight deals with the complexity of ensuring the security, safety, resilience and privacy-protection of smart systems against cyber threats.18 EXECUTIVE SUMMARY ARTIFICIAL INTELLIGENCE, ROBOTICS AND DATA SCIENCE Topic Coordinators Sara Degli Esposti ( IPP-CCHS, CSIC ) and Carles Sierra ( IIIA, CSIC ) 18 CHALLENGE 1 INTEGRATING KNOWLEDGE, REASONING AND LEARNING Challenge Coordinators Felip ManyĂ ( IIIA, CSIC ) and AdriĂ ColomĂ© ( IRI, CSIC – UPC ) 38 CHALLENGE 2 MULTIAGENT SYSTEMS Challenge Coordinators N. Osman ( IIIA, CSIC ) and D. LĂłpez ( IFS, CSIC ) 54 CHALLENGE 3 MACHINE LEARNING AND DATA SCIENCE Challenge Coordinators J. J. Ramasco Sukia ( IFISC ) and L. Lloret Iglesias ( IFCA, CSIC ) 80 CHALLENGE 4 INTELLIGENT ROBOTICS Topic Coordinators G. AlenyĂ ( IRI, CSIC – UPC ) and J. Villagra ( CAR, CSIC ) 100 CHALLENGE 5 COMPUTATIONAL COGNITIVE MODELS Challenge Coordinators M. D. del Castillo ( CAR, CSIC) and M. Schorlemmer ( IIIA, CSIC ) 120 CHALLENGE 6 ETHICAL, LEGAL, ECONOMIC, AND SOCIAL IMPLICATIONS Challenge Coordinators P. Noriega ( IIIA, CSIC ) and T. AusĂn ( IFS, CSIC ) 142 CHALLENGE 7 LOW-POWER SUSTAINABLE HARDWARE FOR AI Challenge Coordinators T. Serrano ( IMSE-CNM, CSIC – US ) and A. Oyanguren ( IFIC, CSIC - UV ) 160 CHALLENGE 8 SMART CYBERSECURITY Challenge Coordinators D. Arroyo Guardeño ( ITEFI, CSIC ) and P. Brox JimĂ©nez ( IMSE-CNM, CSIC – US )Peer reviewe
- …