113 research outputs found

    Genome Assembly: Novel Applications by Harnessing Emerging Sequencing Technologies and Graph Algorithms

    Get PDF
    Genome assembly is a critical first step for biological discovery. All current sequencing technologies share the fundamental limitation that segments read from a genome are much shorter than even the smallest genomes. Traditionally, whole- genome shotgun (WGS) sequencing over-samples a single clonal (or inbred) target chromosome with segments from random positions. The amount of over-sampling is known as the coverage. Assembly software then reconstructs the target. So called next-generation (or second-generation) sequencing has reduced the cost and increased throughput exponentially over first-generation sequencing. Unfortunately, next-generation sequences present their own challenges to genome assembly: (1) they require amplification of source DNA prior to sequencing leading to artifacts and biased coverage of the genome; (2) they produce relatively short reads: 100bp- 700bp; (3) the sizeable runtime of most second-generation instruments is prohibitive for applications requiring rapid analysis, with an Illumina HiSeq 2000 instrument requiring 11 days for the sequencing reaction. Recently, successors to the second-generation instruments (third-generation) have become available. These instruments promise to alleviate many of the down- sides of second-generation sequencing and can generate multi-kilobase sequences. The long sequences have the potential to dramatically improve genome and transcriptome assembly. However, the high error rate of these reads is challenging and has limited their use. To address this limitation, we introduce a novel correction algorithm and assembly strategy that utilizes shorter, high-identity sequences to correct the error in single-molecule sequences. Our approach achieves over 99% read accuracy and produces substantially better assemblies than current sequencing strategies. The availability of cheaper sequencing has made new sequencing targets, such as multiple displacement amplified (MDA) single-cells and metagenomes, popular. Current algorithms assume assembly of a single clonal target, an assumption that is violated in these sequencing projects. We developed Bambus 2, a new scaffolder that works for metagenomics and single cell datasets. It can accurately detect repeats without assumptions about the taxonomic composition of a dataset. It can also identify biological variations present in a sample. We have developed a novel end-to-end analysis pipeline leveraging Bambus 2. Due to its modular nature, it is applicable to clonal, metagenomic, and MDA single-cell targets and allows a user to rapidly go from sequences to assembly, annotation, genes, and taxonomic info. We have incorporated a novel viewer, allowing a user to interactively explore the variation present in a genomic project on a laptop. Together, these developments make genome assembly applicable to novel targets while utilizing emerging sequencing technologies. As genome assembly is critical for all aspects of bioinformatics, these developments will enable novel biological discovery

    Towards more complete metagenomic analyses through circularized genomes and conjugative elements

    Get PDF
    Advancements in sequencing technologies have revolutionized biological sciences and led to the emergence of a number of fields of research. One such field of research is metagenomics, which is the study of the genomic content of complex communities of bacteria. The goal of this thesis was to contribute computational methodology that can maximize the data generated in these studies and to apply these protocols human and environmental metagenomic samples. Standard metagenomic analyses include a step for binning of assembled contigs, which has previously been shown to exclude mobile genetic elements, and I demonstrated that this phenomenon extends to all conjugative elements, which are a subset of mobile genetic elements. I proposed two separate methodologies that could detect contigs that are potential conjugative elements: a curated set of profile hidden Markov models that are very efficient to run, or annotation using the full UniRef90 database, a slower but more sensitive method. I then applied this framework to a large population-based cohort and to a study examining the association of the maternal human gut microbiota and the development of spina bifida. Broadly, the composition and abundances of conjugative elements were discriminatory between the age and geographic cohorts. In the spina bifida cohort, there was an enrichment of Campylobacter hominis and a conjugative element belonging to Campylobacter hominis, which was excluded from the metagenomic bins. Next, I characterized a novel species belonging to the recently discovered manganese-oxidizing genus Manganitrophus growing on oil refinery carbon filters. I successfully circularized the genomes of three strains and got quality assemblies for the remaining two samples. Furthermore, I identified a previously uncharacterized conjugative plasmid belonging to the species using my framework developed in chapter 2. Finally, I developed an assembly pipeline to perform a secondary assembly on binned assemblies using long reads. The secondary assemblies yielded a number of additional circularized sequences that would be useful as scaffolds in future metatranscriptomic, variation analysis, and community dynamic studies. The methodologies and applications in this thesis provide a framework for more complete metagenomic analyses going forward that will aid in our understanding of microbial ecology

    Metagenomics : beyond the horizon of current implementations and methods

    Get PDF
    The current field of metagenomics can be summarised by three main questions: "Who is in metagenome?" (or "How complex the metagenome is?"), "What are they doing?" and "What is the difference between two metagenomes?".This research was dedicated to creating new and evaluating already existing methods answering these questions.This work is part of the research programme "Forensic Science" with project number 727.011.002, which is financed by the Dutch Research Council (NWO).LUMC / Geneeskund

    Inferring phylogenetic trees under the general Markov model via a minimum spanning tree backbone

    Get PDF
    Phylogenetic trees are models of the evolutionary relationships among species, with species typically placed at the leaves of trees. We address the following problems regarding the calculation of phylogenetic trees. (1) Leaf-labeled phylogenetic trees may not be appropriate models of evolutionary relationships among rapidly evolving pathogens which may contain ancestor-descendant pairs. (2) The models of gene evolution that are widely used unrealistically assume that the base composition of DNA sequences does not evolve. Regarding problem (1) we present a method for inferring generally labeled phylogenetic trees that allow sampled species to be placed at non-leaf nodes of the tree. Regarding problem (2), we present a structural expectation maximization method (SEM-GM) for inferring leaf-labeled phylogenetic trees under the general Markov model (GM) which is the most complex model of DNA substitution that allows the evolution of base composition. In order to improve the scalability of SEM-GM we present a minimum spanning tree (MST) framework called MST-backbone. MST-backbone scales linearly with the number of leaves. However, the unrealistic location of the root as inferred on empirical data suggests that the GM model may be overtrained. MST-backbone was inspired by the topological relationship between MSTs and phylogenetic trees that was introduced by Choi et al. (2011). We discovered that the topological relationship does not necessarily hold if there is no unique MST. We propose so-called vertex-order based MSTs (VMSTs) that guarantee a topological relationship with phylogenetic trees.Phylogenetische Bäume modellieren evolutionäre Beziehungen zwischen Spezies, wobei die Spezies typischerweise an den Blättern der Bäume sitzen. Wir befassen uns mit den folgenden Problemen bei der Berechnung von phylogenetischen Bäumen. (1) Blattmarkierte phylogenetische Bäume sind möglicherweise keine geeigneten Modelle der evolutionären Beziehungen zwischen sich schnell entwickelnden Krankheitserregern, die Vorfahren-Nachfahren-Paare enthalten können. (2) Die weit verbreiteten Modelle der Genevolution gehen unrealistischerweise davon aus, dass sich die Basenzusammensetzung von DNA-Sequenzen nicht ändert. Bezüglich Problem (1) stellen wir eine Methode zur Ableitung von allgemein markierten phylogenetischen Bäumen vor, die es erlaubt, Spezies, für die Proben vorliegen, an inneren des Baumes zu platzieren. Bezüglich Problem (2) stellen wir eine strukturelle Expectation-Maximization-Methode (SEM-GM) zur Ableitung von blattmarkierten phylogenetischen Bäumen unter dem allgemeinen Markov-Modell (GM) vor, das das komplexeste Modell von DNA-Substitution ist und das die Evolution von Basenzusammensetzung erlaubt. Um die Skalierbarkeit von SEM-GM zu verbessern, stellen wir ein Minimale Spannbaum (MST)-Methode vor, die als MST-Backbone bezeichnet wird. MST-Backbone skaliert linear mit der Anzahl der Blätter. Die Tatsache, dass die Lage der Wurzel aus empirischen Daten nicht immer realistisch abgeleitet warden kann, legt jedoch nahe, dass das GM-Modell möglicherweise übertrainiert ist. MST-backbone wurde von einer topologischen Beziehung zwischen minimalen Spannbäumen und phylogenetischen Bäumen inspiriert, die von Choi et al. 2011 eingeführt wurde. Wir entdeckten, dass die topologische Beziehung nicht unbedingt Bestand hat, wenn es keinen eindeutigen minimalen Spannbaum gibt. Wir schlagen so genannte vertex-order-based MSTs (VMSTs) vor, die eine topologische Beziehung zu phylogenetischen Bäumen garantieren

    Metagenomic Systems Biology of the Human Microbiome

    Get PDF

    Résurrection du passé à l’aide de modèles hétérogènes d’évolution des séquences protéiques

    Get PDF
    The molecular reconstruction and resurrection of ancestral proteins is the major issue tackled in this thesis manuscript. While fossil molecular data are almost nonexistent, phylogenetic methods allow to estimate what were the most likely ancestral protein sequences along a phylogenetic tree describing the relationships between extant sequences. With these ancestral sequences, several biological hypotheses can be tested, from the evolution of protein function to the inference of ancient environments in which the ancestors were adatapted. These probabilistic estimations of ancestral sequences depend on substitution models giving the different probabilities of substitution between all pairs of amino acids. Classicaly, substitution models assume in a simplistic way that the evolutionary process remains homogeneous (constant) among sites of the multiple sequence alignment or between lineages. During the last decade, several methodological improvements were realised, with the description of substitution models allowing to account for the heterogeneity of the process among sites and in time. During my thesis, I developed new heterogeneous substitution models in Maximum Likelihood that were proved to better fit the data than any other homogeneous or heterogeneous models. I also demonstrated their better performance regarding the accuracy of ancestral sequence reconstruction. With the use of these models to reconstruct or resurrect ancestral proteins, my coworkers and I showed the adapation to temperature is a major determinant of evolutionary rates in Archaea. Furthermore, we also deciphed the nature of the phylogenetic signal informing substitution models to infer a non-parsimonious scenario for the adaptation to temperature during early Life on Earth, with a non-hyperthermophilic last universal common ancestor living at lower temperatures than its two descendants. Finally, we showed that the use of heterogeneous models allow to improve the functionality of resurrected proteins, opening the way to a better understanding of evolutionary mechanisms acting on biological sequencesLa reconstruction et la résurrection moléculaire de protéines ancestrales est au coeur de cette thèse. Alors que les données moléculaires fossiles sont quasi inexistantes, il est possible d'estimer quelles étaient les séquences ancestrales les plus probables le long d'un arbre phylogénétique décrivant les relations de parentés entre séquences actuelles. Avoir accès à ces séquences ancestrales permet alors de tester de nombreuses hypothèses biologiques, de la fonction des protéines ancestrales à l'adaptation des organismes à leur environnement. Cependant, ces inférences probabilistes de séquences ancestrales sont dépendantes de modèles de substitution fournissant les probabilités de changements entre acides aminés. Ces dernières années ont vu le développement de nouveaux modèles de substitutions d'acides aminés, permettant de mieux prendre en compte les phénomènes biologiques agissant sur l'évolution des séquences protéiques. Classiquement, les modèles supposent que le processus évolutif est à la fois le même pour tous les sites d'un alignement protéique et qu'il est resté constant au cours du temps lors de l'évolution des lignées. On parle alors de modèle homogène en temps et en sites. Les modèles récents, dits hétérogènes, ont alors permis de lever ces contraintes en permettant aux sites et/ou aux lignées d'évoluer selon différents processus. Durant cette thèse, de nouveaux modèles hétérogènes en temps et sites ont été développés en Maximum de Vraisemblance. Il a notamment été montré qu'ils permettent d'améliorer considérablement l'ajustement aux données et donc de mieux prendre en compte les phénomènes régissant l'évolution des séquences protéiques afin d'estimer de meilleurs séquences ancestrales. A l'aide de ces modèles et de reconstruction ou résurrection de protéines ancestrales en laboratoire, il a été montré que l'adaptation à la température est un déterminant majeur de la variation des taux évolutifs entre lignées d'Archées. De même, en appliquant ces modèles hétérogènes le long de l'arbre universel du vivant, il a été possible de mieux comprendre la nature du signal évolutif informant de manière non-parcimonieuse un ancêtre universel vivant à plus basse température que ses deux descendants, à savoir les ancêtres bactériens et archéens. Enfin, il a été montré que l'utilisation de tels modèles pouvait permettre d'améliorer la fonctionnalité des protéines ancestrales ressuscitées en laboratoire, ouvrant la voie à une meilleure compréhension des mécanismes évolutifs agissant sur les séquences biologique
    corecore