From millions to one: theoretical and concrete approaches to De Novo assembly using short read DNA sequences

Abstract

One of the most significant advances in biology has been the ability to sequence the DNA of organisms. Even in the shadow of the completion of the human genome, intractable regions of the genome remain incomplete. Next generation high-throughput short read sequencing technologies are now available and have the ability to generate millions of short read DNA sequences per run. Although greater coverage depths are possible, de novo sequence assembly with these shorter sequences is significantly more complex than resequencing; handling them presents new computational problems and opportunities. Identifying repetitive regions, coping with sequencing errors, and manipulating the millions of short reads simultaneously, are some of the difficulties that must be overcome. As a result of these complexities and working with the short read sequences from the Waksman SOLiD sequencing platform, this work explores the problem of de novo assembly. Initially, we develop tools for filtering short read sequence data based on quality scores and find that this procedure is critical for the success of the subsequent de novo assembly. Next, we analyze the key phenomena responsible for producing contigs that are much shorter than the values provided by theoretical estimates. Finally, we explore two different routes to circumventing the difficulty imposed by short contigs. The first involves utilization of information from multiple orthologous genomes in a comparative assembly. In particular, we developed a pipeline for using the reference genome of a close by relative to improve genome assembly. The second approach uses paired read information to build scaffolds that are two orders of magnitude larger than the original contigs. For typical bacterial genomes, less than one hundred of these scaffolds are required to cover the entire genome. The combination of short reads from various platforms, assembly, and recovery pipelines brings mid-sized genomes close to completion. As a result, minimal additional work using conventional sequencing technologies are enough to close the remaining small gaps and return a finished single genome. Current advancements in sequencing technologies leave us hopeful that it would be possible to provide fairly complete assemblies for complex genomes via these technological approaches.Ph.D.Includes bibliographical referencesIncludes vitaby Ariella Syma Sasso

    Similar works

    Full text

    thumbnail-image