4,297 research outputs found

    Assessment of Next Generation Sequencing Technologies for \u3ci\u3eDe novo\u3c/i\u3e and Hybrid Assemblies of Challenging Bacterial Genomes

    Get PDF
    In past decade, tremendous progress has been made in DNA sequencing methodologies in terms of throughput, speed, read-lengths, along with a sharp decrease in per base cost. These technologies, commonly referred to as next-generation sequencing (NGS) are complimented by the development of hybrid assembly approaches which can utilize multiple NGS platforms. In the first part of my dissertation I performed systematic evaluations and optimizations of nine de novo and hybrid assembly protocols across four novel microbial genomes. While each had strengths and weaknesses, via optimization using multiple strategies I obtained dramatic improvements in overall assembly size and quality. To select the best assembly, I also proposed the novel rDNA operon validation approach to evaluate assembly accuracy. Additionally, I investigated the ability of third-generation PacBio sequencing platform and achieved automated finishing of Clostridium autoethanogenum without any accessory data. These complete genome sequences facilitated comparisons which revealed rDNA operons as a major limitation for short read technologies, and also enabled comparative and functional genomics analysis. To facilitate future assessment and algorithms developments of NGS technologies we publically released the sequence datasets for C. autoethanogenum which span three generations of sequencing technologies, containing six types of data from four NGS platforms. To assess limitations of NGS technologies, assessment of unassembled regions within Illumina and PacBio assemblies was performed using eight microbial genomes. This analysis confirmed rDNA operons as major breakpoints within Illumina assembly while gaps within PacBio assembly appears to be an unaccounted for event and assembly quality is cumulative effect of read-depth, read-quality, sample DNA quality and presence of phage DNA or mobile genetic elements. In a final collaborative study an enrichment protocol was applied for isolation of live endophytic bacteria from roots of the tree Populus deltoides. This protocol achieved a significant reduction in contaminating plant DNA and enabled use these samples for single-cell genomics analysis for the first time. Whole genome sequencing of selected single-cell genomes was performed, assembly and contamination removal optimized, and followed by the bioinformatics, phylogenetic and comparative genomics analyses to identify unique characteristics of these uncultured microorganisms

    MICRO$EC: Cost Effective, Whole-Genome Sequencing

    Get PDF
    While the feasibility of whole human genome sequencing was proven by the success of the Human Genome Project several years ago, the prevalence of personal genome sequencing in the medical industry is still elusive due to its unrealistic cost and time requirements. Microeqisastartupcompanywiththegoalofovercomingtheselimitationsbysequencingaminimumof12completehumangenomesperdayatanerrorratelessthantenpartsinmillionataprofitablemarketpriceoflessthanUSeq is a startup company with the goal of overcoming these limitations by sequencing a minimum of 12 complete human genomes per day at an error rate less than ten parts in million at a profitable market price of less than US1000 per genome. To overcome the technology bottlenecks hindering current biotech companies from achieving these target throughput, error rate, and market price goals, Microeqhasdevelopedaninnovativesequencingtechniquethatusesshortreadfragmentswithhighcoverageonamicrofluidicsplatform.Short,amplifiedDNAfragmentsaregeneratedfromaninputofcustomersaliva.6basepair(bp)sequencehybridizationisusedforsequencingeachoftheDNAfragmentsindividually.TheresultsarethesehydridizationreadsarethenassembledviadeBruijngraphtheoryandthegraphicalreconstructionsofeachfragment’ssequencearethenassembledtoacompletegenomeviashotgunsequencingwithanexpectederrorratelessthan1in100,000bp.Uponthecompletionoffinancialanalysis,bothasmall−scalebusinessmodelproducing72genomesperdayatUSeq has developed an innovative sequencing technique that uses shortread fragments with high coverage on a microfluidics platform. Short, amplified DNA fragments are generated from an input of customer saliva. 6 base pair(bp) sequence hybridization is used for sequencing each of the DNA fragments individually. The results are these hydridization reads are then assembled via de Bruijn graph theory and the graphical reconstructions of each fragment’s sequence are then assembled to a complete genome via shotgun sequencing with an expected error rate less than 1 in 100,000bp. Upon the completion of financial analysis, both a small-scale business model producing 72 genomes per day at US999 per genome, and a largescale business model producing 52.2 genomes per year at a market price of US299pergenomewerefoundtobeprofitable,yieldingMicro299 per genome were found to be profitable, yielding Microeq investors return margins of ~90% and 300% for the small and large scale models, respectively. With a market price Micro$eq offers personal genome sequencing at one-tenth of its nearest potential competitor’s cost. Additionally, its ability for bulk-sequencing allows it to profitably venture into the previously untapped Pharmaceutical Industry market sector, enabling the creation of large-scale genome databases which are the next step forward in the quest for truly personalized

    Focus: A Graph Approach for Data-Mining and Domain-Specific Assembly of Next Generation Sequencing Data

    Get PDF
    Next Generation Sequencing (NGS) has emerged as a key technology leading to revolutionary breakthroughs in numerous biomedical research areas. These technologies produce millions to billions of short DNA reads that represent a small fraction of the original target DNA sequence. These short reads contain little information individually but are produced at a high coverage of the original sequence such that many reads overlap. Overlap relationships allow for the reads to be linearly ordered and merged by computational programs called assemblers into long stretches of contiguous sequence called contigs that can be used for research applications. Although the assembly of the reads produced by NGS remains a difficult task, it is the process of extracting useful knowledge from these relatively short sequences that has become one of the most exciting and challenging problems in Bioinformatics. The assembly of short reads is an aggregative process where critical information is lost as reads are merged into contigs. In addition, the assembly process is treated as a black box, with generic assembler tools that do not adapt to input data set characteristics. Finally, as NGS data throughput continues to increase, there is an increasing need for smart parallel assembler implementations. In this dissertation, a new assembly approach called Focus is proposed. Unlike previous assemblers, Focus relies on a novel hybrid graph constructed from multiple graphs at different levels of granularity to represent the assembly problem, facilitating information capture and dynamic adjustment to input data set characteristics. This work is composed of four specific aims: 1) The implementation of a robust assembly and analysis tool built on the hybrid graph platform 2) The development and application of graph mining to extract biologically relevant features in NGS data sets 3) The integration of domain specific knowledge to improve the assembly and analysis process. 4) The construction of smart parallel computing approaches, including the application of energy-aware computing for NGS assembly and knowledge integration to improve algorithm performance. In conclusion, this dissertation presents a complete parallel assembler called Focus that is capable of extracting biologically relevant features directly from its hybrid assembly graph

    Impacts of Genome Sequencing Technologies: Developing Educational Materials to Create Greater Public Awareness

    Get PDF
    Educators often use online material to aid in student learning. After reviewing academic and corporate sources, the team determined the primary effects of the advances in genome sequencing technologies. Based on survey responses, it was determined that many people are not familiar with these effects. Consequently, the team developed a series of educational videos and supplemental assignments that provide an introduction to genome sequencing technologies, as well as their worldwide impacts on legislation, economy, forensics, health care, and biological research

    Bioinformatic approaches for genome finishing

    Get PDF
    Husemann P, Tauch A. Bioinformatic approaches for genome finishing. Bielefeld: Universitätsbibliothek Bielefeld; 2011

    Improved sequence-read simulation for (meta)genomics

    Get PDF
    There are many programs available for generating simulated whole-genome shotgun sequence reads. The data generated by many of these programs follow predefined models, which limits their use to the authors' original intentions. For example, many models assume that read lengths follow a uniform or normal distribution. Other programs generate models from actual sequencing data, but are limited to reads from single-genome studies. To our knowledge, there are no programs that allow a user to generate simulated data for metagenomics applications following empirical read-length distributions and quality profiles based on empirically-derived information from actual sequencing data. We present BEAR (Better Emulation for Artificial Reads), a program that uses a machine-learning approach to generate reads with lengths and quality values that closely match empirically-derived distributions. BEAR can emulate reads from various sequencing platforms, including Illumina, 454, and Ion Torrent. BEAR requires minimal user input, as it automatically determines appropriate parameter settings from user-supplied data. BEAR also uses a unique method for deriving run-specific error rates, and extracts useful statistics from the metagenomic data itself, such as quality-error models. Many existing simulators are specific to a particular sequencing technology; however, BEAR is not restricted in this way. Because of its flexibility, BEAR is particularly useful for emulating the behaviour of technologies like Ion Torrent, for which no dedicated sequencing simulators are currently available. BEAR is also the first metagenomic sequencing simulator program that automates the process of generating abundances, which can be an arduous task. BEAR is useful for evaluating data processing tools in genomics. It has many advantages over existing comparable software, such as generating more realistic reads and being independent of sequencing technology, and has features particularly useful for metagenomics work

    Genome sequencing of bacteria: sequencing, de novo assembly and rapid analysis using open source tools

    Full text link

    Genome Assembly: Novel Applications by Harnessing Emerging Sequencing Technologies and Graph Algorithms

    Get PDF
    Genome assembly is a critical first step for biological discovery. All current sequencing technologies share the fundamental limitation that segments read from a genome are much shorter than even the smallest genomes. Traditionally, whole- genome shotgun (WGS) sequencing over-samples a single clonal (or inbred) target chromosome with segments from random positions. The amount of over-sampling is known as the coverage. Assembly software then reconstructs the target. So called next-generation (or second-generation) sequencing has reduced the cost and increased throughput exponentially over first-generation sequencing. Unfortunately, next-generation sequences present their own challenges to genome assembly: (1) they require amplification of source DNA prior to sequencing leading to artifacts and biased coverage of the genome; (2) they produce relatively short reads: 100bp- 700bp; (3) the sizeable runtime of most second-generation instruments is prohibitive for applications requiring rapid analysis, with an Illumina HiSeq 2000 instrument requiring 11 days for the sequencing reaction. Recently, successors to the second-generation instruments (third-generation) have become available. These instruments promise to alleviate many of the down- sides of second-generation sequencing and can generate multi-kilobase sequences. The long sequences have the potential to dramatically improve genome and transcriptome assembly. However, the high error rate of these reads is challenging and has limited their use. To address this limitation, we introduce a novel correction algorithm and assembly strategy that utilizes shorter, high-identity sequences to correct the error in single-molecule sequences. Our approach achieves over 99% read accuracy and produces substantially better assemblies than current sequencing strategies. The availability of cheaper sequencing has made new sequencing targets, such as multiple displacement amplified (MDA) single-cells and metagenomes, popular. Current algorithms assume assembly of a single clonal target, an assumption that is violated in these sequencing projects. We developed Bambus 2, a new scaffolder that works for metagenomics and single cell datasets. It can accurately detect repeats without assumptions about the taxonomic composition of a dataset. It can also identify biological variations present in a sample. We have developed a novel end-to-end analysis pipeline leveraging Bambus 2. Due to its modular nature, it is applicable to clonal, metagenomic, and MDA single-cell targets and allows a user to rapidly go from sequences to assembly, annotation, genes, and taxonomic info. We have incorporated a novel viewer, allowing a user to interactively explore the variation present in a genomic project on a laptop. Together, these developments make genome assembly applicable to novel targets while utilizing emerging sequencing technologies. As genome assembly is critical for all aspects of bioinformatics, these developments will enable novel biological discovery

    Viral population estimation using pyrosequencing

    Get PDF
    The diversity of virus populations within single infected hosts presents a major difficulty for the natural immune response as well as for vaccine design and antiviral drug therapy. Recently developed pyrophosphate based sequencing technologies (pyrosequencing) can be used for quantifying this diversity by ultra-deep sequencing of virus samples. We present computational methods for the analysis of such sequence data and apply these techniques to pyrosequencing data obtained from HIV populations within patients harboring drug resistant virus strains. Our main result is the estimation of the population structure of the sample from the pyrosequencing reads. This inference is based on a statistical approach to error correction, followed by a combinatorial algorithm for constructing a minimal set of haplotypes that explain the data. Using this set of explaining haplotypes, we apply a statistical model to infer the frequencies of the haplotypes in the population via an EM algorithm. We demonstrate that pyrosequencing reads allow for effective population reconstruction by extensive simulations and by comparison to 165 sequences obtained directly from clonal sequencing of four independent, diverse HIV populations. Thus, pyrosequencing can be used for cost-effective estimation of the structure of virus populations, promising new insights into viral evolutionary dynamics and disease control strategies.Comment: 23 pages, 13 figure
    • …
    corecore