1,392 research outputs found

    Circlator: automated circularization of genome assemblies using long sequencing reads

    Get PDF
    The assembly of DNA sequence data is undergoing a renaissance thanks to emerging technologies capable of producing reads tens of kilobases long. Assembling complete bacterial and small eukaryotic genomes is now possible, but the final step of circularizing sequences remains unsolved. Here we present Circlator, the first tool to automate assembly circularization and produce accurate linear representations of circular sequences. Using Pacific Biosciences and Oxford Nanopore data, Circlator correctly circularized 26 of 27 circularizable sequences, comprising 11 chromosomes and 12 plasmids from bacteria, the apicoplast and mitochondrion of Plasmodium falciparum and a human mitochondrion. Circlator is available at http://sanger-pathogens.github.io/circlator/

    Insights into the biology of Candidate Division OP3 LiM populations

    Get PDF
    The candidate division OP3, recently entitled candidate phylum Omnitrophica, is characterized by 16S rRNA gene sequences from a broad range of anoxic habitats with a broad phylogeny of up to 26% sequence dissimilarity. The 16S rRNA phylotype OP3 LiM had previously been detected in limonene-degrading, methanogenic enrichment cultures and represented small coccoid cells. Neither isolation experiments nor physiological experiments had provided insights into the metabolism of this bacterium within the complex methanogenic community. This doctoral thesis aimed at the characterization of populations of the phylotype OP3 LiM to discover its biology. Metagenomes usually yield draft population genomes. To obtain the complete closed OP3 LiM genome, in silico methods were explored to improve draft assemblies. Large genomes of planctomycete strains were assembled with a variety of methods. A taxonomic classification of contig sequences was used to differentiate and separate contigs of draft assemblies into taxon-specific groups. Reassemblies of reads obtaining from mapping onto taxon-specific contigs yielded improved draft assemblies. This knowledge was used to obtain a closed genome of OP3 LiM from a metagenome of physically enriched OP3 LiM cells. Finishing the OP3 LiM genome required the combination of data of different sequencing technologies, a variety of assembly and mapping software, over 15 reassemblies with intensive manual quality controls by read and contig mapping and, finally, laboratory work with combinatorial PCR to solve the genome puzzle. The population genome of OP3 LiM is the first closed genome of a member of candidate phylum Omnitrophica and comprises 1,974,501 bp with a GC content of 52.9%. Its 23S rRNA contains a group I intron. The genome offers a syntrophic life on hydrogen or formate, however, the metaproteome indicated that OP3 LiM uses glycolysis together with pyruvate oxidation as major catabolic pathway. The metaproteome also identified high levels of proteins potentially involved in the degradation of polymers as well as in the uptake of foreign nucleic acids. The genomic information was combined with observations of cells of the methanogenic community by different visualization methods. Images of OP3 LiM required electron microscopy due to the small cell size of 0.2a 0.3 AAmicrometre in diameter. In situ hybridizations revealed two physiological stages, free-living OP3 LiM cells with low ribosome content and OP3 LiM cells attached to either bacteria or archaea, which showed strong signals. This observation indicated a higher metabolic activity of OP3 LiM cells during the attachment and, likewise, that the bacterium utilizes surface polysaccharides as preferred substrate. In situ hybridizations revealed that the methanogen Methanosaeta in the enrichment culture contained cells in the filaments that lacked DNA and rRNA suggesting that these cells lost their cellular content. We also observed faint signals of the OP3 LiM 16S rRNA in Methanosaeta cells. The presence of the intron RNA of the 23S rRNA of OP3 LiM was visualized in Methanosaeta cells devoid of DNA and rRNA. This first direct observation of an intron transfer from a bacterium to an archaeon together with metaproteomic observations indicate the lifestyle of an epibiotic bacterium for OP3 LiM. OP3 LiM is the first predatory bacterium that preys on Archaea. We propose to name OP3 LiM a Candidatus Vampirococcus archaeovorusa

    Effort required to finish shotgun-generated genome sequences differs significantly among vertebrates

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The approaches for shotgun-based sequencing of vertebrate genomes are now well-established, and have resulted in the generation of numerous draft whole-genome sequence assemblies. In contrast, the process of refining those assemblies to improve contiguity and increase accuracy (known as 'sequence finishing') remains tedious, labor-intensive, and expensive. As a result, the vast majority of vertebrate genome sequences generated to date remain at a draft stage.</p> <p>Results</p> <p>To date, our genome sequencing efforts have focused on comparative studies of targeted genomic regions, requiring sequence finishing of large blocks of orthologous sequence (average size 0.5-2 Mb) from various subsets of 75 vertebrates. This experience has provided a unique opportunity to compare the relative effort required to finish shotgun-generated genome sequence assemblies from different species, which we report here. Importantly, we found that the sequence assemblies generated for the same orthologous regions from various vertebrates show substantial variation with respect to misassemblies and, in particular, the frequency and characteristics of sequence gaps. As a consequence, the work required to finish different species' sequences varied greatly. Application of the same standardized methods for finishing provided a novel opportunity to "assay" characteristics of genome sequences among many vertebrate species. It is important to note that many of the problems we have encountered during sequence finishing reflect unique architectural features of a particular vertebrate's genome, which in some cases may have important functional and/or evolutionary implications. Finally, based on our analyses, we have been able to improve our procedures to overcome some of these problems and to increase the overall efficiency of the sequence-finishing process, although significant challenges still remain.</p> <p>Conclusion</p> <p>Our findings have important implications for the eventual finishing of the draft whole-genome sequences that have now been generated for a large number of vertebrates.</p

    Assessment of Next Generation Sequencing Technologies for \u3ci\u3eDe novo\u3c/i\u3e and Hybrid Assemblies of Challenging Bacterial Genomes

    Get PDF
    In past decade, tremendous progress has been made in DNA sequencing methodologies in terms of throughput, speed, read-lengths, along with a sharp decrease in per base cost. These technologies, commonly referred to as next-generation sequencing (NGS) are complimented by the development of hybrid assembly approaches which can utilize multiple NGS platforms. In the first part of my dissertation I performed systematic evaluations and optimizations of nine de novo and hybrid assembly protocols across four novel microbial genomes. While each had strengths and weaknesses, via optimization using multiple strategies I obtained dramatic improvements in overall assembly size and quality. To select the best assembly, I also proposed the novel rDNA operon validation approach to evaluate assembly accuracy. Additionally, I investigated the ability of third-generation PacBio sequencing platform and achieved automated finishing of Clostridium autoethanogenum without any accessory data. These complete genome sequences facilitated comparisons which revealed rDNA operons as a major limitation for short read technologies, and also enabled comparative and functional genomics analysis. To facilitate future assessment and algorithms developments of NGS technologies we publically released the sequence datasets for C. autoethanogenum which span three generations of sequencing technologies, containing six types of data from four NGS platforms. To assess limitations of NGS technologies, assessment of unassembled regions within Illumina and PacBio assemblies was performed using eight microbial genomes. This analysis confirmed rDNA operons as major breakpoints within Illumina assembly while gaps within PacBio assembly appears to be an unaccounted for event and assembly quality is cumulative effect of read-depth, read-quality, sample DNA quality and presence of phage DNA or mobile genetic elements. In a final collaborative study an enrichment protocol was applied for isolation of live endophytic bacteria from roots of the tree Populus deltoides. This protocol achieved a significant reduction in contaminating plant DNA and enabled use these samples for single-cell genomics analysis for the first time. Whole genome sequencing of selected single-cell genomes was performed, assembly and contamination removal optimized, and followed by the bioinformatics, phylogenetic and comparative genomics analyses to identify unique characteristics of these uncultured microorganisms

    Comparison of bacterial genome assembly software for MinION data and their applicability to medical microbiology.

    Get PDF
    Translating the Oxford Nanopore MinION sequencing technology into medical microbiology requires on-going analysis that keeps pace with technological improvements to the instrument and release of associated analysis software. Here, we use a multidrug-resistant Enterobacter kobei isolate as a model organism to compare open source software for the assembly of genome data, and relate this to the time taken to generate actionable information. Three software tools (PBcR, Canu and miniasm) were used to assemble MinION data and a fourth (SPAdes) was used to combine MinION and Illumina data to produce a hybrid assembly. All four had a similar number of contigs and were more contiguous than the assembly using Illumina data alone, with SPAdes producing a single chromosomal contig. Evaluation of the four assemblies to represent the genome structure revealed a single large inversion in the SPAdes assembly, which also incorrectly integrated a plasmid into the chromosomal contig. Almost 50 %, 80 % and 90 % of MinION pass reads were generated in the first 6, 9 and 12 h, respectively. Using data from the first 6 h alone led to a less accurate, fragmented assembly, but data from the first 9 or 12 h generated similar assemblies to that from 48 h sequencing. Assemblies were generated in 2 h using Canu, indicating that going from isolate to assembled data is possible in less than 48 h. MinION data identified that genes responsible for resistance were carried by two plasmids encoding resistance to carbapenem and to sulphonamides, rifampicin and aminoglycosides, respectively.Health Innovation Challenge Fund (WT098600, HICF-T5-342) (Department of Health, Wellcome Trust)This is the final version of the article. It first appeared from the Microbiology Society via http://dx.doi.org/10.1099/mgen.0.00008

    Finishing the finished human chromosome 22 sequence

    Get PDF
    A combination of approaches was used to close 8 of the 11 gaps in the original sequence of human chromosome 22, and to generate a total 1.018 Mb of new sequence

    Finished sequence and assembly of the DUF1220-rich 1q21 region using a haploid human genome

    Get PDF
    BackgroundAlthough the reference human genome sequence was declared finished in 2003, some regions of the genome remain incomplete due to their complex architecture. One such region, 1q21.1-q21.2, is of increasing interest due to its relevance to human disease and evolution. Elucidation of the exact variants behind these associations has been hampered by the repetitive nature of the region and its incomplete assembly. This region also contains 238 of the 270 human DUF1220 protein domains, which are implicated in human brain evolution and neurodevelopment. Additionally, examinations of this protein domain have been challenging due to the incomplete 1q21 build. To address these problems, a single-haplotype hydatidiform mole BAC library (CHORI-17) was used to produce the first complete sequence of the 1q21.1-q21.2 region.ResultsWe found and addressed several inaccuracies in the GRCh37sequence of the 1q21 region on large and small scales, including genomic rearrangements and inversions, and incorrect gene copy number estimates and assemblies. The DUF1220-encoding NBPF genes required the most corrections, with 3 genes removed, 2 genes reassigned to the 1p11.2 region, 8 genes requiring assembly corrections for DUF1220 domains (~91 DUF1220 domains were misassigned), and multiple instances of nucleotide changes that reassigned the domain to a different DUF1220 subtype. These corrections resulted in an overall increase in DUF1220 copy number, yielding a haploid total of 289 copies. Approximately 20 of these new DUF1220 copies were the result of a segmental duplication from 1q21.2 to 1p11.2 that included two NBPF genes. Interestingly, this duplication may have been the catalyst for the evolutionarily important human lineage-specific chromosome 1 pericentric inversion.ConclusionsThrough the hydatidiform mole genome sequencing effort, the 1q21.1-q21.2 region is complete and misassemblies involving inter- and intra-region duplications have been resolved. The availability of this single haploid sequence path will aid in the investigation of many genetic diseases linked to 1q21, including several associated with DUF1220 copy number variations. Finally, the corrected sequence identified a recent segmental duplication that added 20 additional DUF1220 copies to the human genome, and may have facilitated the chromosome 1 pericentric inversion that is among the most notable human-specific genomic landmarks

    High quality draft sequences for prokaryotic genomes using a mix of new sequencing technologies

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Massively parallel DNA sequencing instruments are enabling the decoding of whole genomes at significantly lower cost and higher throughput than classical Sanger technology. Each of these technologies have been estimated to yield assemblies with more problematic features than the standard method. These problems are of a different nature depending on the techniques used. So, an appropriate mix of technologies may help resolve most difficulties, and eventually provide assemblies of high quality without requiring any Sanger-based input.</p> <p>Results</p> <p>We compared assemblies obtained using Sanger data with those from different inputs from New Sequencing Technologies. The assemblies were systematically compared with a reference finished sequence. We found that the 454 GSFLX can efficiently produce high continuity when used at high coverage. The potential to enhance continuity by scaffolding was tested using 454 sequences from circularized genomic fragments. Finally, we explore the use of Solexa-Illumina short reads to polish the genome draft by implementing a technique to correct 454 consensus errors.</p> <p>Conclusion</p> <p>High quality drafts can be produced for small genomes without any Sanger data input. We found that 454 GSFLX and Solexa/Illumina show great complementarity in producing large contigs and supercontigs with a low error rate.</p

    De Novo Assembly of the Complete Genome of an Enhanced Electricity-Producing Variant of Geobacter sulfurreducens Using Only Short Reads

    Get PDF
    State-of-the-art DNA sequencing technologies are transforming the life sciences due to their ability to generate nucleotide sequence information with a speed and quantity that is unapproachable with traditional Sanger sequencing. Genome sequencing is a principal application of this technology, where the ultimate goal is the full and complete sequence of the organism of interest. Due to the nature of the raw data produced by these technologies, a full genomic sequence attained without the aid of Sanger sequencing has yet to be demonstrated
    corecore