28 research outputs found

    Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Next-generation sequencing technologies allow genomes to be sequenced more quickly and less expensively than ever before. However, as sequencing technology has improved, the difficulty of <it>de novo </it>genome assembly has increased, due in large part to the shorter reads generated by the new technologies. The use of mated sequences (referred to as mate-pairs) is a standard means of disambiguating assemblies to obtain a more complete picture of the genome without resorting to manual finishing. Here, we examine the effectiveness of mate-pair information in resolving repeated sequences in the DNA (a paramount issue to overcome). While it has been empirically accepted that mate-pairs improve assemblies, and a variety of assemblers use mate-pairs in the context of repeat resolution, the effectiveness of mate-pairs in this context has not been systematically evaluated in previous literature.</p> <p>Results</p> <p>We show that, in high-coverage prokaryotic assemblies, libraries of short mate-pairs (about 4-6 times the read-length) more effectively disambiguate repeat regions than the libraries that are commonly constructed in current genome projects. We also demonstrate that the best assemblies can be obtained by 'tuning' mate-pair libraries to accommodate the specific repeat structure of the genome being assembled - information that can be obtained through an initial assembly using unpaired reads. These results are shown across 360 simulations on 'ideal' prokaryotic data as well as assembly of 8 bacterial genomes using SOAPdenovo. The simulation results provide an upper-bound on the potential value of mate-pairs for resolving repeated sequences in real prokaryotic data sets. The assembly results show that our method of tuning mate-pairs exploits fundamental properties of these genomes, leading to better assemblies even when using an off -the-shelf assembler in the presence of base-call errors.</p> <p>Conclusions</p> <p>Our results demonstrate that dramatic improvements in prokaryotic genome assembly quality can be achieved by tuning mate-pair sizes to the actual repeat structure of a genome, suggesting the possible need to change the way sequencing projects are designed. We propose that a two-tiered approach - first generate an assembly of the genome with unpaired reads in order to evaluate the repeat structure of the genome; then generate the mate-pair libraries that provide most information towards the resolution of repeats in the genome being assembled - is not only possible, but likely also more cost-effective as it will significantly reduce downstream manual finishing costs. In future work we intend to address the question of whether this result can be extended to larger eukaryotic genomes, where repeat structure can be quite different.</p

    A gene-by-gene population genomics platform: de novo assembly, annotation and genealogical analysis of 108 representative Neisseria meningitidis genomes

    Get PDF
    Background: Highly parallel,‘second generation’ sequencing technologies have rapidly expanded the number of bacterial whole genome sequences available for study, permitting the emergence of the discipline of population genomics. Most of these data are publically available as unassembled short-read sequence files that require extensive processing before they can be used for analysis. The provision of data in a uniform format, which can be easily assessed for quality, linked to provenance and phenotype and used for analysis, is therefore necessary. Results: The performance of de novo short-read assembly followed by automatic annotation using the pubMLST. orgNeisseriadatabase was assessed and evaluated for 108 diverse, representative, and well-characterisedNeisseria meningitidisisolates. High-quality sequences were obtained for >99% of known meningococcal genes among the de novoassembled genomes and four resequenced genomes and less than 1% of reassembled genes had sequence discrepancies or misassembled sequences. A core genome of 1600 loci, present in at least 95% of the population, was determined using the Genome Comparator tool. Genealogical relationships compatible with, but at a higher resolution than, those identified by multilocus sequence typing were obtained with core genome comparisons and ribosomal protein gene analysis which revealed a genomic structure for a number of previously described phenotypes. This unified system for cataloguing Neisseria genetic variation in the genome was implemented and used for multiple analyses and the data are publically available in the PubMLST Neisseria database. Conclusions: The de novo assembly, combined with automated gene-by-gene annotation, generates high quality draft genomes in which the majority of protein-encoding genes are present with high accuracy. The approach catalogues diversity efficiently, permits analyses of a single genome or multiple genome comparisons, and is a practical approach to interpreting WGS data for large bacterial population samples. The method generates novel insights into the biology of the meningococcus and improves our understanding of the whole population structure, not just disease causing lineages.</p

    Bioinformatics approaches for hybrid de novo genome assembly

    Get PDF
    De novo genome assembly, the computational process to reconstruct the genomic sequence from scratch stitching together overlapping reads, plays a key role in computational biology and, to date, it cannot be considered a solved problem. Many bioinformatics approaches are available to deal with different type of data generated by diverse technologies. Assemblies relying on short read data resulted to be highly fragmented, reconstructing short contigs interrupted in repetitive region; on the other side long-read based approaches still suffer of high sequencing error rate, worsening the final consensus quality. This thesis aimed to assess the impact of different assembly approaches on the reconstruction of a highly repetitive genome, identifying the strengths and limiting the weaknesses of such approaches through the integration of orthogonal data types. Moreover, a benchmarking study has been undertaken to improve the contiguity of this genome, describing the improvements obtained thanks to the integration of additional data layers. Assemblies performed using short reads confirmed the limitation in the reconstruction of long sequences for both the software adopted. The use of long reads allowed to improve the genome assembly contiguity, reconstructing also a greater number of gene models. Despite the enhancement of contiguity, base level accuracy of long reads-based assembly could still not reach higher levels. Therefore, short reads were integrated within the assembly process to limit the base level errors present in the reconstructed sequences up to 96%. To order and orient the assembled polished contigs into longer scaffolds, data derived from three different technologies (linked read, chromosome conformation capture and optical mapping) have been analysed. The best contiguity metrics were obtained using chromosome conformation data, which permit to obtain chromosome-scale scaffolds. To evaluate the obtained results, data derived from linked reads and optical mapping have been used to identify putative misassemblies in the scaffolds. Both the datasets allowed the identification of misassemblies, highlighting the importance of integrating data derived from orthogonal technologies in the de novo assembly process. 4 This work underlines the importance of adopting bioinformatics approaches able to deal with data type generated by different technologies. In this way, results could be more accurately validated for the reconstruction of assemblies that could be eventually considered reference genomes

    Next-generation sequencing and its new possibilities in medicine

    Get PDF
    Next-Generation Sequencing (NGS) originally refers to high-throughput, massively parallel sequencing methods that allow the sequencing of up to billions of small (50-1000 bp), amplified DNA fragments at the same time but nowadays, there are NGS techniques that determine the sequence of long (up to 50 kbp) single molecules. Over the past years, NGS technologies become widely available with increasing throughput and decreasing sequencing costs per base making them more cost effective than the previously used capillary sequencing methods based on Sanger biochemistry. Nowadays, high-throughput DNA sequencing is routinely used on a wide range of important fields of biology and medicine enabling large-scale sequencing projects like analysis of complete genomes, disease association studies, whole transcriptomes, methylomes and provide new insights into complex biological systems. In addition, more and more NGS-based diagnostic tools are being introduced into the clinical practice, for example, on the fields of oncology, inherited and infectious diseases or pre-implantation and prenatal genetic screenings

    Lost in plasmids: next generation sequencing and the complex genome of the tick-borne pathogen Borrelia burgdorferi

    Get PDF
    Background: Borrelia (B.) burgdorferi sensu lato, including the tick-transmitted agents of human Lyme borreliosis, have particularly complex genomes, consisting of a linear main chromosome and numerous linear and circular plasmids. The number and structure of plasmids is variable even in strains within a single genospecies. Genes on these plasmids are known to play essential roles in virulence and pathogenicity as well as host and vector associations. For this reason, it is essential to explore methods for rapid and reliable characterisation of molecular level changes on plasmids. In this study we used three strains: a low passage isolate of B. burgdorferi sensu stricto strain B31(-NRZ) and two closely related strains (PAli and PAbe) that were isolated from human patients. Sequences of these strains were compared to the previously sequenced reference strain B31 (available in GenBank) to obtain proof-of-principle information on the suitability of next generation sequencing (NGS) library construction and sequencing methods on the assembly of bacterial plasmids. We tested the effectiveness of different short read assemblers on Illumina sequences, and of long read generation methods on sequence data from Pacific Bioscience single-molecule real-time (SMRT) and nanopore (Oxford Nanopore Technologies) sequencing technology. Results: Inclusion of mate pair library reads improved the assembly in some plasmids as did prior enrichment of plasmids. While cp32 plasmids remained refractory to assembly using only short reads they were effectively assembled by long read sequencing methods. The long read SMRT and nanopore sequences came, however, at the cost of indels (insertions or deletions) appearing in an unpredictable manner. Using long and short read technologies together allowed us to show that the three B. burgdorferi s.s. strains investigated here, whilst having similar plasmid structures to each other (apart from fusion of cp32 plasmids), differed significantly from the reference strain B31-GB, especially in the case of cp32 plasmids. Conclusion: Short read methods are sufficient to assemble the main chromosome and many of the plasmids in B. burgdorferi. However, a combination of short and long read sequencing methods is essential for proper assembly of all plasmids including cp32 and thus, for gaining an understanding of host- or vector adaptations. An important conclusion from our work is that the evolution of Borrelia plasmids appears to be dynamic. This has important implications for the development of useful research strategies to monitor the risk of Lyme disease occurrence and how to medically manage it

    Genome-based natural product biosynthetic gene cluster discovery : from sequencing to mining

    Get PDF
    Natural products are small molecules produced by a range of living organisms. They may be toxic or have pharmaceutical applications as antibiotics, anticancer, antiparasitic and anti-fungal agents. Natural products are commonly synthesized by nonribosomal peptide synthetases (NRPSs) and polyketide synthases (PKSs), such as microcystins. Ribosomal pathways in cyanobacteria are also known for the synthesis of bacteriocins, lantibiotics, cyanobactins and microviridins. Genes encoding biosynthetic enzymes of these systems are often found together and form gene clusters. The filamentous cyanobacterium Anabaena sp. strain 90, a hepatotoxin producer isolated from a bloom of a Finnish lake, was selected for genome sequencing, in order to explore its full capacity of bioactive compound production. The 5.3-Mb Anabaena sp. 90 genome displays a multi-chromosomal composition with five circular replicons: two chromosomes and three plasmids. A total of four non-ribosomal biosynthetic gene clusters, which are responsible for the production of anabaenopeptilides, anabaenopeptins, microcystins and the novel glycolipopeptides hassallidins, were identified in chromosome I. Genome annotation revealed that Anabaena sp. 90 genome also harbors an anacyclamide-encoding cyanobactin gene cluster and seven putative bacteriocin gene clusters, which belong to the ribosomal pathways. These biosynthetic gene clusters amount to a total of ~250 kb, and 5% of the genome. Analysis of the Anabaena sp. 90 genome suggested that cyanobacteria might produce bacteriocins. A thorough genome mining at the phylum level was conducted targeting the discovery of cyanobacterial bacteriocin biosynthetic pathways. The results demonstrated the common presence of bacteriocin gene clusters in cyanobacteria. A total of 145 bacteriocin gene clusters were discovered, the majority of them were previously unknown. Based on their gene organization and domain composition, these gene clusters were classified into seven groups. This classification is supported by the phylogenetic analysis, which also indicates independent evolutionary trajectories of the gene clusters in different groups. By scrutinizing the surrounding regions of these gene clusters, a total of 290 putative precursors were located. They showed diverse structures and very little sequence conservation of the core peptide. To explore the distribution of NRPSs and PKSs, a comprehensive genome-mining study was carried out and demonstrated their widespread occurrence across the three domains of life, with the discovery of 3,339 gene clusters from 991 organisms, by examining a total of 2,699 genomes. The majority of these gene clusters were found in bacteria, in which high correlation between bacterial genome size and the capacity of NRPS and PKS biosynthetic pathways was observed. Currently, PKSs are classified into three types. Type I PKSs and NRPSs are known to share a modular scheme with a multidomain structure. Surprisingly, a large number (8,906) of enzymes encoding a single NRPS or type I PKS functional domain were found. These monodomain enzymes have a similar genetic organization to type II PKSs, which are nonmodular enzymes. The finding of common occurrence of nonmodular NRPSs and type I PKSs substantially differs from the current knowledge. Furthermore, a total of 314 gene clusters comprised mostly of monodomain enzymes were found. In addition, sequence analysis suggested that the evolution of NRPS machineries was a combination of common descent and horizontal gene transfer.Natural products are bioactive compounds produced by living organisms. They have diverse chemical structures and broad biological activities, which lend themselves to pharmaceutical applications such as drug lead candidates. Nonribosomal peptides and polyketides are the most commonly utilized natural products. Recently, ribosomally synthesized natural products, such as bacteriocins, lantibiotics and cyanobactins, were also found with interesting activities and appeared as potential sources of novel medical agents. Given the advancement of DNA sequencing techniques and exponential growth of genomic data, more than three thousands of biosynthetic pathways of nonribosomal peptides, polyketides and bacteriocins were discovered in this study by complete genome sequencing and systematic genome mining. The majority of these pathways have unknown end-products, which highlights the power of genome mining in discovering novel secondary metabolites biosynthetic machineries. Genome sequencing revealed that 5% of Anabaena sp. 90 genome is dedicated to the production of bioactive peptides. Genome mining demonstrated the widespread occurrence of bacteriocin gene clusters in cyanobacteria, which were shown as a rich source of natural products biosynthesized by both nonribosomal and ribosomal pathways. Furthermore, a comprehensive genomic survey of nonribosomal peptide and polyketide biosynthetic pathways demonstrated their widespread distribution across three domains of life. This atlas showed that Proteobacteria, Actinobacteria, Firmicutes and Cyanobacteria in bacteria, and phylum of Ascomycotain in fungi contained higher number of these gene clusters and may produce a vast array of nonribosomal peptides and polyketides. The common occurrence of non-canonical nonmodular biosynthetic enzymes of peptide synthethase and type I polyketide synthase was also revealed. The knowledge discovered in this study provides a solid basis for the exploration of natural product biosynthetic capacity, for example to aid drug discovery

    Genomic analyses of Paenibacillus polymyxa CR1, a bacterium with potential applications in biomass degradation and biofuel production

    Get PDF
    Lignin is a polyphenolic heteropolymer constituting between 18 to 35% of lignocellulose and is recognized as preventative of cellulosic biofuel commercialization. Paenibacillus polymyxa CR1 was isolated from naturally degrading corn stover and shown to produce alcohols using lignin as a sole carbon source. Genome sequencing and comparative genomics of P. polymyxa CR1 identified two homologs, a Dyp-type peroxidase and a laccase, which have previously been implicated in lignin metabolism in other bacteria. Knockout mutants of the identified genes displayed no growth deficiency and P. polymyxa CR1 is incapable of metabolizing common aromatic intermediates of lignin, suggesting the bacterium employs a novel catabolic pathway. To identify genes involved in lignin metabolism, a transposon library was generated and screened for abnormal lignin growth phenotypes. The results contained within will help elucidate the genetic basis of known functions helping delineate regulatory pathways and metabolic versatility in P. polymyxa relevant to lignin metabolism

    SĂ©quençage des gĂ©nomes nuclĂ©aires d’eucaryotes unicellulaires ‘primitifs’ : les jakobides

    Full text link
    Les eucaryotes sont des organismes chimĂ©riques issus de l’endosymbiose entre une archĂ©obactĂ©rie et une α-protĂ©obactĂ©rie. Au cours de ce processus, ces organismes ont Ă©voluĂ© de sorte Ă  obtenir un grand nombre de caractĂ©ristiques observĂ©es chez les eucaryotes modernes, notamment une mitochondrie, un noyau, un systĂšme endomembranaire, un systĂšme d’épissage ou encore des chromosomes linĂ©aires terminĂ©s par un tĂ©lomĂšre. Bien que les caractĂ©ristiques du dernier ancĂȘtre commun des eucaryotes aient majoritairement Ă©tĂ© identifiĂ©, la suite des Ă©vĂšnements Ă©volutifs ayant menĂ© Ă  l’apparition de cet organisme demeure peu compris. Afin de mieux reconstruire cette suite d’évĂšnements, l’analyse des gĂ©nomes d’organismes basals aux eucaryotes sera nĂ©cessaire pour identifier des traces de cette Ă©volution. Ainsi, nous proposons que l’analyse d’une collection de gĂ©nomes d’eucaryotes « primitifs », les jakobides et malawimonades, des eucaryotes unicellulaires flagellĂ©s se nourrissant de bactĂ©ries, pourrait permettre une meilleure comprĂ©hension de ce processus. De plus, il a Ă©tĂ© supposĂ© que le gĂ©nome d’un de ces organismes, Andalucia godoyi, pourrait possĂ©der des chromosomes circulaires, une caractĂ©ristique atypique chez les eucaryotes, une caractĂ©ristique qui pourra ĂȘtre confirmĂ©e par la production d’assemblage gĂ©nomique de haute contigĂŒitĂ©. Afin d’obtenir des assemblages gĂ©nomiques de haute qualitĂ©, les jakobides A. godoyi, Jakoba bahamiensis, Seculamonas ecuadoriensis, Stygiella incarcerata et le malawimonades Malawimonas californiana ont Ă©tĂ© sĂ©quencĂ©s par nanopore. Le sĂ©quençage nanopore a prĂ©sentĂ© des rĂ©sultats mitigĂ©s et les organismes J. bahamiensis et M. californiana ont prĂ©sentĂ©s un faible rendement de sĂ©quençage, possiblement dĂ» Ă  la contamination par des polysaccharides. Pour les autres organismes, nous avons dĂ©veloppĂ© un pipeline d’assemblage utilisant les assembleurs Flye et Shasta qui nous a permis de produire des assemblages gĂ©nomiques. L’analyse du gĂ©nome de A. godoyi a permis d’identifier la prĂ©sence de quatre chromosomes circulaires, possiblement localisĂ©s dans le noyau, contenant plusieurs gĂšnes liĂ©s au mĂ©tabolisme, au transport et Ă  la signalisation et qui constituent possiblement un type de chromosome circulaire diffĂ©rent de ceux observĂ©s prĂ©cĂ©demment chez les eucaryotes. Dans l’ensemble, ces travaux ont permis la mise en place d’une collection de gĂ©nome d’eucaryotes « primitifs » qui pourront ĂȘtre utilisĂ©s pour des analyse de gĂ©nomique comparative afin de mieux comprendre l’évolution des eucaryotes.Eucaryotes are chimeric organisms that are the product of an endosymbiotic event between an archaebacteria and an α-proteobacteria. During the eukaryogenesis, these organisms have gained many characteristics that defines modern eucaryotes such as a mitochondrion, a nucleus, an endomembrane system, the splicing machinery, and linear chromosome with telomeres. While most characteristics of the last common eukaryote ancestor have mostly been identified, most of the evolutionary process that led to this organism is still unknown. To reconstruct this string of event, we must analyse the genome of “primitive” basal eukaryotes with a slow evolutionary rate and a lifestyle like that of the last common eukaryotes ancestor, and thus are most likely to contain remains of ancestral mechanisms that have been lost in most known eukaryotes. We propose that this analysis of the genome of the jakobids and malawimonads, two groups are free-living flagellate that feeds on bacteria, could provide such clues on the evolution of eukaryotes. Using nanopore sequencing, a collection of high-quality genomes has been built to help in this analysis. Furthermore, it has been supposed that the genome of the jakobid Andalucia godoyi could be composed to both linear and circular chromosomes, a genomic structure that have not been identified in other eukaryotes, which was investigated using the high quality nanopore assembly. To generate a collection of high-quality genome assemblies, we have sequenced the genomes of the jakobids A. godoyi, Jakoba bahamiensis, Seculamonas ecuadoriensis and Stygiella incarcerata as well as the malawimonad Malawimonas californiana by nanopore. While the yields were too low for J. bahamiensis and M. californiana, probably due to a contamination by polysaccharides, we were able to assemble chromosome level genome for A. godoyi and S. incarcerata and high-quality draft genome for S. ecuadoriensis et R. americana. Using this assembly, we were able to identify four circular chromosomes in the genome of A. godoyi. The circular chromosomes are likely to be located in the nucleus and encodes genes with functions related to the metabolism, ions and macromolecules transport as well as signaling. Furthermore, these molecules differ from known circular chromosome in eukaryotes as they are unlikely to be selfish DNA elements, such as known eucaryotes plasmids, or circular by-product of replication identified in other eukaryotes. Overall, this work sets the bases for larger scale comparative genomics of the jakobids and malawimonads, by generating a small collection of genomes that will be used in future studies to better understand the origin of the eukaryotes
    corecore