311 research outputs found

    PERGA: A Paired-End Read Guided De Novo Assembler for Extending Contigs Using SVM and Look Ahead Approach

    Get PDF
    Since the read lengths of high throughput sequencing (HTS) technologies are short, de novo assembly which plays significant roles in many applications remains a great challenge. Most of the state-of-the-art approaches base on de Bruijn graph strategy and overlap-layout strategy. However, these approaches which depend on k-mers or read overlaps do not fully utilize information of paired-end and single-end reads when resolving branches. Since they treat all single-end reads with overlapped length larger than a fix threshold equally, they fail to use the more confident long overlapped reads for assembling and mix up with the relative short overlapped reads. Moreover, these approaches have not been special designed for handling tandem repeats (repeats occur adjacently in the genome) and they usually break down the contigs near the tandem repeats. We present PERGA (Paired-End Reads Guided Assembler), a novel sequence-reads-guided de novo assembly approach, which adopts greedy-like prediction strategy for assembling reads to contigs and scaffolds using paired-end reads and different read overlap size ranging from Omax to Omin to resolve the gaps and branches. By constructing a decision model using machine learning approach based on branch features, PERGA can determine the correct extension in 99.7% of cases. When the correct extension cannot be determined, PERGA will try to extend the contig by all feasible extensions and determine the correct extension by using look-ahead approach. Many difficult-resolved branches are due to tandem repeats which are close in the genome. PERGA detects such different copies of the repeats to resolve the branches to make the extension much longer and more accurate. We evaluated PERGA on both Illumina real and simulated datasets ranging from small bacterial genomes to large human chromosome, and it constructed longer and more accurate contigs and scaffolds than other state-of-the-art assemblers. PERGA can be freely downloaded at https://github.com/hitbio/PERGA.published_or_final_versio

    Functional Comparison of Current Software Tools for Genomic Assembly from High Throughput Sequencing Data

    Get PDF
    De novo genomic sequencing, which is the process of discovering the sequence of a genome which has not previously been elucidated, provides unique challenges, especially for larger genomes. Modern high-throughput sequencing technologies have addressed the issue of covering the entire genome in a reasonable time by fragmenting the genome into portions that can be examined in a massively-parallel approach. While these have saved considerable time and cost for the chemical process of determining the sequence of a genome, they result in sets of many tens of millions of sequence fragments called reads, each of which is typically on the order of just 100 to 300 bases long. Assembling these reads into a genomic sequence is highly computationally complex. A variety of assembly software packages are readily available for this purpose. In this project, a set of genomic assemblers was selected for examination. These programs were then tested with an Illumina data set for the grape species Vitis romanetii. Experimental runs with this dataset were performed to evaluate the run time required as well as the contiguity, completeness, and accuracy of the resulting assemblies. Different approaches to quality control preprocessing of the sequence data were also explored and evaluated. The results strongly recommend the use of the program MaSuRCA, run with data which has not been preprocessed for quality control. The second highest recommendation would be the use of ABySS with data preprocessed via QuorUM error-correction. In the process of these tests, it was also hoped that at least the beginnings of a draft genome for V. romanetii would be produced. The assemblies which came closest to publication quality were produced by MaSuRCA. Examination of these using the assessment software BUSCO suggest that the best of these assemblies may well be approaching publishable quality

    Genome Sequence of the Native Apiculate Wine Yeast Hanseniaspora vineae T02/19AF

    Get PDF
    The use of novel yeast strains for winemaking improves quality and provides variety including subtle characteristic differences in fine wines. Here we report the first genome of a yeast strain native to Uruguay, Hanseniaspora vineae T02/19AF, which has been shown to positively contribute to aroma and wine quality.Fil: Giorello, Facundo M.. Universidad de la República; UruguayFil: Berná, Luisa. Instituto Pasteur de Montevideo; UruguayFil: Greif, Gonzalo. Instituto Pasteur de Montevideo; UruguayFil: Camesasca, Laura. Inst. de Investigaciones Biológicas Clemente Estable; UruguayFil: Salzman, Valentina. Instituto Pasteur de Montevideo; Uruguay. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Medina, Karina. Universidad de la Republica. Facultad de Química; UruguayFil: Robello, Carlos. Instituto Pasteur de Montevideo; UruguayFil: Gaggero, Carina. Inst. de Investigaciones Biológicas Clemente Estable; UruguayFil: Aguilar, Pablo S.. Instituto Pasteur de Montevideo; UruguayFil: Carrau, Francisco. Sección Enología; Urugua

    First draft genome assembly of the Argane tree (Argania spinosa)

    Get PDF
    Background: The Argane tree (Argania spinosa L. Skeels) is an endemic tree of southwestern Morocco that plays an important socioeconomic and ecologic role for a dense human population in an arid zone. Several studies confirmed the importance of this species as a food and feed source and as a resource for both pharmaceutical and cosmetic compounds. Unfortunately, the argane tree ecosystem is facing significant threats from environmental changes (global warming, over-population) and over-exploitation. Limited research has been conducted, however, on argane tree genetics and genomics, which hinders its conservation and genetic improvement. Methods: Here, we present a draft genome assembly of A. spinosa. A reliable reference genome of A. spinosa was created using a hybrid de novo assembly approach combining short and long sequencing reads. Results: In total, 144 Gb Illumina HiSeq reads and 7.2 Gb PacBio reads were produced and assembled. The final draft genome comprises 75 327 scaffolds totaling 671 Mb with an N50 of 49 916 kb. The draft assembly is close to the genome size estimated by k-mers distribution and covers 89% of complete and 4.3 % of partial Arabidopsis orthologous groups in BUSCO. Conclusion: The A. spinosa genome will be useful for assessing biodiversity leading to efficient conservation of this endangered endemic tree. Furthermore, the genome may enable genome-assisted cultivar breeding, and provide a better understanding of important metabolic pathways and their underlying genes for both cosmetic and pharmacological purposes

    The Douglas-Fir Genome Sequence Reveals Specialization of the Photosynthetic Apparatus in Pinaceae.

    Get PDF
    A reference genome sequence for Pseudotsuga menziesii var. menziesii (Mirb.) Franco (Coastal Douglas-fir) is reported, thus providing a reference sequence for a third genus of the family Pinaceae. The contiguity and quality of the genome assembly far exceeds that of other conifer reference genome sequences (contig N50 = 44,136 bp and scaffold N50 = 340,704 bp). Incremental improvements in sequencing and assembly technologies are in part responsible for the higher quality reference genome, but it may also be due to a slightly lower exact repeat content in Douglas-fir vs. pine and spruce. Comparative genome annotation with angiosperm species reveals gene-family expansion and contraction in Douglas-fir and other conifers which may account for some of the major morphological and physiological differences between the two major plant groups. Notable differences in the size of the NDH-complex gene family and genes underlying the functional basis of shade tolerance/intolerance were observed. This reference genome sequence not only provides an important resource for Douglas-fir breeders and geneticists but also sheds additional light on the evolutionary processes that have led to the divergence of modern angiosperms from the more ancient gymnosperms

    A Chromosome-Scale Assembly of the Garden Orach (Atriplex hortensis L.) Genome Using Oxford Nanopore Sequencing

    Get PDF
    Atriplex hortensis (2n = 2x = 18, 1C genome size 1.1 gigabases), also known as garden orach and mountain-spinach, is a highly nutritious, broadleaf annual of the Amaranthaceae-Chenopodiaceae alliance (Chenopodiaceae sensu stricto, subfam. Chenopodioideae) that has spread in cultivation from its native primary domestication area in Eurasia to other temperate and subtropical regions worldwide. Atriplex L. is a highly complex but, as understood now, a monophyletic group of mainly halophytic and/or xerophytic plants, of which A. hortensis has been a vegetable of minor importance in some areas of Eurasia (from Central Asia to the Mediterranean) at least since antiquity. Nonetheless, it is a crop with tremendous nutritional potential due primarily to its exceptional leaf and seed protein quantities (approaching 30%) and quality (high levels of lysine). Although there is some literature describing the taxonomy and production of A. hortensis, there is a general lack of genetic and genomic data that would otherwise help elucidate the genetic variation, phylogenetic positioning, and future potential of the species. Here, we report the assembly of the first high-quality, chromosome-scale reference genome for A. hortensis cv. “Golden.” Long-read data from Oxford Nanopore’s MinION DNA sequencer was assembled with the program Canu and polished with Illumina short reads. Contigs were scaffolded to chromosome scale using chromatin-proximity maps (Hi-C) yielding a final assembly containing 1,325 scaffolds with a N50 of 98.9 Mb – with 94.7% of the assembly represented in the nine largest, chromosome-scale scaffolds. Sixty-six percent of the genome was classified as highly repetitive DNA, with the most common repetitive elements being Gypsy- (32%) and Copia-like (11%) long-terminal repeats. The annotation was completed using MAKER which identified 37,083 gene models and 2,555 tRNA genes. Completeness of the genome, assessed using the Benchmarking Universal Single Copy Orthologs (BUSCO) metric, identified 97.5% of the conserved orthologs as complete, with only 2.2% being duplicated, reflecting the diploid nature of A. hortensis. A resequencing panel of 21 wild, unimproved and cultivated A. hortensis accessions revealed three distinct populations with little variation within subpopulations. These resources provide vital information to better understand A. hortensis and facilitate future study

    Hybrid Assembly Provides Improved Resolution of Plasmids, Antimicrobial Resistance Genes, and Virulence Factors in Escherichia coli and Klebsiella pneumoniae Clinical Isolates

    Get PDF
    Emerging new sequencing technologies have provided researchers with a unique opportunity to study factors related to microbial pathogenicity, such as antimicrobial resistance (AMR) genes and virulence factors. However, the use of whole-genome sequence (WGS) data requires good knowledge of the bioinformatics involved, as well as the necessary techniques. In this study, a total of nine Escherichia coli and Klebsiella pneumoniae isolates from Norwegian clinical samples were sequenced using both MinION and Illumina platforms. Three out of nine samples were sequenced directly from blood culture, and one sample was sequenced from a mixed-blood culture. For genome assembly, several long-read, (Canu, Flye, Unicycler, and Miniasm), short-read (ABySS, Unicycler and SPAdes) and hybrid assemblers (Unicycler, hybridSPAdes, and MaSurCa) were tested. Assembled genomes from the best-performing assemblers (according to quality checks using QUAST and BUSCO) were subjected to downstream analyses. Flye and Unicycler assemblers performed best for the assembly of long and short reads, respectively. For hybrid assembly, Unicycler was the top-performing assembler and produced more circularized and complete genome assemblies. Hybrid assembled genomes performed substantially better in downstream analyses to predict putative plasmids, AMR genes and β-lactamase gene variants, compared to MinION and Illumina assemblies. Thus, hybrid assembly has the potential to reveal factors related to microbial pathogenicity in clinical and mixed samples.publishedVersio
    • …
    corecore