5,628 research outputs found

    Extreme Scale De Novo Metagenome Assembly

    Full text link
    Metagenome assembly is the process of transforming a set of short, overlapping, and potentially erroneous DNA segments from environmental samples into the accurate representation of the underlying microbiomes's genomes. State-of-the-art tools require big shared memory machines and cannot handle contemporary metagenome datasets that exceed Terabytes in size. In this paper, we introduce the MetaHipMer pipeline, a high-quality and high-performance metagenome assembler that employs an iterative de Bruijn graph approach. MetaHipMer leverages a specialized scaffolding algorithm that produces long scaffolds and accommodates the idiosyncrasies of metagenomes. MetaHipMer is end-to-end parallelized using the Unified Parallel C language and therefore can run seamlessly on shared and distributed-memory systems. Experimental results show that MetaHipMer matches or outperforms the state-of-the-art tools in terms of accuracy. Moreover, MetaHipMer scales efficiently to large concurrencies and is able to assemble previously intractable grand challenge metagenomes. We demonstrate the unprecedented capability of MetaHipMer by computing the first full assembly of the Twitchell Wetlands dataset, consisting of 7.5 billion reads - size 2.6 TBytes.Comment: Accepted to SC1

    RSEARCH: Finding homologs of single structured RNA sequences

    Get PDF
    BACKGROUND: For many RNA molecules, secondary structure rather than primary sequence is the evolutionarily conserved feature. No programs have yet been published that allow searching a sequence database for homologs of a single RNA molecule on the basis of secondary structure. RESULTS: We have developed a program, RSEARCH, that takes a single RNA sequence with its secondary structure and utilizes a local alignment algorithm to search a database for homologous RNAs. For this purpose, we have developed a series of base pair and single nucleotide substitution matrices for RNA sequences called RIBOSUM matrices. RSEARCH reports the statistical confidence for each hit as well as the structural alignment of the hit. We show several examples in which RSEARCH outperforms the primary sequence search programs BLAST and SSEARCH. The primary drawback of the program is that it is slow. The C code for RSEARCH is freely available from our lab's website. CONCLUSION: RSEARCH outperforms primary sequence programs in finding homologs of structured RNA sequences

    Specimens at the Center: An Informatics Workflow and Toolkit for Specimen-level analysis of Public DNA database data

    Get PDF
    Major public DNA databases — NCBI GenBank, the DNA DataBank of Japan (DDBJ), and the European Molecular Biology Laboratory (EMBL) — are invaluable biodiversity libraries. Systematists and other biodiversity scientists commonly mine these databases for sequence data to use in phylogenetic studies, but such studies generally use only the taxonomic identity of the sequenced tissue, not the specimen identity. Thus studies that use DNA supermatrices to construct phylogenetic trees with species at the tips typically do not take advantage of the fact that for many individuals in the public DNA databases, several DNA regions have been sampled; and for many species, two or more individuals have been sampled. Thus these studies typically do not make full use of the multigene datasets in public DNA databases to test species coherence and select optimal sequences to represent a species. In this study, we introduce a set of tools developed in the R programming language to construct individual-based trees from NCBI GenBank data and present a set of trees for the genus Carex (Cyperaceae) constructed using these methods. For the more than 770 species for which we found sequence data, our approach recovered an average of 1.85 gene regions per specimen, up to seven for some specimens, and more than 450 species represented by two or more specimens. Depending on the subset of genes analyzed, we found up to 42% of species monophyletic. We introduce a simple tree statistic—the Taxonomic Disparity Index (TDI)—to assist in curating specimen-level datasets and provide code for selecting maximally informative (or, conversely, minimally misleading) sequences as species exemplars. While tailored to the Carex dataset, the approach and code presented in this paper can readily be generalized to constructing individual-level trees from large amounts of data for any species group

    A new reference genome assembly for the microcrustacean Daphnia pulex

    Get PDF
    Comparing genomes of closely related genotypes from populations with distinct demographic histories can help reveal the impact of effective population size on genome evolution. For this purpose, we present a high quality genome assembly of Daphnia pulex (PA42), and compare this with the first sequenced genome of this species (TCO), which was derived from an isolate from a population with >90% reduction in nucleotide diversity. PA42 has numerous similarities to TCO at the gene level, with an average amino acid sequence identity of 98.8 and >60% of orthologous proteins identical. Nonetheless, there is a highly elevated number of genes in the TCO genome annotation, with similar to 7000 excess genes appearing to be false positives. This view is supported by the high GC content, lack of introns, and short length of these suspicious gene annotations. Consistent with the view that reduced effective population size can facilitate the accumulation of slightly deleterious genomic features, we observe more proliferation of transposable elements (TEs) and a higher frequency of gained introns in the TCO genome

    Bio-inspired call-stack reconstruction for performance analysis

    Get PDF
    The correlation of performance bottlenecks and their associated source code has become a cornerstone of performance analysis. It allows understanding why the efficiency of an application falls behind the computer's peak performance and enabling optimizations on the code ultimately. To this end, performance analysis tools collect the processor call-stack and then combine this information with measurements to allow the analyst comprehend the application behavior. Some tools modify the call-stack during run-time to diminish the collection expense but at the cost of resulting in non-portable solutions. In this paper, we present a novel portable approach to associate performance issues with their source code counterpart. To address it, we capture a reduced segment of the call-stack (up to three levels) and then process the segments using an algorithm inspired by multi-sequence alignment techniques. The results of our approach are easily mapped to detailed performance views, enabling the analyst to unveil the application behavior and its corresponding region of code. To demonstrate the usefulness of our approach, we have applied the algorithm to several first-time seen in-production applications to describe them finely, and optimize them by using tiny modifications based on the analyses.We thankfully acknowledge Mathis Bode for giving us access to the Arts CF binaries, and Miguel Castrillo and Kim Serradell for their valuable insight regarding Nemo. We would like to thank Forschungszentrum Jülich for the computation time on their Blue Gene/Q system. This research has been partially funded by the CICYT under contracts No. TIN2012-34557 and TIN2015-65316-P.Peer ReviewedPostprint (author's final draft

    Resurrection and emendation of the Hypoxylaceae, recognised from a multigene phylogeny of the Xylariales

    Get PDF
    A multigene phylogeny was constructed, including a significant number of representative species of the main lineages in the Xylariaceae and four DNA loci the internal transcribed spacer region (ITS), the large subunit (LSU) of the nuclear rDNA, the second largest subunit of the RNA polymerase II (RPB2), and beta-tubulin (TUB2). Specimens were selected based on more than a decade of intensive morphological and chemotaxonomic work, and cautious taxon sampling was performed to cover the major lineages of the Xylariaceae; however, with emphasis on hypoxyloid species. The comprehensive phylogenetic analysis revealed a clear-cut segregation of the Xylariaceae into several major clades, which was well in accordance with previously established morphological and chemotaxonomic concepts. One of these clades contained Annulohypoxylon, Hypoxylon, Daldinia, and other related genera that have stromatal pigments and a nodulisporium-like anamorph. They are accommodated in the family Hypoxylaceae, which is resurrected and emended. Representatives of genera with a nodulisporium-like anamorph and bipartite stromata, lacking stromatal pigments (i.e. Biscogniauxia, Camillea, and Obolarina) appeared in a clade basal to the xylarioid taxa. As they clustered with Graphostroma platystomum, they are accommodated in the Graphostromataceae. The new genus Jackrogersella with J. multiformis as type species is segregated from Annulohypoxylon. The genus Pyrenopolyporus is resurrected for Hypoxylon polyporus and allied species. The genus Daldinia and its allies Entonaema, Rhopalostroma, Ruwenzoria, and Thamnomyces appeared in two separate subclades, which may warrant further splitting of Daldinia in the future, and even Hypoxylon was divided in several clades. However, more species of these genera need to be studied before a conclusive taxonomic rearrangement can be envisaged. Epitypes were designated for several important species in which living cultures and molecular data are available, in order to stabilise the taxonomy of the Xylariales.Fil: Wendt, Lucile. Helmholtz-Zentrum für Infektionsforschung GmbH. Department of Microbial Drugs; Alemania. German Centre for Infection Research; AlemaniaFil: Sir, Esteban Benjamin. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Tucuman. Unidad Ejecutora Lillo. Fundación Miguel Lillo. Unidad Ejecutora Lillo; ArgentinaFil: Kuhnert, Eric. Helmholtz-Zentrum für Infektionsforschung GmbH. Department of Microbial Drugs; Alemania. German Centre for Infection Research; AlemaniaFil: Heitkämper, Simone. Helmholtz-Zentrum für Infektionsforschung GmbH. Department of Microbial Drugs; Alemania. German Centre for Infection Research; AlemaniaFil: Lambert, Christopher. Helmholtz-Zentrum für Infektionsforschung GmbH. Department of Microbial Drugs; Alemania. German Centre for Infection Research; AlemaniaFil: Hladki, Adriana I.. Fundación Miguel Lillo. Dirección de Botánica. Instituto de Micologia; ArgentinaFil: Romero, Andrea Irene. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Micología y Botánica. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Micología y Botánica; ArgentinaFil: Luangsa-Ard, Janet Jennifer. National Center for Genetic Engineering and Biotechnology; TailandiaFil: Srikitikulchai, Prasert. National Center for Genetic Engineering and Biotechnology; TailandiaFil: Peršoh, Derek. Ruhr-Universität Bochum; AlemaniaFil: Stadler, Marc. Helmholtz-Zentrum für Infektionsforschung GmbH. Department of Microbial Drugs; Alemania. German Centre for Infection Research; Alemani
    corecore