36 research outputs found

    genomeRxiv : a microbial whole-genome database and diagnostic marker design resource for classification, identification, and data sharing

    Get PDF
    genomeRxiv is a newly-funded US-UK collaboration to provide a public, web-accessible database of public genome sequences, accurately catalogued and classified by whole-genome similarity independent of their taxonomic affiliation. Our goal is to supply the basic and applied research community with rapid, precise and accurate identification of unknown isolates based on genome sequence alone, and with molecular tools for environmental analysis. The DNA sequencing revolution enabled the use of cultured and uncultured microorganism genomes for fast and precise identification. However, precise identification is impossible without 1. reference databases that precisely circumscribe classes of microorganisms, and label these with their uniquely-shared characteristics 2. fast algorithms that can handle the volumes of genome data Our approach integrates the highly-resolved classification framework of Life Identification Numbers (LINs) with the speed and computational efficiency of sourmash and k-mer hashing algorithms, and the precision and filtering of average nucleotide identity (ANI). We aim to construct a single genome-based indexing scheme that extends from phylum to strain, enabling the unique and consistent placement of any sequenced prokaryote genome. genomeRxiv includes protocols for confidentiality, allowing groups to identify and announce the identities of newly-sequenced organisms without sharing genome data directly. This protects communities working with commercially- and ethically-sensitive organisms (e.g. production engineering strains, potential bioweapons, and to enable benefit sharing with indigenous communities). genomeRxiv will also provide online capability to design molecular diagnostic tools for metabarcoding and qPCR, to enable tracking of specific groupings of bacteria directly in the environment

    Biogeographic distribution of five Antarctic cyanobacteria using large-scale k-mer searching with sourmash branchwater

    Get PDF
    Cyanobacteria form diverse communities and are important primary producers in Antarctic freshwater environments, but their geographic distribution patterns in Antarctica and globally are still unresolved. There are however few genomes of cultured cyanobacteria from Antarctica available and therefore metagenome-assembled genomes (MAGs) from Antarctic cyanobacteria microbial mats provide an opportunity to explore distribution of uncultured taxa. These MAGs also allow comparison with metagenomes of cyanobacteria enriched communities from a range of habitats, geographic locations, and climates. However, most MAGs do not contain 16S rRNA gene sequences, making a 16S rRNA gene-based biogeography comparison difficult. An alternative technique is to use large-scale k-mer searching to find genomes of interest in public metagenomes. This paper presents the results of k-mer based searches for 5 Antarctic cyanobacteria MAGs from Lake Fryxell and Lake Vanda, assigned the names Phormidium pseudopriestleyi FRX01, Microcoleus sp. MP8IB2.171, Leptolyngbya sp. BulkMat.35, Pseudanabaenaceae cyanobacterium MP8IB2.15, and Leptolyngbyaceae cyanobacterium MP9P1.79 in 498,942 unassembled metagenomes from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). The Microcoleus sp. MP8IB2.171 MAG was found in a wide variety of environments, the P. pseudopriestleyi MAG was found in environments with challenging conditions, the Leptolyngbyaceae cyanobacterium MP9P1.79 MAG was only found in Antarctica, and the Leptolyngbya sp. BulkMat.35 and Pseudanabaenaceae cyanobacterium MP8IB2.15 MAGs were found in Antarctic and other cold environments. The findings based on metagenome matches and global comparisons suggest that these Antarctic cyanobacteria have distinct distribution patterns ranging from locally restricted to global distribution across the cold biosphere and other climatic zones

    Context-aware genomic surveillance reveals hidden transmission of a carbapenemase-producing Klebsiella pneumoniae

    Get PDF
    Genomic surveillance can inform effective public health responses to pathogen outbreaks. However, integration of non-local data is rarely done. We investigate two large hospital outbreaks of a carbapenemase-carrying Klebsiella pneumoniae strain in Germany and show the value of contextual data. By screening about 10 000 genomes, over 400 000 metagenomes and two culture collections using in silico and in vitro methods, we identify a total of 415 closely related genomes reported in 28 studies. We identify the relationship between the two outbreaks through time-dated phylogeny, including their respective origin. One of the outbreaks presents extensive hidden transmission, with descendant isolates only identified in other studies. We then leverage the genome collection from this meta-analysis to identify genes under positive selection. We thereby identify an inner membrane transporter (ynjC) with a putative role in colistin resistance. Contextual data from other sources can thus enhance local genomic surveillance at multiple levels and should be integrated by default when available

    The khmer software package: enabling efficient nucleotide sequence analysis

    Get PDF
    The khmer package is a freely available software library for working efficiently with fixed length DNA words, or k-mers. khmer provides implementations of a probabilistic k-mer counting data structure, a compressible De Bruijn graph representation, De Bruijn graph partitioning, and digital normalization. khmer is implemented in C++ and Python, and is freely available under the BSD license at https://github.com/dib-lab/khmer/

    The khmer software package: enabling efficient nucleotide sequence analysis [version 1; referees: 2 approved, 1 approved with reservations]

    Get PDF
    The khmer package is a freely available software library for working efficiently with fixed length DNA words, or k-mers. khmer provides implementations of a probabilistic k-mer counting data structure, a compressible De Bruijn graph representation, De Bruijn graph partitioning, and digital normalization. khmer is implemented in C++ and Python, and is freely available under the BSD license at https://github.com/dib-lab/khmer/

    Qualifying Examination: Overlap graph-based sequence assembly in bioinformatics

    No full text
    <p>This is my qualifying examination final report and presentation for the oral exam, presented at Michigan State University on August 29th. It reviews how the genome assembly problem formulation changed since the foundation of genome sequencing, with special attention to overlap graph-based sequence assembly. Three assemblers implementing this theoretical framework are reviewed, and a fourth one based on de Bruijn graph serves as a comparison between shared methods and differences on each approach. Finally possible directions for future works are discussed.</p

    Decentralizing Indices for Genomic Data

    No full text
    Biology as a field is being transformed by the increasing availability of data, especially genomic sequencing data. Computational methods that can adapt and take advantage of this data deluge are essential for exploring and providing insights for new hypotheses, helping to unveil the biological processes that were previously expensive or even impossible to study. This dissertation introduces data structures and approaches for scaling data analysis to hundreds of thousands of DNA sequencing datasets using Scaled MinHash sketches, a reduced space representation of the original datasets that can lower computational requirements for similarity and containment estimation; MHBT and LCA indices, structures for indexing and searching large collections of Scaled MinHash sketches; gather, a new top-down approach for decomposing datasets into a collection of reference components that can be implemented efficiently with Scaled MinHash sketches and MHBT and LCA indices; wort, a distributed system for large scale sketch computation across heterogeneous systems, from laptops to academic clusters and cloud instances, including prototypes for containment searches across millions of datasets; as well as explorations on how to facilitate sharing and increase the resilience of sketches collections built from public genomic data

    castelao/seabird: 0.10

    No full text
    Python parser for Sea-Bird CTD outputs, usually .cnv files

    Oceansound demonstration

    No full text
    <p>This is the code for the demonstration of OceanSound, using Flask to create a backend and OpenLayers + MIDI.js on the browser to select lat/lon and play music.</p> <p>Live demo at http://ocean.datasounds.org/</p

    ImSound - Access sound of 2D dataset with mouse movements.

    No full text
    <p>Access sound of 2D dataset with mouse movements.</p> <p>Program implemented in Python for free use inside other scripts.</p
    corecore