28 research outputs found

    genomeRxiv : a microbial whole-genome database and diagnostic marker design resource for classification, identification, and data sharing

    Get PDF
    genomeRxiv is a newly-funded US-UK collaboration to provide a public, web-accessible database of public genome sequences, accurately catalogued and classified by whole-genome similarity independent of their taxonomic affiliation. Our goal is to supply the basic and applied research community with rapid, precise and accurate identification of unknown isolates based on genome sequence alone, and with molecular tools for environmental analysis. The DNA sequencing revolution enabled the use of cultured and uncultured microorganism genomes for fast and precise identification. However, precise identification is impossible without 1. reference databases that precisely circumscribe classes of microorganisms, and label these with their uniquely-shared characteristics 2. fast algorithms that can handle the volumes of genome data Our approach integrates the highly-resolved classification framework of Life Identification Numbers (LINs) with the speed and computational efficiency of sourmash and k-mer hashing algorithms, and the precision and filtering of average nucleotide identity (ANI). We aim to construct a single genome-based indexing scheme that extends from phylum to strain, enabling the unique and consistent placement of any sequenced prokaryote genome. genomeRxiv includes protocols for confidentiality, allowing groups to identify and announce the identities of newly-sequenced organisms without sharing genome data directly. This protects communities working with commercially- and ethically-sensitive organisms (e.g. production engineering strains, potential bioweapons, and to enable benefit sharing with indigenous communities). genomeRxiv will also provide online capability to design molecular diagnostic tools for metabarcoding and qPCR, to enable tracking of specific groupings of bacteria directly in the environment

    The khmer software package: enabling efficient nucleotide sequence analysis

    Get PDF
    The khmer package is a freely available software library for working efficiently with fixed length DNA words, or k-mers. khmer provides implementations of a probabilistic k-mer counting data structure, a compressible De Bruijn graph representation, De Bruijn graph partitioning, and digital normalization. khmer is implemented in C++ and Python, and is freely available under the BSD license at https://github.com/dib-lab/khmer/

    Qualifying Examination: Overlap graph-based sequence assembly in bioinformatics

    No full text
    <p>This is my qualifying examination final report and presentation for the oral exam, presented at Michigan State University on August 29th. It reviews how the genome assembly problem formulation changed since the foundation of genome sequencing, with special attention to overlap graph-based sequence assembly. Three assemblers implementing this theoretical framework are reviewed, and a fourth one based on de Bruijn graph serves as a comparison between shared methods and differences on each approach. Finally possible directions for future works are discussed.</p

    Decentralizing Indices for Genomic Data

    No full text
    Biology as a field is being transformed by the increasing availability of data, especially genomic sequencing data. Computational methods that can adapt and take advantage of this data deluge are essential for exploring and providing insights for new hypotheses, helping to unveil the biological processes that were previously expensive or even impossible to study. This dissertation introduces data structures and approaches for scaling data analysis to hundreds of thousands of DNA sequencing datasets using Scaled MinHash sketches, a reduced space representation of the original datasets that can lower computational requirements for similarity and containment estimation; MHBT and LCA indices, structures for indexing and searching large collections of Scaled MinHash sketches; gather, a new top-down approach for decomposing datasets into a collection of reference components that can be implemented efficiently with Scaled MinHash sketches and MHBT and LCA indices; wort, a distributed system for large scale sketch computation across heterogeneous systems, from laptops to academic clusters and cloud instances, including prototypes for containment searches across millions of datasets; as well as explorations on how to facilitate sharing and increase the resilience of sketches collections built from public genomic data

    ImSound - Access sound of 2D dataset with mouse movements.

    No full text
    <p>Access sound of 2D dataset with mouse movements.</p> <p>Program implemented in Python for free use inside other scripts.</p

    DataSounds - Sonification of temporal series.

    No full text
    <p>Get sounds from temporal series, or another sequecial data.</p

    Streamlining data-intensive biology with workflow systems

    No full text
    As the scale of biological data generation has increased, the bottleneck of research has shifted from data generation to analysis. Researchers commonly need to build computational workflows that include multiple analytic tools and require incremental development as experimental insights demand tool and parameter modifications. These workflows can produce hundreds to thousands of intermediate files and results that must be integrated for biological insight. Data-centric workflow systems that internally manage computational resources, software, and conditional execution of analysis steps are reshaping the landscape of biological data analysis and empowering researchers to conduct reproducible analyses at scale. Adoption of these tools can facilitate and expedite robust data analysis, but knowledge of these techniques is still lacking. Here, we provide a series of strategies for leveraging workflow systems with structured project, data, and resource management to streamline large-scale biological analysis. We present these practices in the context of high-throughput sequencing data analysis, but the principles are broadly applicable to biologists working beyond this field
    corecore