632 research outputs found

    These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure

    Full text link
    K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a Count-Min Sketch introduces a systematic overcount for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer false positive rates. khmer is implemented in C++ wrapped in a Python interface, offers a tested and robust API, and is freely available under the BSD license at github.com/ged-lab/khmer

    A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data

    Full text link
    Deep shotgun sequencing and analysis of genomes, transcriptomes, amplified single-cell genomes, and metagenomes has enabled investigation of a wide range of organisms and ecosystems. However, sampling variation in short-read data sets and high sequencing error rates of modern sequencers present many new computational challenges in data interpretation. These challenges have led to the development of new classes of mapping tools and {\em de novo} assemblers. These algorithms are challenged by the continued improvement in sequencing throughput. We here describe digital normalization, a single-pass computational algorithm that systematizes coverage in shotgun sequencing data sets, thereby decreasing sampling variation, discarding redundant data, and removing the majority of errors. Digital normalization substantially reduces the size of shotgun data sets and decreases the memory and time requirements for {\em de novo} sequence assembly, all without significantly impacting content of the generated contigs. We apply digital normalization to the assembly of microbial genomic data, amplified single-cell genomic data, and transcriptomic data. Our implementation is freely available for use and modification

    Challenges and opportunities in understanding microbial communities with metagenome assembly (accompanied by IPython Notebook tutorial)

    Get PDF
    Metagenomic investigations hold great promise for informing the genetics, physiology, and ecology of environmental microorganisms. Current challenges for metagenomic analysis are related to our ability to connect the dots between sequencing reads, their population of origin, and their encoding functions. Assembly-based methods reduce dataset size by extending overlapping reads into larger contiguous sequences (contigs), providing contextual information for genetic sequences that does not rely on existing references. These methods, however, tend to be computationally intensive and are again challenged by sequencing errors as well as by genomic repeats While numerous tools have been developed based on these methodological concepts, they present confounding choices and training requirements to metagenomic investigators. To help with accessibility to assembly tools, this review also includes an IPython Notebook metagenomic assembly tutorial. This tutorial has instructions for execution any operating system using Amazon Elastic Cloud Compute and guides users through downloading, assembly, and mapping reads to contigs of a mock microbiome metagenome. Despite its challenges, metagenomic analysis has already revealed novel insights into many environments on Earth. As software, training, and data continue to emerge, metagenomic data access and its discoveries will to grow

    Assembling large, complex environmental metagenomes

    Full text link
    The large volumes of sequencing data required to sample complex environments deeply pose new challenges to sequence analysis approaches. De novo metagenomic assembly effectively reduces the total amount of data to be analyzed but requires significant computational resources. We apply two pre-assembly filtering approaches, digital normalization and partitioning, to make large metagenome assemblies more comput\ ationaly tractable. Using a human gut mock community dataset, we demonstrate that these methods result in assemblies nearly identical to assemblies from unprocessed data. We then assemble two large soil metagenomes from matched Iowa corn and native prairie soils. The predicted functional content and phylogenetic origin of the assembled contigs indicate significant taxonomic differences despite similar function. The assembly strategies presented are generic and can be extended to any metagenome; full source code is freely available under a BSD license.Comment: Includes supporting informatio

    Avalanches and the Distribution of Reconnection Events in Magnetized Circumstellar Disks

    Full text link
    Cosmic rays produced by young stellar objects can potentially alter the ionization structure, heating budget, chemical composition, and accretion activity in circumstellar disks. The inner edges of these disks are truncated by strong magnetic fields, which can reconnect and produce flaring activity that accelerates cosmic radiation. The resulting cosmic rays can provide a source of ionization and produce spallation reactions that alter the composition of planetesimals. This reconnection and particle acceleration are analogous to the physical processes that produce flaring in and heating of stellar coronae. Flaring events on the surface of the Sun exhibit a power-law distribution of energy, reminiscent of those measured for Earthquakes and avalanches. Numerical lattice-reconnection models are capable of reproducing the observed power-law behavior of solar flares under the paradigm of self-organized criticality. One interpretation of these experiments is that the solar corona maintains a nonlinear attractor -- or ``critical'' -- state by balancing energy input via braided magnetic fields and output via reconnection events. Motivated by these results, we generalize the lattice-reconnection formalism for applications in the truncation region of magnetized disks. Our numerical experiments demonstrate that these nonlinear dynamical systems are capable of both attaining and maintaining criticality in the presence of Keplerian shear and other complications. The resulting power-law spectrum of flare energies in the equilibrium attractor state is found to be nearly universal in magnetized disks. This finding indicates that magnetic reconnection and flaring in the inner regions of circumstellar disks occur in a manner similar to activity on stellar surfaces

    Probing the Catalytic Roles of Arg548 and Gln552 in the Carboxyl Transferase Domain of the \u3cem\u3eRhizobium etli\u3c/em\u3e Pyruvate Carboxylase by Site-directed Mutagenesis

    Get PDF
    The roles of Arg548 and Gln552 residues in the active site of the carboxyl transferase domain of Rhizobium etli pyruvate carboxylase were investigated using site-directed mutagenesis. Mutation of Arg548 to alanine or glutamine resulted in the destabilization of the quaternary structure of the enzyme, suggesting that this residue has a structural role. Mutations R548K, Q552N, and Q552A resulted in a loss of the ability to catalyze pyruvate carboxylation, biotin-dependent decarboxylation of oxaloacetate, and the exchange of protons between pyruvate and water. These mutants retained the ability to catalyze reactions that occur at the active site of the biotin carboxylase domain, i.e., bicarbonate-dependent ATP cleavage and ADP phosphorylation by carbamoyl phosphate. The effects of oxamate on the catalysis in the biotin carboxylase domain by the R548K and Q552N mutants were similar to those on the catalysis of reactions by the wild-type enzyme. However, the presence of oxamate had no effect on the reactions catalyzed by the Q552A mutant. We propose that Arg548 and Gln552 facilitate the binding of pyruvate and the subsequent transfer of protons between pyruvate and biotin in the partial reaction catalyzed in the active site of the carboxyl transferase domain of Rhizobium etli pyruvate carboxylase
    • …
    corecore