5,456 research outputs found

    Multiple Comparative Metagenomics using Multiset k-mer Counting

    Get PDF
    Background. Large scale metagenomic projects aim to extract biodiversity knowledge between different environmental conditions. Current methods for comparing microbial communities face important limitations. Those based on taxonomical or functional assignation rely on a small subset of the sequences that can be associated to known organisms. On the other hand, de novo methods, that compare the whole sets of sequences, either do not scale up on ambitious metagenomic projects or do not provide precise and exhaustive results. Methods. These limitations motivated the development of a new de novo metagenomic comparative method, called Simka. This method computes a large collection of standard ecological distances by replacing species counts by k-mer counts. Simka scales-up today's metagenomic projects thanks to a new parallel k-mer counting strategy on multiple datasets. Results. Experiments on public Human Microbiome Project datasets demonstrate that Simka captures the essential underlying biological structure. Simka was able to compute in a few hours both qualitative and quantitative ecological distances on hundreds of metagenomic samples (690 samples, 32 billions of reads). We also demonstrate that analyzing metagenomes at the k-mer level is highly correlated with extremely precise de novo comparison techniques which rely on all-versus-all sequences alignment strategy or which are based on taxonomic profiling

    Gerbil: A Fast and Memory-Efficient kk-mer Counter with GPU-Support

    Get PDF
    A basic task in bioinformatics is the counting of kk-mers in genome strings. The kk-mer counting problem is to build a histogram of all substrings of length kk in a given genome sequence. We present the open source kk-mer counting software Gerbil that has been designed for the efficient counting of kk-mers for k≥32k\geq32. Given the technology trend towards long reads of next-generation sequencers, support for large kk becomes increasingly important. While existing kk-mer counting tools suffer from excessive memory resource consumption or degrading performance for large kk, Gerbil is able to efficiently support large kk without much loss of performance. Our software implements a two-disk approach. In the first step, DNA reads are loaded from disk and distributed to temporary files that are stored at a working disk. In a second step, the temporary files are read again, split into kk-mers and counted via a hash table approach. In addition, Gerbil can optionally use GPUs to accelerate the counting step. For large kk, we outperform state-of-the-art open source kk-mer counting tools for large genome data sets.Comment: A short version of this paper will appear in the proceedings of WABI 201

    These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure

    Full text link
    K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a Count-Min Sketch introduces a systematic overcount for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer false positive rates. khmer is implemented in C++ wrapped in a Python interface, offers a tested and robust API, and is freely available under the BSD license at github.com/ged-lab/khmer

    Indexing arbitrary-length kk-mers in sequencing reads

    Full text link
    We propose a lightweight data structure for indexing and querying collections of NGS reads data in main memory. The data structure supports the interface proposed in the pioneering work by Philippe et al. for counting and locating kk-mers in sequencing reads. Our solution, PgSA (pseudogenome suffix array), based on finding overlapping reads, is competitive to the existing algorithms in the space use, query times, or both. The main applications of our index include variant calling, error correction and analysis of reads from RNA-seq experiments

    A double-lined spectroscopic orbit for the young star HD 34700

    Full text link
    We report high-resolution spectroscopic observations of the young star HD 34700, which confirm it to be a double-lined spectroscopic binary. We derive an accurate orbital solution with a period of 23.4877 +/- 0.0013 days and an eccentricity of e = 0.2501 +/- 0.0068. The stars are found to be of similar mass (M2/M1 = 0.987 +/- 0.014) and luminosity. We derive also the effective temperatures (5900 K and 5800 K) and projected rotational velocities (28 km/s and 22 km/s) of the components. These values of v sin i are much higher than expected for main-sequence stars of similar spectral type (G0), and are not due to tidal synchronization. We discuss also the indicators of youth available for the object. Although there is considerable evidence that the system is young --strong infrared excess, X-ray emission, Li I 6708 absorption (0.17 Angstroms equivalent width), H alpha emission (0.6 Angstroms), rapid rotation-- the precise age cannot yet be established because the distance is unknown.Comment: 17 pages, including 2 figures and 2 tables. Accepted for publication in AJ, to appear in February 200

    Indexing large genome collections on a PC

    Full text link
    Motivation: The availability of thousands of invidual genomes of one species should boost rapid progress in personalized medicine or understanding of the interaction between genotype and phenotype, to name a few applications. A key operation useful in such analyses is aligning sequencing reads against a collection of genomes, which is costly with the use of existing algorithms due to their large memory requirements. Results: We present MuGI, Multiple Genome Index, which reports all occurrences of a given pattern, in exact and approximate matching model, against a collection of thousand(s) genomes. Its unique feature is the small index size fitting in a standard computer with 16--32\,GB, or even 8\,GB, of RAM, for the 1000GP collection of 1092 diploid human genomes. The solution is also fast. For example, the exact matching queries are handled in average time of 39\,ÎĽ\mus and with up to 3 mismatches in 373\,ÎĽ\mus on the test PC with the index size of 13.4\,GB. For a smaller index, occupying 7.4\,GB in memory, the respective times grow to 76\,ÎĽ\mus and 917\,ÎĽ\mus. Availability: Software and Suuplementary material: \url{http://sun.aei.polsl.pl/mugi}

    Cuspy Dark-Matter Haloes and the Galaxy

    Get PDF
    The microlensing optical depth to Baade's Window constrains the minimum total mass in baryonic matter within the Solar circle to be greater than 3.9 x 10^{10} solar masses, assuming the inner Galaxy is barred with viewing angle of roughly 20 degrees. From the kinematics of solar neighbourhood stars, the local surface density of dark matter is about 30 +/- 15 solar masses per square parsec. We construct cuspy haloes normalised to the local dark matter density and calculate the circular-speed curve of the halo in the inner Galaxy. This is added in quadrature to the rotation curve provided by the stellar and ISM discs, together with a bar sufficiently massive so that the baryonic matter in the inner Galaxy reproduces the microlensing optical depth. Such models violate the observational constraint provided by the tangent-velocity data in the inner Galaxy (typically at radii 2-4 kpc). The high baryonic contribution required by the microlensing is consistent with implications from hydrodynamical modelling and the pattern speed of the Galactic bar. We conclude that the cuspy haloes favoured by the Cold Dark Matter cosmology (and its variants) are inconsistent with the observational data on the Galaxy.Comment: 5 pages, 1 figures, MNRAS (submitted
    • …
    corecore