Search CORE

5,456 research outputs found

Multiple Comparative Metagenomics using Multiset k-mer Counting

Author: Benoit Gaëtan
Drezen Erwan
Lavenier Dominique
Lemaitre Claire
Mariadassou Mahendra
Peterlongo Pierre
Schbath Sophie
Publication venue
Publication date: 28/04/2016
Field of study

Background. Large scale metagenomic projects aim to extract biodiversity knowledge between different environmental conditions. Current methods for comparing microbial communities face important limitations. Those based on taxonomical or functional assignation rely on a small subset of the sequences that can be associated to known organisms. On the other hand, de novo methods, that compare the whole sets of sequences, either do not scale up on ambitious metagenomic projects or do not provide precise and exhaustive results. Methods. These limitations motivated the development of a new de novo metagenomic comparative method, called Simka. This method computes a large collection of standard ecological distances by replacing species counts by k-mer counts. Simka scales-up today's metagenomic projects thanks to a new parallel k-mer counting strategy on multiple datasets. Results. Experiments on public Human Microbiome Project datasets demonstrate that Simka captures the essential underlying biological structure. Simka was able to compute in a few hours both qualitative and quantitative ecological distances on hundreds of metagenomic samples (690 samples, 32 billions of reads). We also demonstrate that analyzing metagenomes at the k-mer level is highly correlated with extremely precise de novo comparison techniques which rely on all-versus-all sequences alignment strategy or which are based on taxonomic profiling

arXiv.org e-Print Archive

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

Directory of Open Access Journals

Gerbil: A Fast and Memory-Efficient $k$ -mer Counter with GPU-Support

Author: Erbert Marius
Müller-Hannemann Matthias
Rechner Steffen
Publication venue
Publication date: 22/07/2016
Field of study

A basic task in bioinformatics is the counting of

k

-mers in genome strings. The

k

-mer counting problem is to build a histogram of all substrings of length

k

in a given genome sequence. We present the open source

k

-mer counting software Gerbil that has been designed for the efficient counting of

k

-mers for

k\geq32

. Given the technology trend towards long reads of next-generation sequencers, support for large

k

becomes increasingly important. While existing

k

-mer counting tools suffer from excessive memory resource consumption or degrading performance for large

k

, Gerbil is able to efficiently support large

k

without much loss of performance. Our software implements a two-disk approach. In the first step, DNA reads are loaded from disk and distributed to temporary files that are stored at a working disk. In a second step, the temporary files are read again, split into

k

-mers and counted via a hash table approach. In addition, Gerbil can optionally use GPUs to accelerate the counting step. For large

k

, we outperform state-of-the-art open source

k

-mer counting tools for large genome data sets.Comment: A short version of this paper will appear in the proceedings of WABI 201

arXiv.org e-Print Archive

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure

Author: Brown C. Titus
Canino-Koning Rosangela
Howe Adina Chuang
Pell Jason
Zhang Qingpeng
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 14/07/2014
Field of study

K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a Count-Min Sketch introduces a systematic overcount for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer false positive rates. khmer is implemented in C++ wrapped in a Python interface, offers a tested and robust API, and is freely available under the BSD license at github.com/ged-lab/khmer

arXiv.org e-Print Archive

Directory of Open Access Journals

PubMed Central

Indexing arbitrary-length $k$ -mers in sequencing reads

Author: Deorowicz Sebastian
Grabowski Szymon
Kowalski Tomasz
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 13/02/2015
Field of study

We propose a lightweight data structure for indexing and querying collections of NGS reads data in main memory. The data structure supports the interface proposed in the pioneering work by Philippe et al. for counting and locating

k

-mers in sequencing reads. Our solution, PgSA (pseudogenome suffix array), based on finding overlapping reads, is competitive to the existing algorithms in the space use, query times, or both. The main applications of our index include variant calling, error correction and analysis of reads from RNA-seq experiments

arXiv.org e-Print Archive

Public Library of Science (PLOS)

Directory of Open Access Journals

A double-lined spectroscopic orbit for the young star HD 34700

Author: Alcala J. M.
Arellano Ferro A.
Bhatt H. C.
Corporon P.
Duquennoy A.
Guillermo Torres
Hog E.
Hut P.
Kohler R.
Leinert C.
Neuhauser R.
Nordstrom B.
Oudmaijer R. D.
Siess L.
Siess L.
Sylvester R. J.
Sylvester R. J.
van den Ancker M. E.
Voges W.
Wichmann R.
Publication venue: 'University of Chicago Press'
Publication date: 01/01/2003
Field of study

We report high-resolution spectroscopic observations of the young star HD 34700, which confirm it to be a double-lined spectroscopic binary. We derive an accurate orbital solution with a period of 23.4877 +/- 0.0013 days and an eccentricity of e = 0.2501 +/- 0.0068. The stars are found to be of similar mass (M2/M1 = 0.987 +/- 0.014) and luminosity. We derive also the effective temperatures (5900 K and 5800 K) and projected rotational velocities (28 km/s and 22 km/s) of the components. These values of v sin i are much higher than expected for main-sequence stars of similar spectral type (G0), and are not due to tidal synchronization. We discuss also the indicators of youth available for the object. Although there is considerable evidence that the system is young --strong infrared excess, X-ray emission, Li I 6708 absorption (0.17 Angstroms equivalent width), H alpha emission (0.6 Angstroms), rapid rotation-- the precise age cannot yet be established because the distance is unknown.Comment: 17 pages, including 2 figures and 2 tables. Accepted for publication in AJ, to appear in February 200

arXiv.org e-Print Archive

CiteSeerX

Crossref

CERN Document Server

Indexing large genome collections on a PC

Author: Danek Agnieszka
Deorowicz Sebastian
Grabowski Szymon
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 28/03/2014
Field of study

Motivation: The availability of thousands of invidual genomes of one species should boost rapid progress in personalized medicine or understanding of the interaction between genotype and phenotype, to name a few applications. A key operation useful in such analyses is aligning sequencing reads against a collection of genomes, which is costly with the use of existing algorithms due to their large memory requirements. Results: We present MuGI, Multiple Genome Index, which reports all occurrences of a given pattern, in exact and approximate matching model, against a collection of thousand(s) genomes. Its unique feature is the small index size fitting in a standard computer with 16--32\,GB, or even 8\,GB, of RAM, for the 1000GP collection of 1092 diploid human genomes. The solution is also fast. For example, the exact matching queries are handled in average time of 39\,

\mu

s and with up to 3 mismatches in 373\,

\mu

s on the test PC with the index size of 13.4\,GB. For a smaller index, occupying 7.4\,GB in memory, the respective times grow to 76\,

\mu

s and 917\,

\mu

s. Availability: Software and Suuplementary material: \url{http://sun.aei.polsl.pl/mugi}

arXiv.org e-Print Archive

Public Library of Science (PLOS)

Directory of Open Access Journals

PubMed Central

FigShare

Cuspy Dark-Matter Haloes and the Galaxy

Author: Efstathiou
J. J. Binney
N. W. Evans
Spergel
Publication venue: 'Wiley'
Publication date: 01/01/2001
Field of study

The microlensing optical depth to Baade's Window constrains the minimum total mass in baryonic matter within the Solar circle to be greater than 3.9 x 10^{10} solar masses, assuming the inner Galaxy is barred with viewing angle of roughly 20 degrees. From the kinematics of solar neighbourhood stars, the local surface density of dark matter is about 30 +/- 15 solar masses per square parsec. We construct cuspy haloes normalised to the local dark matter density and calculate the circular-speed curve of the halo in the inner Galaxy. This is added in quadrature to the rotation curve provided by the stellar and ISM discs, together with a bar sufficiently massive so that the baryonic matter in the inner Galaxy reproduces the microlensing optical depth. Such models violate the observational constraint provided by the tangent-velocity data in the inner Galaxy (typically at radii 2-4 kpc). The high baryonic contribution required by the microlensing is consistent with implications from hydrodynamical modelling and the pattern speed of the Galactic bar. We conclude that the cuspy haloes favoured by the Cold Dark Matter cosmology (and its variants) are inconsistent with the observational data on the Galaxy.Comment: 5 pages, 1 figures, MNRAS (submitted

arXiv.org e-Print Archive

Crossref

Oxford University Research Archive

CERN Document Server