5,456 research outputs found
Multiple Comparative Metagenomics using Multiset k-mer Counting
Background. Large scale metagenomic projects aim to extract biodiversity
knowledge between different environmental conditions. Current methods for
comparing microbial communities face important limitations. Those based on
taxonomical or functional assignation rely on a small subset of the sequences
that can be associated to known organisms. On the other hand, de novo methods,
that compare the whole sets of sequences, either do not scale up on ambitious
metagenomic projects or do not provide precise and exhaustive results.
Methods. These limitations motivated the development of a new de novo
metagenomic comparative method, called Simka. This method computes a large
collection of standard ecological distances by replacing species counts by
k-mer counts. Simka scales-up today's metagenomic projects thanks to a new
parallel k-mer counting strategy on multiple datasets.
Results. Experiments on public Human Microbiome Project datasets demonstrate
that Simka captures the essential underlying biological structure. Simka was
able to compute in a few hours both qualitative and quantitative ecological
distances on hundreds of metagenomic samples (690 samples, 32 billions of
reads). We also demonstrate that analyzing metagenomes at the k-mer level is
highly correlated with extremely precise de novo comparison techniques which
rely on all-versus-all sequences alignment strategy or which are based on
taxonomic profiling
Gerbil: A Fast and Memory-Efficient -mer Counter with GPU-Support
A basic task in bioinformatics is the counting of -mers in genome strings.
The -mer counting problem is to build a histogram of all substrings of
length in a given genome sequence. We present the open source -mer
counting software Gerbil that has been designed for the efficient counting of
-mers for . Given the technology trend towards long reads of
next-generation sequencers, support for large becomes increasingly
important. While existing -mer counting tools suffer from excessive memory
resource consumption or degrading performance for large , Gerbil is able to
efficiently support large without much loss of performance. Our software
implements a two-disk approach. In the first step, DNA reads are loaded from
disk and distributed to temporary files that are stored at a working disk. In a
second step, the temporary files are read again, split into -mers and
counted via a hash table approach. In addition, Gerbil can optionally use GPUs
to accelerate the counting step. For large , we outperform state-of-the-art
open source -mer counting tools for large genome data sets.Comment: A short version of this paper will appear in the proceedings of WABI
201
These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure
K-mer abundance analysis is widely used for many purposes in nucleotide
sequence analysis, including data preprocessing for de novo assembly, repeat
detection, and sequencing coverage estimation. We present the khmer software
package for fast and memory efficient online counting of k-mers in sequencing
data sets. Unlike previous methods based on data structures such as hash
tables, suffix arrays, and trie structures, khmer relies entirely on a simple
probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits
online updating and retrieval of k-mer counts in memory which is necessary to
support online k-mer analysis algorithms. On sparse data sets this data
structure is considerably more memory efficient than any exact data structure.
In exchange, the use of a Count-Min Sketch introduces a systematic overcount
for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we
analyze the speed, the memory usage, and the miscount rate of khmer for
generating k-mer frequency distributions and retrieving k-mer counts for
individual k-mers. We also compare the performance of khmer to several other
k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC,
Turtle and KAnalyze. Finally, we examine the effectiveness of profiling
sequencing error, k-mer abundance trimming, and digital normalization of reads
in the context of high khmer false positive rates. khmer is implemented in C++
wrapped in a Python interface, offers a tested and robust API, and is freely
available under the BSD license at github.com/ged-lab/khmer
Indexing arbitrary-length -mers in sequencing reads
We propose a lightweight data structure for indexing and querying collections
of NGS reads data in main memory. The data structure supports the interface
proposed in the pioneering work by Philippe et al. for counting and locating
-mers in sequencing reads. Our solution, PgSA (pseudogenome suffix array),
based on finding overlapping reads, is competitive to the existing algorithms
in the space use, query times, or both. The main applications of our index
include variant calling, error correction and analysis of reads from RNA-seq
experiments
A double-lined spectroscopic orbit for the young star HD 34700
We report high-resolution spectroscopic observations of the young star HD
34700, which confirm it to be a double-lined spectroscopic binary. We derive an
accurate orbital solution with a period of 23.4877 +/- 0.0013 days and an
eccentricity of e = 0.2501 +/- 0.0068. The stars are found to be of similar
mass (M2/M1 = 0.987 +/- 0.014) and luminosity. We derive also the effective
temperatures (5900 K and 5800 K) and projected rotational velocities (28 km/s
and 22 km/s) of the components. These values of v sin i are much higher than
expected for main-sequence stars of similar spectral type (G0), and are not due
to tidal synchronization. We discuss also the indicators of youth available for
the object. Although there is considerable evidence that the system is young
--strong infrared excess, X-ray emission, Li I 6708 absorption (0.17 Angstroms
equivalent width), H alpha emission (0.6 Angstroms), rapid rotation-- the
precise age cannot yet be established because the distance is unknown.Comment: 17 pages, including 2 figures and 2 tables. Accepted for publication
in AJ, to appear in February 200
Indexing large genome collections on a PC
Motivation: The availability of thousands of invidual genomes of one species
should boost rapid progress in personalized medicine or understanding of the
interaction between genotype and phenotype, to name a few applications. A key
operation useful in such analyses is aligning sequencing reads against a
collection of genomes, which is costly with the use of existing algorithms due
to their large memory requirements.
Results: We present MuGI, Multiple Genome Index, which reports all
occurrences of a given pattern, in exact and approximate matching model,
against a collection of thousand(s) genomes. Its unique feature is the small
index size fitting in a standard computer with 16--32\,GB, or even 8\,GB, of
RAM, for the 1000GP collection of 1092 diploid human genomes. The solution is
also fast. For example, the exact matching queries are handled in average time
of 39\,s and with up to 3 mismatches in 373\,s on the test PC with
the index size of 13.4\,GB. For a smaller index, occupying 7.4\,GB in memory,
the respective times grow to 76\,s and 917\,s.
Availability: Software and Suuplementary material:
\url{http://sun.aei.polsl.pl/mugi}
Cuspy Dark-Matter Haloes and the Galaxy
The microlensing optical depth to Baade's Window constrains the minimum total
mass in baryonic matter within the Solar circle to be greater than 3.9 x
10^{10} solar masses, assuming the inner Galaxy is barred with viewing angle of
roughly 20 degrees. From the kinematics of solar neighbourhood stars, the local
surface density of dark matter is about 30 +/- 15 solar masses per square
parsec. We construct cuspy haloes normalised to the local dark matter density
and calculate the circular-speed curve of the halo in the inner Galaxy. This is
added in quadrature to the rotation curve provided by the stellar and ISM
discs, together with a bar sufficiently massive so that the baryonic matter in
the inner Galaxy reproduces the microlensing optical depth. Such models violate
the observational constraint provided by the tangent-velocity data in the inner
Galaxy (typically at radii 2-4 kpc). The high baryonic contribution required by
the microlensing is consistent with implications from hydrodynamical modelling
and the pattern speed of the Galactic bar. We conclude that the cuspy haloes
favoured by the Cold Dark Matter cosmology (and its variants) are inconsistent
with the observational data on the Galaxy.Comment: 5 pages, 1 figures, MNRAS (submitted
- …