1,142 research outputs found
Full-fledged Real-Time Indexing for Constant Size Alphabets
In this paper we describe a data structure that supports pattern matching
queries on a dynamically arriving text over an alphabet ofconstant size. Each
new symbol can be prepended to in O(1) worst-case time. At any moment, we
can report all occurrences of a pattern in the current text in
time, where is the length of and is the number of occurrences.
This resolves, under assumption of constant-size alphabet, a long-standing open
problem of existence of a real-time indexing method for string matching (see
\cite{AmirN08})
Reconsidering the significance of genomic word frequency
We propose that the distribution of DNA words in genomic sequences can be
primarily characterized by a double Pareto-lognormal distribution, which
explains lognormal and power-law features found across all known genomes. Such
a distribution may be the result of completely random sequence evolution by
duplication processes. The parametrization of genomic word frequencies allows
for an assessment of significance for frequent or rare sequence motifs
Estimating seed sensitivity on homogeneous alignments
We address the problem of estimating the sensitivity of seed-based similarity
search algorithms. In contrast to approaches based on Markov models [18, 6, 3,
4, 10], we study the estimation based on homogeneous alignments. We describe an
algorithm for counting and random generation of those alignments and an
algorithm for exact computation of the sensitivity for a broad class of seed
strategies. We provide experimental results demonstrating a bias introduced by
ignoring the homogeneousness condition
RNF: a general framework to evaluate NGS read mappers
Aligning reads to a reference sequence is a fundamental step in numerous
bioinformatics pipelines. As a consequence, the sensitivity and precision of
the mapping tool, applied with certain parameters to certain data, can
critically affect the accuracy of produced results (e.g., in variant calling
applications). Therefore, there has been an increasing demand of methods for
comparing mappers and for measuring effects of their parameters.
Read simulators combined with alignment evaluation tools provide the most
straightforward way to evaluate and compare mappers. Simulation of reads is
accompanied by information about their positions in the source genome. This
information is then used to evaluate alignments produced by the mapper.
Finally, reports containing statistics of successful read alignments are
created.
In default of standards for encoding read origins, every evaluation tool has
to be made explicitly compatible with the simulator used to generate reads. In
order to solve this obstacle, we have created a generic format RNF (Read Naming
Format) for assigning read names with encoded information about original
positions.
Futhermore, we have developed an associated software package RNF containing
two principal components. MIShmash applies one of popular read simulating tools
(among DwgSim, Art, Mason, CuReSim etc.) and transforms the generated reads
into RNF format. LAVEnder evaluates then a given read mapper using simulated
reads in RNF format. A special attention is payed to mapping qualities that
serve for parametrization of ROC curves, and to evaluation of the effect of
read sample contamination
On the combinatorics of suffix arrays
We prove several combinatorial properties of suffix arrays, including a
characterization of suffix arrays through a bijection with a certain
well-defined class of permutations. Our approach is based on the
characterization of Burrows-Wheeler arrays given in [1], that we apply by
reducing suffix sorting to cyclic shift sorting through the use of an
additional sentinel symbol. We show that the characterization of suffix arrays
for a special case of binary alphabet given in [2] easily follows from our
characterization. Based on our results, we also provide simple proofs for the
enumeration results for suffix arrays, obtained in [3]. Our approach to
characterizing suffix arrays is the first that exploits their relationship with
Burrows-Wheeler permutations
Spaced seeds improve k-mer-based metagenomic classification
Metagenomics is a powerful approach to study genetic content of environmental
samples that has been strongly promoted by NGS technologies. To cope with
massive data involved in modern metagenomic projects, recent tools [4, 39] rely
on the analysis of k-mers shared between the read to be classified and sampled
reference genomes. Within this general framework, we show in this work that
spaced seeds provide a significant improvement of classification accuracy as
opposed to traditional contiguous k-mers. We support this thesis through a
series a different computational experiments, including simulations of
large-scale metagenomic projects. Scripts and programs used in this study, as
well as supplementary material, are available from
http://github.com/gregorykucherov/spaced-seeds-for-metagenomics.Comment: 23 page
- …